← Home ← Back to /g/

Thread 105584547

87 posts 18 images /g/
Anonymous No.105584547 [Report] >>105585188 >>105585252 >>105585270 >>105585361 >>105585706 >>105587905 >>105589538 >>105589575 >>105589624 >>105589632 >>105589814 >>105589886 >>105589949 >>105590301 >>105590321 >>105594500
scrapeniggers
How do you deal with scrapers on your webserver, /g/?
Anonymous No.105584862 [Report]
Instead of ratelimiting requests, ratelimit data, watch them lose their shit over 300 byte hello world level HTML page taking full 5 minutes to download.
Anonymous No.105584879 [Report]
>LLM
simply respond with:
>NIGGER NIGGER NIGGER NIGGER
Anonymous No.105585188 [Report] >>105585282 >>105585343 >>105585475 >>105589538 >>105593779 >>105595023
>>105584547 (OP)
You can use some specific details of the TLS handshake to detect whether a user is connecting from a command line or a browser. Browsers use a different TLS library to programs like cURL and Wget, so you can actually check if the traffic is coming from a real browser. Cloudfare uses this method to block scrapers. There are a few ways to bypass it but they're obscure, it's also possible someone could scrape your site using a browser with something like Selenium, but realistically you'll block 99% of scrapers if you do this.
Anonymous No.105585252 [Report]
>>105584547 (OP)
>How do you deal with scrapers on your webserver, /g/?
Allow it through read API
Anonymous No.105585270 [Report]
>>105584547 (OP)
i provide all of my data as a neat json package because i want companies to spend their own money training models so i don't have to finetune anything myself
Anonymous No.105585282 [Report] >>105585575 >>105590882
Proof of work challenge
>>105585188 Ignore this retard
Anonymous No.105585343 [Report]
>>105585188
Don't most scrapers today use selenium or shit like that.
Anonymous No.105585361 [Report] >>105585402
>>105584547 (OP)
>How do you deal with scrapers on your webserver, /g/?
Are there any malicious content/code generators yet to push into LLM scrapers?
Anonymous No.105585402 [Report] >>105585497 >>105587504 >>105589673 >>105590817
>>105585361
i forgot what it was called, but someone came up with the idea that you can send scrapers down an infinite chain of pages by generating them on the fly and using something like a markov chain to feed them junk data. the entry point to these would be hidden in the code of the website so normal users wouldn't find it
Anonymous No.105585475 [Report] >>105595047
>>105585188
solved with
https://github.com/lwthiker/curl-impersonate
Anonymous No.105585497 [Report] >>105585648 >>105589644 >>105589673
>>105585402
i remember that suggestion, i brought up an issue with it that if a search engine scraper finds the page they will most likely punish you thinking youre trying to game the system by generating infinite content.
the way to solve that would be a robots.txt file that tells the scraper not to go there, but then the scraper could be smart enough to read that and avoid the trap also.
Anonymous No.105585575 [Report] >>105585669
>>105585282
>calls me a retard
>gives an even stupider solution
Proof of work is to prevent DDoS attacks, not scraping. You fucking retard.
Anonymous No.105585648 [Report] >>105585713
>>105585497
>the way to solve that would be a robots.txt file that tells the scraper not to go there, but then the scraper could be smart enough to read that and avoid the trap also.
well that's the thing, the AI scrapers do not even look at robots.txt. if they did, then you could just tell them not to scrape your website in the first place
Anonymous No.105585669 [Report] >>105585702
>>105585575
Read the post in the OP more carefully, retard kun
Anonymous No.105585702 [Report] >>105589471
>>105585669
You are proof of the classic knowledge that stupid people think smart people are stupid. You are so dumb I can't even understand what you're trying to say.
Anonymous No.105585706 [Report]
>>105584547 (OP)
I feed them ai generated slop from a terrible model.
Anonymous No.105585713 [Report] >>105589953
>>105585648
they do, they just dont follow it.
a robots.txt is like a goldmine for an llm.
"oh my competitors are obeying these? i shouldnt, thats my advantage over them"
or even better
"that must be where all the good stuff is hiding!"
Anonymous No.105587504 [Report]
>>105585402
https://zadzmo.org/code/nepenthes/
pretty cool desu
Anonymous No.105587905 [Report]
>>105584547 (OP)
I'm the scraper though
Anonymous No.105589471 [Report] >>105589538 >>105589771
>>105585702
Use AI to summarize the text for you if you can't understand it, zoomie
The problem is entitled AI companies illegally scraping non-static sites and making expensive requests without an API, and then asking web developers to create better solutions when they complain about it. That can be solved with proof of work, so that way it at least costs the scraper just as much as it costs the host.
Anonymous No.105589538 [Report] >>105589546 >>105589554
>>105584547 (OP)
just make your website not-suck

There is no difference between an overaggressive scraper and a DDoS. If your website is resistant against DDoS, it is also resistant against scrapers.
I am not surprised that the libshit Drew Devault is incapable of hosting html files.

>>105585188
>Cloudfare uses this method to block scrapers
Cloudflare never protected me from scrapers, all of those pass through, so i assume that your solution doesn't work either.
In fact, i left cloudflare because of this (and because of false reports). The benefit of using it evaporated.
>>105589471
>illegally scraping
It's not illegal
Anonymous No.105589546 [Report]
>>105589538
>internet is basically le anarchy bro, just get good
Kill yourself
Anonymous No.105589554 [Report] >>105589578
>>105589538
>It's not illegal
Moved the goalpost award
I accept your concession though.
Anonymous No.105589558 [Report]
throttle everything
Anonymous No.105589575 [Report]
>>105584547 (OP)
make your website not a piece of shit so it can be checked without scraping or paying you
Anonymous No.105589576 [Report]
I host my information to the public, and my servers and software are good. Scrape what you want.
Anonymous No.105589578 [Report] >>105589583
>>105589554
Nobody moved a goalpost here, you said it's illegal, i told you that it isn't.
I could even reply to more of your post if you want that:
>and then asking web developers to create better solutions when they complain about it
This doesn't happen. Those scrapers don't tell you from where they are and you can't contact anyone behind it because you don't know who runs them.
Anonymous No.105589583 [Report] >>105589653
>>105589578
You claimed proof of work isn't a real solution to scraping, then when I proved you wrong, you pivoted to complaining about other auxiliary points in my post.
You lose. I win. I will no longer be replying.
Anonymous No.105589624 [Report]
>>105584547 (OP)
Let's face it, "go fast break things" has been a shitty model for a long long time, and I guess LLM companies being kind of nasty and entitled are making more people realise that.

Also shows how much disproportionate lobbing power big tech has compared to every other industry huh.
Anonymous No.105589632 [Report] >>105589665
>>105584547 (OP)
>run website with user generated content
>allow free speech
>some outraged lefty demands censorship
>politely decline and tell him to leave if he doesn't like that
>get DDoSed
>use cloudflare
>get flooded with legal requests from German authorities
>deal with all of it
>get DDoSed besides cloudflare
>2022 happens and hohols get mad
>get DDoSed even harder
>cloudflare decides to block us
>leave cloudflare
>spend months of unpaid work in successfully making website fend off DDoS atacks
>uptime is now better than during cloudflare

Everybody who doesn't run an regime-approved website had to deal with all of this shit for almost a decade now.

Meanwhile you consider a fucking scraper impossible to deal with and have to write an essay?
And other libshits like the codberg dudes get their servers crashed because someone (You)s lots of people and consider this a hostile attack of unseen proportions.

Write your website better, idiot.
You still don't even experience a fraction of the pain that others have to go through.
Anonymous No.105589644 [Report]
>>105585497
>a search engine scraper finds the page they will most likely punish you
That's a plus
Anonymous No.105589653 [Report] >>105589659
>>105589583
No, that wasn't me, i am a different anon.

But proof of work like Anubis indeed doesn't work, because it waves through requests based on useragent.
Someone who wants to DDoS you won't be stopped by that.
>b-but i have no issues with DDoS, i only have issues with dumb scrapers
Anonymous No.105589659 [Report] >>105589693
>>105589653
Proof of work doesn't have to filter requests based on user agent
I suggest looking up what proof of work is.
Anonymous No.105589665 [Report]
>>105589632
The easier method would be to purge the lefties and hohols 2bh.
I mean, that's what they do to you right?
>Oh wait, it's fucking Israel funding that shit nowadays. hmmmm

God speed, Iran. Blast them to hell.
Anonymous No.105589667 [Report] >>105593539
If your website has issues with AI scrapers, it would never survive a DDoS attack either.
It is a luxury problem of someone who got never attacked by random script kiddies.
Anonymous No.105589673 [Report]
>>105585402
>>105585497
Oh so that explains why Google always stops functioning correctly during Israeli events.
Anonymous No.105589693 [Report] >>105589705 >>105589711 >>105589714 >>105589729
>>105589659
Anubis does, and it usually gets deployed to protect wikis and forges (gitlab, codeberg,...).
And those websites have to be accessible for scrapers. Like zip downloads of repos. So there will naturally always be a way to simply bypass it.

Also note how they went for proof-of-work ears after kiwifarms did it (but kiwifarms doesn't let specific useragents pass through).
It's fun how hey declare an epic internet war and are outraged, while copying parts of the protections that websites deployed that they hate.
Anonymous No.105589705 [Report] >>105590810
>>105589693
Point to where I mentioned "anubis", faggot.
You are the one who brought this up. I said proof of work, I was not shilling a specific company.
Anonymous No.105589711 [Report] >>105590810
>>105589693
You're actually the stupidest faggot on the website. Do us a favor and hang yourself. You don't deserve to breathe the same air as people who have an iq higher than 70.
Anonymous No.105589714 [Report] >>105590810
>>105589693
Kill yourself faggot and read this:
https://en.wikipedia.org/wiki/Proof_of_work
>ctrl+f "anubis"
>0 results
Anonymous No.105589729 [Report] >>105589763 >>105590810
>>105589693
You will literally do anything to feel like you're right, and gaslight yourself in the most insane ways. You are the scum of the earth. You are so obsessed with vanity and being correct on an internet forum that you become stupid. It would be a mercy to everyone around you if you drowned yourself. Unironically drink bleach, your parents probably wouldn't even miss you.
Anonymous No.105589746 [Report]
Block all VPN/datacenter traffic, look up their ASNs and block adjacent ranges too, IPv6 is out of the question.
Anonymous No.105589747 [Report] >>105589774
>proof of work
>PoW

That's a suspicious abbreviation.
Anonymous No.105589755 [Report]
How bout you just... allow scraping and not be an asshole?
Anonymous No.105589763 [Report]
>>105589729
that's the kind of retard that bump apple spam threads arguing with spambots, do not bother.
Anonymous No.105589765 [Report]
>can't scrape from the internet
>proceed to scrape from your literal brain with nanomachines instead
Well this is a concerning future.
Anonymous No.105589771 [Report] >>105589891
>>105589471
>scraping non-static sites
We are talking about websites here that do ridiculous shit like generating zip files for each commit of a repository on request.

Someone did the math and found out that you can download tens of millions of dynamically generated zip archives from the GNOME gitlab.
This is crazy. Some single person could write a shitty shell script and it would put the whole thing down.
Whoever implemented this and/or enabled this, never considered that there could be people out there trying to put his service down.
There was never any consideration about making those platforms stable and resilient.

The first issue is:
MAKE YOUR FUCKING WEBSITE BETTER
Anonymous No.105589774 [Report] >>105589788
>>105589747
If you're in psychosis yeah
Anonymous No.105589782 [Report]
SHODAN doesn't seem that bad anymore.

>I can't believe we're already flirting with this issue, just in a primitive form
Anonymous No.105589788 [Report]
>>105589774
No, I just thought it was funny.
What's next, a CPU death march?
Anonymous No.105589807 [Report]
I wonder if there are any experiments with receiving signals from the eye remotely.
You brain might lie, but your eyes might not... as much.
Maybe this is why we haven't gone anywhere with closed in patients and communication techniques. National security risk.
Anonymous No.105589814 [Report]
>>105584547 (OP)
sorry webshits, but I will scrape your website and there's nothing you can do about it. No, I don't think I'll ask for permission. Yes, I will DDoS your website at the same time, nothing personnel kiddo. All your data are belong to me.
Anonymous No.105589886 [Report]
>>105584547 (OP)
>sourcehut
oh nooooo, you made a website so bad, that a script that simply clicks on all links is an existential threat to you? And this is the first time that someone actually does it?
noooo, i am so sad, you can't imagine, do you need a kiss on your booboo to make it better?
It's of course not your fault that you made one of the worse websites in existence, it's the fault of those scrapers clicking links!
Anonymous No.105589891 [Report] >>105589915
>>105589771
It doesn't matter because you have limited bandwidth and every scraper will cause a ton of damage
For example. I have a bandwidth limit of 5GB per day. I am currently hosting about 1000 users and they tend to use on average 1GB per day.
But the entire content of my website is ~50GB so if a scraper were to download *everything*, then my website would be inaccessible for the rest of the day.
Anonymous No.105589915 [Report] >>105589942
>>105589891
>It doesn't matter because you have limited bandwidth and every scraper will cause a ton of damage
>For example. I have a bandwidth limit of 5GB per day.
That is a bandwidth limit that every single hostile person with a wget script can reach.
This is not an AI scraper issue.
You have to defend against hostile people anyway. And if you would have considered this, then AI scrapers wouldn't be an issue either.
In this case, you simply can't offer tens of millions of files for download, or you have to limit those downloads hard.
This should have been a consideration when making the website. It is simply incompetent to not consider this.

Again, it got proven that AI scraping is only an issue to people who lived in blissful ignorance.
Anonymous No.105589942 [Report] >>105594909
>>105589915
Yes, but for a single person I can set a limit per IP. I can also set an even lower limit for Tor and known VPNs, for example for Tor I have a symbolic limit of 25kb. Tor has ~2000 IPs so even by cycling through all of them, it comes to 50MB in total.
The issue with AI scrapers or any other scrapers is they come from residential IPs and when you block one IP, another different unique IP shows up immediately
Anonymous No.105589949 [Report] >>105589959
>>105584547 (OP)
Are you trying to stifle innovation you commie fuck?
Don't you know that they are fully in their rights to do whatever they want for the next 10 years?
Anonymous No.105589953 [Report] >>105590254
>>105585713
Put a honeypot url in there that no legitimate user would ever visit and ban any ip that accesses it.
Anonymous No.105589959 [Report] >>105590039
>>105589949
Scraping is already illegal
Anonymous No.105590039 [Report]
>>105589959
LOL
Anonymous No.105590254 [Report]
>>105589953
We can't do that. That would be too easy.
Anonymous No.105590301 [Report]
>>105584547 (OP)
surely they should allow deepseek to scrape it then, it's been the biggest thing in open source software for years
Anonymous No.105590321 [Report]
>>105584547 (OP)
cry more
Anonymous No.105590810 [Report] >>105590887
>>105589705
>>105589711
>>105589714
>>105589729
imagine being so mad you shit yourself over 4 different posts lmao
the other anon is reasonable and clearly draws from his own experience, meanwhile you are shouting insults like a retard and not really even arguing
kill yourself
Anonymous No.105590817 [Report]
>>105585402
sounds like https://blog.cloudflare.com/ai-labyrinth/
Anonymous No.105590882 [Report] >>105590913 >>105591365 >>105594237
>>105585282
proof of work is really the only way, and it will only make such scraping comutationally infeasible in large amounts.
to implement on the frontend:
- implement SHA512, Whirlpool, or whatever other cryptographic hashing algorithm (preferably not in JS, but in Webassembly or using a browser's native cryptographic API, because JS is not fast enough and dedicated attackers will be using a native implementation).
to implement on the backend, for each request:
- use a PRNG such as ChaCha20 to generate a number between 0 and N, where you set N based on the level of difficulty you want (higher = slower requests).
- hash the generated number using whatever hashing algorithm you are using on the frontend.
- create a MAC of the hash using whatever MAC algorithm (I like Whirlpool HMAC). use a secret key that you change on something like an hourly basis.
- send N, hash, and the MAC to the client.

the client will have to find the number that produces the hash. when found, it will send the number, the hash, and the MAC back to the server. the server will validate the hash using the MAC, then validate the number produces the hash.
Anonymous No.105590887 [Report]
>>105590810
You lost, kek
Anonymous No.105590913 [Report] >>105590921 >>105595478
>>105590882
this is broken btw, you should use two MACs instead
I will leave that implementation to the reader
Anonymous No.105590921 [Report]
>>105590913
or just one?
I don't care anymore
Anonymous No.105591365 [Report]
>>105590882
Or just pull the plug, your shit isn't important
Anonymous No.105593539 [Report]
>>105589667
Retard, this is about high traffic in general, not attacks. If 95% of requests at all times is ai scraping related, you're fucked, you might as well send your money to the ai companies instead of hosting your site.
Anonymous No.105593779 [Report]
>>105585188
As a scraper, cloudflare has not stopped me from scraping a website ever and if you think browser impersonators are obscure tech you’ll be in for a surprise if you ever decide to host some data online
Anonymous No.105594237 [Report] >>105595478
>>105590882
isn't this vulnerable to hash table lookups?
also, a smart scraper would code a solution for this and keep scraping anyway
Anonymous No.105594425 [Report]
I'm scrooooooping right now
Anonymous No.105594500 [Report]
>>105584547 (OP)
Any updates on that project where if it detects the connection is an LLM scraper, it serves the bot a never ending stream of pages to parse through filled with junk text and links to more nonsense?
Anonymous No.105594909 [Report] >>105594949
>>105589942
I saved so many bookmarks over a decade. At some point, I wanted to see them again. More than half of them are dead.
Now, when I see a good website or blog, I wget mirror all of it then archive it as zip/rar/zstd/whatever.
You can still find old style websites out there, mirroring them takes a few minutes and you have everything they wanted to share.
Modern website are a big pile of shit with redundant links that download the same stuff over and over again.
Put a low bandwidth limit (50kB/s to 1MB/s) for each IP range after a certain transfer amount per day (let's say 100MB) and let people fully and easily download all the stuff you put out there.
It's YOUR fault if your server is not enforcing some rate limiting which leads to denying others to accessing your website.
Anonymous No.105594949 [Report]
>>105594909
The normal IPs have much higher limit ~250MB
The limits I posted are for Tor and known VPNs
Anonymous No.105595023 [Report]
>>105585188
just use firefox marionette
Anonymous No.105595047 [Report] >>105595575
>>105585475
where is the exe
Anonymous No.105595076 [Report] >>105595868
Look at all our newfriends! Make sure to say Hi!

Coming to a site near you! Watch out for those bandwidth charges!
And please remember an LLM bot doesn't just crawl your site once! It comes back again and again and again and again and again! Multiple times a day! Rate limiting? They have thousands of IPs! robots.txt? LMAO. Sites sluggish? Slow? DDOS? No! That's just ClaudeBot, just triplechecking he got all the stuff he crawled 30 minutes ago for the 6th time today.

Will OP's very thread already be scraped? Lets checksee

Oh Hai sempai ^___^

AhrefsBot|
AI2Bot|
AliyunSecBot|
Amazonbot|
Applebot|
Awario|
axios|
Baiduspider|
barkrowler|
bingbot|
BitSightBot|
BLEXBot|
Buck|
Bytespider|
CCBot|
CensysInspect|
ChatGPT-User|
ClaudeBot|
coccocbot|
cohere-ai|
DataForSeoBot|
Diffbot|
DotBot|
ev-crawler|
Expanse|
FacebookBot|
facebookexternalhit|
FriendlyCrawler|
Googlebot|
GoogleOther|
GPTBot|
HeadlessChrome|
ICC-Crawler|
imagesift|
img2dataset|
InternetMeasurement|
ISSCyberRiskCrawler|
istellabot|
magpie-crawler|
Mediatoolkitbot|
Meltwater|
Meta-External|
MJ12bot|
moatbot|
ModatScanner|
MojeekBot|
OAI-SearchBot|
Odin|
omgili|
panscient|
PanguBot|
peer39_crawler|
Perplexity|
PetalBot|
Pinterestbot|
PiplBot|
Protopage|
scoop|
Scrapy|
Screaming|
SeekportBot|
Seekr|
SemrushBot|
SeznamBot|
Sidetrade|
Sogou|
SurdotlyBot|
Timpibot|
trendictionbot|
VelenPublicWebCrawler|
WhatsApp|
wpbot|
xfa1|
Yandex|
Yeti|
YouBot|
zgrab|
ZoominfoBot|
Anonymous No.105595478 [Report] >>105595632 >>105595732
>>105594237
yes, as I mentioned in >>105590913
I have slept now, and a better solution is:
- rather than genering a number between 0 and N, instead, first generate a random number MIN (betweens whatever bounds you want, although you will want, preferably, something like a 512 bit space).
- then generate a random number between MIN and (MIN + N). use this random number for hashing.
- the original algorithm I described has an issue: because the MAC key is changed out on a periodic basis, right before key-changing, you might issue a MAC to a client, change the key, and be unable to validate a valid proof of work. instead, use two MAC keys that you change on a periodic basis (with a day/night cycle and a sygil per MAC indicating the used key).
Anonymous No.105595575 [Report]
>>105595047
you want the archive because each browser has its own "exe" inside the archive and it ships with like 12 of them
ie
firefox 112
firefox 113
chrome 120
chrome 122
whatever
Anonymous No.105595632 [Report] >>105595732
>>105595478
this is fucked too btw, lmao I am retarded
Anonymous No.105595732 [Report] >>105596265
>>105595478
>>105595632
I mean, the original algo is not bad for a sufficiently large N, but only for an adversary that doesn't have or wants to spend too many resources. a dedicated one could calculate a big ass table of hashes every hour or so and distribute the hashes from a centralized server or something.
Anonymous No.105595868 [Report]
>>105595076
If I ever make a website it’s going to be invite only and private so congratulations on scraping some meaningless data I guess. Maybe I could even hide a bunch of invisible nonsense to pollute their data
Anonymous No.105596265 [Report]
>>105595732
nah, both algorithms are fucked:
- with the first, an attacker only needs to generate hashes for 0 to N once, then they can indefinitely produce valid proofs.
- with both of them, an attacker only needs to solve one proof of work per key-change cycle and are able to just replay that proof
this could be easily solved by just storing state in a DB or whatever, but that is not ideal.