scrapeniggers - /g/ (#105584547) [Archived: 1004 hours ago]

Anonymous

6/13/2025, 9:19:26 PM No.105584547

md5: c1a46255693476f8f03bedc062af5a39🔍

How do you deal with scrapers on your webserver, /g/?

Replies: >>105585188 >>105585252 >>105585270 >>105585361 >>105585706 >>105587905 >>105589538 >>105589575 >>105589624 >>105589632 >>105589814 >>105589886 >>105589949 >>105590301 >>105590321 >>105594500

Anonymous

6/13/2025, 9:55:14 PM No.105584862

Instead of ratelimiting requests, ratelimit data, watch them lose their shit over 300 byte hello world level HTML page taking full 5 minutes to download.

Anonymous

6/13/2025, 9:57:15 PM No.105584879

>LLM
simply respond with:
>NIGGER NIGGER NIGGER NIGGER

Anonymous

6/13/2025, 10:30:13 PM No.105585188

>>105584547 (OP)
You can use some specific details of the TLS handshake to detect whether a user is connecting from a command line or a browser. Browsers use a different TLS library to programs like cURL and Wget, so you can actually check if the traffic is coming from a real browser. Cloudfare uses this method to block scrapers. There are a few ways to bypass it but they're obscure, it's also possible someone could scrape your site using a browser with something like Selenium, but realistically you'll block 99% of scrapers if you do this.

Replies: >>105585282 >>105585343 >>105585475 >>105589538 >>105593779 >>105595023

Anonymous

6/13/2025, 10:36:37 PM No.105585252

>>105584547 (OP)
>How do you deal with scrapers on your webserver, /g/?
Allow it through read API

Anonymous

6/13/2025, 10:39:27 PM No.105585270

>>105584547 (OP)
i provide all of my data as a neat json package because i want companies to spend their own money training models so i don't have to finetune anything myself

Anonymous

6/13/2025, 10:40:28 PM No.105585282

Proof of work challenge
>>105585188 Ignore this retard

Replies: >>105585575 >>105590882

Anonymous

6/13/2025, 10:48:38 PM No.105585343

>>105585188
Don't most scrapers today use selenium or shit like that.

Anonymous

6/13/2025, 10:50:58 PM No.105585361

>>105584547 (OP)
>How do you deal with scrapers on your webserver, /g/?
Are there any malicious content/code generators yet to push into LLM scrapers?

Replies: >>105585402

Anonymous

6/13/2025, 10:56:26 PM No.105585402

>>105585361
i forgot what it was called, but someone came up with the idea that you can send scrapers down an infinite chain of pages by generating them on the fly and using something like a markov chain to feed them junk data. the entry point to these would be hidden in the code of the website so normal users wouldn't find it

Replies: >>105585497 >>105587504 >>105589673 >>105590817

Anonymous

6/13/2025, 11:03:44 PM No.105585475

>>105585188
solved with
https://github.com/lwthiker/curl-impersonate

Replies: >>105595047

Anonymous

6/13/2025, 11:05:33 PM No.105585497

>>105585402
i remember that suggestion, i brought up an issue with it that if a search engine scraper finds the page they will most likely punish you thinking youre trying to game the system by generating infinite content.
the way to solve that would be a robots.txt file that tells the scraper not to go there, but then the scraper could be smart enough to read that and avoid the trap also.

Replies: >>105585648 >>105589644 >>105589673

Anonymous

6/13/2025, 11:13:54 PM No.105585575

>>105585282
>calls me a retard
>gives an even stupider solution
Proof of work is to prevent DDoS attacks, not scraping. You fucking retard.

Replies: >>105585669

Anonymous

6/13/2025, 11:21:22 PM No.105585648

>>105585497
>the way to solve that would be a robots.txt file that tells the scraper not to go there, but then the scraper could be smart enough to read that and avoid the trap also.
well that's the thing, the AI scrapers do not even look at robots.txt. if they did, then you could just tell them not to scrape your website in the first place

Replies: >>105585713

Anonymous

6/13/2025, 11:23:45 PM No.105585669

>>105585575
Read the post in the OP more carefully, retard kun

Replies: >>105585702

Anonymous

6/13/2025, 11:26:19 PM No.105585702

>>105585669
You are proof of the classic knowledge that stupid people think smart people are stupid. You are so dumb I can't even understand what you're trying to say.

Replies: >>105589471

Anonymous

6/13/2025, 11:26:34 PM No.105585706

>>105584547 (OP)
I feed them ai generated slop from a terrible model.

Anonymous

6/13/2025, 11:27:07 PM No.105585713

>>105585648
they do, they just dont follow it.
a robots.txt is like a goldmine for an llm.
"oh my competitors are obeying these? i shouldnt, thats my advantage over them"
or even better
"that must be where all the good stuff is hiding!"

Replies: >>105589953

Anonymous

6/14/2025, 3:36:57 AM No.105587504

>>105585402
https://zadzmo.org/code/nepenthes/
pretty cool desu

Anonymous

6/14/2025, 4:42:58 AM No.105587905

>>105584547 (OP)
I'm the scraper though

Anonymous

6/14/2025, 9:47:53 AM No.105589471

>>105585702
Use AI to summarize the text for you if you can't understand it, zoomie
The problem is entitled AI companies illegally scraping non-static sites and making expensive requests without an API, and then asking web developers to create better solutions when they complain about it. That can be solved with proof of work, so that way it at least costs the scraper just as much as it costs the host.

Replies: >>105589538 >>105589771

Anonymous

6/14/2025, 10:04:08 AM No.105589538

>>105584547 (OP)
just make your website not-suck

There is no difference between an overaggressive scraper and a DDoS. If your website is resistant against DDoS, it is also resistant against scrapers.
I am not surprised that the libshit Drew Devault is incapable of hosting html files.

>>105585188
>Cloudfare uses this method to block scrapers
Cloudflare never protected me from scrapers, all of those pass through, so i assume that your solution doesn't work either.
In fact, i left cloudflare because of this (and because of false reports). The benefit of using it evaporated.
>>105589471
>illegally scraping
It's not illegal

Replies: >>105589546 >>105589554

Anonymous

6/14/2025, 10:05:30 AM No.105589546

>>105589538
>internet is basically le anarchy bro, just get good
Kill yourself

Anonymous

6/14/2025, 10:06:33 AM No.105589554

>>105589538
>It's not illegal
Moved the goalpost award
I accept your concession though.

Replies: >>105589578

Anonymous

6/14/2025, 10:07:41 AM No.105589558

throttle everything

Anonymous

6/14/2025, 10:11:21 AM No.105589575

>>105584547 (OP)
make your website not a piece of shit so it can be checked without scraping or paying you

Anonymous

6/14/2025, 10:11:54 AM No.105589576

I host my information to the public, and my servers and software are good. Scrape what you want.

Anonymous

6/14/2025, 10:12:00 AM No.105589578

>>105589554
Nobody moved a goalpost here, you said it's illegal, i told you that it isn't.
I could even reply to more of your post if you want that:
>and then asking web developers to create better solutions when they complain about it
This doesn't happen. Those scrapers don't tell you from where they are and you can't contact anyone behind it because you don't know who runs them.

Replies: >>105589583

Anonymous

6/14/2025, 10:13:25 AM No.105589583

>>105589578
You claimed proof of work isn't a real solution to scraping, then when I proved you wrong, you pivoted to complaining about other auxiliary points in my post.
You lose. I win. I will no longer be replying.

Replies: >>105589653

Anonymous

6/14/2025, 10:20:55 AM No.105589624

>>105584547 (OP)
Let's face it, "go fast break things" has been a shitty model for a long long time, and I guess LLM companies being kind of nasty and entitled are making more people realise that.

Also shows how much disproportionate lobbing power big tech has compared to every other industry huh.

Anonymous

6/14/2025, 10:22:21 AM No.105589632

>>105584547 (OP)
>run website with user generated content
>allow free speech
>some outraged lefty demands censorship
>politely decline and tell him to leave if he doesn't like that
>get DDoSed
>use cloudflare
>get flooded with legal requests from German authorities
>deal with all of it
>get DDoSed besides cloudflare
>2022 happens and hohols get mad
>get DDoSed even harder
>cloudflare decides to block us
>leave cloudflare
>spend months of unpaid work in successfully making website fend off DDoS atacks
>uptime is now better than during cloudflare

Everybody who doesn't run an regime-approved website had to deal with all of this shit for almost a decade now.

Meanwhile you consider a fucking scraper impossible to deal with and have to write an essay?
And other libshits like the codberg dudes get their servers crashed because someone (You)s lots of people and consider this a hostile attack of unseen proportions.

Write your website better, idiot.
You still don't even experience a fraction of the pain that others have to go through.

Replies: >>105589665

Anonymous

6/14/2025, 10:25:19 AM No.105589644

>>105585497
>a search engine scraper finds the page they will most likely punish you
That's a plus

Anonymous

6/14/2025, 10:27:41 AM No.105589653

>>105589583
No, that wasn't me, i am a different anon.

But proof of work like Anubis indeed doesn't work, because it waves through requests based on useragent.
Someone who wants to DDoS you won't be stopped by that.
>b-but i have no issues with DDoS, i only have issues with dumb scrapers

Replies: >>105589659

Anonymous

6/14/2025, 10:28:27 AM No.105589659

>>105589653
Proof of work doesn't have to filter requests based on user agent
I suggest looking up what proof of work is.

Replies: >>105589693

Anonymous

6/14/2025, 10:30:17 AM No.105589665

>>105589632
The easier method would be to purge the lefties and hohols 2bh.
I mean, that's what they do to you right?
>Oh wait, it's fucking Israel funding that shit nowadays. hmmmm

God speed, Iran. Blast them to hell.

Anonymous

6/14/2025, 10:30:34 AM No.105589667

1734355635404

md5: f137210b58e92fcb3cbfd3b17e1b34f2🔍

If your website has issues with AI scrapers, it would never survive a DDoS attack either.
It is a luxury problem of someone who got never attacked by random script kiddies.

Replies: >>105593539

Anonymous

6/14/2025, 10:32:14 AM No.105589673

>>105585402
>>105585497
Oh so that explains why Google always stops functioning correctly during Israeli events.

Anonymous

6/14/2025, 10:35:20 AM No.105589693

>>105589659
Anubis does, and it usually gets deployed to protect wikis and forges (gitlab, codeberg,...).
And those websites have to be accessible for scrapers. Like zip downloads of repos. So there will naturally always be a way to simply bypass it.

Also note how they went for proof-of-work ears after kiwifarms did it (but kiwifarms doesn't let specific useragents pass through).
It's fun how hey declare an epic internet war and are outraged, while copying parts of the protections that websites deployed that they hate.

Replies: >>105589705 >>105589711 >>105589714 >>105589729

Anonymous

6/14/2025, 10:37:01 AM No.105589705

>>105589693
Point to where I mentioned "anubis", faggot.
You are the one who brought this up. I said proof of work, I was not shilling a specific company.

Replies: >>105590810

Anonymous

6/14/2025, 10:38:41 AM No.105589711

>>105589693
You're actually the stupidest faggot on the website. Do us a favor and hang yourself. You don't deserve to breathe the same air as people who have an iq higher than 70.

Replies: >>105590810

Anonymous

6/14/2025, 10:39:43 AM No.105589714

>>105589693
Kill yourself faggot and read this:
https://en.wikipedia.org/wiki/Proof_of_work
>ctrl+f "anubis"
>0 results

Replies: >>105590810

Anonymous

6/14/2025, 10:42:32 AM No.105589729

>>105589693
You will literally do anything to feel like you're right, and gaslight yourself in the most insane ways. You are the scum of the earth. You are so obsessed with vanity and being correct on an internet forum that you become stupid. It would be a mercy to everyone around you if you drowned yourself. Unironically drink bleach, your parents probably wouldn't even miss you.

Replies: >>105589763 >>105590810

Anonymous

6/14/2025, 10:44:56 AM No.105589746

Block all VPN/datacenter traffic, look up their ASNs and block adjacent ranges too, IPv6 is out of the question.

Anonymous

6/14/2025, 10:45:19 AM No.105589747

1720094535146459

md5: 47e1f1435a8887e914fe6c2f752cfd95🔍

>proof of work
>PoW

That's a suspicious abbreviation.

Replies: >>105589774

Anonymous

6/14/2025, 10:46:25 AM No.105589755

How bout you just... allow scraping and not be an asshole?

Anonymous

6/14/2025, 10:48:02 AM No.105589763

>>105589729
that's the kind of retard that bump apple spam threads arguing with spambots, do not bother.

Anonymous

6/14/2025, 10:48:18 AM No.105589765

>can't scrape from the internet
>proceed to scrape from your literal brain with nanomachines instead
Well this is a concerning future.

Anonymous

6/14/2025, 10:49:16 AM No.105589771

1726668583342

md5: 27dc2f772530cb8fa997cb153749f841🔍

>>105589471
>scraping non-static sites
We are talking about websites here that do ridiculous shit like generating zip files for each commit of a repository on request.

Someone did the math and found out that you can download tens of millions of dynamically generated zip archives from the GNOME gitlab.
This is crazy. Some single person could write a shitty shell script and it would put the whole thing down.
Whoever implemented this and/or enabled this, never considered that there could be people out there trying to put his service down.
There was never any consideration about making those platforms stable and resilient.

The first issue is:
MAKE YOUR FUCKING WEBSITE BETTER

Replies: >>105589891

Anonymous

6/14/2025, 10:49:29 AM No.105589774

>>105589747
If you're in psychosis yeah

Replies: >>105589788

Anonymous

6/14/2025, 10:50:30 AM No.105589782

The_many

md5: 6d0e91e3996882a7c9a572c617b9eb3f🔍

SHODAN doesn't seem that bad anymore.

>I can't believe we're already flirting with this issue, just in a primitive form

Anonymous

6/14/2025, 10:52:01 AM No.105589788

>>105589774
No, I just thought it was funny.
What's next, a CPU death march?

Anonymous

6/14/2025, 10:56:27 AM No.105589807

I wonder if there are any experiments with receiving signals from the eye remotely.
You brain might lie, but your eyes might not... as much.
Maybe this is why we haven't gone anywhere with closed in patients and communication techniques. National security risk.

Anonymous

6/14/2025, 10:58:27 AM No.105589814

>>105584547 (OP)
sorry webshits, but I will scrape your website and there's nothing you can do about it. No, I don't think I'll ask for permission. Yes, I will DDoS your website at the same time, nothing personnel kiddo. All your data are belong to me.

Anonymous

6/14/2025, 11:14:34 AM No.105589886

>>105584547 (OP)
>sourcehut
oh nooooo, you made a website so bad, that a script that simply clicks on all links is an existential threat to you? And this is the first time that someone actually does it?
noooo, i am so sad, you can't imagine, do you need a kiss on your booboo to make it better?
It's of course not your fault that you made one of the worse websites in existence, it's the fault of those scrapers clicking links!

Anonymous

6/14/2025, 11:15:53 AM No.105589891

>>105589771
It doesn't matter because you have limited bandwidth and every scraper will cause a ton of damage
For example. I have a bandwidth limit of 5GB per day. I am currently hosting about 1000 users and they tend to use on average 1GB per day.
But the entire content of my website is ~50GB so if a scraper were to download *everything*, then my website would be inaccessible for the rest of the day.

Replies: >>105589915

Anonymous

6/14/2025, 11:21:43 AM No.105589915

>>105589891
>It doesn't matter because you have limited bandwidth and every scraper will cause a ton of damage
>For example. I have a bandwidth limit of 5GB per day.
That is a bandwidth limit that every single hostile person with a wget script can reach.
This is not an AI scraper issue.
You have to defend against hostile people anyway. And if you would have considered this, then AI scrapers wouldn't be an issue either.
In this case, you simply can't offer tens of millions of files for download, or you have to limit those downloads hard.
This should have been a consideration when making the website. It is simply incompetent to not consider this.

Again, it got proven that AI scraping is only an issue to people who lived in blissful ignorance.

Replies: >>105589942

Anonymous

6/14/2025, 11:26:54 AM No.105589942

>>105589915
Yes, but for a single person I can set a limit per IP. I can also set an even lower limit for Tor and known VPNs, for example for Tor I have a symbolic limit of 25kb. Tor has ~2000 IPs so even by cycling through all of them, it comes to 50MB in total.
The issue with AI scrapers or any other scrapers is they come from residential IPs and when you block one IP, another different unique IP shows up immediately

Replies: >>105594909

Anonymous

6/14/2025, 11:28:22 AM No.105589949

>>105584547 (OP)
Are you trying to stifle innovation you commie fuck?
Don't you know that they are fully in their rights to do whatever they want for the next 10 years?

Replies: >>105589959

Anonymous

6/14/2025, 11:28:57 AM No.105589953

>>105585713
Put a honeypot url in there that no legitimate user would ever visit and ban any ip that accesses it.

Replies: >>105590254

Anonymous

6/14/2025, 11:30:07 AM No.105589959

>>105589949
Scraping is already illegal

Replies: >>105590039

Anonymous

6/14/2025, 11:46:24 AM No.105590039

>>105589959
LOL

Anonymous

6/14/2025, 12:33:28 PM No.105590254

>>105589953
We can't do that. That would be too easy.

Anonymous

6/14/2025, 12:41:37 PM No.105590301

>>105584547 (OP)
surely they should allow deepseek to scrape it then, it's been the biggest thing in open source software for years

Anonymous

6/14/2025, 12:45:16 PM No.105590321

>>105584547 (OP)
cry more

Anonymous

6/14/2025, 2:00:12 PM No.105590810

>>105589705
>>105589711
>>105589714
>>105589729
imagine being so mad you shit yourself over 4 different posts lmao
the other anon is reasonable and clearly draws from his own experience, meanwhile you are shouting insults like a retard and not really even arguing
kill yourself

Replies: >>105590887

Anonymous

6/14/2025, 2:01:36 PM No.105590817

>>105585402
sounds like https://blog.cloudflare.com/ai-labyrinth/

Anonymous

6/14/2025, 2:11:57 PM No.105590882

>>105585282
proof of work is really the only way, and it will only make such scraping comutationally infeasible in large amounts.
to implement on the frontend:
- implement SHA512, Whirlpool, or whatever other cryptographic hashing algorithm (preferably not in JS, but in Webassembly or using a browser's native cryptographic API, because JS is not fast enough and dedicated attackers will be using a native implementation).
to implement on the backend, for each request:
- use a PRNG such as ChaCha20 to generate a number between 0 and N, where you set N based on the level of difficulty you want (higher = slower requests).
- hash the generated number using whatever hashing algorithm you are using on the frontend.
- create a MAC of the hash using whatever MAC algorithm (I like Whirlpool HMAC). use a secret key that you change on something like an hourly basis.
- send N, hash, and the MAC to the client.

the client will have to find the number that produces the hash. when found, it will send the number, the hash, and the MAC back to the server. the server will validate the hash using the MAC, then validate the number produces the hash.

Replies: >>105590913 >>105591365 >>105594237

Anonymous

6/14/2025, 2:13:27 PM No.105590887

>>105590810
You lost, kek

Anonymous

6/14/2025, 2:18:12 PM No.105590913

>>105590882
this is broken btw, you should use two MACs instead
I will leave that implementation to the reader

Replies: >>105590921 >>105595478

Anonymous

6/14/2025, 2:19:45 PM No.105590921

1726459060699277

md5: 333ceb8a929260ebbfe6501c22cded2d🔍

>>105590913
or just one?
I don't care anymore

Anonymous

6/14/2025, 3:39:42 PM No.105591365

>>105590882
Or just pull the plug, your shit isn't important

Anonymous

6/14/2025, 8:12:34 PM No.105593539

>>105589667
Retard, this is about high traffic in general, not attacks. If 95% of requests at all times is ai scraping related, you're fucked, you might as well send your money to the ai companies instead of hosting your site.

Anonymous

6/14/2025, 8:42:01 PM No.105593779

>>105585188
As a scraper, cloudflare has not stopped me from scraping a website ever and if you think browser impersonators are obscure tech you’ll be in for a surprise if you ever decide to host some data online

Anonymous

6/14/2025, 9:37:39 PM No.105594237

>>105590882
isn't this vulnerable to hash table lookups?
also, a smart scraper would code a solution for this and keep scraping anyway

Replies: >>105595478

Anonymous

6/14/2025, 10:02:45 PM No.105594425

I'm scrooooooping right now

Anonymous

6/14/2025, 10:14:13 PM No.105594500

>>105584547 (OP)
Any updates on that project where if it detects the connection is an LLM scraper, it serves the bot a never ending stream of pages to parse through filled with junk text and links to more nonsense?

Anonymous

6/14/2025, 11:05:09 PM No.105594909

>>105589942
I saved so many bookmarks over a decade. At some point, I wanted to see them again. More than half of them are dead.
Now, when I see a good website or blog, I wget mirror all of it then archive it as zip/rar/zstd/whatever.
You can still find old style websites out there, mirroring them takes a few minutes and you have everything they wanted to share.
Modern website are a big pile of shit with redundant links that download the same stuff over and over again.
Put a low bandwidth limit (50kB/s to 1MB/s) for each IP range after a certain transfer amount per day (let's say 100MB) and let people fully and easily download all the stuff you put out there.
It's YOUR fault if your server is not enforcing some rate limiting which leads to denying others to accessing your website.

Replies: >>105594949

Anonymous

6/14/2025, 11:09:17 PM No.105594949

>>105594909
The normal IPs have much higher limit ~250MB
The limits I posted are for Tor and known VPNs

Anonymous

6/14/2025, 11:17:34 PM No.105595023

>>105585188
just use firefox marionette

Anonymous

6/14/2025, 11:19:40 PM No.105595047

file

md5: 64f6c0e0eb1f3c770d7c9be4629bc173🔍

>>105585475
where is the exe

Replies: >>105595575

Anonymous

6/14/2025, 11:21:39 PM No.105595076

1736767936351546

md5: cfa35abaaff616ace07cf3a1e427154a🔍

Replies: >>105595868

Anonymous

6/15/2025, 12:01:00 AM No.105595478

>>105594237
yes, as I mentioned in >>105590913
I have slept now, and a better solution is:
- rather than genering a number between 0 and N, instead, first generate a random number MIN (betweens whatever bounds you want, although you will want, preferably, something like a 512 bit space).
- then generate a random number between MIN and (MIN + N). use this random number for hashing.
- the original algorithm I described has an issue: because the MAC key is changed out on a periodic basis, right before key-changing, you might issue a MAC to a client, change the key, and be unable to validate a valid proof of work. instead, use two MAC keys that you change on a periodic basis (with a day/night cycle and a sygil per MAC indicating the used key).

Replies: >>105595632 >>105595732

Anonymous

6/15/2025, 12:11:20 AM No.105595575

>>105595047
you want the archive because each browser has its own "exe" inside the archive and it ships with like 12 of them
ie
firefox 112
firefox 113
chrome 120
chrome 122
whatever

Anonymous

6/15/2025, 12:17:48 AM No.105595632

1731989351280114

md5: a881b719313c5e8d2de2132f6742df1d🔍

>>105595478
this is fucked too btw, lmao I am retarded

Replies: >>105595732

Anonymous

6/15/2025, 12:31:18 AM No.105595732

>>105595478
>>105595632
I mean, the original algo is not bad for a sufficiently large N, but only for an adversary that doesn't have or wants to spend too many resources. a dedicated one could calculate a big ass table of hashes every hour or so and distribute the hashes from a centralized server or something.

Replies: >>105596265

Anonymous

6/15/2025, 12:46:18 AM No.105595868

>>105595076
If I ever make a website it’s going to be invite only and private so congratulations on scraping some meaningless data I guess. Maybe I could even hide a bunch of invisible nonsense to pollute their data

Anonymous

6/15/2025, 1:46:47 AM No.105596265

>>105595732
nah, both algorithms are fucked:
- with the first, an attacker only needs to generate hashes for 0 to N once, then they can indefinitely produce valid proofs.
- with both of them, an attacker only needs to solve one proof of work per key-change cycle and are able to just replay that proof
this could be easily solved by just storing state in a DB or whatever, but that is not ideal.