Thread 105739969

22 posts 2 images /g/

Anonymous 6/29/2025, 5:04:11 AM No.105739969 >>105739976 >>105740020 >>105740042 >>105743352 >>105743706

Why does stackoverflow and github now block AI scraping attempts when their entire body of text has already been used as training material by all the major LLMs and no new stackoverflow questions have been allowed on the site since 2013?

Anonymous 6/29/2025, 5:04:54 AM No.105739976

>>105739969 (OP)
To virtue signal of course

Anonymous 6/29/2025, 5:11:01 AM No.105740020

>>105739969 (OP)
To prevent new and upcoming independent LLMs from scraping them too so only the big dogs can benefit.

Anonymous 6/29/2025, 5:14:24 AM No.105740042 >>105740201

>>105739969 (OP)
Because llm scraping is basically a ddos

Anonymous 6/29/2025, 5:25:32 AM No.105740117

All the big boys already got all the data they need you are not allowed to compete, be faster next time

Anonymous 6/29/2025, 5:32:29 AM No.105740165

why would you scrape a site that has its entire database dump uploaded to archive.org in the first place?

Anonymous 6/29/2025, 5:38:34 AM No.105740201 >>105741158

>>105740042
SO could release an archive torrent of all the old posts, but that's not the real reason they block scaping

Anonymous 6/29/2025, 8:56:00 AM No.105741158 >>105741504

>>105740201
It should be mandated by law to make archives of all public posts easily accessible to everyone.

Anonymous 6/29/2025, 9:54:54 AM No.105741504 >>105741517

>>105741158
I don't like mandating someone has to do more work. This should only be required if you implement anti scraping measures

Anonymous 6/29/2025, 9:56:41 AM No.105741517 >>105741739

>>105741504
Don't like it? Don't host a website where 100% of your data is user generated.

Anonymous 6/29/2025, 10:34:09 AM No.105741739 >>105741846

>>105741517
Or you could stop being lazy. I'm basically a no coder and whipped up a web scraper ina couple of days. Luckily the site's I was going after work without JS

Anonymous 6/29/2025, 10:53:31 AM No.105741846 >>105741923

>>105741739
So don't cry about web scraping then, subhuman.

Anonymous 6/29/2025, 11:04:27 AM No.105741923 >>105741960

>>105741846
Some websites make it nearly impossible to scrape. In those cases I support easily accessible archive. Big sites don't want people accessing that data all, cause in their eyes data = money

Anonymous 6/29/2025, 11:10:17 AM No.105741960 >>105742008

>>105741923
No such thing as unscrapeable website.

Anonymous 6/29/2025, 11:18:58 AM No.105742008 >>105742037

>>105741960
You have horrible reading comprehension.
You are either:
>1. a retard
>2. esl
>3. a paid disruptor/bot

Anonymous 6/29/2025, 11:22:53 AM No.105742037

>>105742008
I accept your concession.

Anonymous 6/29/2025, 3:28:30 PM No.105743352

>>105739969 (OP)
Because it's a pay to play market, fuck-o

Anonymous 6/29/2025, 4:15:35 PM No.105743706 >>105744924

>>105739969 (OP)
You can easily bypass that using kiwix.
Online kiwix library at https://browse.library.kiwix.org/viewer#stackoverflow.com_en_all_2023-11/questions
...Or install it locally, be aware the entire stackoverflow weights 80Gb

Anonymous 6/29/2025, 4:21:03 PM No.105743750 >>105743765

How do they even stop scrapers.. why wouldn't you just write a bot with human characteristics ie sleep times.

Anonymous 6/29/2025, 4:22:51 PM No.105743765

>>105743750
You just add a small proof of work cryptominer on each page load small enough that people will tolerate it but significant enough that you are essentially paying to scrape

Anonymous 6/29/2025, 6:34:09 PM No.105744924

>>105743706
>the entire stackoverflow weights 80Gb
I wonder how much it is once you delete all comments (worthless), HTML (non-information) and all accepted answers (always wrong)

Anonymous 6/29/2025, 10:27:58 PM No.105747146

bump