← Home ← Back to /g/

Thread 105739969

22 posts 2 images /g/
Anonymous No.105739969 >>105739976 >>105740020 >>105740042 >>105743352 >>105743706
Why does stackoverflow and github now block AI scraping attempts when their entire body of text has already been used as training material by all the major LLMs and no new stackoverflow questions have been allowed on the site since 2013?
Anonymous No.105739976
>>105739969 (OP)
To virtue signal of course
Anonymous No.105740020
>>105739969 (OP)
To prevent new and upcoming independent LLMs from scraping them too so only the big dogs can benefit.
Anonymous No.105740042 >>105740201
>>105739969 (OP)
Because llm scraping is basically a ddos
Anonymous No.105740117
All the big boys already got all the data they need you are not allowed to compete, be faster next time
Anonymous No.105740165
why would you scrape a site that has its entire database dump uploaded to archive.org in the first place?
Anonymous No.105740201 >>105741158
>>105740042
SO could release an archive torrent of all the old posts, but that's not the real reason they block scaping
Anonymous No.105741158 >>105741504
>>105740201
It should be mandated by law to make archives of all public posts easily accessible to everyone.
Anonymous No.105741504 >>105741517
>>105741158
I don't like mandating someone has to do more work. This should only be required if you implement anti scraping measures
Anonymous No.105741517 >>105741739
>>105741504
Don't like it? Don't host a website where 100% of your data is user generated.
Anonymous No.105741739 >>105741846
>>105741517
Or you could stop being lazy. I'm basically a no coder and whipped up a web scraper ina couple of days. Luckily the site's I was going after work without JS
Anonymous No.105741846 >>105741923
>>105741739
So don't cry about web scraping then, subhuman.
Anonymous No.105741923 >>105741960
>>105741846
Some websites make it nearly impossible to scrape. In those cases I support easily accessible archive. Big sites don't want people accessing that data all, cause in their eyes data = money
Anonymous No.105741960 >>105742008
>>105741923
No such thing as unscrapeable website.
Anonymous No.105742008 >>105742037
>>105741960
You have horrible reading comprehension.
You are either:
>1. a retard
>2. esl
>3. a paid disruptor/bot
Anonymous No.105742037
>>105742008
I accept your concession.
Anonymous No.105743352
>>105739969 (OP)
Because it's a pay to play market, fuck-o
Anonymous No.105743706 >>105744924
>>105739969 (OP)
You can easily bypass that using kiwix.
Online kiwix library at https://browse.library.kiwix.org/viewer#stackoverflow.com_en_all_2023-11/questions
...Or install it locally, be aware the entire stackoverflow weights 80Gb
Anonymous No.105743750 >>105743765
How do they even stop scrapers.. why wouldn't you just write a bot with human characteristics ie sleep times.
Anonymous No.105743765
>>105743750
You just add a small proof of work cryptominer on each page load small enough that people will tolerate it but significant enough that you are essentially paying to scrape
Anonymous No.105744924
>>105743706
>the entire stackoverflow weights 80Gb
I wonder how much it is once you delete all comments (worthless), HTML (non-information) and all accepted answers (always wrong)
Anonymous No.105747146
bump