The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Davriellelouna@lemmy.world · edit-2 4 days ago

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

tarknassus@lemmy.world · 2 days ago

I don’t see a problem here. Maybe Perplexity should consider the reasons WHY Cloudflare have a firewall…?

Amberskin@europe.pub · 3 days ago

Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

Isn’t that a literal computer crime?

dinckel@lemmy.world · 3 days ago

No-no, see. When an AI-first company does it, it’s actually called courageous innovation. Crimes are for poor people

Silicon@lemmy.world · 3 days ago

See: Facebook/Meta

Electricd@lemmybefree.net · edit-2 3 days ago

They do have a point though. It would be great to let per-prompt searches go through, but not mass scrapping

I believe a lot of websites don’t want both though

Dr. Moose@lemmy.world · 3 days ago

It’s insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.

CatDogL0ver@lemmy.world · 3 days ago

It happened to me before until I did a Google search. It was my VPN web protection. It was too " over protective".

Check your security settings, antivirus and VPN

Dremor@lemmy.world · edit-2 2 days ago

Linux and Firefox here. No problem at all with Cloudflare, despite having more or less as much privacy preserving add-on as possible. I even spoof my user agent to the latest Firefox ESR on Linux.

Something’s may be wrong with your setup.

COASTER1921@lemmy.ml · 3 days ago

I suspect a lot of it comes down to your ISP. Like the original commentor I also frequently can’t pass CloudFlare turnstile when on Wifi, although refreshing the page a few times usually gets me through. Worst case on my phone’s hotspot I can much more consistently pass. It’s super annoying and combined with their recent DNS outage has totally ruined any respect I had for CloudFlare.

Interesting video on the subject: https://youtu.be/SasXJwyKkMI

Dr. Moose@lemmy.world · 3 days ago

Thats not how it works. Cf uses thousands of variables to estimate a trust score and block people so just because it works for you doesn’t mean it works.

Dremor@lemmy.world · edit-2 3 days ago

Same goes the other way. It’s not because it doesn’t work for you that it should go away.

That technology has its uses, and Cloudflare is probably aware that there are still some false positive, and probably is working on it as we write.

The decision is for the website owner to take, taking into consideration the advantages of filtering out a majority of bots and the disadvantages of loosing some legitimate traffic because of false positives. If you get Cloudflare challenge, chances are that he chosed that the former vastly outclass the later.

Now there are some self-hosted alternatives, like Anubis, but business clients prefer SaaS like Cloudflare to having to maintain their own software. Once again it is their choices and liberty to do so.

Dr. Moose@lemmy.world · 3 days ago

lmao imagine shilling for corporate Cloudflare like this. Also false positive vs false negative are fundamentally not equal.

Cloudflare is probably aware that there are still some false positive, and probably is working on it as we write.

The main issue with Cloudflare is that it’s mostly bullshit. It does not report any stats to the admins on how many users were rejected or any false positive rates and happily put’s everyone under “evil bot” umbrella. So people from low trust score environments like Linux or IPs from poorer countries are under significant disadvantage and left without a voice.

I’m literally a security dev working with Cloudflare anti-bot myself (not by choice). It’s a useful tool for corporate but a really fucking bad one for the health of the web, much worse than any LLM agent or crawler, period.

dodos@lemmy.world · 3 days ago

I’m on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.

Dr. Moose@lemmy.world · 3 days ago

“Wrong with my setup” - thats not how internet works.

I’m based in south east asia and often work on the road so IP rating probably is the final crutch in my fingerprint score.

Either way this should be no way acceptible.

poopkins@lemmy.world · edit-2 3 days ago

I’ve developed my own agent for assisting me with researching a topic I’m passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I’m a human using a web browser. (For my network requests, I’ve defined my own user agent.)

So I use that as a signal that the website doesn’t want automated tools scraping their data. That’s fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like ~~this~~ Perplexity’s are awful for the web.

(Edited for clarity)

WolfLink@sh.itjust.works · 4 days ago

This is a nice CloudFlare ad

pyre@lemmy.world · 4 days ago

yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.

Glitchvid@lemmy.world · 4 days ago

When a firm outright admits to bypassing or trying to bypass measures taken to keep them out, you think that would be a slam dunk case of unauthorized access under the CFAA with felony enhancements.

GamingChairModel@lemmy.world · 4 days ago

Fuck that. I don’t need prosecutors and the courts to rule that accessing publicly available information in a way that the website owner doesn’t want is literally a crime. That logic would extend to ad blockers and editing HTML/js in an “inspect element” tag.

Encrypt-Keeper@lemmy.world · 4 days ago

That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.

GamingChairModel@lemmy.world · 4 days ago

gaining unauthorized access to a computer system

And my point is that defining “unauthorized” to include visitors using unauthorized tools/methods to access a publicly visible resource would be a policy disaster.

If I put a banner on my site that says “by visiting my site you agree not to modify the scripts or ads displayed on the site,” does that make my visit with an ad blocker “unauthorized” under the CFAA? I think the answer should obviously be “no,” and that the way to define “authorization” is whether the website puts up some kind of login/authentication mechanism to block or allow specific users, not to put a simple request to the visiting public to please respect the rules of the site.

To me, a robots.txt is more like a friendly request to unauthenticated visitors than it is a technical implementation of some kind of authentication mechanism.

Scraping isn’t hacking. I agree with the Third Circuit and the EFF: If the website owner makes a resource available to visitors without authentication, then accessing those resources isn’t a crime, even if the website owner didn’t intend for site visitors to use that specific method.

finitebanjo@lemmy.world · 2 days ago

Site owners currently do and should have the freedom to decide who is and is not allowed to access the data, and to decide for what purpose it gets used for. Idgaf if you think scraping is malicious or not, it is and should be illegal to violate clear and obvious barriers against them at the cost of the owners and unsanctioned profit of the scrapers off of the work of the site owners.

GamingChairModel@lemmy.world · 2 days ago

to decide for what purpose it gets used for

Yeah, fuck everything about that. If I’m a site visitor I should be able to do what I want with the data you send me. If I bypass your ads, or use your words to write a newspaper article that you don’t like, tough shit. Publishing information is choosing not to control what happens to the information after it leaves your control.

Don’t like it? Make me sign an NDA. And even then, violating an NDA isn’t a crime, much less a felony punishable by years of prison time.

Interpreting the CFAA to cover scraping is absurd and draconian.

finitebanjo@lemmy.world · edit-2 2 days ago

If you want anybody and everyone to be able to use everything you post for any purpose, right on, good for you, but don’t try to force your morality on others who rely on their writing, programming, and artworks to make a living to survive.

GamingChairModel@lemmy.world · 2 days ago

I’m gonna continue to use ad blockers and yt-dlp, and if you think I’m a criminal for doing so, I’m gonna say you don’t understand either technology or criminal law.

Glitchvid@lemmy.world · edit-2 4 days ago

When sites put challenges like Anubis or other measures to authenticate that the viewer isn’t a robot, and scrapers then employ measures to thwart that authentication (via spoofing or other means) I think that’s a reasonable violation of the CFAA in spirit — especially since these mass scraping activities are getting attention for the damage they are causing to site operators (another factor in the CFAA, and one that would promote this to felony activity.)

The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.

ubergeek@lemmy.today · 4 days ago

The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.

Silly plebe! Those laws are there to target the working class, not to be used against corporations. See: Copyright.

RangerAndTheCat@lemmy.dbzer0.com · 4 days ago

Aatube@lemmy.dbzer0.com · 4 days ago

That same logic is how Aaron Swartz was cornered into suicide for scraping JSTOR, something widely agreed to be a bad idea by a wide range of lawspeople including SCOTUS in its 2021 decision Van Buren v. US that struck this interpretation off the books.

tomalley8342@lemmy.world · 4 days ago

Nah, that would also mean using Newpipe, YoutubeDL, Revanced, and Tachiyomi would be a crime, and it would only take the re-introduction of WEI to extend that criminalization to the rest of the web ecosystem. It would be extremely shortsighted and foolish of me to cheer on the criminalization of user spoofing and browser automation because of this.

Glitchvid@lemmy.world · edit-2 4 days ago

Do you think DoS/DDoS activities should be criminal?

If you’re a site operator and the mass AI scraping is genuinely causing operational problems (not hard to imagine, I’ve seen what it does to my hosted repositories pages) should there be recourse? Especially if you’re actively trying to prevent that activity (revoking consent in cookies, authorization captchas).

In general I think the idea of “your right to swing your fists ends at my face” applies reasonably well here — these AI scraping companies are giving lots of admins bloody noses and need to be held accountable.

I really am amenable to arguments wrt the right to an open web, but look at how many sites are hiding behind CF and other portals, or outright becoming hostile to any scraping at all; we’re already seeing the rapid death of the ideal because of these malicious scrapers, and we should be using all available recourse to stop this bleeding.

tomalley8342@lemmy.world · 4 days ago

DoS attacks are already a crime, so of course the need for some kind of solution is clear. But any proposal that gatekeeps the internet and restricts the freedoms with which the user can interact with it is no solution at all. To me, the openness of the web shouldn’t be something that people just consider, or are amenable to. It should be the foundation in which all reasonable proposals should consider as a principle truth.

Encrypt-Keeper@lemmy.world · 4 days ago

If I put a banner on my site that says “by visiting my site you agree not to modify the scripts or ads displayed on the site,” does that make my visit with an ad blocker “unauthorized” under the CFAA?

How would you “authorize” a user to access assets served by your systems based on what they do with them after they’ve accessed them? That doesn’t logically follow so no, that would not make an ad blocker unauthorized under the CFAA. Especially because you’re not actually taking any steps to deny these people access either.

AI scrapers on the other hand are a type of users that you’re not authorizing to begin with, and if you’re using CloudFlares bot protection you’re putting into place a system to deny them access. To purposefully circumvent that access would be considered unauthorized.

GamingChairModel@lemmy.world · 4 days ago

That doesn’t logically follow so no, that would not make an ad blocker unauthorized under the CFAA.

The CFAA also criminalizes “exceeding authorized access” in every place it criminalizes accessing without authorization. My position is that mere permission (in a colloquial sense, not necessarily technical IT permissions) isn’t enough to define authorization. Social expectations and even contractual restrictions shouldn’t be enough to define “authorization” in this criminal statute.

To purposefully circumvent that access would be considered unauthorized.

Even as a normal non-bot user who sees the cloudflare landing page because they’re on a VPN or happen to share an IP address with someone who was abusing the network? No, circumventing those gatekeeping functions is no different than circumventing a paywall on a newspaper website by deleting cookies or something. Or using a VPN or relay to get around rate limiting.

The idea of criminalizing scrapers or scripts would be a policy disaster.

Electricd@lemmybefree.net · 3 days ago

I don’t like cloudflare but it’s nice that they allow people to stop AI scrapping if they want to

tempest@lemmy.ca · 3 days ago

CloudFlare has become an Internet protection racket and I’m not happy about it.

Laser@feddit.org · 3 days ago

It’s been this from the very beginning. But they don’t fit the definition of a protection racket as they’re not the ones attacking you if you don’t pay up. So they’re more like a security company that has no competitors due to the needed investment to operate.

A1kmm@lemmy.amxl.com · 2 days ago

Cloudflare are notorious for shielding cybercrime sites. You can’t even complain about abuse of Cloudflare about them, they’ll just forward on your abuse complaint to the likely dodgy host of the cybercrime site. They don’t even have a channel to complain to them about network abuse of their DNS services.

So they certainly are an enabler of the cybercriminals they purport to protect people from.

MithranArkanere@lemmy.world · 2 days ago

Any internet service provider needs to be completely neutral. Not only in their actions, but also in their liability.
Same goes for other services like payment processors.
If companies that provide content-agnostic services are allowed to policy the content, that opens the door to really nasty stuff.

You can’t chop everyone’s arms to stop a few people from stealing.

If they think their services are being used in a reprehensible manner, what they need to do is alert the authorities, not act like vigilantes.

Electricd@lemmybefree.net · edit-2 3 days ago

they’re good at protecting websites but damn, having a company being MITM feels so wrong

sandwich.make(bathing_in_bismuth)@sh.itjust.works · 2 days ago

The shit they know. Plus their support for non-JS users or For are pure shite

wosat@lemmy.world · 4 days ago

This is why companies like Perplexity and OpenAI are creating browsers.

floquant@lemmy.dbzer0.com · 4 days ago

It’s difficult to be a shittier company than OpenAI, but Perplexity seems to be trying hard.

BigFig@lemmy.world · 4 days ago

Step 1, SOMEHOW find a more punchable face than Altman

kreskin@lemmy.world · edit-2 3 days ago

they cant get their ai to check a box that says “I am not a robot”? I’d think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.

5gruel@lemmy.world · 3 days ago

Recaptcha v2 does way more than check if the box was checked.

https://stackoverflow.com/a/27299487

kreskin@lemmy.world · 2 days ago

you’re not wrong, but it also allows more than 99.8% of the bot traffic through too on text challenges. Its like the TSA of website security. Its mostly there to keep the user busy while cloudflare places itself in a man in the middle of your encrypted connection to a third party. The only difference between cloudflare and a malicious attacker is cloudflares stated intention not to be evil. With that and 3 dollars I can buy myself a single hard shell taco from tacobell.

Dr. Moose@lemmy.world · 3 days ago

Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It’s all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.

tempest@lemmy.ca · 3 days ago

Yeah and the worst part is it doesn’t fucking work for the one thing it’s supposed to do.

The only thing it does is stop the stupidest low effort scrapers and forces the good ones to use a browser.

TheGrandNagus@lemmy.world · 3 days ago

Can’t believe I’ve lived to see Cloudflare be the good guys

sunbeam60@lemmy.ml · 3 days ago

They’re not. They’re using this as an excuse to become paid gatekeepers of the internet as we know it. All that’s happening is that Cloudflare is using this to menuever into position where they can say “nice traffic you’ve got there - would be a shame if something happened to it”.

AI companies are crap.

What Cloudflare is doing here is also crap.

And we’re cheering it on.

Kissaki@feddit.org · edit-2 4 days ago

Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?

Dr. Moose@lemmy.world · 3 days ago

Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.

ubergeek@lemmy.today · 3 days ago

And I’m assuming if the robots.txt state their UserAgent isn’t allowed to crawl, it obeys it, right? :P

Kissaki@feddit.org · 3 days ago

No, as per the article, their argumentation is that they are not web crawlers generating an index, they are user-action-triggered agents working live for the user.

Ekybio@lemmy.world · 4 days ago

Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

BetaDoggo_@lemmy.world · edit-2 4 days ago

Perplexity (an “AI search engine” company with 500 million in funding) can’t bypass cloudflare’s anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity’s scrapers because they ignore robots.txt and mimic real users to get around cloudflare’s blocking features. Perplexity argues that their scraping is acceptable because it’s user initiated.

Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

lividweasel@lemmy.world · 4 days ago

…and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

rdri@lemmy.world · 3 days ago

First we complain that AI steals and trains on our data. Then we complain when it doesn’t train. Cool.

ubergeek@lemmy.today · 3 days ago

I think it boils down to “consent” and “remuneration”.

I run a website, that I do not consent to being accessed for LLMs. However, should LLMs use my content, I should be compensated for such use.

So, these LLM startups ignore both consent, and the idea of remuneration.

Most of these concepts have already been figured out for the purpose of law, if we consider websites much akin to real estate: Then, the typical trespass laws, compensatory usage, and hell, even eminent domain if needed ie, a city government can “take over” the boosted post feature to make sure alerts get pushed as widely and quickly as possible.

rdri@lemmy.world · 3 days ago

That all sounds very vague to me, and I don’t expect it to be captured properly by law any time soon. Being accessed for LLM? What does it mean for you and how is it different from being accessed by a user? Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

Is it okay for a person to access your site? Is it okay for a script written by that person to fetch data every day automatically? Would it be okay for a user to dump a page of your site with a headless browser? Would it be okay to let an LLM take a look at it to extract info required by a user? Have you heard about changedetection.io project? If some of these sound unfair to you, you might want to put a DRM on your data or something.

Would you expect a compensation from me after reading your comment?

ubergeek@lemmy.today · 2 days ago

That all sounds very vague to me, and I don’t expect it to be captured properly by law any time soon.

It already has been captured, properly in law, in most places. We can use the US as an example: Both intellectual property and real property have laws already that cover these very items.

What does it mean for you and how is it different from being accessed by a user?

Well, does a user burn up gigawatts of power, to access my site every time? That’s a huge different.

Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

Depends on the terms of service I set for that service.

Is it okay for a person to access your site?

Sure!

Is it okay for a script written by that person to fetch data every day automatically?

Sure! As long as it doesn’t cause problems for me, the creator and hoster of said content.

Would it be okay for a user to dump a page of your site with a headless browser?

See above. Both power usage and causing problems for me.

Would it be okay to let an LLM take a look at it to extract info required by a user?

No. I said, I do not want my content and services to be used by and for LLMs.

Have you heard about changedetection.io project?

I have now. And should a user want to use that service, that service, which charges 8.99/month for it needs to pay me a portion of that, or risk having their service blocked.

There no need to use it, as I already provide RSS feeds for my content. Use the RSS feed, if you want updates.

If some of these sound unfair to you, you might want to put a DRM on your data or something.

Or, I can just block them, via a service like Cloud Flare. Which I do.

Would you expect a compensation from me after reading your comment?

None. Unless you’re wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

rdri@lemmy.world · 2 days ago

Both intellectual property and real property have laws already that cover these very items.

And it causes a lot of trouble to many people and pains me specifically. Information should not be gated or owned in a way that would make it illegal for anyone to access it under proper conditions. License expiration causing digital work to die out, DRM causing software to break, idiotic license owners not providing appropriate service, etc.

Well, does a user burn up gigawatts of power, to access my site every time?

Doing a GET request doesn’t do that.

As long as it doesn’t cause problems for me, the creator and hoster of said content.

What kind of problems that would be?

Both power usage and causing problems for me.

?? How? And what?

do not want my content and services to be used by and for LLMs.

You have to agree that at one point “be used by LLM” would not be different from “be used by a user”.

which charges 8.99/month

It’s self-hosted and free.

Use the RSS feed, if you want updates.

How does that prohibit usage and processing of your info? That sounds like “I won’t be providing any comments on Lemmy website, if you want my opinion you can mail me at a@b.com”

I can just block them, via a service like Cloud Flare. Which I do.

That will never block all of them. Your info will be used without your consent and you will not feel troubled from it. So you might not feel troubled if more things do the same.

None. Unless you’re wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

What if I use my local hosted LLM? Anyway, the point is, selling text can’t work well, and you’re going to spend much more resources on collecting and summarizing data about how your text was used and how others benefited from it, in order to get compensation, than it worths.

Also, it might be the case that some information is actually worthless when compared to a service provided by things like LLM, even though they use that worthless information in the process.

I’m all for killing off LLMs, btw. Concerns of site makers who think they are being damaged by things like Perplexity are nothing compared to what LLMs do to the world. Maybe laws should instead make it illegal to waste energy. Before energy becomes the main currency.

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Perplexity Says Cloudflare Is Blocking Legitimate AI Assistants