Forum Moderators: open

Message Too Old, No Replies

AI bots

         

Scooter24

5:25 pm on Jan 31, 2025 (gmt 0)

10+ Year Member Top Contributors Of The Month



I have a travel photography site with > 40000 image and countless pages about travelling.

There is a honeypot mechanism which I implemented years ago to prevent people from downloading the entire site. If people try to do so, their IP address gets blocked in .htaccess.

Until last September-October there was a limited number of such events per month. But since last November this has ballooned to very high levels.

My guess is that these are bots of LLMs which scan the web for content with which to train their models.

not2easy

6:21 pm on Jan 31, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I moved your thread to the Crawler, Spider, and User Agent ID forum because you are more likely to get useful help in this forum.

It might be that these are AI bots, there are a few threads on their User Agents and info about which can reliably be blocked in robots.txt rather than via UA blocks.

This AI UA thread was started in this forum, then expanded in the AI forum: [webmasterworld.com...]

Of course it helps to download and look through your server's logs to identify the culprits. I'd be surprised if the AI bots had waited this long to start scraping your content. Today you may see users who have used AI to create their own scraping scripts though so they could appear to be human users. Their behavior can help you determine whether they are human or not as most bots don't access all files for a particular page and only scrape the text.

Scooter24

7:11 pm on Jan 31, 2025 (gmt 0)

10+ Year Member Top Contributors Of The Month



Well, the user agents are totally neutral (Mozilla/5.0 (Windows NT 6.2) AppleWebKit...), but only bots trigger the honeypot mechanism.

BTW, it got so bad that I'm temporarily blocking entire IP ranges.

lucy24

10:07 pm on Jan 31, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



BTW, it got so bad that I'm temporarily blocking entire IP ranges.
Nothing wrong with that, unless your scrapers are coming from compromised machines smack dab in the middle of some major ISP that sends you legitimate traffic as well.

SumGuy

3:50 am on Feb 1, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



My general IP blocking list has reached the maximum capacity of my router - 64k entries (65535). Those are CIDR's, some of them /24, one or two are /9 or even /8. I'm talking IPv4 - my server is not accessible on IPv6 (and I plan to keep it that way). I've been building this list for the past 6 or 7 years. It probably represents close to half of all IPv4 IP space. I'm glad I had so much of this list in place when the AI bots came on the scene.

There's no such thing as a temporary entry in my blocking list. I don't remove anything, unless I've proved to myself I likely are or have been blocking legit human access.

For the last maybe 2 months of 2024, the list had stabilized and I wasn't adding very much to it. Then this past month, things changed. One big thing I'm seeing is a major influx of African IP's. When I see misbehavior from an IP, I generally block THE ENTIRE AUTONOMOUS SYSTEM. That could be a few hundred IP's, a few thousand, or a few million. And I'll go down a rabbit hole checking out the peers, and block them too. You would not believe how many IPv4's are assigned to complete garbage outfits. The exception (to my blocking strategy) is residential ISP's from G7, maybe G20 countries. But I am seeing increasing problematic behavior from that source, which I attribute to increase usage of VPN's which promise users with anonymity and "protection" but end up using their internet connections for proxy use by black hats.

I'm seeing new patterns every week. I've never seen this happen before. Something is changing, something is going on. I look at everything hitting me, not just on ports 80 and 443, or 25. I see a lot of weird #*$! now.

lucy24

6:44 am on Feb 1, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some years ago, I switched over to header-based blocking. Over the years I have had to put back some IPs as well, but it's a comparatively short list. To this day, plenty of robots engage in no-brainers like not sending a User-Agent.

blend27

7:12 pm on Feb 14, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@SumGuy
...which I attribute to increase usage of VPN's which promise users with anonymity and "protection" but end up using their internet connections for proxy use by black hats.....

Is there a source of that out there?

SumGuy

3:25 pm on Feb 15, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



@blend27

"A residential IP address is compromised – hijacked – during a cyber-attack or is harvested when users sign up for a free VPN or DNS proxy service without reading the terms and conditions. Failure to read the T&Cs allows the free VPN provider to sublease and sell the IP address to an unknown person or entity.

Through its industry monitoring, GeoComply has identified 17 companies selling residential proxy IPs. In total, we estimate more than 200 million users of free VPN services have unknowingly had their home IP addresses compromised."

[geocomply.com...]

Unless proven otherwise, I'm going to assume that this includes heavily-advertised paid VPN's, like Nord VPN and others. I see far too much rogue activity from residential IP space to attribute this to obscure / no-cost VPN's.

Somewhat less related:

[blog.koddos.net...]

There are proxy IP providers who boast the fact that their networks include millions of residential IP's, without explaining how exactly they have such access. I've posted some examples of who they are and their claims on this board over the past few years as I discover them.

Have a look at this:

[brightdata.com...]

Note that there is now a catagory - "ethically sourced residential IP's". For which you can believe or not, but it implies that there are also "un-ethically" sourced residential IP's.

lucy24

6:02 pm on Feb 15, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Failure to read the T&Cs allows the free VPN provider to sublease and sell the IP address to an unknown person or entity.
I tend to doubt that this is literally true, unless the VPN is doing some really fancy retinal scanning.

SumGuy

2:13 am on Feb 16, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



If you want to take a deeper dive into this, try here:

[intel471.com...]

Under the "How Are Real Residential IP Addresses Acquired?" heading, I find this particularly interesting:

==========
2. ISP partnerships: Proxy providers may form agreements with ISPs to lease residential IP addresses. This method is more transparent and often involves clear contracts that outline how the IP addresses will be used, thus maintaining legitimacy.
==========

Will any residential ISP's admit to leasing their IP's for proxy use? I believe Windstream does - by way of this:

[geofeed.windstream.com...]

I have identified Windstream Communications (Little Rock Arkansas, AS7029) that is (or claims to be) a residential ISP that has dedicated specific /24 IPv4 ranges for use by third parties (web scraping - from my direct experience).

Emerging AI systems apparently have a need to digest vast amounts of natural text, the scraping of which does not present website owners with any direct upside or benefit in the way that public-usage search engines do. The owners or "trainers" of these AI systems will increasingly resort to using residential proxies to scrape this text material, and also likely images, photographs, etc, off the internet if they are blocked by standard access rules (based on their IP address or user agent for example).

Jonesy

1:11 pm on Feb 23, 2025 (gmt 0)

10+ Year Member Top Contributors Of The Month



@SumGuy -- Oh, this is scary!

Kendo

12:55 am on Feb 24, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am currently experimenting with fingerprinting that ignores IP. It may be a way to identify those using rotated IPs. Still too early to say if it will be worthwhile or not. But I do see that some hits are not using a user-agent and thus no fingerprint, but in doing that they can be treated accordingly.

thecoalman

10:55 am on Feb 24, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been using Cloudflare for years. If you go into the WAF section they have an autommagical setting "Bot fight mode". This will block the most aggressive malicious bots Cloudflare has identified, these are usually from IP's like AWS where the IP does not change.The effectiveness of this depends on your plan, the more you are paying the more effective it is but even the free plan provides some protection. There is also option for AI but this blocks any known AI so if you want ChatGPT and others it's not an option.

The custom rules is where it really becomes effective, you can set up variety of blocking mechanisms based on multiple criteria. Country, ASN, user agent, IP range, etc. For example traffic for my site is expected from two countries. Those are whitelisted and the rest get the "Checking your browser..." page. Only about 1.5% of them are successful.

blend27

7:46 pm on Mar 14, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



..... Emerging AI systems apparently have a need to digest vast amounts of natural text, the scraping of which does not present website owners with any direct upside or benefit in the way that public-usage search engines do....

and now asking "GunVerntMaht" to allow them doing so as a "Fair Use", using scare tactics....

[arstechnica.com...]

thecoalman

8:23 pm on Mar 15, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's a valid point and an issue that effects just about any Western industry. Hard to to remain competitive if you are operating under a strict set of laws/regulations and your competition is operating under the wild west rules.

Kendo

4:46 am on Mar 16, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The custom rules is where it really becomes effective, you can set up variety of blocking mechanisms based on multiple criteria. Country, ASN, user agent, IP range, etc.

I wonder how they cope with VPNs and TOR when IP address is required for geo location?
For example traffic for my site is expected from two countries. Those are whitelisted and the rest get the "Checking your browser..." page. Only about 1.5% of them are successful.

I see in some discussion forums that some web browsers are either failing to display the captcha or crashing while loading it.

thecoalman

10:51 am on Mar 16, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Tor traffic gets it's own country code. Link is for the documentation, apparently you can enable onion routing through CF.

[developers.cloudflare.com...]

VPN AFAIK is treated like any other IP. I know I have banned at least one ASN where a obscure VPN was using their services.

CF was added to phpbb.com about a year ago at my urging. There was some minor issues for few days but no complaints since then,

Bewenched

6:00 pm on Apr 10, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm having issues with OpenAi's bot.
I have it forbidden in my robots.txt file, but this bot will grab the robots.txt, then go to our homepage and then some random page anyway. Totally ignoring the robots.txt directive.
We deal in a lot of very specific data regarding pieces and parts of things that we've compiled over 25yrs and unless they want to pay me for MY data they can stay away.

Kendo

12:13 am on Apr 11, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm having issues with OpenAi's bot.

Blocking them will never keep them out.

If I was artificially intelligent and found no content by using my usual user-agent and IP, i would change the user-agent and hop onto a VPN just like any scraper can.

After fingerprinting (without JavaScript) 1000s of visitors per day, I am logging a lot of visitors that have nothing more than a "common browser" user-agent.

lucy24

2:14 am on Apr 11, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have it forbidden in my robots.txt file, but this bot will grab the robots.txt, then go to our homepage and then some random page anyway. Totally ignoring the robots.txt directive.
If you know who it is--whether by IP or UA--how do they get in?