Hi all.
I am a new member and this my first post.
I am "web mastering" ( a steep learning curve for me) my own personal web site that features some of my artworks and photographs.
I have visited Webmaster World a few times before joining for helpful guidance - particularly about - the User Agent abuse from - Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0.
I am pretty sure, from evidence behavior on my site (but I could be wrong) that this U/A it is a scraper (possibly human but probably a unnamed/unknown bot) that is behind this user agent. It ignores my robots.txt file block request. It switches IP's frequently - most IP's show up in AbuseIPDB website as known dodgy IP's - but some IP's it uses are alarming - the latest being the French Atomic Energy Agency! - A lot of Universities/Schools, Cloud Proxies, and Amazon Aws.
The reason I think this is a scraper is from log reports - here is an example:
8 Jul 2023, 01:34:47104.219.213.35GET1.1200162,241425Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
8 Jul 2023, 01:31:5844.229.15.165GET1.140316,3690Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
8 Jul 2023, 01:30:2844.229.15.165GET1.140316,3690Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
When is encounters a 403 Response Code (my ip block) - it switches IP - to something new and when is gets a 200 Response Code it then takes any thing from a few hundred Time Taken (ms) to over two thousand Time Taken(ms) to GET what it wants. It usually does this in blocks of 3 attempts and maybe only 5 or 6 attempts in one 24 hour period before moving on to a different target on my website. It seems to be concentrating on GETTING individual images (.jpeg) (hundreds of them on the site). There are only 13 different HTML pages on the site. There is no advertising on the site and it is NOT a commercial site as nothing is offered for sale on the site.
Of course I wondered at first was this U/A legitimate soon after it appeared a few months ago so when I noticed this Forum message about the botnet coming back I became more suspicious. As I blocked the IP's it just seemed to switch to new IP's as fast as I blocked them.
I also noticed many of the IP's were associated with China, North Korea, Hong Kong but as I blocked these - the IP's switching went worldwide - USA, UK, etc. So I tried blocking the countries China and HK - and then there was a marked increase in the U/A string using international IP's.
So far I have blocked probably a hundred different IP's and incidences now seem to be slowing down - most now come out of the USA.
I have not used the .htaccess file to attempt to block as I am pretty sure X11 the U/A will ignore that too.
I few days ago I decided as an experiment to lift the county block for China and -- I got over 30 hits from x11 in 24 hours - so I blocked China again. I don't get any audience traffic from China - other than hosting companies like 10 cent so I thought no great loss of traffic and so worth a shot to see what happened.
So. X11 seems to originate form China but what is behind it?
I notice on GitHub A LOT of people learning or using scraping use the X11 user agent string - and there is advice there for them to switch it often to another UA !
Legitimate traffic to my site does not seem to be down much and it usually fluctuates up and down anyway - but I do fear the X11 trouble could get much worse as others posting on WM world have indicated has occurred on their web sites. I don't want this to happen to me. My host does not have a anti-scrape tool, yet, And *loudflare has other problems I don't want to touch
I thought to post my experience here and welcome all comments and suggestions from you guys who are more experienced.
Thanks.
[edited by: not2easy at 1:55 pm (utc) on Jul 8, 2023]
[edit reason] split thread cleanup [/edit]