|Suspicious referrals in my web stats|
I couldn't seem to find an appropriate forum for this post, so this one seemed the most appropriate of the topics given since it does have something to do with referral and tracking. On another forum thread at this site, someone was asking about SE spider-bots and if they always visited the robots.txt file first. I noticed something rather strange and suspicious and thought it would be more appropriate to put it here:
I just noticed in my site's referral stats (in cPanel under "Web / Ftp Statistics", then under that "Latest visitors" area), the "Aipbot" spidering several pages, but it didn't stop at my robots.txt file. Could be it saw it on some previous visit.?
<I snipped some text that wasn't necessary here>
Regarding the Aipbot, I checked again and this time the Aipbot entries DOES show it checked my robots.txt file. I also see that the Aipbot is not a search engine bot, aipbot[dot]com appears to be the robotstxt.org site! I don't know why that website would be crawling my pages, does anyone?
Hmmm, this is interesting. If you go to aipbot[dot]com, you see that name in the address bar, and the link on the page is a clickable URL that saids it's on robotstxt.org. If you click that URL it ALSO goes to "aipbot[dot]com", it still shows in the address bar, but it's a webpage from robotstxt.org! It appears to be a "frame site spoof" (even though it is NOT a framed page) since if you right click the aipbot[dot]com page and "open frame in new window" it then shows my.name-services[dot]com/79900/page99.htm in the address bar! My, this is really confusing. I did the header check of that page and aipbot[dot]com and they both show 200 OK. I back checked that URL to just the domain of my.name-services[dot]com and it says: "The service for this site has been canceled." HUH? Looks like something fishy is going on here. I put that URL in Google and came up with one hit that said: Your Site
The service for this site has been canceled.
And clicking the blue text goes to the green URL, but it also says the exact same "The service for this site has been canceled". Enom is a webhosting company, and it appears they canceled the account of "my", whomever that was. But name-services[dot]com goes to some cover page, and it is hosted by Enom. Yadda yadda yadda.
Does anyone know what's going on here? Has anyone ever heard anything about "aipbot"...are they some type of scam site, or do they have a "questionable" spider? Should I block that spider from my sites, and if my.name-services[dot]com has been canceled, then what is it doing spidering my site?
Ahh, I did some G searches and found this:
[webmasterworld.com...] . Yes, it does appear questionable and it should be blocked in cPanel. I'm leaving this post here so others may want to check their stats for access by this bot. A G search for Aipbot shows some rather interesting info. I'd like to know how Aipbot[dot]com and my.name-services[dot]com/79900/page99.htm can still exist.
Beautiful summer in Norway.
Yes, it is not easy. The other link is good. I have also seen interesting things in the cPanel. I am not a specialist. To me it seemed that some spiders entered my site via subpages. But I did not study it in detail.
"Tracing search engine IP addresses is handy for a number of reasons. You might want to see when your web site was spidered by search engines. You might want to know which pages where spidered. You might want to ban some bots and let others in. You might want to see which bots are obeying your robots.txt files. You might want to redirect bots to other pages or serve them different content (cloaking).
Keeping track of the various IP addresses belonging to search engine spiders is a difficult enterprise and requires the assistance of a large number of people. I'm constantly receiving reports of new spiders and doing the research necessary to verify that they are indeed new and that they do indeed belong to search engines. The lists are as up to date as possible, but there is no guarantee that they include all of the IP addresses of all search engines (it would be impossible to guarantee such a thing). Also, there is no guarantee that all of the IP addresses contained in the lists belong to search engines (they are sometimes reassigned or abandoned)".
I cut from a site listed in the thread.
What about advanced sitecrawlers that can steal much of your code in seconds. Possible to exclude them? As I said in another post, as long as there are twoway communication, I think "everything is possible".
Different browsers, different web server os, dynamic (random) IP adresses, make it more difficult.
Make it simple, as simple as possible, but no simpler.
Discussion of this here:
Well I haven't seen the bot back since I put its IP address in my "IP Deny" area. However, now I see another one that might be bad! "Nicebot". I knew from the name alone to check it out. I found this page [webmasterworld.com...] but the only thing it checked at my site was the robots.txt file. Nicebot[dot]com looks pretty flaky. What do you know about it Fiestagirl? I couldn't find anything on it at SE's.
I also just noticed the Googlebot is NOT obeying robots meta tags. I wanted to block it on ONE of my "links" pages (experimented a few weeks ago to see what would happen) where in the <head> tag I placed <meta name="googlebot" content="noindex, nofollow"> and the G-bot did indeed visit the page as it does any of my other pages. I don't want to use the robots.txt file to do that since I tried that for one PDF file and after I did that I started to drop in G again.
Heya Clint, there are many thousands of funky spiders and bots and you'll just waste a lot of time trying to figure out each one. On an established site they can be a third of all traffic. Block 'em, filter 'em out, or live with them if they're not doing any harm, which most of them don't other than using a little bandwidth. They're interesting the first few times but the only effort you should put into 'em is making new filters to keep them out of your stats. If you block their IPs they'll just move on and you might end up blocking legitimate traffic.
Clint, for Googlebot to read those Meta tags, it has to actually request the page! If you don't want Googlebot to visit the page you have to do it in robots.txt.
What the noindex will do is that the page shouldn't appear in the search results, and the nofollow means it wont harvest the links on that page (however if it finds those links elsewhere without a nofollow it will try and visit them).
Hi all: Aipbot crawled my site 2 days ago.
It started with robots.txt, then sucked in about a quarter of my html pages.
Hits were spaced out, it looked well behaved to me.
The simple text page at [aipbot.com...] just says:
" aipbot honors robots.txt files (for info on how to
exclude pages from aipbot, see:
user agent string aipbot "
There's no mention of a search engine, just the crawler. The URL
[aipbot.com...] gives an identical page. Another mystery box. -Larry
Make the following search on Google:
advanced site crawler
I get a lot of hits. The description for one advanced site crawler is as follos:
"Advanced Site Crawler #*$!x is a Windows-based shareware that has two main functions. The first one is to search INSIDE a website that you will choose and will follow one link after the other to search for information. The second function allows you to search a website and download images, videos, documents, sounds and much more! You can download files into separate categories or create a duplicate of the original website."
So easy is it to create duplicate of (steal) a website. Is it possible to block them?
Make it simple, as simple as possible, but no simpler.
I checked my logs again and I saw the Aipbot was back, took a hit on my robots file, however, it got a "403 forbidden" on the file thanks to the IP Deny I added.
Cgrantski, I keep a list of search engine IP addresses and I check it to be sure I'm not blocking anything legit. The only reason I'd want to block some bots is if they are on APNIC, LACNIC or RIPE due to Spam (they spider a site for email addresses to attack). I know a lot of you are on those IP ranges, and unfortunately the decent and good have to suffer for the illegalities of the spamming and scamming parasites that are on those ranges. I also can only sell in the USA and Canada (also due to fraud), so I'm not missing any sales traffic. I would also block any known USA bots that are known as email harvesters and the like. I had a list of "known bad bots" somewhere.
Dijkgraaf, thanks, that makes sense. ;)
Larry, read my post again about the Aipbot. From the evidence I found, they are not only not legit, but appear to be spoofing another website.
Kgun, I don't know anything about any "Advanced Site Crawler", it could be legit and the way crawlers work to a point. You could block it of course, but only if you could find a common IP address to block and it doesn't sound like you could since it sounds like it's software that's on the user's side. The IP address would of course change for every user that used the 'ware. Only thing I could suggest is you using it on your site, then look at your logs and see if there is a distinguishable characteristic of it that you could block. I doubt there would be. Unless you see something like "AdvancedSiteCrawler-bot" then you could deny that in your .htaccess file.
"it sounds like it's software that's on the user's side".
Think you are right.
I get 447 000 hits, when I search for
list of bad bots
on Google. FUNDER have a fairly large list.
Your private list is perhaps good enough?
Good enough is best.
"Setting a Spider-trap
The best method of identifying bad bots is to create what is known as a Spider-trap. Create a directory, block that directory to all agents using robots.txt and link to the directory from a page (usually as a small 1x1 pixel link).
Only bad bots will access that directory (ie they've ignored our robots.txt exclusion). These bots can then be directed to a script that will immediately grab their IP address, User Agent or Referrer and add it to an .htaccess file - so that they're banned from the site".
You may also study my
.htaccess (not implemented) name htaccess.txt
Make it stimple, as simple as possible, but no simpler.
Interesting, thanks, looks like a good link too.
Go to an SE and search for "list of bad bots" in quotes and you will find a more exacting search.
[edited by: Receptional at 2:23 pm (utc) on June 30, 2005]
[edit reason] No specifics please [/edit]