Welcome to WebmasterWorld Guest from 18.104.22.168
I just noticed in my site's referral stats (in cPanel under "Web / Ftp Statistics", then under that "Latest visitors" area), the "Aipbot" spidering several pages, but it didn't stop at my robots.txt file. Could be it saw it on some previous visit.?
<I snipped some text that wasn't necessary here>
Regarding the Aipbot, I checked again and this time the Aipbot entries DOES show it checked my robots.txt file. I also see that the Aipbot is not a search engine bot, aipbot[dot]com appears to be the robotstxt.org site! I don't know why that website would be crawling my pages, does anyone?
Hmmm, this is interesting. If you go to aipbot[dot]com, you see that name in the address bar, and the link on the page is a clickable URL that saids it's on robotstxt.org. If you click that URL it ALSO goes to "aipbot[dot]com", it still shows in the address bar, but it's a webpage from robotstxt.org! It appears to be a "frame site spoof" (even though it is NOT a framed page) since if you right click the aipbot[dot]com page and "open frame in new window" it then shows my.name-services[dot]com/79900/page99.htm in the address bar! My, this is really confusing. I did the header check of that page and aipbot[dot]com and they both show 200 OK. I back checked that URL to just the domain of my.name-services[dot]com and it says: "The service for this site has been canceled." HUH? Looks like something fishy is going on here. I put that URL in Google and came up with one hit that said:
The service for this site has been canceled.
And clicking the blue text goes to the green URL, but it also says the exact same "The service for this site has been canceled". Enom is a webhosting company, and it appears they canceled the account of "my", whomever that was. But name-services[dot]com goes to some cover page, and it is hosted by Enom. Yadda yadda yadda.
Does anyone know what's going on here? Has anyone ever heard anything about "aipbot"...are they some type of scam site, or do they have a "questionable" spider? Should I block that spider from my sites, and if my.name-services[dot]com has been canceled, then what is it doing spidering my site?
Ahh, I did some G searches and found this:
[webmasterworld.com...] . Yes, it does appear questionable and it should be blocked in cPanel. I'm leaving this post here so others may want to check their stats for access by this bot. A G search for Aipbot shows some rather interesting info. I'd like to know how Aipbot[dot]com and my.name-services[dot]com/79900/page99.htm can still exist.
Yes, it is not easy. The other link is good. I have also seen interesting things in the cPanel. I am not a specialist. To me it seemed that some spiders entered my site via subpages. But I did not study it in detail.
"Tracing search engine IP addresses is handy for a number of reasons. You might want to see when your web site was spidered by search engines. You might want to know which pages where spidered. You might want to ban some bots and let others in. You might want to see which bots are obeying your robots.txt files. You might want to redirect bots to other pages or serve them different content (cloaking).
Keeping track of the various IP addresses belonging to search engine spiders is a difficult enterprise and requires the assistance of a large number of people. I'm constantly receiving reports of new spiders and doing the research necessary to verify that they are indeed new and that they do indeed belong to search engines. The lists are as up to date as possible, but there is no guarantee that they include all of the IP addresses of all search engines (it would be impossible to guarantee such a thing). Also, there is no guarantee that all of the IP addresses contained in the lists belong to search engines (they are sometimes reassigned or abandoned)".
I cut from a site listed in the thread.
What about advanced sitecrawlers that can steal much of your code in seconds. Possible to exclude them? As I said in another post, as long as there are twoway communication, I think "everything is possible".
Different browsers, different web server os, dynamic (random) IP adresses, make it more difficult.
Make it simple, as simple as possible, but no simpler.
I also just noticed the Googlebot is NOT obeying robots meta tags. I wanted to block it on ONE of my "links" pages (experimented a few weeks ago to see what would happen) where in the <head> tag I placed <meta name="googlebot" content="noindex, nofollow"> and the G-bot did indeed visit the page as it does any of my other pages. I don't want to use the robots.txt file to do that since I tried that for one PDF file and after I did that I started to drop in G again.
What the noindex will do is that the page shouldn't appear in the search results, and the nofollow means it wont harvest the links on that page (however if it finds those links elsewhere without a nofollow it will try and visit them).
There's no mention of a search engine, just the crawler. The URL
[aipbot.com...] gives an identical page. Another mystery box. -Larry
advanced site crawler
I get a lot of hits. The description for one advanced site crawler is as follos:
"Advanced Site Crawler #*$!x is a Windows-based shareware that has two main functions. The first one is to search INSIDE a website that you will choose and will follow one link after the other to search for information. The second function allows you to search a website and download images, videos, documents, sounds and much more! You can download files into separate categories or create a duplicate of the original website."
So easy is it to create duplicate of (steal) a website. Is it possible to block them?
Make it simple, as simple as possible, but no simpler.
Cgrantski, I keep a list of search engine IP addresses and I check it to be sure I'm not blocking anything legit. The only reason I'd want to block some bots is if they are on APNIC, LACNIC or RIPE due to Spam (they spider a site for email addresses to attack). I know a lot of you are on those IP ranges, and unfortunately the decent and good have to suffer for the illegalities of the spamming and scamming parasites that are on those ranges. I also can only sell in the USA and Canada (also due to fraud), so I'm not missing any sales traffic. I would also block any known USA bots that are known as email harvesters and the like. I had a list of "known bad bots" somewhere.
Dijkgraaf, thanks, that makes sense. ;)
Larry, read my post again about the Aipbot. From the evidence I found, they are not only not legit, but appear to be spoofing another website.
Kgun, I don't know anything about any "Advanced Site Crawler", it could be legit and the way crawlers work to a point. You could block it of course, but only if you could find a common IP address to block and it doesn't sound like you could since it sounds like it's software that's on the user's side. The IP address would of course change for every user that used the 'ware. Only thing I could suggest is you using it on your site, then look at your logs and see if there is a distinguishable characteristic of it that you could block. I doubt there would be. Unless you see something like "AdvancedSiteCrawler-bot" then you could deny that in your .htaccess file.
"it sounds like it's software that's on the user's side".
Think you are right.
I get 447 000 hits, when I search for
list of bad bots
on Google. FUNDER have a fairly large list.
Your private list is perhaps good enough?
Good enough is best.
Only bad bots will access that directory (ie they've ignored our robots.txt exclusion). These bots can then be directed to a script that will immediately grab their IP address, User Agent or Referrer and add it to an .htaccess file - so that they're banned from the site".
You may also study my
.htaccess (not implemented) name htaccess.txt
Make it stimple, as simple as possible, but no simpler.
Go to an SE and search for "list of bad bots" in quotes and you will find a more exacting search.
[edited by: Receptional at 2:23 pm (utc) on June 30, 2005]
[edit reason] No specifics please [/edit]