homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Just a heads up

 1:00 am on May 25, 2006 (gmt 0)

Recently I had two ranges that I knew were previously denied be allowed access, due to a syntax error I had made.
(Funny thing about syntax errors, there are some that result in 500's and take your entire site (s) down, while others may linger incorrectly for months or longer until we see something eye-opening).
The syntax error and the correction allowed me to be more alert of verfying that the correction had actually solved the problem. In the process I stumbled across this strange correlation.

The following may just be coincedence, however I don't believe so.

ALL bots were both reading robots.txt and then following immediately with the same folder and page.
There must be some correlation or connection either between the bots or perhaps on the page itself.

I went over and over the page html and could not find any weakness.

68.178.242.#*$! - - [22/May/2006:03:23:29 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
68.178.242.xxx - - [22/May/2006:03:23:36 -0700] "GET /myfolder/mypage.html HTTP/1.1" 403 - "-" "-" - - [07/May/2006:00:20:10 -0700] "GET /robots.txt HTTP/1.0" 403 - "-" "Gigabot/2.0/gigablast.com/spider.html" - - [07/May/2006:00:20:11 -0700] "GET /Same Folder/Same Page.html HTTP/1.0" 403 - "-" "Gigabot/2.0/gigablast.com/spider.html"

64.62.228.xx - - [18/May/2006:07:41:53 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
64.62.228.xx - - [18/May/2006:07:41:53 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

216.195.47.xxx - - [23/May/2006:07:13:41 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
216.195.47.xxx - - [23/May/2006:07:13:46 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

64.71.167.xx - - [24/May/2006:09:19:40 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
64.71.167.xx - - [24/May/2006:09:19:43 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

Most of these bots visited the same pages over and over in the excat same order.

Any thoughts?
Other than it's time for my medication ;)




 10:35 am on May 25, 2006 (gmt 0)

Sorry Don,
I cannot confirm this from my log archives, but several IPs from 64.71.167.* repeatedly attempted to crawl using
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322) . (This is a Hurricane E. range and therefore denied anyway)



 4:29 pm on Jun 1, 2006 (gmt 0)

HTTP/1.1" 200 3727 "-" "-" - - [01/Jun/2006:06:31:58 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

only log entries.

Intersting in that the backbone of this IP range is the same name as the HE pest.

The actual IP regsitered to a Moscow address.

I've added the backbone range to my denies.



 6:27 am on Jun 8, 2006 (gmt 0)

The range 68.178.242. is GoDaddy hosting and I've blocked them due to various creepy crawlers from their servers.

What you may be seeing is what I predicted months ago in that scrapers/crawlers are building distributed networks so they can't be caught by downloading too many pages from a single source.

AdSense and other monetization programs is a very good incentive for this type of activity.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved