homepage Welcome to WebmasterWorld Guest from 54.145.182.50
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Strange?
Just a heads up
wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3263 posted 1:00 am on May 25, 2006 (gmt 0)


Recently I had two ranges that I knew were previously denied be allowed access, due to a syntax error I had made.
(Funny thing about syntax errors, there are some that result in 500's and take your entire site (s) down, while others may linger incorrectly for months or longer until we see something eye-opening).
The syntax error and the correction allowed me to be more alert of verfying that the correction had actually solved the problem. In the process I stumbled across this strange correlation.

The following may just be coincedence, however I don't believe so.

ALL bots were both reading robots.txt and then following immediately with the same folder and page.
There must be some correlation or connection either between the bots or perhaps on the page itself.

I went over and over the page html and could not find any weakness.

68.178.242.#*$! - - [22/May/2006:03:23:29 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
68.178.242.xxx - - [22/May/2006:03:23:36 -0700] "GET /myfolder/mypage.html HTTP/1.1" 403 - "-" "-"

66.154.103.150 - - [07/May/2006:00:20:10 -0700] "GET /robots.txt HTTP/1.0" 403 - "-" "Gigabot/2.0/gigablast.com/spider.html"
66.154.102.96 - - [07/May/2006:00:20:11 -0700] "GET /Same Folder/Same Page.html HTTP/1.0" 403 - "-" "Gigabot/2.0/gigablast.com/spider.html"

64.62.228.xx - - [18/May/2006:07:41:53 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
64.62.228.xx - - [18/May/2006:07:41:53 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

216.195.47.xxx - - [23/May/2006:07:13:41 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
216.195.47.xxx - - [23/May/2006:07:13:46 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

64.71.167.xx - - [24/May/2006:09:19:40 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
64.71.167.xx - - [24/May/2006:09:19:43 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

Most of these bots visited the same pages over and over in the excat same order.

Any thoughts?
Other than it's time for my medication ;)

Don

 

bull

10+ Year Member



 
Msg#: 3263 posted 10:35 am on May 25, 2006 (gmt 0)

Sorry Don,
I cannot confirm this from my log archives, but several IPs from 64.71.167.* repeatedly attempted to crawl using
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322) . (This is a Hurricane E. range and therefore denied anyway)

Cheers
Jan

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3263 posted 4:29 pm on Jun 1, 2006 (gmt 0)

HTTP/1.1" 200 3727 "-" "-"
208.66.195.6 - - [01/Jun/2006:06:31:58 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

only log entries.

Intersting in that the backbone of this IP range is the same name as the HE pest.

The actual IP regsitered to a Moscow address.

I've added the backbone range to my denies.

Don

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3263 posted 6:27 am on Jun 8, 2006 (gmt 0)

The range 68.178.242. is GoDaddy hosting and I've blocked them due to various creepy crawlers from their servers.

What you may be seeing is what I predicted months ago in that scrapers/crawlers are building distributed networks so they can't be caught by downloading too many pages from a single source.

AdSense and other monetization programs is a very good incentive for this type of activity.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved