homepage Welcome to WebmasterWorld Guest from 54.211.201.65
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Strange?
Just a heads up
wilderness




msg:396076
 1:00 am on May 25, 2006 (gmt 0)


Recently I had two ranges that I knew were previously denied be allowed access, due to a syntax error I had made.
(Funny thing about syntax errors, there are some that result in 500's and take your entire site (s) down, while others may linger incorrectly for months or longer until we see something eye-opening).
The syntax error and the correction allowed me to be more alert of verfying that the correction had actually solved the problem. In the process I stumbled across this strange correlation.

The following may just be coincedence, however I don't believe so.

ALL bots were both reading robots.txt and then following immediately with the same folder and page.
There must be some correlation or connection either between the bots or perhaps on the page itself.

I went over and over the page html and could not find any weakness.

68.178.242.#*$! - - [22/May/2006:03:23:29 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
68.178.242.xxx - - [22/May/2006:03:23:36 -0700] "GET /myfolder/mypage.html HTTP/1.1" 403 - "-" "-"

66.154.103.150 - - [07/May/2006:00:20:10 -0700] "GET /robots.txt HTTP/1.0" 403 - "-" "Gigabot/2.0/gigablast.com/spider.html"
66.154.102.96 - - [07/May/2006:00:20:11 -0700] "GET /Same Folder/Same Page.html HTTP/1.0" 403 - "-" "Gigabot/2.0/gigablast.com/spider.html"

64.62.228.xx - - [18/May/2006:07:41:53 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
64.62.228.xx - - [18/May/2006:07:41:53 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

216.195.47.xxx - - [23/May/2006:07:13:41 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
216.195.47.xxx - - [23/May/2006:07:13:46 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

64.71.167.xx - - [24/May/2006:09:19:40 -0700] "GET /robots.txt HTTP/1.1" 200 3727 "-" "-"
64.71.167.xx - - [24/May/2006:09:19:43 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

Most of these bots visited the same pages over and over in the excat same order.

Any thoughts?
Other than it's time for my medication ;)

Don

 

bull




msg:396077
 10:35 am on May 25, 2006 (gmt 0)

Sorry Don,
I cannot confirm this from my log archives, but several IPs from 64.71.167.* repeatedly attempted to crawl using
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322) . (This is a Hurricane E. range and therefore denied anyway)

Cheers
Jan

wilderness




msg:396078
 4:29 pm on Jun 1, 2006 (gmt 0)

HTTP/1.1" 200 3727 "-" "-"
208.66.195.6 - - [01/Jun/2006:06:31:58 -0700] "GET /Same Folder/Same Page.html HTTP/1.1" 403 - "-" "-"

only log entries.

Intersting in that the backbone of this IP range is the same name as the HE pest.

The actual IP regsitered to a Moscow address.

I've added the backbone range to my denies.

Don

incrediBILL




msg:396079
 6:27 am on Jun 8, 2006 (gmt 0)

The range 68.178.242. is GoDaddy hosting and I've blocked them due to various creepy crawlers from their servers.

What you may be seeing is what I predicted months ago in that scrapers/crawlers are building distributed networks so they can't be caught by downloading too many pages from a single source.

AdSense and other monetization programs is a very good incentive for this type of activity.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved