Forum Moderators: DixonJones

Message Too Old, No Replies

Forged user agents

Is this a site scraper... whose scraper?

         

idoc

9:23 pm on Oct 1, 2004 (gmt 0)

10+ Year Member



This appears to be a site scraper following linked pages of my site. The interesting thing is... plug the i.p. into your browser. The i.p. shows content from a well known website. Traceroute the well known website and it traces to a different i.p. address than the one in the referral log. I replaced the filepaths with * to obfuscate my site path. Their i.p. address is blocked in apache config resulting in 403's to them. I have *alot* of these with numerous user agents. I am not sure what to make of it all yet, though the well known site does have an affiliate program.

198.65.155.205 - - [01/Oct/2004:16:09:40 -0400] "GET /*/*.html HTTP/1.0" 403 1167 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; btiv6download)"

198.65.155.205 - - [01/Oct/2004:11:37:20 -0400] "GET /*.html HTTP/1.0" 403 1167 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0; AltaVista 1.00.05)"

198.65.155.205 - - [01/Oct/2004:12:50:51 -0400] "GET /*/*.html HTTP/1.0" 403 1167 "-" "Mozilla/5.0 (Windows; U; Win98; en-US; m18) Gecko/20001108 Netscape6/6.0"

198.65.155.205 - - [01/Oct/2004:17:19:46 -0400] "GET /*/*.html HTTP/1.0" 403 1167 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)"

198.65.155.205 - - [01/Oct/2004:17:20:00 -0400] "GET /*/*.html HTTP/1.0" 403 1167 "-" "Mozilla/4.7 [en] (Win98; U)"

198.65.155.205 - - [01/Oct/2004:12:52:10 -0400] "GET /*/*.html HTTP/1.0" 403 1167 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 95; MSNIA)"

198.65.155.205 - - [01/Oct/2004:12:57:41 -0400] "GET /* HTTP/1.0" 403 1167 "-" "Mozilla/4.7 [en] (WinNT; I)"

198.65.155.205 - - [01/Oct/2004:12:57:41 -0400] "GET /*/*.html HTTP/1.0" 403 1167 "-" "Mozilla/4.7 [en] (WinNT; I)"

idoc

5:22 pm on Oct 8, 2004 (gmt 0)

10+ Year Member



Another interesting finding from my log files:

198.65.155.205 - - [07/Oct/2004:19:21:22 -0400] "GET /*/*-*-*.html HTTP/1.0" 403 1167 "-" "Mozilla/4.5 [en]C-CCK-MCD CIN.NET (Win95; U)"

207.44.196.107 - - [08/Oct/2004:10:09:52 -0400] "GET /*/*-*-*.html HTTP/1.0" 200 12769 "-" "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Visitor 1 i.p. resolves to a copy of the #*$! website but is not the i.p. you get if you traceroute to the domain... visitor 2 i.p. is a cheap web host which I have now 403'd their entire net block. The pages requested are the same identical page... not a top level or common site page yet a hyphenated three keyword second level page from a large site. Neither have any referrer or took any images .css files etc. Coincidence?... I doubt it very highly. The question is is this an affliate of #*$! or if you are a subscriber can you use them as a proxy? I don't use them... any ideas?

jdMorgan

5:36 pm on Oct 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is your competition looking for keywords using the services of the site you found at that IP address. The server running the script is in the given hosting provider's network. When a keyword search is done, the script goes out and loads pages from related sites to count word frequency.

Just another annoyance... Block it and forget it.

Jim

pendanticist

2:17 am on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



198.65.155.205 - - [19/Oct/2004:17:42:54 -0700] "GET /blah<b>blah</b>blah.html HTTP/1.0" 403 480 "-" "Mozilla/4.75 [en] (Win98; U)"

A search of this IP Number led me to this thread...

Can anyone explain the reasoning behind the html coding located in my log files?

Oh, it's red in my files too.

Thanks.

idoc

3:07 am on Oct 20, 2004 (gmt 0)

10+ Year Member



When I get to the office tomorrow I need to look over these logs again... I am almost certain I saw the bold tags in my logs someplaces too. I had literally hundreds of logged requests from this one i.p. across three websites. I also saw while plugging referring i.p.'s into a browser a directory style site with links coded like that (with bold tags) as well. I think the site builder thinks the bold tags in the links gives an advantage somehow. The interesting thing is I don't use them but I also saw them in my referrer logs. I also see requests for dynamic pages I don't use either... all mine are completely static from the server.

jdMorgan

3:16 am on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There was a post several years ago about this, if I recall correctly... It looks like they copied a URL from a page with keywords highlighted in the URL. So, it either wasn't a proper "Links page" with normal link-text, or maybe the URL was copied and pasted from a keyword-highlighted list of URLs, not links.

It's possible they copied the link from something like a Google search results page; By copying and pasting the URL that appears below your page's listing, and assuming that they had searched for keywords that appear in your URL, you'd get these bolding tags. But Google's highlighted URLs are green, so it must have been another search engine with a similar feature.

Jim