Do scrapers forge HTTP Referer? - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Do scrapers forge HTTP Referer?

Wondering what the consensus is...

dataguy

2:01 am on May 22, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This past week I decided it was time to install a function to block screen scrapers from pulling down massive sections of my sites, mostly by counting IP addresses making page requests. It's been a bitter sweet process, because this is adding a lot of overhead to simple page views, but I'm hoping that in the end I will come out ahead because I will have fewer malicious bots on my sites making requests.

I've struggled with this because I hate the thought of the possibility of blocking access to a legitmate visitor. I mean, it's possible that I have a visitor that really likes one of my sites and goes through it page by page, and after so many page requests from the same IP, I start to serve up bogus pages, which would not look good to a serious web user.

It's been a real eye opener to see just how many bots are crawling my sites, which are not associated with a search engine. Many have user agent strings that appear as a normal web surfer, except for the fact that a normal surfer couldn't possibly view so many pages as quickly as these bots do.

Anyway, while watching the stats this week, I've wondered how many bots forge the HTTP REFERER so as to appear more legitimate. Would it be a good idea to give different weighting to page requests made without a referrer, or is this too random?

Thanks for your thoughts....

SimpleEnigma

3:22 pm on May 31, 2005 (gmt 0)

10+ Year Member

In looking through my own log files, trying to find the referers that send me the most traffic (or any traffic) I have seen referers that are awful scrapper sites.

Links to my sites are never included on the page and I find the same format on multiple domains and IPs.

My read on the Situation:

Some bots to forge the HTTP_Referer variable to get their web page into your log files. I think they do this on the off chance your log files or a stats report will link back to thier site. I've always thought it was a scam for free link backs.

If you give a differnet weight to a page request without a http_referer then you are going to do that for both spiders and people who use bookmarks (on some browsers).

What about putting in a part of the script that will by pass your script if the http_referer is from your domain?

dataguy

4:19 pm on May 31, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks for your post, I was thinking that I must be the only one dealing with this problem.

What about putting in a part of the script that will by pass your script if the http_referer is from your domain?

I've been watching the logs and I'm seeing quite a few scrapers that show a referrer from my domain, probably 1 out of 10. I see the the link-spam bots as well, but they usually only hit a few pages and then stop. The scrapers often hit thousands of pages and I would assume they would keep going if I didn't shut them out by blocking their IP. Usually they'll even keep changing their referrer to other URL's within my domain, which makes it even more difficult.

I've been using my scraper trap for over two weeks now and I can see that I've blocked a few legitimate surfers... the biggest problem seems to come from AOL users as there are apparently quite a few that use the same IP and the exact same user agent string.

I can see where this could be a full-time job blocking scrapers and I'm already not getting my day to day tasks completed. The up side is that there is a noticable reduction in server load and bandwidth usage, even with the added overhead of checking the IP with each page view.

wilderness

6:53 pm on May 31, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

"screen scrapers from pulling down massive sections of my sites"

Perhaps I'm dense ;)
What's a screen scraper?

The majority of us in this forum have been dealing with visitors for some years.

There are a couple of threads which define the use of scripts to stop pests in their tracks:

Ban With a Script
[webmasterworld.com...]

Reduce Harvests (Msg#16)
[webmasterworld.com...]

I'm not sure how to identify a "spidering process" from AOL (with multipe generated IP's) or visitors that fake IP's?
Any success does require an extensive knowledge of both your page's content and the frequency and interests of your visitors when dealing with unidentified crawls, at least when making these determinations by viewing logs manually.
Still, it's very easy to make mistakes and identify a visitor who has spent hours or days looking for particular words or content and has travelled a tons of pages and spends very little time on each of your pages before precceding to the next of your pages.

I frequently find myself going back and opening up extensive ranges that I have previously denied in disgust.
It's a delicate balance between what both you and your visitors desire and what could be deemed an excess.

Don