Forum Moderators: coopster & phranque

Message Too Old, No Replies

Disguising automated searches

How far to go?

         

sugarkane

11:25 am on Mar 13, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm thinking of writing a simple automated search tool, more as an exercise than anything else. Now, I know that such tools can be a big no-no as far as some SEs are concerned.

The question is, how far to go to disguise the tool? Would a combination of spoofed user agent and simulated real-user timings be enough, or would spoofing the referral string be a good idea as well? Anything else?

littleman

5:25 pm on Mar 13, 2001 (gmt 0)



This is what I do:
1 Rotate UAs per session
2 Rotate IPs at random per session
3 Emulate a browser in the header information
4 Send the proper referrer
5 Variably delay requests (not by much)
<added>
6 I also accept and give back cookies when I do submissions
</added>
I've never had a problem, and at this point I've done a few million SE crawls.

sugarkane

10:58 pm on Mar 13, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do you do this using the various LWP / HTTP etc modules or go a little deeper?

littleman

11:06 pm on Mar 13, 2001 (gmt 0)



No that is about as deep as I get, :)
I've done some stuff via io::socket, but it is all built into lwp.
One thing that was driving be crazy with lwp was finding the documentation on using and flushing cookies. I finally got it right through trial and a lot of error.

bobriggs

3:55 am on Apr 2, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



littleman:

1. How to rotate IP's? (wow!)
(Or is that just a dialup [dynamic] IP changing by itself?)

2. What is LWP?

TIA

Brett_Tabke

4:24 am on Apr 2, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Dialup IP Bob.
LWP = libwww perl module.

I never could get the lwp cookies figured out last year so I went to all socket based. The trouble I ever had there was with Hotbots MS boxes. Finally figured out if I'd send an accept */* as part of the header, it would work fine (just something about the way I'd setup the code).

If you do write an automated tool, do try to keep respectful in the number of searches you run per day. As someone who runs many sites that are targets for spiders, I can sympathize with the se's on the issue. I don't know what that figure is, but try to keep it under 1 request a minute. Slow and steady in off hours works best. Let it run while you sleep. I have a personal limit of 500 a day max per engine. Most days I don't even run it.

bobriggs

5:17 am on Apr 2, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, Now I get it. lwp, you guys being on *nix machines. I'm on a Win Box. But I do have a secure shell account on a *nix box. I can duplicate 1,3,4,5,and 6, on my windows box. but rotating IP's is out of my control. Can this be handled in unix with lwp?

littleman

5:45 am on Apr 2, 2001 (gmt 0)



It all could be done in windows too. It is relatively simple to use public proxies, If you have only one IP, just use someone else's.

jhee

5:13 pm on Apr 3, 2001 (gmt 0)



Mr. Kane,

Is this search tool for finding keywords, or for finding how your competition is ranked?

I am not understanding why one would do this.

-jhee

sugarkane

10:11 pm on Apr 3, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi jhee, and welcome to webmasterworld.

I'd be using it for 2 things - firstly to check my own rankings for a range of keywords on various sites (I'm totally disorganised about keeping a watch on my rankings, it'd be nice to automate the process), and also to gather information about the search results for a large amount of randomly chosen and popular keywords. This would provide the raw data for analysis to try and figure out an engine's algo.

(Plus, of course, I'd have fun doing the scripting ;)