Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Companies running a business spidering Google - how and why?

         

pavlovapete

12:27 am on May 19, 2009 (gmt 0)

10+ Year Member



Just came across another "report" from a business that "crawled" Google results "for x thousand keywords" across the US as part of a survey.

Yesterday I got some email from a business that has "servers across the US" working out what keywords businesses are ranking for.

The other day I was looking at a huge database of adwords advertisements and keywords used by businesses worldwide.

What is going on?

Obviously (to me) these are the products of automated queries - supposedly something Google will not allow.

How are these "crawling" companies hoping to build a long term business based on activities that are outside the Google TOS?

I'm not understanding this at all.

tedster

1:23 am on May 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The "why" seems clear to me - they feel there is commercial value in the intelligence, especially for large aggregated data-sets.

Whether this can be a viable long-term business model, well, that's anyone's guess. And how do they do it? well, I'd say the phrase "servers across the US" gives a big clue. Many servers on dissimilar IP addresses, throttled way back to avoid detection.

ogletree

1:33 am on May 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you know what you are doing it is easy to spider google. All Google does is fight amateurs. The pros will always get what they need. There are people that have spidered Google every day for years.

pavlovapete

4:52 am on May 19, 2009 (gmt 0)

10+ Year Member



I guess I wonder why Google lets these businesses continue. It would be pretty easy to prove that the data was coming from automated queries wouldn't it? Alternatively couldn't google release the data themselves and become the authoritative source of aggregated reports? Collectively I'd assume that these businesses are making many dollars. Not to mention the computational load.

throttled way back
...
every day for years

- that makes me shiver. I can't imagine being banned from using google.

Maybe I am naive about how clever they are?

fishfinger

10:27 am on May 19, 2009 (gmt 0)

10+ Year Member



It would be pretty easy to prove that the data was coming from automated queries wouldn't it?

I know nothing about how it's done, but I'd imagine that you make sure that you

(a) don't fire off your queries too fast, and
(b) have a huge amount of random IPs (either non-existent or spoof a genuine one)

Impossible to spot if you mix it up.

I've been temporarily blocked from Google for skimming through SERPS too fast in too short a space of time.

Perhaps they have a threshold for so many queries per IP per 24 hour period too. But anything that is programmed can be measured and counter-programmed.

I'd imagine these guys are constantly testing the limits with bots that do occasionally get caught and then making sure that their proper bots run below that threshold.