Getting GoogleBot to Spider a Page Now, Before It's Scraped

Google has said they are trying to attribute the original source of information better and I know there have been many threads and comments here about 'hoping GoogleBot gets it before the scrapers do' and I've done some things in the past, like submitting a page and leaving it unlinked, but there's a way faster method I've found you can use if you aren't set up to 'ping' GoogleBot.

You have to use a bit of mod_rewrite to make sure GoogleBot gets it first, but here's what I've found works ... It's simple:

Tweet it!

Yep, no joke when you tweet a URL GBot hits it in a hurry when it's the first time it's seen or hasn't been seen in a while, so what I would consider doing if you have a page or even a section of a site you NEED GoogleBot to get to before the scrapers hit it so there's a better chance of you being considered the originator of the content is:

1.) Set your .htaccess to forbid anything that is NOT GoogleBot to the page.

(You can get more detailed if you want or are worried about spoofing, but this should be fairly effective... It's what I use. Quick, simple, easy.)

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} !googlebot [NC]
RewriteRule ^ThePage\.ext$ - [F]

2.) Upload ThePage.ext
3.) Tweet a link to the page.
4.) Check your logs 2 to 3 seconds (or less) after you tweet.
(Usually by the time I can refresh my stats there's a visit, but once in a while it takes longer.)

5.) Give it a bit and let Google churn on it.

How long you should leave the block in place probably depends on your specific situation... If it's news, you'd probably want to pull the block down as soon as it gets spidered. If it's a section of a site and you tweet the index page and have the whole section blocked and you have the time, maybe give it a few days.

Anyway, hope this idea helps some people who need content spidered right away or before the scrapers get to it and aren't set up to ping G out a bit and I hope G is really set on giving original content the thumbs up and removing the dups that are found after, because it would be really cool to write a quality page and get 'extra credit' for every time it gets scraped rather than worrying about it being replaced by a copy. (Yeah, ok I'm dreaming again, but the part about not having to worry about it getting replaced would be enough for me. ;)

[edited by: TheMadScientist at 8:08 am (utc) on Feb 23, 2011]

Getting GoogleBot to Spider a Page Now, Before It's Scraped

TheMadScientist

TheMadScientist

tedster

TheMadScientist

goodroi

TheMadScientist

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week