Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Getting GoogleBot to Spider a Page Now, Before It's Scraped

         

TheMadScientist

7:52 am on Feb 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google has said they are trying to attribute the original source of information better and I know there have been many threads and comments here about 'hoping GoogleBot gets it before the scrapers do' and I've done some things in the past, like submitting a page and leaving it unlinked, but there's a way faster method I've found you can use if you aren't set up to 'ping' GoogleBot.

You have to use a bit of mod_rewrite to make sure GoogleBot gets it first, but here's what I've found works ... It's simple:

Tweet it!

Yep, no joke when you tweet a URL GBot hits it in a hurry when it's the first time it's seen or hasn't been seen in a while, so what I would consider doing if you have a page or even a section of a site you NEED GoogleBot to get to before the scrapers hit it so there's a better chance of you being considered the originator of the content is:

1.) Set your .htaccess to forbid anything that is NOT GoogleBot to the page.

(You can get more detailed if you want or are worried about spoofing, but this should be fairly effective... It's what I use. Quick, simple, easy.)

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} !googlebot [NC]
RewriteRule ^ThePage\.ext$ - [F]

2.) Upload ThePage.ext
3.) Tweet a link to the page.
4.) Check your logs 2 to 3 seconds (or less) after you tweet.
(Usually by the time I can refresh my stats there's a visit, but once in a while it takes longer.)

5.) Give it a bit and let Google churn on it.

How long you should leave the block in place probably depends on your specific situation... If it's news, you'd probably want to pull the block down as soon as it gets spidered. If it's a section of a site and you tweet the index page and have the whole section blocked and you have the time, maybe give it a few days.

Anyway, hope this idea helps some people who need content spidered right away or before the scrapers get to it and aren't set up to ping G out a bit and I hope G is really set on giving original content the thumbs up and removing the dups that are found after, because it would be really cool to write a quality page and get 'extra credit' for every time it gets scraped rather than worrying about it being replaced by a copy. (Yeah, ok I'm dreaming again, but the part about not having to worry about it getting replaced would be enough for me. ;)

[edited by: TheMadScientist at 8:08 am (utc) on Feb 23, 2011]

TheMadScientist

7:56 am on Feb 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just a note:

You're probably pretty safe without the mod_rewrite even, as long as it's not linked... The normal 'in a blip' set of bots aren't the ones that take the whole page and republish it as far as I've noticed, but I throw in the mod_rewrite just to be safe.

tedster

8:19 am on Feb 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting tip - don't you worry about messing with your Twitter followers by blocking their access to the link?

TheMadScientist

2:10 pm on Feb 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It works with any account ... and I have a delete button. ;)

It's not like a page where you need it to stay for the bot to see it, it's the opposite.
The longer it stays the more 'nofollow' it becomes, afaik.

Also, in the 'news situation' you could have your .htaccess open, click tweet, refresh stats, delete the code, and save probably in less than 2 seconds if you're good with a mouse and on a fast connection. Not many followers are fast enough to hit the block if you do that, afaik.

ADDED: It's a bit of a 'poor man's ping service'.

[edited by: TheMadScientist at 2:29 pm (utc) on Feb 23, 2011]

goodroi

2:12 pm on Feb 23, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Why not just tweet it and forget about playing user agent blocking?

By tweeting your url, Google will quickly follow the link and give you credit. Plus by having a link posted in twitter you do get a little ranking boost which combined with your website ranking power you likely can handle any scraper. I don't see a significant need for the user agent blocking.

Even if you use user agent blocking, a scraper bot can simply pretend to be googlebot. In which case the only people you are blocking are your twitter followers who won't be happy (unless you setup a fake twitter account to avoid upsetting real humans).

Bottom line - I like the twitter tip but not a fan of combining with user agent blocking.

TheMadScientist

2:22 pm on Feb 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



By tweeting your url, Google will quickly follow the link and give you credit.

We would think, but Google has a notoriously tough time figuring out origination.

combined with your website ranking power you likely can handle any scraper.

Depends on the site and it's power relative to that of the would be scrapers imo...

Even if you use user agent blocking, a scraper bot can simply pretend to be googlebot.

1.) You can stop that with more details.
2.) I haven't seen one of those from the 'pipe' hit one of my sites.
3.) The longer the link stays the bigger that becomes a possibility.

I like the twitter tip but not a fan of combining with user agent blocking.

There's no rule that says you have to use the block ;)
Glad you think the tip might be useful though.