Forum Moderators: goodroi
We have a comparison shopping engine with about 12,000 products listed. The "buy now" link used to go to a url strucutured like:
/track/productid#
That page just records the ip address of the clicker, the date, the time, and the item that was clicked upon. It then automatically redirects to appropriate page on the merchant's website.
In robots.txt, we blocked everything in the track directory, since it's pretty useless for search engines to index those pages. While tweaking a few minor things, we decided that the word "track" has some negative connotations. Although few people actually look at the link they're clicking on, we thought that some paranoid users might object to the fact that they've been "tracked". So, we decided to change the url structure to:
/buy/productid#
Today, we made that change and I immediately changed the robots.txt file to also block everything in the buy directory.
Two hours later, I log into the admin section of the site to see how many click throughs we've gotten today. I see that the number is insanely high and that most of the clicks are coming from a single ip address. Thinking I've got another rougue Chinese bot on my hands, I do a reverse lookup. Lo and behold, it's Google bot.
So, I head on over to my webmaster tools account to find out that the most recent version of robots.txt they have cached is from 1:00 this morning...before we uploaded the changes. There's also a little message that says "Yeah, we download robots.txt about once a day".
So, as of now I've got two separate google bots indexing all the parts of my site they're not supposed to be indexing. Hopefully, they'll access robots.txt soon and figure out that they've wasted the last 7 hours crawling blocked pages.
Anyway, I hope I don't wind up with some penalty for adding 12,000 pages to my site in the course of an hour, but here's fair warning. If you're going to do something similar, block the new pages first, wait until Google caches the new robots.txt, then make the change.