Forum Moderators: Robert Charlton & goodroi
I discovered another Google indexing problem (doesn't seem to happen with MSN, Yahoo/Inktomi or others) generated by content theft.
Googlebot suddenly started trying to crawl nonexistent links. To give you an idea, lets assume a directory structure:
www.widgets.com
www.widgets.com/blue
www.widgets.com/blue/us
www.widgets.com/blue/france
www.widgets.com/red
www.widgets.com/red/italy
www.widgets.com/red/canada
Google tried to crawl links such as www.widgets.com/blue/us/red/italy/blue/france/widget_order.html
Obviously this page didn't exist BUT, presumably because of the Apache lookback function, it would ultimately return a 200 resolving to the correct "www.widgets.com/widget_order.html"
I went crazy trying to figure the source and finally traced it to a site stealing content. Because the page (not the widgets.com index page) it stole did not have a base href tag, all the relative links on the stolen page became screwed up. I don't know if a base href would have solved it (at least solved the bad links not the theft) but I took the drastic step of deleting my original page and using the Google removal tool to get it out of the index (and note that Googlebot is still trying to crawl these links as the removal is still pending).
Point is, I don't know how Google interpreted this (duplicate content? redirects?) but its another example of other sites being able to screw your PR and SERPs.
What it would try to crawl is urls like:
www.widgets.com/widget_order.html/blue/us/red/italy/blue/france/some_valid_widget_page.html
"widget_order.html" being the stolen page.
I assume the Apache lookbak would just keep discarding the end until it ended up with the valid "www.widgets.com/widget_order.html" which would return a 200.