Forum Moderators: open
When clicking on the live URL at Google, even though it's an alltheweb URL it brings up the web page that had been searched for - exactly the page as it is on the site.
It comes out of here, which can't be accessed
[click.alltheweb.com...]
Why isn't that being excluded at alltheweb from robots crawling altogether? It doesn't seem that should be publicly available for crawling at all if the subdomain root is forbidden.
I don't think I could think up a better way to mess up Google's search if I tried. What it amounts to is that alltheweb is putting web pages out there with one of their own URLs that's actually swiped content belonging to someone else.
As a side note, I haven't begun to check out all the details, but the site I was checking out when I found this appears to have a penalty for the homepage at Yahoo.
It this what you're talking about?
Even better: every result loads as a frameset using one frame to show ads coming from the hijacking domain, and another frame loads the actual redirect's content. Looks like fast allready did something against it - the redirect frame loads a fast error page "Invalid redirect URL" ...
<added>
It's all coming from one "meta search engine". The search results' listings (in this case fast's redirect url's) are obviously cached, mirrored to tons of domains and made available to google and other crawlers. This is getting a new sport obviously.
</added>
Yep, ATW should add /urlinfo to the robots.txt.
That's probably enough to make them aware of the problem. However, it might speed up things to contact the sales people at google - there's money involved ... i received good feedback in the past when i used my AdWords account to send quality complaints to google.
If google one day starts to unintentionally penalize or drop the original version of such mirrored pages instead of the redundant mirror, they might end up penalizing a lot of the best known sites - including WebmasterWorld and Google itself. If you search google for exact pages titles or urls from WebmasterWorld and/or google you'll find a lot of high ranked anonymizing proxies that cache every requested page and make them available to robots.
Bad side effect of google's improved spidering of dynamic content, imho. They presumalby allready work on a solution at the plex.