Indexed AlltheWeb pages causing Google duplicates

Forum Moderators: open

Message Too Old, No Replies

Indexed AlltheWeb pages causing Google duplicates

Marcia

11:20 pm on Aug 13, 2003 (gmt 0)

It's a big ultra-long url with the results of a search conducted at alltheweb.com that turns up at Google with an allinurl: search for sitename.com

When clicking on the live URL at Google, even though it's an alltheweb URL it brings up the web page that had been searched for - exactly the page as it is on the site.

It comes out of here, which can't be accessed

[click.alltheweb.com...]

Why isn't that being excluded at alltheweb from robots crawling altogether? It doesn't seem that should be publicly available for crawling at all if the subdomain root is forbidden.

I don't think I could think up a better way to mess up Google's search if I tried. What it amounts to is that alltheweb is putting web pages out there with one of their own URLs that's actually swiped content belonging to someone else.

As a side note, I haven't begun to check out all the details, but the site I was checking out when I found this appears to have a penalty for the homepage at Yahoo.

Yidaki

7:05 pm on Aug 14, 2003 (gmt 0)

Marcia, i tried to replicate what you noticed using the allinurl search. It turns out that these are not links provided by alltheweb but by a bunch of different domains (17.000+ overall results at google at the moment). All domains are registered by the same person/company. Obviously another search engine results for search engine results spam method. Quite bad since it's caching redirects. Uhhh, weird ...

It this what you're talking about?

Even better: every result loads as a frameset using one frame to show ads coming from the hijacking domain, and another frame loads the actual redirect's content. Looks like fast allready did something against it - the redirect frame loads a fast error page "Invalid redirect URL" ...

<added>
It's all coming from one "meta search engine". The search results' listings (in this case fast's redirect url's) are obviously cached, mirrored to tons of domains and made available to google and other crawlers. This is getting a new sport obviously.
</added>

mcavic

2:14 am on Aug 15, 2003 (gmt 0)

Marcia,

I don't know if this answers your whole question, but the robots.txt on alltheweb is missing a line. It disallows /search, but doesn't disallow /urlinfo.

Compare these [google.com] two [google.com] Google serps.

mcavic

2:18 am on Aug 15, 2003 (gmt 0)

Oh - this [google.com] is what you're talking about, right?

Yep, ATW should add /urlinfo to the robots.txt.

Yidaki

7:27 am on Aug 15, 2003 (gmt 0)

mcavic, i think the fast search result pages are not the thing what Marcia described. They are not producing google duplicates. Search for allinurl:click.alltheweb.com instead and you'll get the picture ...

Marcia

9:27 am on Aug 15, 2003 (gmt 0)

Exactly Yidaki. I just got an email originating out of Yahoo corporate so I did the same type of search on that domain and came up with a few similar, including the allinurl: search result with that Yahoo domain being inserted onto the resulting page. The kicker on this one is that they're running AdSense on the page.

Yidaki

6:22 am on Aug 22, 2003 (gmt 0)

>The kicker on this one is that they're running AdSense on the page

That's really a kicker. Did you eMail the AdSense team about that?

Marcia

6:32 am on Aug 22, 2003 (gmt 0)

>>Did you eMail the AdSense team about that?

I didn't even think of it Yidaki, in fact I should have filled in the form but got distracted and forgot. I did send a long email with details to search-quality though to check out what's going on. It was far from a normal thing to be happening.

MarkHutch

6:57 am on Aug 22, 2003 (gmt 0)

Marcia, maybe it's my IE setup, but all those links that come up under that type of search, crash my IE browser everytime if I click on them. Very strange.

Yidaki

6:57 am on Aug 22, 2003 (gmt 0)

>I did send a long email with details to search-quality

That's probably enough to make them aware of the problem. However, it might speed up things to contact the sales people at google - there's money involved ... i received good feedback in the past when i used my AdWords account to send quality complaints to google.

Yidaki

12:02 pm on Aug 24, 2003 (gmt 0)

Allthough it's somehow disturbing, i doubt that google will penalize neither the original nor the mirrored pages. However, it could probably cause trouble if google tries to merge the results and keeps the wrong (mirrored) page in its index.

If google one day starts to unintentionally penalize or drop the original version of such mirrored pages instead of the redundant mirror, they might end up penalizing a lot of the best known sites - including WebmasterWorld and Google itself. If you search google for exact pages titles or urls from WebmasterWorld and/or google you'll find a lot of high ranked anonymizing proxies that cache every requested page and make them available to robots.

Bad side effect of google's improved spidering of dynamic content, imho. They presumalby allready work on a solution at the plex.