Forum Moderators: phranque

Message Too Old, No Replies

funky Yahoo urls

         

greennature

5:35 am on Nov 28, 2005 (gmt 0)



I notice that Yahoo spidered and inexed one page on my site 1,200 times by giving it different urls, i.e.,
page.php/directory/directory/page.php
page.php/directory/directory/directory/article.html
ad nasuem.

The page is a single static page called
page.php

Why do all of those different garbage extensions work with the same url? I've never seen that before. Is there something in .htaccess I can add or something in the page meta tags, so that any url that has characters or directories after the normal page.php would revert to a 404 error?

Thus far the only thing that has worked for me was to use a 301 redirect for

/page.php/ [domain...]

It's a 2 step process where Slurp now looks for the funky urls, gets a 301 redirect and then goes to a 404 error page.

Also, what is the proper name for this phenomena? I've tried about 10 different searches on phrases such as improper url extensions, extra query strings in url, etc. to see if I could find an answer. I can not even properly identify the problem to be able to execute a search.

Thanks.

jdMorgan

4:19 pm on Nov 28, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do you have content-negotiation/MultiViews enabled? If so, Apache looks for a 'best match' between the requested URL and the files on your server.

Where did these malformed and invalid URLs come from? That's a question you'll need to research as well. Did they appear on your pages due to a script error and Slurp picked them up from there? Did another webmaster link to your pages incorrectly? Or did Slurp somehow 'invent' them on its own? Do some searches on Yahoo for your site's pages, and see if you can determine what happened. There are too many possibilities to get started on a solution without knowing more about how this happened.

Also, in what way is this problem different than what you posted [webmasterworld.com] about previously?

JIm

greennature

4:33 pm on Nov 28, 2005 (gmt 0)



"Do you have content-negotiation/MultiViews enabled? If so, Apache looks for a 'best match' between the requested URL and the files on your server."

I'm on a shared server from Pair so I do not know if they have it enabled.

Where did it come from? I'm not sure. The page has been up for 4 years and I never had that problem. I kept that particular static page in place on my site because it has a DMOZ link from many yers ago. I just wrapped it in my site theme.

I have changed the commands on my .htaccess for that particular section of my site a few times and I'm wondering if that had anything to do with the issue. i.e., sometimes I had applicationtype text/html command included etc.

How is it different from the problem I previously posted? The previous post concerned a page that was dynamically generated with a mod-rewrite attached to it. All I had to do was change the rewrite rule to end it with a "$" to solve the problem.

I just discovered that the modrewrite did not change anything for the static page. Nor should it. More precisely, I discovered I had two different types of url problems.

greennature

5:58 am on Nov 29, 2005 (gmt 0)



Question:

I tried:

RedirectMatch 301 /page\.php/(.*) [domain.com...]

and it does work. The pages resolve to page.php and the header checker shows a 301 permanent move code to the page.

In theory, does this get rid of all the 1,200 duplicate content pages if all 1,200 pages now resovle to one single url?

In theory will all new pages that slurp can potentially spider on that page also be resolved to one page.

My goal is to eliminate duplicate content and I'm not sure if I do that by sending all those Yahoo urls to a 404 page...or by using this method and sending all those duplicate urls to the original page.

jdMorgan

6:32 pm on Nov 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The best approach is to 301 all duplicate pages to whatever (smaller) group of pages represents the non-duplicated content. In other words, if you have 120 duplicates of 10 pages, then you need to redirect each f the 120 pages to one of the 10 'real' pages.

You could 404 all of them, but then that might lose any link-popularity that they have accrued. It all depends on how and why this happened.

Jim

greennature

11:16 pm on Nov 29, 2005 (gmt 0)



I see what you are saying.

I decided on a RedirectMatch gone

There were over 1,200 pages that were generated purely by slurp. They never had links from anywhere else. I wanted slurp to know that those 1,200 pages, and anymore that slurp might generate, are gone forever, never to return, finito, etc., al. :)