Forum Moderators: phranque

Message Too Old, No Replies

Accidental Duplicate content - Now Redirect Issue

Google didn't obey my robots.txt file. Spidered both cgi-bin and PHP...

         

johnpinochet

4:03 pm on Oct 12, 2004 (gmt 0)

10+ Year Member



Google did not obey my robots.txt file.

They spidered my cgi-bin perl script that displays a database driven directory AND they spidered a later developed PHP version (not in the cgi-bin). The PHP version is 1000 times better in terms of performance, and the way it presents the URL, and META-TAGS for each page. I developed it myself.

Bottom line, after 3 months of incredible traffic etc etc, on the PHP directory, all of a sudden all of my PHP related pages disappeared from Google. I couldn't figure out why as it simply is no longer possible to find the old perl script from any link on my existing site, hence I mistakenly thought no duplicate content. However, after more research, I discovered that in fact the original perl directory script pages remain in google. Sure enough, the XX,XXX number of pages of the perl version exactly matched the XX,XXX number of pages in the PHP version just before the PHP version pages got pulled from the Google index.

So, now what? Do I ask that the perl version pages get removed, risking that the PHP version pages may never get spidered again?

Can I simply do a redirect of the each perl version page to its corressponding PHP version? If so, how do I do it to avoid any of the re-direct pitfalls mentioned here and other places?

Please help! This has really put me in a bind, as I went from a rather hefty monthly paycheck (almost what I make at work) to essentially zero.

Thanks in advance!

jdMorgan

4:24 pm on Oct 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



johnpinochet,

Welcome to WebmasterWorld!

I'd recommend that you remove the PERL script from your server, or rename it and mark its file permissions as inaccessible.

Then redirect the PERL-driven URLs to the corresponding PHP-driven URLs using 301-Moved Permanently redirects.

The exact implementation details will depend on your server type, and the URL and directory structure of your site.

There are known problems with some search engines' handling of redirects, but we can only do what we can do.

Jim

Hester

8:49 am on Oct 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google did not obey my robots.txt file.

Did you make sure it was 100% valid?

Webmaster World do a robots.txt validator. I thought I'd bookmarked it but can't find the link. Anyone?

johnpinochet

9:19 am on Oct 13, 2004 (gmt 0)

10+ Year Member



Hester and JDMorgan,

Thanks for advice, and questions.

Yes, I copied the robots.txt format of Google itself, with my custom changes, so the cgi script and resulting directory pages should never have been spidered to begin with. In fact that is why I went the PHP route: to create the appearance of static pages from a dynamic site so that my site would be more search engine friendly. It never occured to me that Google would consider my site as having two duplicate directories as the CGI generated directory pages were never meant to be spidered and this was correctly indicated in the robots.txt file. As stated earlier, when the number of PHP generated directory pages reached the number of CGI generated directory pages, my newer PHP generated pages were dropped.

Looks like I'll do the 301 redirect....can't say for sure though...I have XX,XXX pages from the cgi script spidered. It is scary to say the least to let them go to be redirected to my PHP version, when it is the PHP version that got dropped.

I might just work on modifying the cgi version to mimic the PHP version. This is a shame as the cgi version is sloppy compared to the PHP version.

Hester

9:39 am on Oct 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, I copied the robots.txt format of Google itself.

This is no guarantee of validity. Google have been criticised for poor HTML markup on their home page before. Maybe their spider found the custom changes you made to your robots.txt file slightly invalid and failed to process them? Why else would they ignore a robots file when they say they don't?

I'm not getting at you, just trying to define this one. I ran my own robots.txt through the Webmaster World validator once and found it wasn't valid! It was just a slight tweak I had to make.

I'd also look at the robots.txt forum here. This thread shows how a missing slash let Google in:

[webmasterworld.com...]

Hester

8:50 am on Oct 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ah wait, found that link to the validator. Sorry, it wasn't from Webmaster World after all! But here it is anyway:

[searchengineworld.com...]

Highway61

11:38 pm on Oct 15, 2004 (gmt 0)

10+ Year Member



Hester,

Thanks for info. I went to the Google "Remove pages/directories" form and entered a lot of pages and directories. /cgi-bin/ has been removed.

site:mydomain.com

reflects the removal of /cgi-bin/ plus the other pages.