Forum Moderators: open

Message Too Old, No Replies

index.html page redirecting screwing up the search engines

         

Tonearm

3:55 am on Dec 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A few days ago I employed some MOD_REWRITE code in my .htaccess file with the help of jdMorgan. The code redirected the following two URLs:

www.mystore.com/cgi-bin/catalog
www.mystore.com/cgi-bin/catalog/

to this one:

www.mystore.com/cgi-bin/catalog/index.html

and returned a 301.

This was supposed to accomplish the end of unifying the PageRanks and link popularity of the above 3 URLs.

Ever since I implemented this change though, Googlebot as well as ia_archiver have been acting very strangely from what I've seen in my logs. They hit the robots.txt file after almost every file access, Googlebot has started hitting gifs and jpgs and not a single html file since, and they are just generally not behaving as before.

Looking over my logs I realized that I was being indexed in all the search engines based on:

www.mystore.com/cgi-bin/catalog/

and never:

www.mystore.com/cgi-bin/catalog/index.html

So I did a search on Google for "index.html" and not a single result came up that had "index.html" in the URL. Does Google just chop off the index.html and index the URL as a directory no matter what? If that is true, I wouldn't think it would be very happy with my redirecting system.

It would be great if I could get anyone's opinion on this. Thanks!

jdMorgan

5:25 am on Dec 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Tonearm,

A few days ago

Realistically, I would expect it to take at least one crawl cycle for any major differences to show, unless the fresh bot visits most of your pages.

started hitting gifs and jpgs and not a single html file

Is this Google's image-bot? It sounds like it...

Looking over my logs I realized that I was being indexed in all the search engines based on:
www.mystore.com/cgi-bin/catalog/
and never:
www.mystore.com/cgi-bin/catalog/index.html

The 'bot is going to continue to spider based on your old links until it has had a chance to find and digest the new ones.

If you see
"GET /cgi-bin/catalog/ HTTP/1.x 301 ..."
Then the redirect is working, and eventually, it'll come back and ask for /cgi-bin/catalog/index.html
But it will take time - 3 more weeks if you made the change before the last deep crawl and up to 7 if you didn't.

Paraphrasing Doug Adams in the Hitchhiker's Guide to the Galaxy - "Don't panic!" :)

Jim

Tonearm

3:58 am on Dec 18, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dMorgan- Thanks for the advice. I do think you're right about all that, but doesn't it make sense that a redirect moving "www.mystore.com/cgi-bin/catalog" to "www.mystore.com/cgi-bin/catalog/index.html" would drive a search engine wacky if that engine doesn't list index.html URLs, and only uses the directory URL in that case? That seems to be what Google does. It seems like maybe I should be redirecting "www.mystore.com/cgi-bin/catalog/index.html" to "www.mystore.com/cgi-bin/catalog" instead of the other way around. What do you think?

jdMorgan

5:02 am on Dec 18, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Tonearm,

You can do the redirect either way, but I'd suggest getting your incoming links and your internal links updated to whichever URL you plan to stick with.

However, I would recommend dropping the "index.html" and going with just www.domain.com/ or domain.com/ as the URL for your home page, for the simple reason that it makes the URL simpler to remember and shorter (and therefore makes your pages slightly smaller). It also "concentrates" any benefit to be derived from a keyword-in-domain-sensitive search engine over a shorter string - I have no idea if that's a benefit, but it could be.

I use a redirect on my sites that takes any request for a variant of the "official" domain name and 301-redirects it to, say, www.domain.com/. So, a visitor accessing alternate TLDs such as .org or .net which point to the site, or accessing the site without the "www." in the URL gets redirected to one "standard" domain - www.domain.com/. You can do it either way - with "www." or "www."-less. Whether I "standardize" on www. or not depends on what the site owner wants - I'd tend to leave it off given free choice.

Most server set-ups get confused if you redirect from /index.html to "/". That's because they have been directed to interpret a request for "/" as a request for /index.html, so you end up in a loop. If you are permitted to control this (using DirectoryIndex in mod_dir) on your server, you can tell Apache that the default index file to use for "/" is /index.htm instead of /index.html. Then rename your existing /index.html file to /index.htm and 301-redirect all requests for /index.html to "/". Apache will then serve /index.htm in response for that request for "/", and the 'bots will see the 301 redirect if they do request /index.html. You can actually use any name you like for your "home" page file: index.html, index.html, main.html, and main.htm are the "standard options", but not required.

The above will work if you really want to solve your problem using a redirect, but by far the easiest thing to do here is to remove the redirects you put in place, and change all the on-site links to request "/" instead of /index.html. Then get the most important sites that link to you to similarly update their links (taking the opportunity to ask for better link text, too, if needed) and then the problem will take care of itself after two Google updates. While you do have to deal with outside parties to implement this solution, it is the "simpler and cleaner" approach.

Jim