Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Header response for folder link structure , 301 or 404?

         

adresanet

6:13 pm on Jan 8, 2015 (gmt 0)

10+ Year Member



Hi,

For one of my site i have the following links structure:

www.site.com/folder/subbfolder

At www.site.com/folder I have nothing to show and it shouldn't be indexed by googlebot, but it is indexed.

How should I proceed ?

1. 301 redirect to homepage
2. return 404 error page and header

I have the same question for www.site.com/folder/WRONG_Indexed-Subfolder

Thanks for helping

lucy24

7:30 pm on Jan 8, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



At www.site.com/folder I have nothing to show and it shouldn't be indexed by googlebot, but it is indexed.

If the page doesn't exist, what content is indexed? What do you see when you request the URL? If your site is returning valid content for an invalid URL, search-engine indexing may turn out to be the least of your problems.

Are you using an off-the-shelf CMS, or are you manually rewriting URLs?

For comparison purposes: if every / in your URLs represented a real, physical directory, and the site didn't do any rewriting, a request for
/folder/subfolder
would first be redirected to
/folder/subfolder/
and then if there's no physical index page, you would get either an auto-generated index or a 403, depending on your site settings. So if you wanted the site to look as if everything was a real, physical file, that's the approach you would take.

1. 301 redirect to homepage

A redirect to the home page is generally the absolute last resort, and will lead to search-engine accusations of "soft 404". It will also lead to annoyed (human) users, since they'll have no way of knowing whether they simply made a mistake in their original request. Stick with a 403 or 404, depending on preference.

adresanet

7:44 pm on Jan 8, 2015 (gmt 0)

10+ Year Member



The script is hand-made and the links are rewriting though .htaccess file
www.site.com/folder/subfolder is a virtual path. folder doesn't really exist.

www.site.com/folder doesn't exists But in WMT under Crawl Erros, at Not Found section I see it listed with Response Code 404.

www.site.com/folder/subfolder it is the correct link and it is indexed
I saw now that www.site.com/folder/subfolder/ does 301 redirect to homepage and I think it isn't the best choice.

not2easy

8:42 pm on Jan 8, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



What lucy24 told you is correct. Don't 301 to the root because a 301 is supposed to deliver the content that was at one URL and now can be found at another. A "Soft 404" is far worse than a 404 which is natural.

Don't worry about seeing 404s, just mark as fixed and prevent future 404's by disallowing the folder in robots.txt. I am guessing they find that URL following links on your site to the virtual directories. Best is to block crawling on folders that don't really exist. You don't want to index internal search result pages that are generated by an action but don't physically exist anywhere.

adresanet

8:58 pm on Jan 8, 2015 (gmt 0)

10+ Year Member



Ok, I understand. Thanks!

Having 2000 pages with 404 reported in WMT in a site with 300.000 indexed pages can be a problem ?

not2easy

10:48 pm on Jan 8, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



404s happen as a natural part of how the internet works, pages get outdated, old pages go away. There is no penalty for 404s. It is just to your benefit to mark them as "fixed" so they don't stay in your GWT account. That makes it easier to see new 404s and verify that you are aware of them by marking them as fixed.

The process for marking them fixed is a little bothersome as they will only let you mark off 1000 a day. Be sure to look for the drop down menu just above the listings and change their default of 25 pages to a more reasonable number.

adresanet

5:51 pm on Jan 9, 2015 (gmt 0)

10+ Year Member



For photo gallery how is better to proceed ?

Physical I have www.site.com/main-photo-folder/photo-folder-1/photo1.jpg

When googlebot tries to crawl www.site.com/main-photo-folder/ what header should I return ? 404, 410 or 403 ?

I have the same question for www.site.com/main-photo-folder/photo-folder-1/ ... www.site.com/main-photo-folder/photo-folder-20/

Thanks!

lucy24

8:15 pm on Jan 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is
/main-photo-folder/photo-folder-1/

a real, physical directory with no index page? Have you set
Options -Indexes

(or equivalent in IIS)? If so, a request for the bare directory should automatically return a 403 and you need not take any further action.

adresanet

9:49 am on Jan 10, 2015 (gmt 0)

10+ Year Member



Well, I was aware of Options -Indexes . Thanks for the tip.

Now, if I added that line in .htaccess my rankings in google images won't be affected, right ?

lucy24

10:12 am on Jan 10, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It shouldn't have any effect whatsoever. The imagebot only asks for images, so it will never "know" what else is going on. Besides, if an image is going to come up in search, you want it associated with a real page, not an auto-generated directory index.

If, for some reason, you don't want to set
Options -Indexes

for the whole site, then make a supplementary htaccess for the
/main-photo-folder/

directory, containing only this directive. The setting will be inherited downward through all subdirectories.

adresanet

7:53 pm on Jan 14, 2015 (gmt 0)

10+ Year Member



Until now I hadn't Options -Indexes and now I saw in GWT that many clicks from google image search were going to /main-photo-folder/photo-folder-1/ and those weren't counted in Google Analytics.

The images from folders are displayed through a listimages.php

Is there any way to use .htaccess to make a rule so I can check if the url request is /main-photo-folder/photo-folder-1/ and to make the necessary redirects ?

If I use RewriteRule main-photo-folder/(.*)$ /photo-folder-handle.php?$photo-folder1=$1 my listimages.php can't display the images from that path.

lucy24

8:32 pm on Jan 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



now I saw in GWT that many clicks from google image search were going to /main-photo-folder/photo-folder-1/ and those weren't counted in Google Analytics

Best guess: If auto-indexing is enabled, google has been crawling
/main-photo-folder/photo-folder-1/

which gives it a listing of all image files that live in that directory. For reasons best known to itself, it has decided that the individual images "belong" to this auto-generated index page, instead of to the page they really live on. (Uh... the images are used by some actual, crawlable page, right?) People who find a picture in image search and nicely choose to view the whole "page" will then find themselves on the auto-generated directory index instead of the page you want them to see. Since the page was auto-generated, it doesn't include Analytics code.

Here's one possible remedy; you'll need some further tweaking:

RewriteCond %{HTTP_REFERER} google
RewriteRule ^main-photo-folder/photo-folder-\d+/$ http://www.example.com/my-new-page.php [R=301,L]


This says: If someone got to this now-nonexistent page from google, redirect them to this other page that you've just created for their benefit. It can be your existing listimages.php if appropriate. Note the $ closing anchor in the rule; here it's essential because you only want to redirect requests for the directory itself, not for the images it contains.

You may also need to redirect the googlebot itself. It depends on whether all images are already reachable from some other crawlable page. If they are, you probably don't need to do anything.



Edit: I got curious and checked my own logs. No search engine has ever asked for anything ending in /images/ (I've got plenty of directories with this name, each containing only image files). They do periodically ask for hypothetical files such as
/directory
and
/directory/index.html
where they know the "correct" URL
/directory/

As a cross-check, I looked for any cases of the googlebot (the real one from 66.249.blahblah) getting a 403 response. Nothing there either. So they don't just blindly follow any and all URLpaths.

So it would be interesting to know how Google first thought of asking for these image directories. Do they contain any files other than images?

adresanet

8:51 pm on Jan 14, 2015 (gmt 0)

10+ Year Member



Well, under main-photo-folder I have many photo-folder-N.
I need to get the name of the folder and from my-new-page.php I want to redirect the user to the corresponding page.

I will need something like
www.example.com/main-photo-folder/photo-folder-1 to be similar to

www.example.com/main-photo-folder/my-new-page.php?folder=photo-folder-1

In the same time, keeping Options -Indexes in the .htaccess returns 403 Forbiden
If I make 301 redirect as I saw in your rule ([R=301,L]) what will understand googlebot and how will it act?

So it would be interesting to know how Google first thought of asking for these image directories. Do they contain any files other than images?


I am not sure. Long time ago I installed an auto-generating sitemap script and I think from it I get tons of problems and duplicate pages correlating with bad mode_rewrite rules and so on....

lucy24

9:14 pm on Jan 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you want to capture the exact request and send it along to your php page, it would look something like

RewriteCond %{HTTP_REFERER} google
RewriteRule ^main-photo-folder/(photo-folder-\d+)/$ http://www.example.com/my-new-page.php?folder=$1 [R=301,L]


where \d+ means "one or more numerals".

Come to think of it, you may not even need the condition. It may be easier to point all directory requests to the same place.

Are you worried about a conflict between the redirect created by your new RewriteRules, and the 403 created by the -Indexes directive? It's OK: the redirects will kick in before mod_dir gets a chance to do its thing.

Long time ago I installed an auto-generating sitemap script

Oh, yuk. If the script lists all directories, including ones that contain no pages, it's totally possible this is where the Googlebot first learned of these URLs. You may be able to get information from WMT after they start reporting your new 403s. Sometimes they say where the offending page was linked from; see if it says "in sitemap". You could also open your sitemap-- the physical file-- and see if it lists anything along the lines of /photo-folder-1/. See if you can tweak the script so it only lists directories that contain a physical "index.html" (or "index.php") file.

adresanet

9:30 pm on Jan 14, 2015 (gmt 0)

10+ Year Member



I am sorry I not mention, I thought you will suppose, but photo-folder-1 is created from different words. I need to get all the name of the folder not only the numbers. Many don't even have numbers on it.
e.g.: 2012-blue-pictures , moon-2012-july , and so on . All these folders are in main-photo-folder.

If no rule can be found One solution is to create a index.php into each photo-folder (I will do it dynamically) but I still don't know what header response should I send to real visitors and to google bot.

lucy24

12:40 am on Jan 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All these folders are in main-photo-folder.

Does main-photo-folder contain any folders that are not photo-folders? If yes, are there just a few of them?

RewriteCond %{REQUEST_URI} !^/main-photo-folder/(realfolder|otherfolder)
RewriteRule ^main-photo-folder/([^/]+)/$ http://www.example.com/my-new-page.php?folder=$1 [R=301,L]


If main-photo-folder does not contain any subfolders except the ones with images, you don't need a Condition.

adresanet

12:37 pm on Jan 15, 2015 (gmt 0)

10+ Year Member



Ok, this is working for me in the way I need. Thank you!

I am still confused because I don't know how is better to do:

Should I redirect 301-Permanent Moved all the requests including googlebot for www.site.com/main-photo-folder/photo-folder/ to my-new-page.php which also redirects to the corresponding page. The corresponding page may be or may not be the exact, because in photo-folder I have many pictures which aren't all displayed by the same page. Picture1.jpg, Picture3.jpg and Picture4.jpg may be listed under PicPAGE1.html while Picture2.jpg, Picture5.jpg and Picture8.jpg may be listed under PicPage8.html .

Or maybe, if it is googlebot should I return 403 - Forbiden ? Or 302 Found ?


For few days, since I added in .htaccess Index -Options I see a lot of 403 folders listed under Crawl Errors -> Access Denied. So, I am not sure, in future those auto-generated pages by google will disappear from image search results? Will be replaced by the corresponding pages from my site or I will only lose them in favor of other websites.

lucy24

7:22 pm on Jan 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Should I redirect 301-Permanent Moved all the requests including googlebot

This one's really a judgement call on your part. A global redirect is certainly the easiest approach. Keep the RewriteRule but remove the Condition.

Or 302 Found?

Don't be misled by the name "Found". This is simply a non-301 redirect: either [R] or [R=302] because 302 is the default. Here a 302 is not appropriate, because it implies that the old URL will definitely be coming back at some later date.

I see a lot of 403 folders listed under Crawl Errors -> Access Denied. So, I am not sure, in future those auto-generated pages by google will disappear from image search results?

The new Crawl Errors is exactly what you'd expect. If you drop the condition and redirect everyone, then this group of "Errors" will disappear. You could potentially remove all those auto-index pages from google's index so they disappear right away. But get information from someone knowledgeable about how GWT works. You don't want to inadvertently deindex all your images when you only meant to deindex the auto-generated pages. I'd suggest starting a fresh thread if you want to ask about this. "Removing pages but not images in GWT" -- something like that.