Welcome to WebmasterWorld Guest from 35.175.182.106

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Am I Confusing Googlebot?

Faulty .htaccess file?

     
11:59 pm on Mar 14, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2002
posts: 512
votes: 5


Hi,

I just made the switch from HTTP to HTTPS. At the same time, I also activate Wordpress in the "root" directory as I rebuild an older, static website "page by page."

Since Google treats HTTPS as a "new site," I now have both the HTTP and HTTPS versions listed in Google Webmaster Tools. I"m taking advantage of the "new site" and cleaning up 404 errors Google shows for the HTTPS version as they arise. As far as the user goes, all is well.

However, Googlebot is showing a lot of 404 errors for pages that never existed. Basically, what's happening with Googlebot is that it seems to be showing 404 errors for pages by taking those pages out of its directory structure. Example:

Actual page url might be:
www.mysite.com/folder1/filename.php

Google is showing a 404 errors for this page in Webmaster Tools as follows:

www.mysite.com/filename.php

And for folders, multiple deep, I'm seeing errors such as these:

Actual page url:
www.myste.com/folder1/folder2/filename.php

Google shows errors for this page as follows:
www.mysite.com/folder2/filename.php

Hence, the "top level folder" this file is stored in has gone missing. Sub-folders seem uneffected.

At first, I thought it might be a specific folder problem. But I'm seeing more and more 404 errors pop up across multiple folders that hold my old static pages (so far, my new Wordpress pages seem immune to this).

The reason I'm thinking that this might be a .htaccess problem is because when I view more detail about these errors in Webmaster Tools, I'm NOT seeing any page on my website (or external websites) that are linking to these pages that Googlebot thinks are located outside of the folder structure that exists on the web server. In short, there is no internal links on my site linking to these pages nor are there any external links. Thus, I'm very confused about why Googlebot is "nuking" the top-level folder structure of some, but not all, older, static pages.

Which brings me back to .htaccess, as it's the only thing I can think of that might be confusing Googlebot.

My .htaccess file is a bit of a mess (15 years old), but I am cleaning it up. I was hoping someone might chime in and see whether my .htaccess file is somehow causing this problem.

This is the beginning of my .htaccess file:


RewriteOptions inherit
Options +Includes
RewriteEngine on

RewriteCond %{HTTP_HOST} !^www\.
RewriteRule .* https://www.%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
RewriteCond %{HTTPS} !=on
RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R,L]

AddHandler server-parsed .shtm
AddType image/svg+xml svg svgz
AddEncoding gzip svgz

# Begin Cache Control

Header unset Pragma
FileETag None
Header unset ETag

# cache images/pdf/css docs for 1 Month
<FilesMatch "\.(ico|pdf|jpg|jpeg|png|gif|svg|css)$">
Header set Cache-Control "max-age=2629000, public, must-revalidate"
Header unset Last-Modified
</FilesMatch>

# cache html/htm/xml/txt diles for 2 Days
<FilesMatch "\.(xml|txt|xsl|js|woff)$">
Header set Cache-Control "max-age=172800, must-revalidate"
</FilesMatch>

#End Cache Control

# compress text, html, javascript, css, xml:
AddOutputFilterByType DEFLATE text/plain
AddOutputFilterByType DEFLATE text/html
AddOutputFilterByType DEFLATE text/htm
AddOutputFilterByType DEFLATE text/shtm
AddOutputFilterByType DEFLATE text/php
AddOutputFilterByType DEFLATE text/xml
AddOutputFilterByType DEFLATE text/css
AddOutputFilterByType DEFLATE application/xml
AddOutputFilterByType DEFLATE application/xhtml+xml
AddOutputFilterByType DEFLATE application/rss+xml
AddOutputFilterByType DEFLATE application/javascript
AddOutputFilterByType DEFLATE application/x-javascript
#End Compression


Beneath this lies 100's of redirects. And beneath this pile of redirects, I have this:


# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /wordpress/
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /wordpress/index.php [L]
</IfModule>

# END WordPress


So my question is this - is my .htaccess file somehow causing Googlebot to go off on strange tangents when visiting my HTTPS version of my site?
12:25 am on Mar 15, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11468
votes: 174


if i had a guess i would assume googlebot is testing your url structure to make sure the directory structure is actually meaningful.

if your paths are like:
/actual/directory/or/virtual/path/page.php
any request without the path information should provide a 404 status code response.
this is what you are doing and what googlebot "wants".

on the other hand, if your paths are like:
/arbitrary/path/info/or/what/ev/page.php
requests without the path information or an altered path would provide a 200 status code response and provide the same content.

a typical example of this problem is found with amazon urls.
for example:
https://www.amazon.com/all-your urls-are-belong-to-us/dp/B07965L1FS/
https://www.amazon.com/dp/B07965L1FS/

this is a problem from the search crawling and indexing perspective, since you are either serving content for non-canonical url requests or in some cases it is essentially a soft 404.
12:29 am on Mar 15, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11468
votes: 174


RewriteCond %{HTTP_HOST} !^www\.
RewriteRule .* https://www.%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
RewriteCond %{HTTPS} !=on
RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R,L]


that second ruleset will cause a 302 status code and you should specify a 301 for that response.
i would also combine the two rulesets into one - something liks this:
RewriteCond {HTTP_HOST} !^(www\.example\.com)?$ [NC,OR]
RewriteCond %{HTTPS} !on
RewriteRule ^(.*)$ https://www.example.com/$1 [L,R=301]
4:19 am on Mar 15, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4085
votes: 257


Beneath this lies 100's of redirects.
Please include a few examples of those redirects. Since they follow after the canonical rewrites which should be the last rules before the WordPress section I would be concerned.
5:05 am on Mar 15, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11468
votes: 174


Beneath this lies 100's of redirects.
Please include a few examples of those redirects.

indeed! i missed this part.

are these redirects using mod_rewrite or mod_alias directives?

if anything, the more specific external redirects should precede the hostname canonicalization redirect.
3:11 pm on Mar 15, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2002
posts: 512
votes: 5


Hi,

Like I said, my .htaccess file is a mess - but getting better. I initially (10+ years ago) used cPanel to make redirects, and it used the syntax shown below:

Redirect permanent /folder/file.php https://www.domain.com/folder/page-name/ 

[this is a redirect from old static page to new wordpress pages]

Older redirects from "static" to "static" pages look like this:
Redirect permanent /folder-name/file-name.shtm https://www.mydomain.com/folder-name/file-name.php 


Later (like 5+ years ago), cPanel used the following format:

 RedirectMatch permanent ^/folder-name/another-folder/file-name.htm$ https://www.domain.com/folder-name/another-folder/file-name.shtm


Since then, I've been manually inserting redirects into the .htaccess file using the format cPanel initially used (aka,
 Redirect permanent /folder/file.php https://www.domain.com/folder/page-name/ 


The reason I'm using that format is simple...it is simple to copy/paste old and new url's into the .htaccess and there's not a myriad of strange characters that are easy to mess up when typing them in. I'm no .htaccess expert, so simple is always better. These old style redirects have worked well enough all these years, or at least I've thought they always have!
3:16 pm on Mar 15, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2002
posts: 512
votes: 5


Also, I can post my entire rather lengthy .htaccess file here if someone wants to see it. But it would be a royal pain in the butt to change all references in the file that show my domain name and individual page names. So if the admin's are ok with me doing a bit of "spamming" (cough, cough), I can post the entire .htaccess file if you want to see it and if you think it will help with this problem.

At a minimum, I'm sure .htaccess experts will get a kick out of what a mess I've made of the file!
3:33 pm on Mar 15, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4085
votes: 257


Unfortunately, those redirects may be the source of some of the errors you're seeing. Redirect and Redirect Match use the Apache mod_alias module which does not play well with other mod_rewrite components you're using.

Since you need to use the WordPress block as it is supplied by WP, you should replace all those "redirect" lines with Rewrite Rules. You can't use redirect for http->https so it is best to replace them. Using rewrite patterns you may be able to save many lines of redirects.

We happen to have a similar situation going on here: [webmasterworld.com...] where you can find one easy way to replace all those lines using regex find/replace in a text editor.
3:35 pm on Mar 15, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2002
posts: 512
votes: 5


One final thing that might help troubleshoot. I have Wordpress installed in a sub-directory. The Wordpress part of the site - which is now activated in "root" - is working just fine. New pages I've put in are showing, properly, in Google search results. Wordpress pages on the site are limited to the index.php (home) page and about 50 other pages I've put up. The remaining 3000+ pages, more or less, are still old static files.
3:48 pm on Mar 15, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2002
posts: 512
votes: 5



Unfortunately, those redirects may be the source of some of the errors you're seeing. Redirect and Redirect Match use the Apache mod_alias module which does not play well with other mod_rewrite components you're using.

Since you need to use the WordPress block as it is supplied by WP, you should replace all those "redirect" lines with Rewrite Rules. You can't use redirect for http->https so it is best to replace them. Using rewrite patterns you may be able to save many lines of redirects.

We happen to have a similar situation going on here: [webmasterworld.com...] where you can find one easy way to replace all those lines using regex find/replace in a text editor.


Thanks. I read through that thread and am now totally lost. .htaccess and Apache is far from my strong suit, I'm afraid.

Can you suggest a redirect format that simply works? I guess worse comes to worse, if I know what redirect format to use I can manually change stuff, one by one if needed (hopefully not). And even if I don't change existing redirects, I'll at least put good redirects in from old static pages to the new Wordpress pages.
4:12 pm on Mar 15, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4085
votes: 257


The part of that discussion that addresses changing all those Redirect and RedirectMatch rules to Rewrite Rules is this part, posted there by lucy24:
Then it's time to fire up a text editor that does Regular Expressions, make a copy of your htaccess file, and apply these global changes (replace \1 with $1 depending on your RegEx engine):


^Redirect(?:Match)? 301 /(.+)
TO
RewriteRule \1 [R=301,L]

^Redirect(?:Match)? 410 /(.+)
TO
RewriteRule \1 - [G]

^Redirect(?:Match)? 403 /(.+)
TO
RewriteRule \1 - [F]

Which tells you how to use a decent text editor - one that can use regular expressions (regex) to "Find" and "Replace" - to change all your Redirects to Rewrites in a few minutes. First save a copy of your current .htaccess file to use as a backup and a record of changes. Then you copy your list of redirects to a new .txt file and use that list to edit. In the text editor, using regex you use
^Redirect(?:Match)? 301 /(.+)
to find, and
RewriteRule \1 [R=301,L]
to replace and it should replace that old list of Redirect and RedirectMatch with a new list of RewriteRules.

Because there are different types of regex syntax, you should first experiment with one rule to see whether your editor uses \1 or $1 as the capture format.
4:30 pm on Mar 15, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2002
posts: 512
votes: 5


Thanks.

For complete clarity, would this be the correct format then for new redirects?


^Redirect(?:Match)? 301 /(.+) /folder-name/sub-folder-name/file-name.php RewriteRule \1 [R=301,L] https://www.mydoman.com/folder/sub-folder/file-name/


Or am I missing something very obvious?

As for a text editor, I usually use Dreamweaver 6 CS for this. Will that work? Or you suggest a different text editor (I use a Mac).
5:20 pm on Mar 15, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4085
votes: 257


The RewriteRules are to replace the Redirect and RedirectMatch lines.

On Mac, you can definitely use the \1 examples (rather than $1) in either TextWrangler or BBEdit.
5:35 pm on Mar 15, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4085
votes: 257


You are using the text editor's search function to find "^Redirect(?:Match)? 301 /(.+)" and replace it with"RewriteRule \1 [R=301,L]" AND with the GREP checkbox checked. These examples were formulated to match searches for the other thread, your redirect lines appear in a different format which doesn't use 301, but rather uses "Redirect permanent" so you would need to alter the search terms. For the format you use for Redirects I would change the "^Redirect(?:Match)? 301" part to "^Redirect permanent /(.+)".
5:46 pm on Mar 15, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4085
votes: 257


I just tried it out. If you use
Redirect permanent /(.+)
to find, and
RewriteRule /\1 [R=301,L]
to replace, it converts this line:
Redirect permanent /folder/file.php https://www.example.com/folder/page-name/
to this line:
RewriteRule /folder/file.php https://www.example.com/folder/page-name/ [R=301,L]


(using BBEdit on Mac in GREP mode)
5:57 pm on Mar 15, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4085
votes: 257


What I mentioned about Rewrite patterns saving you lines of rules, is that you can - for example - capture all requests for .php files in a directory to the new "/folder/page-name/" format with just one rule. If you are using hundreds of Redirect lines (one for each file) you could replace them all with one Rewrite line. It depends on if they have similar patterns (/folder/filename.php) in their old and new format.
8:53 pm on Mar 15, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15314
votes: 709


it would be a royal pain in the butt to change all references in the file that show my domain name and individual page names
Er, do you not have a text editor? It doesn't even need to be a good one that does Regular Expressions; the barebones text editor that came preinstalled on your computer will do fine.

-- Copy htaccess file into text editor.

-- Globally replace all occurrences of “your-site-name.tld” with “example.com”.

-- Select part of the edited text and paste into a post. It is never really necessary to paste in the entire htaccess--and if you've got hundreds of redirects, we don't need to see all of them to make suggestions.

Incidentally, I've revised my redirect-syntax-changing boilerplate to account for anchors. The other thread has further discussion about paths and capturing--but if you've got hundreds of redirects, it is safe to say each one is for a specific individual URL.

RewriteRule /folder/file.php
Patterns with leading slash are only used when lying loose in config (most likely in a VirtualHost envelope). In htaccess, omit the leading slash or the rule will not execute.
9:48 pm on Mar 15, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4085
votes: 257


In htaccess, omit the leading slash or the rule will not execute.
Oh, that's right lucy24. I added that error by trying to match the Redirect's format. Oops, my bad.

To convert the old "Redirect permanent" lines without the leading slash, use
RewriteRule \1 [R=301,L]
instead of
RewriteRule /\1 [R=301,L]
as suggested above for the "Replace" part of that search.
11:03 pm on Mar 15, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2002
posts: 512
votes: 5


I appreciate all the help.

But I think I might have traced down my problem. Somehow, a 2nd .htaccess file snuck into the folder that was having 99.5% of all the major issues. No idea how it got there and it's probably been there for many years. Guess I'll need to check out other folders on the site too, to see if any more are lingering around.