homepage Welcome to WebmasterWorld Guest from 54.226.180.223
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google returns both correct and incorrect versions of 301ed url
chms




msg:4623584
 6:40 pm on Nov 15, 2013 (gmt 0)

Hello,

Since a time ago google isn't indexing my urls incorrectly.

For example, my urls are: www.domain.com/id/title-with-hyphenated, instead of there are many url indexed in this way www.domain.com/id/, when someone access to this url I redirect (301) to correct url.

Curiously I've seen in Google for two different search the same url indexed with the correct and incorrect url.

Thank you

 

aakk9999




msg:4623653
 11:36 pm on Nov 15, 2013 (gmt 0)

A few questions:

- How long has the redirect been in place?
- Are you maybe blocking the short URL that should redirect in your robots.txt?
- Are you certain that Googlebot has requested the page that redirects? It has to request it in order to see the redirect
- Have you verified that Google sees 301 redirect (for example, use "Fetch as Googlebot" and if you get "Success", click on this to see the result of the Fetch

And also - I have seen cases where an URL sometimes returned 301 redirect and sometimes not - this was happening owing to threading error in a custom script that was doing redirects. In this case Google may be getting mixed messages on how to handle URL and may leave it in index. To see whether you have this kind of problem you would need to inspect your logs for a period time.

chms




msg:4623663
 12:41 am on Nov 16, 2013 (gmt 0)

Hello,

The redirect is instantaneous, robots.txt is blocking only individual urls and I checked in "Fetch as Googlebot" and sees the redirection perfectly.

I found a link on the site with malformed urls but it is curious to index a URL to redirect and in no case be loaded.

Thank you

netmeg




msg:4623672
 1:42 am on Nov 16, 2013 (gmt 0)

No, she meant how long ago did you set up these redirects?

lucy24




msg:4623677
 2:26 am on Nov 16, 2013 (gmt 0)

Sometimes things get garbled when you exemplify for posting. Is the longer URL the correct one and the shorter URL is the incorrect one?

Was the old/incorrect URL ever used? Or did you or someone else link to it by mistake? In general the redirect wouldn't even need to exist unless you had reason to think someone somewhere would ask for the "wrong" name.

I found a link on the site with malformed urls

Do you mean, the specific URLs that you're asking about? Is that why the redirect is in place? Did you set up the redirect after the wrong URL had already been requested?

Fact of nature: If you make a mistake in a link, and correct the mistake within 15 minutes, and it's on a page that is typically crawled once a week ... some search engine will make its weekly crawl during that 15-minute window. (Look, bing, there's no such thing as "innuuniq". Get over it, willya?)

chms




msg:4623758
 2:01 pm on Nov 16, 2013 (gmt 0)

Hello,

Always was working these redirect

The redirection is executed before the page load and it was created because after id, the title of the page is separated by hyphenated, the redirect avoid duplicated urls since the load of the page depends on the id.

Also we have canonical url with the correct url in each page.

In WMT loading the wrong url like Google I see this:


HTTP/1.1 301 Moved Permanently
Date: Fri, 15 Nov 2013 11:29:27 GMT
Server: Apache
Set-Cookie: PHPSESSID=5f8565786e1261c0c5f12720b7ad15b5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
Pragma: no-cache
Set-Cookie: bb_lastvisit=1384514967; expires=Sat, 15-Nov-2014 11:29:27 GMT; path=/; domain=.mydomain.com
Set-Cookie: bb_lastactivity=0; expires=Sat, 15-Nov-2014 11:29:27 GMT; path=/; domain=.mydomain.com
Location: http://www.mydomain.com/articulos/3525/title-with-hyphenated
Set-Cookie: vuart=3525; expires=Sat, 16-Nov-2013 11:29:27 GMT; path=/
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 16855
Connection: close
Content-Type: text/html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
....



Thank you

aakk9999




msg:4623760
 2:08 pm on Nov 16, 2013 (gmt 0)

If you have 301 redirect, you should not be sending the content. So you should be seeing:

Content-Length: 0

And there should be no

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
.... etc

Maybe this is what is confusing Google? Although Google *should* ignore the content having received 301.

Can you change your 301 redirect to only send back HTTP headers? Then give it some time until it is recrawled and see what happens.

<ADDED>
In fact, thinking about it, since Googlebot received the content, it may think that 301 response was a mistake and it may have assumed it should be 200 OK and therefore may be ignoring 301.

I would also not set cookies, session id etc. when responding with 301
</ADDED>

rainborick




msg:4623787
 3:34 pm on Nov 16, 2013 (gmt 0)

Content length doesn't have to be 0 to make a valid 301 response. For example, Apache sends a default error document for most 3xx, 4xx, and 5xx response codes. It's up to the User Agent whether or not to respond automatically or present the document to the user.

aakk9999




msg:4623819
 6:32 pm on Nov 16, 2013 (gmt 0)

@rainborick, are you sure that default errordocument is sent for 301 by Apache? I have never heard of this and cannot see the point (4xx and 5xx I agree with you).

I know it is up to User-agent to follow the rules, but talking about Googlebot, it is not so outstretching that they may conclude that 301 isnsent in error if a page content is returned too.

rainborick




msg:4623823
 6:36 pm on Nov 16, 2013 (gmt 0)

Yes.

aakk9999




msg:4623829
 7:04 pm on Nov 16, 2013 (gmt 0)

@rainborick
Thank you, you are right that the content can be included according to the protocol (I should have checked this before asking you the question!):

HTTP/1.1: Status Code Definitions
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.2

The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).


But for GET (and sometimes HEAD) which Googlebot uses, the User-agent would automatically follow 301 as specified in the "Location:" without displaying this short content page and the short content page would be ignored.

@chms
Is your "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">" etc... which is returned with 301 the content of your page as it was if you would not be redirecting or is it this short message with a hyperlink to redirect to as quoted above?

If it is your original page that now redirects, I would remove (that is, not send with the response) the content sent on 301, wait until redirect page is re-crawled by Googlebot, wait some time after that (perhaps some weeks) and see if the page that is redirecting disappears from the index. This is just to discount that sending original page content together with 301 response is the issue that confuses Googlebot.

I have many sites with 301 and Google has none of these pages in index, but then I am sending back only HTTP headers and not the page content.

lucy24




msg:4623853
 9:24 pm on Nov 16, 2013 (gmt 0)

Tangentially: are you supposed to send cookies with a 30x response? Seems like you should wait until the user requests the real page.

Content always has some length. It's the number you see in logs. Though why google's request for robots.txt is 581 bytes while some other passing robot's is 597 must remain a mystery, since I use a static text file. (I pulled a random day's raw logs and asked it to show me any 301.)

But if you've got dynamically generated pages it's possible that there was a coding goof and you're sending out the full page with the header. (Admittedly more common with a 404/410 response.) 16k really can't be anything but a page.

Look at your logs. For any given human visitor, what comes immediately after the 301 response? Is it a request for the correct page, or a request for non-page files-- and if so, what page do they belong to? If you don't get humans requesting the old URL, just type it in yourself. If you're using LiveHeaders or similar, make sure it's set to show everything. That includes things like images that you might normally exclude to reduce clutter.

aakk9999




msg:4623860
 10:03 pm on Nov 16, 2013 (gmt 0)

Content always has some length.

Agreed - when you are sending some. But this is the length of the content, excluding HTTP headers.

What I am seeing with 301 I am sending is below, there is no content sent, only headers response.


HTTP/1.1 301 Moved Permanently
Location: http://www.example.com/
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Sat, 16 Nov 2013 14:12:52 GMT
Content-Length: 0


For any given human visitor, what comes immediately after the 301 response? Is it a request for the correct page, or a request for non-page files

Worth checking, but also worth remembering that human visitors would use browser. Browser would probably discard the content received and just redirect based on HTTP response and Location in HTTP headers. In his second post, chms says that his page redirects and that the redirection is executed before the page load, which kind of confirms this.

If the page that redirects is truly indexed, and the redirect is implemented long enough ago for Google to have seen and processed the redirect + some extra time, then what I am speculating is that Googlebot gets confused - because despite receiving 301 response, Googlebot is also receiving cookies and 16k of content.

It is not so far fetched that in this case Googlebot may assume that 301 response is returned in error and treats it somehow like 200OK since it received cookies and 16K of page content.

There is another way to test this - to add something in this 16k of content that is unique to the page that redirects and then after it was recrawled etc, to check Google cache.

@chms, this leads me to another question: The page that you are seeing in SERPs - is there a cache version? If so, when you view the cache, what URL is shown after the words This is Google's cache of, is it the URL that should redirect or the target URL?

chms




msg:4623989
 3:54 pm on Nov 17, 2013 (gmt 0)

Hello,

I'm going to explain the best as possible the situation.

Correct URL: www.mydomain.com/articles/3525/title-with-hyphenated
incorrect URL: www.mydomain.com/articles/3525/

Google is indexing two types of URL, in many cases is not taking the 301 redirection.

When someone try to access to incorrect url the system is redirecting (301) to correct URL

The redirect is made in php previous to the load of the page, php code is before to html.

Google is indexing the two types of URLS in both cases there is a Google's cache and in the incorrect url shows after the words "This is Google's cache of the incorrect URL" but loads correctly the web page.

When I use the tool in WMT for seeing like Google, when I write the incorrect url to check, I receive "correct" and after the next headers loading the correct page, I mean that I ask for an incorrect url and in Locations shows the correct URL ( Location: www.mydomain.com/articles/3525/title-with-hyphenated):

URL: http://www.mydomain.com/articles/3525/
date: viernes, 15 de noviembre de 2013 03:26:01 GMT-8
Type of Google robot: Web
downloading time (in miliseconds): 564
HTTP/1.1 301 Moved Permanently
Date: Fri, 15 Nov 2013 11:29:27 GMT
Server: Apache
Set-Cookie: PHPSESSID=5f8565786e1261c0c5f12720b7ad15b5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
Pragma: no-cache
Set-Cookie: bb_lastvisit=1384514967; expires=Sat, 15-Nov-2014 11:29:27 GMT; path=/; domain=.mydomain.com
Set-Cookie: bb_lastactivity=0; expires=Sat, 15-Nov-2014 11:29:27 GMT; path=/; domain=.mydomain.com
Location: http://www.mydomain.com/articles/3525/title-with-hyphenated
Set-Cookie: vuart=3525; expires=Sat, 16-Nov-2013 11:29:27 GMT; path=/
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 16855
Connection: close
Content-Type: text/html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
....
Load correctly the web page
....

[edited by: aakk9999 at 6:38 pm (utc) on Nov 17, 2013]
[edit reason] Obsecured URL [/edit]

aakk9999




msg:4624020
 7:16 pm on Nov 17, 2013 (gmt 0)

Mods note:
I have been privy to domain URL that may have helped in detecting this problem. Here is the description of the issue without going into specifics. This may help other members who have problem in Google processing 301 redirects.



For the sample URL whose headers are in the post above, Google is in fact displaying the correct (target) URL in SERPs. However there are a number of other URLs on that site which behave exactly as chms has described: Google shows short version of URL in SERPs, when this URL is clicked, the page is redirected.

This would be a normal state of play for the newly introduced redirects which may not yet been processed by Google, however, Chms says the redirects have been in place for some time. I have checked cache date of few of such URLs that should not be indexed by Google because they redirect and the cache date is very recent (few days ago).

What I have noticed however on the sample problematic URLs that I have checked is that the "Location:" returned in HTTP headers did not include protocol and domain name, it only included path from the root down. This was obviously good enough for the browser to redirect correctly, but it seems not for Googlebot.

I am speculating that incorrectly specified redirect Location, together with returned cookies and the full HTML of the page caused Google to ignore the 301 response code, especially since it had enough information to index the page (HTML returned, redirect location incorrect).

Chms says that the redirect is done from within PHP. I suggest the following change (in red) to PHP redirect code:

<?php
// checking whether to redirect or not

... some existing PHP code ...

// if redirecting, send headers and then exit!
header("HTTP/1.1 301 Moved Permanently");
header("Location:
http://www.example.com/your-target-url");
exit;
?>

The "exit;" is added so that the script would at this point terminate and not continue on and create and send HTML of the page when responding with 301.

As I said, I am only speculating that these omissions in red caused Google not to redirect and to keep the original URL in index.

@chms, if you do change your redirect code on all problematic pages (in line with what is specified above), wait a few weeks to give Google the chance to re-request these URLs and process redirects. You may want to monitor your logs for these problematic URLs to see if Google has attempted to crawl them. It would be nice if you could report back after few weeks and let us know whether this has solved your redirect problems.

chms




msg:4632872
 1:48 pm on Dec 22, 2013 (gmt 0)

Hello again,

I want to update this thread because finally we have fixed the problem with redirections.

These are changes we have done:

- changed "Location:" included protocol and domain name
- changed the response headers so that they did not send any cookies
- stoped the 301 response sending HTML content of the page
- changed non-www redirection to go directly to longer URL version instead of being redirected via short version

After all these changes Google has begun to follow the redirections and indexing the correct url.

I want to acknowledge the assistance of aakk9999 who has been helping me with the problem these days, thanks aakk9999, without you not have been able to solve the problem.

I expect this can help other people with the same problem.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved