Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Site Relaunch with URL changes - Redirects & Robots Implementation Question

         

Jeff_Fidler

4:22 pm on Apr 25, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



We have a client that we're helping with SEO elements in a larger replatforming project.

Since all the URLs were going to be changing, 301 redirects were required.

Then on launch, they went and implemented something completely different than what was in the project plan, and now they won't change it, as they are convinced they are doing it right. Before I get too pushy with them, I just wanted to get some feedback from the SEO community on the below setup.

So here's what they've setup...
1.) www 301 redirects to www1.
2.) www1 contains a robots that bans search engines.
3.) Robots on www still allows search engines.

Will search engines crawl the www or will the robots on the www1 be the one that's obeyed?

I suspect that the robots on the www1 will be obeyed. And since www1 bans search engines, all their indexed pages will be dropped when the search engines get to it.

Am I wrong in this instance? Will Google actually crawl a robots file when the URLs are being redirected like this?

Really appreciate feedback.

Cheers - Jeff

aakk9999

4:47 pm on Apr 25, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Welcome to WebmasterWorld, Jeff!

To clarify:
- old URLs (which do not exist any more) redirect to www1.
- what does www1 contain? The same page as it existed under old URL on www?
- what are they trying to achieve with these redirects?

What will most likely happen is that:
- over the time, the old URLs will drop from the index.
- the www1 URLs will probably be in the index with "this page is blocked by robots.txt" and they are unlikely to rank to anything other than perhaps a "site:" command
- the new www URLs will be treated as a brand new pages, with no link power inherited from the same page under the old URL

You should really get them to implement page-to-page redirects. Maybe they will listen when the traffic starts to drop.

Andy Langton

5:34 pm on Apr 25, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If the robots.txt itself redirects, Google will use the one that is redirected to:

Redirects will generally be followed until a valid result can be found

[developers.google.com...]

So a full disallow will apply to both www and www1 if the www robots.txt redirects to the www1 version.

If www does not redirect its robots.txt file, then aakk9999 is exactly right - every URL on www is valid and can be indexed. But if they just redirect, all value goes to the www2 version (which is disallowed). www URLs get dropped (there's no content to index), www2 URLs will not rank for anything much and display "A description for this result is not available because of this site's robots.txt ".

Non-redirected URLs on www can still rank.

Jeff_Fidler

6:09 pm on Apr 25, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month




The robots file on the old site (www) doesn't have a direct redirect, but all other URLs do redirect to the new site (www1).

In this instance, wouldn't Google listen to the robots file on the destination URL (www1) versus listening to the robots file on the old site?

As for their rankings, with everything redirecting to the new site (www1), it's my feeling that Google will begin to drop previously indexed pages (www) due to the fact that these URLs now redirect to a blocked URL. I can't see them maintaining these rankings, with this setup, but perhaps I'm wrong?

I should also mention, that when they first launched, they had 1 redirect setup, for the homepage and /. It was like that for several weeks in different geographic locations (I didn't know this for 3 weeks).

Basically everything I didn't want to happen on this launch, happened - sigh... One of those launches. Complete disaster.

Andy Langton

6:57 pm on Apr 25, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As for their rankings, with everything redirecting to the new site (www1), it's my feeling that Google will begin to drop previously indexed pages (www) due to the fact that these URLs now redirect to a blocked URL


Exactly. You can't rank a 301 redirected URL. Google will rank the destination - in this case, not at all, because the destination is robots excluded. However, in the short term, Google may well regard this as a mistake, and hold onto those www pages longer than usual.

In this instance, wouldn't Google listen to the robots file on the destination URL (www1)


No. An un-redirected robots.txt is telling Google which URLs on that specific host (www) to crawl. For instance, you might do this to selectively block redirects from the www.

For example:

Google crawls www.example.com/url (allowed) which redirects to www1.example.com/newurl (disallowed). Google is allowed to discover the redirects via the robots.txt on www.example.com. But neither URL will appear in search (OK, the www1 can appear as a "URL only" listing, but it's unlikely to rank very well - it might do OK in the very short term).

Jeff_Fidler

8:57 pm on Apr 25, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



No. An un-redirected robots.txt is telling Google which URLs on that specific host (www) to crawl. For instance, you might do this to selectively block redirects from the www.

I thought because the root URL on www is being redirected, they wouldn't even get to the robots.txt.

Interesting - and good to know!

Andy Langton

9:13 pm on Apr 25, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I thought because the root URL on www is being redirected, they wouldn't even get to the robots.txt.


Ah, well search engines won't even know whether they should grab the root unless they check robots.txt first :)

lucy24

9:37 pm on Apr 25, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Will search engines crawl the www or will the robots on the www1 be the one that's obeyed?

Each hostname has its own robots.txt. That is, they could all be the same physical file on your server, but as far as a compliant robot is concerned,
sub1.example.com/robots.txt
lays out the rules for
sub1.example.com
and
sub2.example.com/robots.txt
lays out the rules for
sub2.example.com
et cetera, while
example.com/robots.txt
only lays out the rules for
example.com

So you could, in theory, maintain entirely different robots.txt files for example.com and www.example.com, but it would only be meaningful if a search engine was allowed to crawl both forms. Think about this too long and your head will start to hurt. Keep in mind that a redirect means "make a separate request", so a compliant robot has to ask for robots.txt in both places. It's not like a human who automatically goes where you send them.

Jeff_Fidler

9:44 pm on Apr 25, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



Ok, so search engines will read the robots files on both www and www1, but it's kind of meaningless, since in this case, the pages driving the traffic organically redirect to www1, which is blocking search engines. So any pages they had ranking in Google will wind up being removed when they recognize the robots file on www1.

Do I have that right?

And does any one have any ideas as to how long it might take before Google updates/refreshes and we see all those listings go bye bye?

Jeff_Fidler

9:46 pm on Apr 25, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



Oh and Lucy24 - my head already hurts thanks to IT listening so well to our best practices recos. :-)

Andy Langton

11:10 pm on Apr 25, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So any pages they had ranking in Google will wind up being removed when they recognize the robots file on www1.

Do I have that right?


Pretty much. Google will grab the robots.txt for each site first. Then it will ask for pages from www.example.com (which it allowed) but it won't follow the redirects to www1 (because it is disallowed). Googlebot will probably keep grabbing the www pages (and checking the www1 robots.txt) in case this is an error. The simplified end result unless that changes, however, is that the www1 pages cannot be crawled and so will be dropped from search results. The www "pages" don't exist any more (they point somewhere else) so they won't rank either.

dipper

11:25 pm on Apr 25, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



Presented in the right way this is an opportunity. If you present the risks to the business owner - if we do this, there is a very real chance that pages could be totally blocked from Google. If that should happen then no rankings, no traffic, and no income - I don't want to see that happen. This might alone motivate them in the right way - and if they don't change, and they lose rankings, you just gained a new best friend.

dipper

11:28 pm on Apr 25, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



re: "And does any one have any ideas as to how long it might take before Google updates/refreshes and we see all those listings go bye bye?"

Variable, but you can hasten crawling and indexing of pages or robots.txt by using the Google Search Console. Gives you a way to test their new setup too.

Jeff_Fidler

1:14 pm on Apr 26, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



dipper - that's exactly what we did - emailed the lead on the project and told them, hey you're going to lose 30% of all site traffic and 40% of all site revenue if you keep things this way. Despite that, they are going to leave it this way for another 3 days. Hopefully they remember who told them when the *&^% hits the fan. At least I have it in writing. :-)

Jeff_Fidler

1:15 pm on Apr 26, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks for the feedback everyone, was great to be able to bounce my ideas off someone else who knows what they are talking about. Appreciate all the advice! Cheers - Jeff

tangor

11:52 pm on Apr 26, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do let us know how it turns out! Always like to hear tales of clients who actually LISTEN to what's been offered. Or laugh at those who don't! :)

AussieWebmaster

5:42 am on Apr 27, 2016 (gmt 0)

10+ Year Member



the spidr gets hit by the first redirect before it reads the initial robots.txt - it has to verify the site is live hence first page is domain that's redirected and that www1 robots is read

Andy Langton

9:52 am on Apr 27, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the spidr gets hit by the first redirect before it reads the initial robots.txt - it has to verify the site is live hence first page is domain that's redirected and that www1 robots is read


This isn't correct. Search engines don't need to "test" that a site is live via any method other than retrieving robots.txt. Indeed, doing so would attract a lot of ire from site owners who have blocked whatever URL search engines might be "testing" with. Of course, there's no difference in request robots.txt than requesting any other file, so there's no technical reason why asking for any other URL should occur.

Jeff_Fidler

1:57 pm on Apr 27, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



Would be great to actually get a Google rep to tell us how they do it - from the horse's mouth so to speak... Mental note for a Google Webmaster Tools forum question to post.

ergophobe

10:36 pm on Apr 27, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You don't need a Google rep. Google is very clear about this and it is exactly as Andy_Langton and Lucy24 have said:

A robots.txt on a subdomain is only valid for that subdomain.

See: [developers.google.com...]

They specifically cover the case you're asking about. If you request a url on your www1 host, Google will only consider what's in www1.example.com/robots.txt, which in your case says "don't crawl" so it won't.

If you request a URL that is on the www host and that is redirected to the www1 host, then it's on the www1 host and the same rules apply. Simple as that.


As a few have pointed out, they may index, but that's another story.

Jeff_Fidler

1:05 pm on Apr 28, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



Yes, so if I'm getting it right, everything will be removed due to the www1 robots.txt and the redirects that are setup.

not2easy

2:26 pm on Apr 28, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Depending on how the
1.) www 301 redirects to www1.
was set up AussieWebmaster may be right. Can you visit the robots.txt file at the www. version? If you are 301 redirected to the www1. robots.txt Googlebot will be redirected as well.

Andy Langton

2:50 pm on Apr 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This was actually queried and answered further up:

The robots file on the old site (www) doesn't have a direct redirect, but all other URLs do redirect to the new site (www1).

lucy24

7:45 pm on Apr 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've just remembered something that may help illustrate how it works. Remember, again, that for robots.txt purposes there is no such thing as a subdomain, so rules for example.com do not apply to sub.example.com

Here we go: I used to be at example.net. The site included a roboted-out subdirectory, so robots.txt for example.net said
Disallow: /directory/subdir

Then most of the site, including the roboted-out subdirectory, moved to example.com. Since the subdirectory-- and, for that matter, its parent directory-- no longer existed at the old site, I removed the robot.txt denial from example.net, and included the identical line in the robots.txt for example.com instead.

Immediately after this, the major search engines started requesting everything in example.net/directory/subdir/ that they knew about. (There were links from elsewhere in the site.) These requests received the same 301 response that a human request would have received.

But there were no requests at the redirect target, example.com, because example.com/directory/subdir is denied in robots.txt

AussieWebmaster

2:26 pm on May 3, 2016 (gmt 0)

10+ Year Member



I think I misread - but my intent was if the spider comes in to the redirected domain with the robots.txt appended and the entire site is redirected it would not be returned that old site's robots.txt file or anything from the site -

lucy24

8:17 pm on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can choose not to redirect robots.txt. But if everything else on the site is getting redirected, it would make absolutely no difference in crawling.

Say you're a compliant search engine, requesting robots.txt every day or so. (Or every hour or so if you are the bingbot, but this is the Google subforum.) One day you request
example.com/directory/subdir
and meet with a redirect to
example.net/directory/subdir

At this point, you don't re-request example.com/robots.txt to see if it, too, gets redirected to example.net. You don't care; all that matters is that you now have example.net/directory/subdir on your shopping list. (If you are the googlebot, or most other major search engines, you will in fact follow this redirect within a few seconds, unless you happen to have crawled it recently. But that's an individual robot behavior choice; it's not automatic as with a human using a browser.)

Instead, since the redirect target lives on a different host, you check your records to see whether you have a robots.txt on file for example.net. If you do, you proceed directly to the file request. If you don't, you'll first make a direct request for
example.net/robots.txt
before racing through all your other redirect targets.

if the spider comes in to the redirected domain with the robots.txt appended

What does this mean?

Edit: Here's the opposite scenario. Suppose you redirect robots.txt requests, only, to some other hostname, while leaving everything else unchanged. Congratulations. You have now made your entire site uncrawlable, because the compliant robot is unable to access the site's own robots.txt. (Is this even in the docs? I can't imagine the scenario ever happening, except as the result of some horrible blunder.)