Domain redirect & robots Disallow

Forum Moderators: goodroi

Message Too Old, No Replies

Domain redirect & robots Disallow

BlaxEye

10:51 pm on Jun 4, 2019 (gmt 0)

Site A has been redirected to Site B for a few years, but Google still has 74K+ search results for Site A. We'd like to Disallow: / in robots.txt for Site A, in hopes that'll remove those 74K+ in the index (and replace with the Site B URLs). But since Site A is being redirected at F5, will the bot respect the Disallow (or even see it)?

In other words, can we Disallow: / in the robots file for a subdomain that's been redirected to a subdirectory for several years?

Thanks in advance!

lucy24

12:55 am on Jun 5, 2019 (gmt 0)

since Site A is being redirected at F5, will the bot respect the Disallow (or even see it)?

It will certainly see the directive, unless robots.txt requests are also getting redirected to the new domain--which does not strike me as a good idea.* Don't confuse robots.txt directives with X-Robots headers or a "robots" meta, either of which is only seen when the visitor actually receives the page.

If material is getting redirected to an entirely different hostname (domain or subdomain), law-abiding robots should end up requesting robots.txt twice: for the old URL and then again for the new one.

* When I started moving to https, I exempted robots.txt requests from the global redirect, in part because I don’t want to give robots any excuse whatsoever to say “I tried, but you wouldn’t let me see it.”

tangor

4:07 am on Jun 5, 2019 (gmt 0)

@BlaxEye ... Your redirect, if working properly, will do the job. Meanwhile, enjoy whatever site a love g is showing AS THAT USER ends up on site b anyway. One of those "don't over think this" ... why? G never forgets a url it has met.

MEANWHILE, Site A will need to exist FOREVER for the redirects to work. Give it time (about 10 years) ...

tangor

4:09 am on Jun 5, 2019 (gmt 0)

If you really want Site A to disappear, start returning 410 on all content. (not recommended!)

lucy24

5:47 am on Jun 5, 2019 (gmt 0)

Oh, whoops, double-checking question that should have been asked in the first place: These are all 301 permanent redirects, right? Google does say somewhere that if they keep meeting the same 302 year after year they'll eventually start treating it as a 301, but you should do it the right way in any case.

And we won't talk about Bing, which continues to send image requests (only) to URLs that were redirected in December 2013 (really).

tangor

6:41 am on Jun 5, 2019 (gmt 0)

Don't get me started on Bing! I'm about THAT close to banning the OVER ENERGETIC MS ROBOT(S) trashing my logs with 70-100k bogus requests (all 404 406 408 or 301 to 404) per month ... on a site with less than 1000 page urls! My .htacess has grown in the last two months for all the BS ... to make sure this stuff goes to 403!

That said, site a to stie b must be properly redirected, and site a must live forever to make sure the redirects are done. OP should be grateful that g still has 74k+ urls from site a ---- as all of those SHOULD END UP AT SITE B.

OP, check your logs, make sure everything is working as planned!

BlaxEye

3:01 pm on Jun 5, 2019 (gmt 0)

First, thanks to both of you for taking the time to reply with quality advice & professionalism.

Yes, it was a global 301, robots file and all, and in most cases, the Site A robots file was actually moved to a Site B 2nd level subdirectory. So that’s the dilemma; should we remove the redirect on the robots file from Site A to Site B, then place new file and edit to Disallow: / on Site A. Further, some of these redirects are on their 3rd or 4th hop (http > subdirectory > https > subdomain > subdirectory).

From what I’m hearing:

- There should be two robots files: Site A & Site B (which isn’t happening now; A=No, B=Yes)
- I’m overthinking (no worries; I’d rather overthink than not think at all; apparent my predecessors weren’t).
- Leave it alone. Visitors wind up at the intended page anyway.
- Don’t return a 410
- Bing sucks

Does that sum things up, or did I open another can ‘o worms?

lucy24

6:17 pm on Jun 5, 2019 (gmt 0)

Site A robots file was actually moved to a Site B 2nd level subdirectory

In spite of all the malign robots requesting example.com/blog/robots.txt, you cannot have robots.txt files in a subdirectory. It has to be at the host level. Note that this is about URLs, not physical directories:

example.com/robots.txt
is different from
blogs.example.com/robots.txt
If you have subdomains, both versions need to be reachable at their respective URLs. They might happen to be the same physical file, but the requester doesn't know--in fact cannot know--this.

If you have special access rules for certain directories, you need to put those in the primary robots.txt file, like
Disallow: /images
because you can't give /images/ its own robots.txt

If law-abiding robots cannot see the robots.txt file for Site A, then they may never request files from that site and will therefore never see the redirect.

Illustrative example: Some years ago, I moved a bunch of material from example.old to example.new. This included one roboted-out subdirectory:
Disallow: /directory/subdir
Since it was gone from example.old, I removed the robots.txt disallow on that site and added it to example.new. Over the following months, major search engines put in a flurry of requests for
example.old/directory/subdir/page.html
because they of course had known about the pages all along (linked from elsewhere) but formerly hadn't been allowed to crawl. All these requests received the ordinary redirect to
example.new/directory/subdir/page.html
But the pages were not requested at example.new, because that site's robots.txt told them they weren't allowed.

See how that works?

Don’t return a 410

I think you misread that. If a given piece of content is genuinely gone, no equivalent at a different URL, then do return a 410. Among other things, it will make the Googlebot (only) stop requesting the file sooner.