Removing robots.txt That Disallows All

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Removing robots.txt That Disallows All

Made In Sheffield

3:58 am on Jan 23, 2017 (gmt 0)

I am about to start working with a client who has a fairly suzeable site, with a robots.txt that blocks everything. It has probably been in place gif ten years. My concern in removing it is that if I just delete it, suddenly there are going to be a whole load of pages made available to Google that were not previously, and may trigger a sandbox type penalty for the sudden big change.

I would like advice please, on the possible things to be concerned about, whether to be concerned about them and what course of action would be better if so.

aristotle

7:34 pm on Jan 23, 2017 (gmt 0)

Before you make any decisions, I suggest that you take a look at the raw logs to see what the bots are doing now.

Also, if googlebot has been totally blocked for ten years, it would likely show a message about this in the search results, and in any case the algorithm wouldn't be able to rank the pages properly. So you might take a look at rankings and any google traffic it may be getting now.

So if you investigate these types of things, you might get a better picture of the situation as it stands now, which could help you decide what to do.

Made In Sheffield

9:26 pm on Jan 23, 2017 (gmt 0)

Thanks. I'll do that when I can.

The only page listed on a "site:" search is the homepage with "A description for this result is not available because of this site's robots.txt" beneath it. None of the other pages are listed at all. It says "similar results were excluded". If I show them, I simply get another two entires for the homepage with different titles. All with the same robots.txt message.

Checking the logs is a good idea, I'll do that as soon as I get access to them. Webmaster tools will give useful info too when I can do that. They come up for a company name search but not expecting them to rank for much else.

Dimitri

10:06 pm on Jan 23, 2017 (gmt 0)

May be you should also identify why the robots.txt was blocking all the pages. There might be a reason, to take in consideration too.

lucy24

10:29 pm on Jan 23, 2017 (gmt 0)

I simply get another two entires for the homepage with different titles

That's characteristic of any site that's fully roboted-out. Since they can't crawl, they don't know about the site's preferred name (with or without www) or protocol (https or not). They may even offer up a nonexistent /m/ version.

A "penalty" isn't really relevant, since there has never been anything to apply a penalty to. What you're really talking about is a site that was previously not indexed and is now entering the index--almost as if it's a brand-new site that just happens to have the same name as a pre-existing site. In fact, for all Google knows, it is a brand-new site.

phranque

10:57 pm on Jan 23, 2017 (gmt 0)

The server logs won't tell you much. Once googlebot has received the robots.txt file it will not request any URLs that are excluded.

Made In Sheffield

11:05 pm on Jan 23, 2017 (gmt 0)

Thanks. So in terms of it being effectively a new site, is it better to just drop the robots.txt, or release the pages steadily over a period of time?

aristotle

1:33 am on Jan 24, 2017 (gmt 0)

The server logs won't tell you much. Once googlebot has received the robots.txt file it will not request any URLs that are excluded.

My thought was that the logs would tell you how often googlebot comes around to check. If robots.txt hasn't changed in ten years, and the site hardly gets any traffic, then googlebot might not show up very often. This might be useful information

rainborick

1:58 am on Jan 24, 2017 (gmt 0)

Don't worry about unblocking the site. Google isn't a giant game of "gotcha" with secret rules. It isn't an unheard-of situation for a website to have a complete block for an extended period and then later purposely become unblocked. But don't delete the robots.txt file. Change it to:

User-agent: *
Disallow:

so that you send an affirmative signal that you deliberately removed the blocking instruction and didn't just accidentally delete the robots.txt file. I'd follow this up by using the robots.txt Tester in the Crawl menu of GSC to have Google immediately fetch the updated file, and then use the Fetch As Google tool to fetch and submit the home page to the index.

lucy24

2:12 am on Jan 24, 2017 (gmt 0)

If robots.txt hasn't changed in ten years, and the site hardly gets any traffic, then googlebot might not show up very often.

I checked logs for my test site, which is 100% roboted-out. On average, the Googlebot (and also the bingbot) continues to get robots.txt once every day or so. That's for a site that has existed in the same form for several years; I only looked at the past year's logs.

If you make a substantial change in robots.txt, it may take a day or so for the major search engines to notice--but when they do, you can expect a full top-to-bottom crawl almost immediately.

Made In Sheffield

2:27 am on Jan 24, 2017 (gmt 0)

Thanks all for the advice.

@rainborick great plan, I'm going to follow that, thank you.