Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Disallowing Google from indexing test website before launching it?

         

xpro

2:26 am on Jul 26, 2019 (gmt 0)

5+ Year Member



I am launching a new website which is very large. About 250 million pages. I would like to test things before Google can index it. Is the best way to achieve this by putting the following into robots.txt?

User-agent: *
Disallow: /


Would the site get penalized by Google in any way by disallowing indexing in the beginning?

lucy24

3:40 am on Jul 26, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You are not disallowing indexing. You are disallowing crawling. This is perfectly normal, and is definitely preferable to having a lot of temporary pages and invalid links messing up the index.

:: quick detour to check something ::

Oh yes indeed. Google will keep asking for your robots.txt every day or so in perpetuity (I checked logs on my test site, which has always been roboted-out), and they will undoubtedly ask far more often in the beginning. You need not fear that they will go away sulking because they didn't get in the first time.

not2easy

3:51 am on Jul 26, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Disallowing Googlebot's crawling will not necessarily prevent your pages from being indexed. As soon as a link anywhere leads to a page anywhere, well, that's not crawling.

There's no requirement to allow indexing, no penalty for no-indexing to start. Indexing can be prevented sitewide using X-Robots headers.

dennisjensen

7:12 am on Jul 26, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



Following up on not2easy: We're in the same situation as we speak. Have deployed a sitewide x-robots-noindex tag, works just fine.

phranque

7:49 am on Jul 26, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I would like to test things before Google can index it. Is the best way to achieve this by putting the following into robots.txt?


i would set up HTTP Basic Authentication for the test site so googlebot and other unauthorized requests get a 401 response.
this will prevent crawling of any content and indexing of any urls whether they are externally linked or not.

here is apache's guide:
Password protect a directory using basic authentication [wiki.apache.org]

phranque

7:55 am on Jul 26, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Would the site get penalized by Google in any way by disallowing indexing in the beginning?

there should be zero long-term disadvantage to using HTTP Basic Authentication or other standard methods (e.g. noindex directives) to temporarily prevent indexing.

topr8

9:35 am on Jul 26, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i agree with phranque.

alternatively, if you have a firewall, just block all ip addresses for that site and allow only yours (or whoever is testing)

Mark_A

10:54 am on Jul 26, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



a new website which is very large. About 250 million pages.

I can't think of any subject for a website that could legitimately command so many pages.

Best of luck with it :-)

Dimitri

1:35 pm on Jul 26, 2019 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



It depends what you need for your "test". If this is only for you to be able to browse your site, you can restrict the access (IP, password, etc...) .

Personally, when I do major changes to my site, I serve the new version from a subdomain and restrict the access to my IP.

In all events, it's good price to prevent crawlers while you debug a new site, ... Googlebot especially is eager at new stuff, so as soon as it will find the site, it can crawl hundred of thousands of page in 24 hours, which can have all kind of unpredictable consequences, ... so, a site should be ready before going live.

I can't think of any subject for a website that could legitimately command so many pages.

It's 43 times bigger than Wikipedia (EN) :)
Might be an amazon affiliate site, or ebay, or both ... or a mix of all affiliate programs :)

Kendo

1:41 pm on Jul 26, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you need to keep search engines and spiders from the site while testing, limit access to your IP address only because that is the only way to prevent uninvited guests.

If that is not feasible, when testing do not use Chrome or any other browser that treats URLs as search requests. Because simply typing any URL into the address bar can add it to their indexing queue. One site that I had only just put online, one that only I knew existed, was being spidered by Google within a couple of hours.

tangor

2:25 am on Jul 27, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This kind of testing suggests a dev machine running a closed system with only given ips access for testing purposes.

I would NEVER put a website on any outward facing host and "hope" that any declarations of "no index" etc. would be sufficient.

Period.

Xamp, wamp ... a local dev install ... whatever, and UNCONNECTED to the web except for your tester IPS ...

Meanwhile, 250m pages (or 250m products) is not going to index well, or very soon, with crawl budgets, even for the top five search engines. Expect a few delays when finally opened to the public!

Note: "fancy widget in 32 colors" is not 32 PAGES, it is one page with 32 options!

Then again, who knows what might happen? :)

Dimitri

8:01 am on Jul 27, 2019 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Note: "fancy widget in 32 colors" is not 32 PAGES, it is one page with 32 options!

damn!

lucy24

7:23 pm on Jul 27, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You forgot the “basic” widget which comes in the same 32 colors but doesn't allow you to change the whangdoodle setting, making 64 pages, and the “advanced” widget with a further array of ... oh, never mind.

But seriously: Even if you are physically blocking everyone (firewall, Require all denied, whatever it may be), retain the robots.txt Disallow. Law-abiding robots--notably most search engines--will honor the Disallow, and the only thing better than a blocked request is a request that isn't made in the first place.

phranque

9:27 pm on Jul 27, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Law-abiding robots--notably most search engines--will honor the Disallow, and the only thing better than a blocked request is a request that isn't made in the first place.

A disallowed URL may be indexed which is precisely what the OP is trying to avoid.

lucy24

12:25 am on Jul 28, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But, realistically, how many URLs will float to the top of a SERP even though the search engine hasn’t crawled the page and knows nothing about its content? For that matter, how would it even know the URL exists, unless someone (who? how?) has linked to it?

phranque

2:00 am on Jul 28, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



how many URLs will float to the top of a SERP

the question was about indexing, not ranking.

lucy24

4:07 am on Jul 28, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My point is that there is no reason to worry about being “indexed”, in and of itself, when there is nothing to index. My test site is, in the broadest technical sense, “indexed” (it is roboted-out and has no GSC account), but that doesn’t mean it is inundated with unwanted traffic. I don't think you are suggesting that OP should allow search engines to crawl 250 million not-ready-for-prime-time pages just so they can collect <noindex> labels for all of them.

phranque

4:33 am on Jul 28, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I don't think you are suggesting that OP should allow search engines to crawl

if you reread my first reply you will notice i suggested a 401 response rather than a robots exclusion directive or a 200 response containing a noindex signal.

others have suggested a 403 response (via IP access controls) which would be equivalent to a 401 from an indexing perspective.

phranque

8:27 am on Jul 29, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



it is clearly described here:
A robotted page can still be indexed if linked to from from (sic) other sites
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

(source:Understand the limitations of robots.txt [support.google.com])

[edited by: phranque at 8:49 am (utc) on Jul 29, 2019]

tangor

8:44 am on Jul 29, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Did we run off xpro?

xpro! Come back! Amid all these opinions and comments there is value to be shared!

CoffeeOrDeathPlease

3:19 pm on Aug 2, 2019 (gmt 0)

5+ Year Member



Robots.txt is not always obeyed. If you have the capacity to build 250 million unique pages then restricting access to only your IPs should be a cinch.