Welcome to WebmasterWorld Guest from 3.94.202.88

Forum Moderators: Robert Charlton & goodroi

Disallowing Google from indexing test website before launching it?

     
2:26 am on Jul 26, 2019 (gmt 0)

New User

joined:June 2, 2019
posts: 11
votes: 0


I am launching a new website which is very large. About 250 million pages. I would like to test things before Google can index it. Is the best way to achieve this by putting the following into robots.txt?

User-agent: *
Disallow: /


Would the site get penalized by Google in any way by disallowing indexing in the beginning?
3:40 am on July 26, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


You are not disallowing indexing. You are disallowing crawling. This is perfectly normal, and is definitely preferable to having a lot of temporary pages and invalid links messing up the index.

:: quick detour to check something ::

Oh yes indeed. Google will keep asking for your robots.txt every day or so in perpetuity (I checked logs on my test site, which has always been roboted-out), and they will undoubtedly ask far more often in the beginning. You need not fear that they will go away sulking because they didn't get in the first time.
3:51 am on July 26, 2019 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4504
votes: 347


Disallowing Googlebot's crawling will not necessarily prevent your pages from being indexed. As soon as a link anywhere leads to a page anywhere, well, that's not crawling.

There's no requirement to allow indexing, no penalty for no-indexing to start. Indexing can be prevented sitewide using X-Robots headers.
7:12 am on July 26, 2019 (gmt 0)

Junior Member from DK 

Top Contributors Of The Month

joined:Oct 24, 2018
posts: 48
votes: 4


Following up on not2easy: We're in the same situation as we speak. Have deployed a sitewide x-robots-noindex tag, works just fine.
7:49 am on July 26, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11842
votes: 242


I would like to test things before Google can index it. Is the best way to achieve this by putting the following into robots.txt?


i would set up HTTP Basic Authentication for the test site so googlebot and other unauthorized requests get a 401 response.
this will prevent crawling of any content and indexing of any urls whether they are externally linked or not.

here is apache's guide:
Password protect a directory using basic authentication [wiki.apache.org]
7:55 am on July 26, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11842
votes: 242


Would the site get penalized by Google in any way by disallowing indexing in the beginning?

there should be zero long-term disadvantage to using HTTP Basic Authentication or other standard methods (e.g. noindex directives) to temporarily prevent indexing.
9:35 am on July 26, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 19, 2002
posts:3505
votes: 82


i agree with phranque.

alternatively, if you have a firewall, just block all ip addresses for that site and allow only yours (or whoever is testing)
10:54 am on July 26, 2019 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 15, 2001
posts:1835
votes: 66


a new website which is very large. About 250 million pages.

I can't think of any subject for a website that could legitimately command so many pages.

Best of luck with it :-)
1:35 pm on July 26, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1194
votes: 285


It depends what you need for your "test". If this is only for you to be able to browse your site, you can restrict the access (IP, password, etc...) .

Personally, when I do major changes to my site, I serve the new version from a subdomain and restrict the access to my IP.

In all events, it's good price to prevent crawlers while you debug a new site, ... Googlebot especially is eager at new stuff, so as soon as it will find the site, it can crawl hundred of thousands of page in 24 hours, which can have all kind of unpredictable consequences, ... so, a site should be ready before going live.

I can't think of any subject for a website that could legitimately command so many pages.

It's 43 times bigger than Wikipedia (EN) :)
Might be an amazon affiliate site, or ebay, or both ... or a mix of all affiliate programs :)
1:41 pm on July 26, 2019 (gmt 0)

Preferred Member from AU 

10+ Year Member Top Contributors Of The Month

joined:May 27, 2005
posts:468
votes: 20


If you need to keep search engines and spiders from the site while testing, limit access to your IP address only because that is the only way to prevent uninvited guests.

If that is not feasible, when testing do not use Chrome or any other browser that treats URLs as search requests. Because simply typing any URL into the address bar can add it to their indexing queue. One site that I had only just put online, one that only I knew existed, was being spidered by Google within a couple of hours.
2:25 am on July 27, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


This kind of testing suggests a dev machine running a closed system with only given ips access for testing purposes.

I would NEVER put a website on any outward facing host and "hope" that any declarations of "no index" etc. would be sufficient.

Period.

Xamp, wamp ... a local dev install ... whatever, and UNCONNECTED to the web except for your tester IPS ...

Meanwhile, 250m pages (or 250m products) is not going to index well, or very soon, with crawl budgets, even for the top five search engines. Expect a few delays when finally opened to the public!

Note: "fancy widget in 32 colors" is not 32 PAGES, it is one page with 32 options!

Then again, who knows what might happen? :)
8:01 am on July 27, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1194
votes: 285


Note: "fancy widget in 32 colors" is not 32 PAGES, it is one page with 32 options!

damn!
7:23 pm on July 27, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


You forgot the “basic” widget which comes in the same 32 colors but doesn't allow you to change the whangdoodle setting, making 64 pages, and the “advanced” widget with a further array of ... oh, never mind.

But seriously: Even if you are physically blocking everyone (firewall, Require all denied, whatever it may be), retain the robots.txt Disallow. Law-abiding robots--notably most search engines--will honor the Disallow, and the only thing better than a blocked request is a request that isn't made in the first place.
9:27 pm on July 27, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11842
votes: 242


Law-abiding robots--notably most search engines--will honor the Disallow, and the only thing better than a blocked request is a request that isn't made in the first place.

A disallowed URL may be indexed which is precisely what the OP is trying to avoid.
12:25 am on July 28, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


But, realistically, how many URLs will float to the top of a SERP even though the search engine hasn’t crawled the page and knows nothing about its content? For that matter, how would it even know the URL exists, unless someone (who? how?) has linked to it?
2:00 am on July 28, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11842
votes: 242


how many URLs will float to the top of a SERP

the question was about indexing, not ranking.
4:07 am on July 28, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15867
votes: 869


My point is that there is no reason to worry about being “indexed”, in and of itself, when there is nothing to index. My test site is, in the broadest technical sense, “indexed” (it is roboted-out and has no GSC account), but that doesn’t mean it is inundated with unwanted traffic. I don't think you are suggesting that OP should allow search engines to crawl 250 million not-ready-for-prime-time pages just so they can collect <noindex> labels for all of them.
4:33 am on July 28, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11842
votes: 242


I don't think you are suggesting that OP should allow search engines to crawl

if you reread my first reply you will notice i suggested a 401 response rather than a robots exclusion directive or a 200 response containing a noindex signal.

others have suggested a 403 response (via IP access controls) which would be equivalent to a 401 from an indexing perspective.
8:27 am on July 29, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11842
votes: 242


it is clearly described here:
A robotted page can still be indexed if linked to from from (sic) other sites
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

(source:Understand the limitations of robots.txt [support.google.com])

[edited by: phranque at 8:49 am (utc) on Jul 29, 2019]

8:44 am on July 29, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1087


Did we run off xpro?

xpro! Come back! Amid all these opinions and comments there is value to be shared!
3:19 pm on Aug 2, 2019 (gmt 0)

New User

joined:July 25, 2019
posts:6
votes: 3


Robots.txt is not always obeyed. If you have the capacity to build 250 million unique pages then restricting access to only your IPs should be a cinch.