Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How to Handle Indexing of Staging Environment

         

Rysk100

11:04 am on Jul 9, 2017 (gmt 0)

5+ Year Member



I need some advice regarding the indexing of a site's staging environment.

Both new site and new CMS were indexed around 4 months ago. There's about 90 'bad' URLs which have been indexed and are returning 200 response codes. In addition to the indexing problem the company also wants to do a http > https migration. The real URLs i.e. content pages in the site are about 200 of which maybe 20 have been indexed

Breakdown of these URLs

1. 35 URLs
- URLs like /sites/test.com /widget/tom (test page)
- Have no corresponding / relevant URLs
- Some Cached with the 404 error page, some returning Google's 404 page, some not cached

-> I will return a 410 for these URLs

2. 44 URLs
- URL like /widget/blue which should have been indexed as /gizmo/blue or /widget/blue > which should have been indexed as /widget/blue-new
- Some cached with correct content but wrong URL, some cached, some not cached at all

-> I will return a 301 for these URLs

3. 28 URLs
- URLs of proprietary CMS assets including CSS, JPEG, Sprite files
- 1/2 are returning a 403 response code, some are returning 200 response code

-> I will return a 403 for those URLs

My thinking

1. 301s - although some of these URLs were indexed without content, there is a relevant logical corresponding URL. 301s are a strong order and will force the indexing of the correct pages and removal of the bad URLs quicker

2. 403s - eventually Google will stop trying with these URLs

Process

1. Undertake 401, 403 and 301 as above
2. Request temporary removal of all the 410 and 403 URLs via Google Search Console
3. Request re-crawling and indexing of the good URLs on the site via Google Search Console
4. Acquire good links to assist in deep crawling and indexing
5. Wait for indexation issues to be rectified
6. Site migration and server re-direct 301 from http to https

Questions

1. How does the plan sound?
2. What else can I do expedite the indexing and un-indexing of the URLs I want?
3. What is the extent of the damage?

not2easy

3:38 pm on Jul 9, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Hi Rysk100 and Welcome to WebmasterWorld [webmasterworld.com]

Too late to prevent now, but this question would have been much simpler to answer before the horses were out of the barn. (Keep the doors closed)

It is not terribly bad, but since Google obeys robots.txt, why show them 403s?

There is no need to bother with #2 (2. Request temporary removal of all the 410 and 403 URLs via Google Search Console ) as none of those are in the index. You should submit a sitemap to provide a list of URLs that you would like to have indexed, that is done in GSC. It is not required, but it has always helped in my experience.

Since it appears that this is a "new" site, why go through a change just after debut? Start it off on https. (If it is ready) You could easily have valid reasons.

Rysk100

3:56 pm on Jul 9, 2017 (gmt 0)

5+ Year Member



All the URLs I referred to are in the index. Some cached, some not.

It seems that the URLs I wanted to 410 are just error pages which are being returned with a 200 code - i.e soft 404s. I could just leave these and EVENTUALLY G will de-index them but won't the removal tool help things along? Regarding the asset files/URLS (e.g. css, jpeg files) - there are some asset file URLs in the index which are being returned with a 200 response code and some with 403

So in addition to returning 403 response to those asset folders/ files I should also block crawling via robots.txt?

This will i) force those indexed URLs out of the index and ii) stop other ones being crawled in the first place?

Yes the sitemap is being added soon. The site was launched on http, but we will wait on the migration.

not2easy

6:50 pm on Jul 9, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If G receives a genuine 410 response from your server when it requests the files, it will stop looking for them because 410 is a server error that means [GONE]. If you need help making that work, visit the Apache Forum here: [webmasterworld.com...]

If you Disallow those files in robots.txt then there won't be 403s served. At least not to Google. Of course, if you disallow css and js files from Google they have a hard time determining where your content belongs in search results. Harder to tell what a visitor would see if they send them there.

Using robots.txt works fine with Google. Some 'good' bots don't always interpret things the same way, but Google is quite faithful in that regard. IF those site assets are in their own folder or folders, it is fairly simple to "no-index" the entire folder - again that is discussed in the Apache forum because it uses .htaccess to set a header for bots.

There's a decent site search here, you may find related information using that search in the upper right corner. Try X-robots for examples.

phranque

8:31 pm on Jul 9, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You should set up HTTP Basic Authentication on your staging server which would send a 401 status code with the response.

ipco

12:51 pm on Jul 10, 2017 (gmt 0)

10+ Year Member



Apologies if I appear to be hijacking this, but I had a similar situation where both my production and my dev sites were indexed, ergo duplication.
I password protected the dev site to stop it being crawled. To me, it seemed easier than redirecting everything.
Did I do wrong?
should I have done more?
Would the same work for Rysk100?

martinibuster

2:17 pm on Jul 10, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Answer to Rysk100 ;)

Putting those URLs into robots.txt only stops Google from crawling those URLs. It won't remove those URLs from the index.

I think the 410 error response code plus using the GSC removal tool to remove those URLs from Google's index one at a time is a good strategy.

Adding the URLs to Robots.txt will encourage the crawling of those URLs by rogue bots. What happens after that is unpredictable, but could include getting added to a scraper page (longshot). I would feel more comfortable not having those URLs picked up by rogue bots.

[edited by: martinibuster at 2:30 pm (utc) on Jul 10, 2017]

Rysk100

2:26 pm on Jul 10, 2017 (gmt 0)

5+ Year Member



Thanks Martinibuster (like your writing by the way). Are you referring to me or ipco?

Why do you suggest requesting URL removal one at a time?

The CMS is now returning 404s for those error pages (in group 1. I listed above), i can do 410s for them but its problematic. Will google take the same action on a removal request on GSC on a page returning a 404 as a 410?

martinibuster

2:36 pm on Jul 10, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Why do you suggest requesting URL removal one at a time?


There's no bulk removal via GSC, afaik. If you know of one let me know, I'll be very grateful!

Will google take the same action on a removal request on GSC on a page returning a 404 as a 410?


It should be the same, in theory. I like the message of permanence that the 410 message relays. But in your case, doing 404 and the GSC manual removal should be enough.

Thanks Martinibuster (like your writing by the way).


Thanks! :)
I've a backlog of articles to complete for SEJ that'll hopefully be getting published within the next few weeks!

phranque

2:16 am on Jul 11, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I password protected the dev site to stop it being crawled.
...
Would the same work for Rysk100?


one method of "password protecting" a dev or staging site is HTTP Basic Authentication.
does your method provide a 401 status code response with a WWW-Authenticate field to unauthenticated requests?

ipco

1:50 pm on Jul 11, 2017 (gmt 0)

10+ Year Member



@phranque, I must admit, a lot of this is new ground for me so I'm stumbling along and frequently tripping myself up :-)
I got the idea to password protect from this forum so I did that and site:dev.mysite doesn't return anything so I assumed all ok.