Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Having trouble getting rid of unwanted item pages in Google

         

gomer

7:02 pm on Sep 21, 2014 (gmt 0)

10+ Year Member



We are having trouble getting pages out of the Google index and I would appreciate some thoughts.

We have an item page that takes in parameters as follows:

item.php?param=A&param=B&param=C&param=D ...

It was never our intention to get our item page indexed but unfortunately it did. With a whole bunch of parameters and combinations that got crawled from external sites, we currently have 149,000 pages in the index for item.php.

To remove the pages from the index, we have added the following to robots.txt:

/item.php

In Google Websmater Tools, we have also put in a request to remove:

/item.php

Google has processed the request, however, when we do:

site:domain.com/item.php

We are still seeing the 149K page. We have now added a no-index, no-follow tags to the pages themselves but they will need to be crawled again by Google to see that. And unfortunately, all those combinations of parameters may not happen exactly again.

We've also added:
<link rel="canonical" href="http://www.domain.com/item.php" /> to the page.


My questions are:

1. /item.php in robots.txt should have removed all item.php pages including ones with parameters from Google, correct?

2. Why are we seeing 149K results with: site:domain.com/item.php if Google has processed our request in Webmaster Tools and also our robots.txt change? Is there a lag between site: command and Google saying they processed our request?

3. Does anyone have any direct experience on site: still showing results, including cached pages but those results not really being in the Google index?

4. Anything else we can do to get those pages not appearing in site: ?

Thanks.

netmeg

9:55 pm on Sep 21, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



1. No. robots.txt directs crawling, not indexing.

2. Pages removed in GWT are taken out almost instantly, in my experience. Sounds like it didn't go through correctly.

3. No.

4. Remove the robots.txt entry, leave the NOINDEX. THEN remove it. Leave the NOINDEX on it, and don't bother about the robots.txt.

not2easy

10:04 pm on Sep 21, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Yes, netmeg is right. With crawling blocked in robots.txt they can't see the noindex tags or the canonical tags. URL removal in GWT works right away but if they can't crawl those URLs and see them with "noindex" tags they will show up again after a few months. Blocking in robots.txt is especially ineffective when links exist that point to the URLs you want de-indexed. The rest is a matter of time. One thing you can do, if there are a few real pages that generate all those parameters, is to add the noindex tag and then "Fetch as Google". But if it means 149K "fetches", it will just need to take some time.

aakk9999

10:24 pm on Sep 21, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



2. Because, from what you wrote above, you asked Google to remove one URL which is /item.php and not all /item.php?param=A&param=B&param=C&param=D ... It is important to understand that WMT removal tool does not do partial URL match like robots.txt. You can remove the whole directory (the whole folder), but /item.php is not a directory, it is a page and hence you asked Google to remove this page only from its index.

3. If index.php?etc.. URLs show when you run site: command, they are in Google index. In some cases, unless you run "site:" command, you do not see these URLs listed in SERPs as there are many other URLs that Google prefers to show first. Sometimes you cannot even see them with a plain "site:" command and you need to supplement it with "inurl:" command, e.g.
site:example.com inurl:/item.php

4. You can do as netmeg said above. If you do this, remove canonical link element as it should not coexist with meta robots noindex. Alternatively:

a) Leave blocked in robots.txt and do not worry these being in index because they only ever show when you use site: command. After a while all these URLs will all have A description for this result is not available because of this site's robots.txt

b) Use URL Parameters section in WMT to tell Google to ignore URLs with these parameter(s) (you can select "No URLs"). NOTE: you can only do this if these parameters are not used on pages you do wish to be indexed.

c) Unblock from robots.txt, but do not add noindex. Let canonical link parameter do its job. This will take some time as Google will have to re-crawl each URL in order to see the canonical link element. You can monitor the progress by using site:example.com inurl:/item.php and see how number of results gets reduced over the time. Note that it can take some months for Google to recrawl 149K URLs. NOTE: Google will not see canonical unless you unblock item.php from robots.txt.

gomer

12:31 am on Sep 22, 2014 (gmt 0)

10+ Year Member



netmeg, not2easy, aakk9999 very helpful thank you.

netmeg good point about removing /item.php from robots.txt. We need that there before this problem ever happened but now it is to late for that considering we want the pages to be craweled and no-index to be found.

aakk9999, I agree that if it were /item/ then pages would have been moved quickly in WMT and the whole situation would have been much easier.

Here is the plan and I still do have a few questions.

We are going to remove /item.php from robots.txt so the pages can be crawled and we are going to have the pages serve up no-index on the pages. The problem though is that we can't really recreate this combination of parameters. Crazy but there are probably 8 parameters which leads to ridiculous combinations.

A few more questions.

5. Removing the item page and serving up a 410 gon) is an option but I don't think it would help. Having the item page serve up no-index on the page is just as good since that combination of parameters needs to be called anyway. Do you agree that a 410 gone does not do anything over what a no-index would do? No-index is probably better since it is a specific directive to remove the page from the index.

6. How much can this hurt us? Unfortunately, this item page is similar in content to other pages that are important to our site and I'm worried about duplicate content. Our site is supposed to have about 30K pages but not it has this 149K pages we don't want.

7. Is there anything else I can do to remove these pages? I looked at all the contact forms for Google and I'm not seeing anything that will help us but I thought I'd ask here to see if there is anything else we can do? I agree the no-index tag on the page, using rel canonical and time are what is needed but I don't want to miss anything so I thought I'd ask again.

UPDATE:

8. Was thinking about this more and since we don't even want item.php in the index, I'm not seeing the need for rel-canonical, agree?

aakk9999

3:04 am on Sep 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



From your questions above, I presume these URLs are not linked internally? I did not realise this before.

5. If the pages should really not be in existence, and they are not linked internally, then I would personally go for 410 Gone. Google will remove 410 Gone fairly promptly from the index. Note that you will get reported 410 Gone pages in WMT Errors section (alongside 404s), which you should just ignore. I am speculating that 410 Gone may restore your crawl budget faster than noindex although I do not have evidence of this. (If pages are linked internally, I would use noindex as then 410 Gone would not then be the option.)

6. Difficult to say. I have seen some sites hurt by leaking a large number of duplicate URLs and also other sites where this did not seem to make any impact at all. Sometimes "hurt" is just that Google does not have the time to visit your more important pages more often because it is busy re-crawling 149K of other non-important pages. Or it may hurt because Google sees influx of pages with no value. Or it may dilute page rank too thinly so your important pages suffer. Or it may not hurt at all.

7. Not really. Choices are noindex and 410 Gone. Canonical is not the right choice here as you do not want canonical version in index either.

8. Correct. If you decide to use noindex, then remove rel-canonical from the head section. Noindex and canonical should not be used together anyway.

gomer

3:26 am on Sep 22, 2014 (gmt 0)

10+ Year Member



aakk, thanks.

Yes, we do not link to the item page internally, it is actually a page that our affiliates use. We could do the 410 route, serving up a 410 gone but then redirecting to new page which serves up the content affiliates are looking for. Since there seems to be little difference if any between 410 and no-index, we might stay with no-index.

Agree that rel-canonical is not needed since we are not having any variation of the page in the index.

I am considering having our developer write a program to do the following in an automated way. Do site:domain.com/item.php, request a batch of urls and then login to WMT and submit those url's to be removed.This would need to be done many times of course to remove 149K pages but since this is automated, we think we can pull it off unless we get blocked by Google for automated requests, we've done some of this in the past. Will keep this thread posted if we attempt this.