Best Way to Remove Indexed Search Result Pages

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Best Way to Remove Indexed Search Result Pages

networkliquidators

6:38 pm on Apr 2, 2014 (gmt 0)

Hello,

I currently have a site with over 20,000 search results indexed. What is the best method for killing these pages? The search page and these search result pages are not of the same file path. Meaning i have search.html as normal search that is not indexed and /widget-thing/keyword-query as the indexed search result.

Believe me, this is something that I would never allow in the first place, just walked into it already being setup this way.

A.) Adding a Disallow rule in the Robots.txt file based on the whole directory of /widget-thing/. Although, I heard that Google may not remove pages, only not re-visit them again to update.

B.) 301 these pages back to the Search Page to kill them off and remove the section from the website where these popular query pages exist.

C.) Allow them to all to 404

D.) Adding a Meta NOINDEX,NOFOLLOW Rule to these pages

Maybe it's a combination of some of the above.

lucy24

9:22 pm on Apr 2, 2014 (gmt 0)

Adding a Disallow rule in the Robots.txt file based on the whole directory of /widget-thing/. Although, I heard that Google may not remove pages, only not re-visit them again to update.

roboting-out a page will stop it from ever being crawled again, but won't fully remove it from search results. This may or may not be an issue, depending on whether the page could ever come up naturally-- that is, somewhere other than a "site:" search you're doing yourself to test whether the page shows up.

If a page is roboted-out, the search engine will never request it, so it will never see a response header and/or in-page meta.

If you can readily attach a "noindex" label to all search-result pages, that seems the best approach. Google is always saying they don't want to index search-result pages; they just aren't very good at identifying them.

roshaoar

10:32 pm on Apr 2, 2014 (gmt 0)

I had the same, I did d) but also a) disallowed it in robots.txt a few days after I'd set "no URLs" against zoom_query and zoom_sort in webmaster tools > crawl > url parameters. Google has now removed all but 1 after 3 months (out of a few hundred).

Andem

11:02 pm on Apr 2, 2014 (gmt 0)

This might not be considered orthodox, but here's my solution (in PHP):

if (strstr($_SERVER['HTTP_USER_AGENT'], "Googlebot")) {
    header("HTTP/1.1 404 Not Found");
    exit;
}

Note this is after Google repeatedly ignored robots.txt and noindex. You might even consider "HTTP/1.1 403 Forbidden".

lucy24

11:18 pm on Apr 2, 2014 (gmt 0)

robots.txt and noindex

Concurrently on one page, or first one and then the other? They're mutually exclusive for search-engine purposes.

rainborick

12:26 am on Apr 3, 2014 (gmt 0)

If you're certain that you want these pages removed from the index, the easiest solution is to use the URL Removal Tool in Webmaster Tools where you can remove an entire directory with a single command.

Andem

10:04 am on Apr 3, 2014 (gmt 0)

Concurrently on one page, or first one and then the other? They're mutually exclusive for search-engine purposes.

They aren't mutually exclusive because robots.txt is often ignored.

7_Driver

11:06 am on Apr 3, 2014 (gmt 0)

I've removed a lot of pages - and the NoIndex meta tag seems to be the best way.

It will (eventually) get rid of most pages - but some always seem to stick around no matter what.

If you get sick of waiting, you can expedite the process using the URL removal tool (maximum 1000 urls per day) - and there's a chrome plugin that will allow you to process a batch. But having provided the tool - Google say you shouldn't use it. Or only in emergencies. (Your business dying as a result of SEO problems due to having the wrong pages in the indexed doesn't count as an emergency).

I'd stay away from robots.txt. First, it tends to stop Google seeing the NoIndex meta tags - and it doesn't tell Google to remove the page from the index - just to stop crawling it.

You may also find that pages blocked by robots.txt still appear in the serps/index - but just as bare URLs - with the message "We'd like to show you the contents of this page - but the web site won't let us". Which is probably not the effect you're looking for.

aakk9999

11:08 am on Apr 3, 2014 (gmt 0)

They aren't mutually exclusive because robots.txt is often ignored.

This is a statement often voiced, but every time when looking closely in a particular case reported, it has been found not to be true.

Have you got the evidence from Googlebot ignoring robots.txt and if so, which?

Andem

12:32 pm on Apr 3, 2014 (gmt 0)

Uh yes. Ever seen "A description for this result is not available because of this site's robots.txt"?

netmeg

12:45 pm on Apr 3, 2014 (gmt 0)

Its a question of crawl vs index. They're not the same thing.

robots.txt tells a spider not to *crawl* - and that's exactly what happens there - Google can't crawl to generate a snippet. That doesn't mean it won't be indexed.

NOINDEX controls indexing, and I for one have never had Google ignore that directive, over millions of pages. Sometimes it takes them a while to find it, but once found, it stays NOINDEXed until I change it.

This is not peculiar to Google either; Bing and Yahoo behave the same way - I assume Yandex does too.

aakk9999

12:55 pm on Apr 3, 2014 (gmt 0)

As netmeg said. Here is also what Google says about this:

Block or remove pages using a robots.txt file
https://support.google.com/webmasters/answer/156449?hl=en [support.google.com]

While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.

Andem

1:04 pm on Apr 3, 2014 (gmt 0)

I have trouble believing Google's line on this topic as a result of not respecting my rules for years now and logs. I could find unique text on pages which I told them not to crawl and decided enough was enough; The 404 route solved that for me.

The robots.txt exclusion protocol was created to tell bots to stay the hell out and I don't think they are. I have found mixed success when using noindex.

networkliquidators

1:13 pm on Apr 3, 2014 (gmt 0)

Thanks for the replies. I feel a bit safer moving forward now.

lucy24

6:30 pm on Apr 3, 2014 (gmt 0)

Ever seen "A description for this result is not available because of this site's robots.txt"?

That means robots.txt IS BEING FOLLOWED. The question was about the googlebot crawling roboted-out pages.

afaik, the only documented instances of google disregarding robots.txt are when the page is named in a sitemap.

JD_Toims

7:55 pm on Apr 3, 2014 (gmt 0)

Why Googlebot Might Ignore A Robots.txt Disallow Rule [webmasterworld.com]

JD_Toims

8:11 pm on Apr 3, 2014 (gmt 0)

Google + Button Ignores Crawler Directives [seroundtable.com]

Andem

8:25 pm on Apr 3, 2014 (gmt 0)

@lucy24: Fair enough. Wording, meaning and execution is not clear.

You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one).

<snip>

While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web.

(Emphasis mine) [support.google.com ]

Google seems to be contradicting themselves. What to do if you don't want your content crawled or indexed? It appears there is no clear policy that is respected.

The rules and Google's ability to follow them seems to be a moving goal post, and I don't disagree with anybody's interpretation either. NOINDEX has been a hit and miss in my experience either way.

JD_Toims

8:34 pm on Apr 3, 2014 (gmt 0)

Google seems to be contradicting themselves. What to do if you don't want your content crawled or indexed? It appears there is no clear policy that is respected.

Emphasis Added

your site includes *content* that you don't want search engines to index.

Google could easily index the content without violating the robots.txt by accessing the URL(s) if there's a rel=canonical on the page(s) EG The lazy scraper forgets to change the canonical URL on the page(s) to point to their site when they steal someone's content and Google follows the canonical directive.

In that case, they don't need to access the blocked URL for the content of the page, and it's not content on the blocked page they indexed either; It's content on someone else's site stolen from the site with the blocked page, then associated with the disallowed URL via the rel=canonical directive on the stolen page of content.

If someone has a Google +1 button on the page(s) as well as rel=canonical it could get really confusing to try and figure out what's going on.

[edited by: JD_Toims at 9:03 pm (utc) on Apr 3, 2014]

networkliquidators

8:56 pm on Apr 3, 2014 (gmt 0)

I have another Question Re-guarding this matter.

Would it be safe to 301 this entire section of automated query pages back to the main search page, considering the main search page is not indexed by Google?

These automated pages hold little value as users to do not directly use these pages as part of their navigational path. I then want to remove this section from being link-able on my site.

Where the ultimate goal is getting these pages out of Google's Indexed and Google does not see these URLs as part of my main site.

I recently took a dive in traffic on the 24th of March, and after having more than 50% of my higher quality pages taking a traffic loss, everything is pointing to a Panda Update Algo change.

I can only imagine their rules became stricter and these automated query pages are hurting the site as a whole since they accumulate 87% of site that is indexed.

JD_Toims

9:17 pm on Apr 3, 2014 (gmt 0)

These automated pages hold little value as users to do not directly use these pages as part of their navigational path. I then want to remove this section from being link-able on my site.

I'd 410 Gone 'em with a custom error page that has something like a 7 to 10 sec meta refresh to the main search page -- You can't just meta refresh 0 or Google will treat is as essentially a 301, which will result in a soft 404 when there are mass redirects to a single location.

roshaoar

9:27 pm on Apr 3, 2014 (gmt 0)

@andem - It's been my experience that Google contradict themselves in a fair number of ways, not just here. I think it's all to do with the fact that Google has so many miniteams working on things, and they're not always aware of each other.

Google seems to be contradicting themselves. What to do if you don't want your content crawled or indexed? It appears there is no clear policy that is respected.

lucy24

1:09 am on Apr 4, 2014 (gmt 0)

Wording, meaning and execution is not clear.

Can't argue with you there. The one that exasperates me the most is the Remove From Index section, where they say outright that roboting-out a file will keep it out of the index after your 90 days are up.

Google won't crawl or index the content of pages

"crawl the content" simply doesn't make sense. I know what they mean, but they're not expressing it very well.

phranque

11:58 am on Apr 4, 2014 (gmt 0)

I could find unique text on pages which I told them not to crawl and decided enough was enough

that's not proof, that's coincidence.
proof is when you look in your web server access log file and see googlebot requesting your robots.txt file followed by a request for an excluded url.

You need a robots.txt file only if your site includes content that you don't want search engines to index.

i changed the emphasis slightly to show that it is still consistent with google's distinction between indexing content and indexing urls.

when someone links to you, the url and anchor text is their content, not yours.

Would it be safe to 301 this entire section of automated query pages back to the ... page?

no.

the ultimate goal is getting these pages out of Google's Indexed and Google does not see these URLs as part of my main site.

410 Gone

networkliquidators

2:54 pm on Apr 4, 2014 (gmt 0)

@phranque

Thanks, I am working on implementing this solution into my site. I'm hoping this topic helps others in this situation, and I truly appreciate the feedback.

Now, if I can only get the custom 410 page to resolve in my HTACCESS file. [G] is suppose to do it, but I probablly have the rule in the wrong position, as in something else is interfering with it.

:P

lucy24

7:54 pm on Apr 4, 2014 (gmt 0)

Do you have an ErrorDocument directive? Do the 410 pages currently lead to the Apache-default 410 page (which admittedly is alarming) or to something else?

For humans, it's often fine to use the same physical page for both 404 and 410. It just depends what kind of site you've got. But you have to specify an ErrorDocument.

networkliquidators

6:13 pm on Apr 7, 2014 (gmt 0)

Yes, I have a ErrorDocument specified specifically for a 410 error page. It goes to a regular 404 page not found that is not even the custom 404 page I have in place for the rest of the site.

It's odd. I put in a ticket with my hosting provider though.

networkliquidators

7:56 pm on Jul 30, 2014 (gmt 0)

Hello,

I realize this is a old thread, but since July 8th, I have had a massive traffic recovery from a loss around March 24th. Since this time, 90% of these automated query pages have exited Google's indexed and only about half are still showing in Google webmaster tools in the crawl errors - NOT FOUND section.

My conclusion after seeing 85% of our quality pages taking a traffic dive on March 24th and now have fully rebounded, the Google Panda penalty or just how the algorithm naturally works has been lifted.

I definitely ruled out seasonal as July is one of our slower months and analytics coincides with our rebound from previous dates. Of coarse, I have been adding quality content 2,000+ words to our category pages with useful content and that seems to be working to rank. In addition, we have multi-language domains set up for about 1.5 months now and a mobile version with has been up since May.

FYI, I have an E-commerce website in the Fashion Industry. I hope this insight helps others.