|How to take random site searches out of the Google index|
I posted about this a few months ago, but I want to start a new thread to consider possible new ideas.
I have a search page, say search.php, that returns dynamic results. It is a GET, not a POST. Apparently, that was one of my first mistakes, yet I do note that many sites use GET for their search functionality. Anyhow, I do have some links on my site to popular searches - maybe 100 or so over time. I did NOT use the canonical tag, and hence I had many drilldown and sort options indexed by Googlebot. My bad.
However, I did put rel="nofollow" on all those links, but that was ignored. So now, I have the robots meta tag to "noindex" for any search except a plain old keyword search. I will note that in another thread, I was cautioned to completely block search from Google indexing. But when I compare traffic for search.php?q=brandx to my /brand/brandx flat HTML page, the search page gets way more traffic.
So I'm inclined to not 301 redirect search to brand pages (or other) or to "noindex" all my search pages just yet. Still, what I've done in terms of my internal searches and what I consider valid searches will certainly help things out.
What really irks me, though, is all the random searches to my site. If I'm selling baseball items, why do I get 100's of searches a day for topiary items, bathroom products, etc. etc. etc.? I feel like Googlebot churns making all these useless hits on my site that aren't relevant. I can't track where they are coming from. Someone said, from a search toolbar. That doesn't entirely make sense - why would such searches lead to my site?
The bottom line is, I want to get rid of all these junk searches. What I've done is capture 14,000 of them (yes, 14,000!)
Now, at the top of my search.php code, I check the search, and if it matches one of these weird searches, I return an empty page with a 410 header. UNLESS, as I've hard-coded, there is a single word in a weird search which has relevance to my site, in which case I search that single word, show products, return a usual 200 page found header, but put in the robots noindex meta tag to stop the page from being crawled.
All said, I think these searches are a waste of my time, Google's time, and have no value to my site. Plus, I think they are bogging down my genuine listings in Google.
Can someone suggest how to "get rid of" all these bogus searches that are coming from Googlebot?
[edited by: tedster at 12:57 am (utc) on Mar 28, 2012]
[edit reason] no specific keywords, please [/edit]
Are you saying that Google Search is sending traffic directly to search results pages on your site that have no content? Or are you saying that googlebot is making those searches?
Hey Tedster, Googlebot is crawling 100's of bizarre search pages on my site every day. They are not Google Search referral pages. Where Googlebot got all these weird searches, I have no idea! I've tried to find the source, but no luck. I estimate that of my 28,000 indexed pages, 8,000 are for search.php. In my opinion, that's too many, but what's worse is that, based on crawl percentages, a huge number of those 8,000 are bogus pages. The search terms have no meaning on my site and, yes, they return 0 results. So I want 'em gone!
A small sample of some weird words:
<Q>$ zcat acc* | grep -i bullseye | grep -i googlebot | wc -l
<Q>$ zcat acc* | grep -i bullseye | grep -i -v googlebot | wc -l
<Q>$ zcat acc* | grep -i akadema | grep -i googlebot | wc -l
<Q>$ zcat acc* | grep -i akadema | grep -i -v googlebot | wc -l
So of those 13 "bullseye" that aren't Googlebot, here's one:
220.127.116.11 - - [11/Mar/2012:15:38:22 -0400] "GET /search.php?q=bullseye+crystal+clear+stained+glass HTTP/1.1" 200 18921 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; WOW64; SV1; .NET CLR 2.0.50727)"
That IP is:
inetnum: 18.104.22.168 - 22.214.171.124
status: ASSIGNED PA
remarks: ABUSE REPORTS:
source: RIPE # Filtered
role: Dedicated Server Contact Admin Role
address: Dedicated Server Contact
address: 2 Frater Gate Business Park
address: Aerodrome Road
address: PO13 0GW
address: UNITED KINGDOM
Not sure if that's good or bad, but that's what it is. I think it's bad, though. For March, they've crawled my site hitting 11,549 pages so far.
<Q>$ zcat acc* | grep -i 126.96.36.199 | wc -l
Scanning the 5,398 search.php of those, oddly, many look ok. But many look weird!
188.8.131.52 - - [11/Mar/2012:15:27:00 -0400] "GET /search.php?q=http%3A%2F%2Fqymdvpbat
yml.com%2F&lp=cTYZUNMGJkrVSZak&hp=nKczBHUCBkuikNj HTTP/1.1" 200 16436 "http://www.mysite.com/search.php?q=cookie+sunglasse" "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)"
What the? Why is the referral that weird search? And what is that q= value? Another weird one:
184.108.40.206 - - [11/Mar/2012:15:37:27 -0400] "GET /search.php?q=dichroic+primary+color
+starter+pack+clear HTTP/1.1" 200 22943 "-" "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0
So there's a whole lot that aren't from Googlebot. But my daily gathering of weird searches is specifically from Googlebot, at least going by the REFERRER string.
|And what is that q= value? |
It's a spin on the fake referer isn't it? Instead of putting in the bogus referer
it gets switched one step up the line to become the search string.
In any case, the first thing you gotta do is switch off a few parameters in gwt. They claim they don't want to index search-results pages. I hope they're right, because it drives me bonkers when I land on one by mistake. Hmm. Maybe Preview is good for something.
And if there's nothing in particular on the search page itself, why not yank the whole thing? Once users have arrived on your site they might find a search useful as a backup. ("I bet this site has what I want, but I can't find it.") But they're not going there for the search are they?
|Hey Tedster, Googlebot is crawling 100's of bizarre search pages on my site every day. They are not Google Search referral pages. Where Googlebot got all these weird searches, I have no idea! I've tried to find the source, but no luck. |
I suspect that Google no longer relies on following links to find new pages now. Google may have found your search urls by seeing what users with the Chrome browser actually land on when they hit the search button. It is widely thought that Google will snoop on the contents of Gmail emails to looks at any urls contained in them that it hasn't seen before, I suspect its using data from Chrome and Android users to do the same.
I'm not clear why you want ANY site search results indexed at all. If this was my site, I would disallow search.php in robots.txt and end any Google issue. Rogue bots are something else entirely - that's why we call them "rogue".
Powering crawlable pages with search results in NOT a good SEO strategy anymore. I've seen evidence that:
1) Google does not rank pages that use the word "search" in the page title.
2) Having many search result pages crawled can lead Google to believe that your site is "low quality" and get you Panda Penalty.
3) Google will apply a manual penalty to your site for having an "infinite number of pages" based on crawlable search results if a Human a Google takes a close look at your site.
If I were you, I would rename search.php to sitesearch.php and put sitesearch.php in robots.txt. Then take your top search.php?q= urls that are actually bringing in referral traffic and 301 redirect them to the appropriate product page.
Yea, if you're bound and determined to keep the search stuff in Google, then I got nothing. But even if that's where most of your traffic is coming in now, don't count on that always being the case.
OK, I did it. Copied search.php to look.php, and blocked look.php in robots.txt as well as putting in the robots meta noindex, nofollow. Updated much code for my site to use look.php, and will have to finish that migration over the next few hours.
For search.php, I added code at the top that tries to find a proper brand or category page, and if found, 301 redirect to it. This is for any page request, not just Googlebot. If not found, the search.php renders as usual for now. In a bit, I'll add more redirection code to either 301 to look.php or handle the searches in other ways.
Now I have to update any and all precaching and reporting related to this.
Any other gotchas in terms of SEO I might want to consider at this point? It'll be interesting to see what happens in about 2 weeks with regard to Googlebot, other bots, and my site traffic.
Well if you blocked it in robots.txt, how is Google gonna catch the NOINDEX?
He put look.php into robots.txt but search.php should still be crawlable for the redirects.
Update: it's been 5 days, and my search.php count in the Google index is now 2,900, down from 8,000. Using the search: site:mysite.com search.php, which is what I used before and now.
At the same time, the total indexed count has gone up to 31,300. Traffic has remained the same.
I wrote code to "smart" 301 redirect to category or brands pages based on the keywords. Where my logic doesn't provide a target page, I've hard-coded the new mappings, about 400 in all so far. About half those 400 keywords (phrases) map to "nothing" - that is, I let those pages 410.
I checked Google for my new look.php search page, and thankfully, none have been indexed so far. So let's see what happens 2 weeks from now - I'll check in again.
Hey sftriman, Thanks for sharing. I am mainly replying to be notified of your progress in the future :) I am facing a nearly identical situation and may be following in your footsteps shortly. Please keep us posted.
Sounds like you implemented it perfectly. Glad it seems to be working out so far.