Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Issue of Multiple Parameters of Dynamic Website Indexed in Google

         

Imansoor

4:04 pm on Aug 9, 2014 (gmt 0)

10+ Year Member



Hello All,


I have this dynamic website which got indexed nearly 18k url index in Google.
many of them with like example.com/?=ck sometime example.com/?=ck&gs

I have blocked /*?* in robots.txt

My concern is that what to do with already indexed urls with these Paramitors. What should I do and not to do?

Please be specific with the answer.

[edited by: brotherhood_of_LAN at 4:19 pm (utc) on Aug 9, 2014]
[edit reason] Using example.com [/edit]

not2easy

6:45 pm on Aug 9, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Since they are not files that live in a directory, I don't know any way to noindex them. The robots.txt is the best I can suggest unless you have a way to generate a header with those particular files(only). The block you have blocks Google's access to any and all URLs with a query that have "something" in the "?" query string, but the * at the end can go away, it is unneeded and confuses the MSN bots. Have you tested this in your GWT account against the URLs you want to block from crawling?

The trailing "*" wildcard can be left off, as /*? it will block crawling of any URL in any directory that has a ? query in the URL. It is always a good idea to compare robots.txt blocking rules against your sitemaps, to be sure you don't accidentally block something you didn't plan to - and to verify in GWT that the block prevents crawling those URLs you intend to block.

lucy24

7:01 pm on Aug 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If certain parameters are entirely meaningless, you can go into the parameters area of gwt and say to ignore them. Caution! Your definition of "meaningless" may differ from google's. Things like Sort Order count as "affect page content", but you would only want one variant to be indexed.

With any luck, all the parameters you use will already be listed. So you just need to tick the appropriate box for each one.

rainborick

7:29 pm on Aug 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But you can't use the URL Parameters tool when the query string is malformed like the example shown here without a valid parameter to set.

My advice would be to remove the block in your robots.txt file, and insert a rule in your .htaccess file to set the X-Robots-Tag in the response header. Something like:

RewriteCond %{QUERY_STRING} ^=ck
RewriteRule ^(.*)$ - [E=NOINDEX:1,L]
Header set X-Robots-Tag "noindex" env=NOINDEX

While I tested this code briefly, you should ask the folks in the Apache forum here if its really correct, just to be safe.

With this rule, all HTTP requests with a Query String that starts with "=ck" should be automatically set to "noindex". Once installed, you can use the Fetch As Googlebot tool "Fetch and Render" option to check that Googlebot sees the X-ROBOTS-TAG when it tries to fetch one of these bad URLs.

not2easy

7:49 pm on Aug 9, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I used to think that the parameters settings were enough, but Google has indexed thousands of my URLs with /SEARCH/ in them and it is clearly in the parameters and has been for several years. I took to blocking them in robots.txt and the problem disappeared. They were showing me pages of 404 when that /SEARCH/ URL was "Not Found" to crawl. duh. The pages do not exist until generated by a search.

seoholic

4:25 am on Aug 10, 2014 (gmt 0)

10+ Year Member



Does anyone have an explanation for Googles habit of saving ("URLs monitored") hundreds of thousands of URLs that shouldn't be crawled regarding to paramater handling in WMT and are also blocked by robots.txt?

Over the years the number for one parameter climbed to over half a million, dropped to zero and is now at 100k again. This URLs are clearly part of an infinite URL space and saving them seems to be a waste of resources. This is not about crawling or indexing, but I really can't find any good explanation for this.