Forum Moderators: Robert Charlton & goodroi
A bit of background:
The site has number of products listed over number of pages with 10 products per page. Product listing can be sorted by four criteria, lets say A, B, C and D, also ascending and descending for each criteria. Pagination used to be in Javascript and pages past page 1 were never crawled by Google.
We have asked for Javascript to be removed and put NOINDEX, FOLLOW to all pages with sort parameters in URL.
We have then noticed that "URLs restricted by robots.txt" in WMT started to grow, most of them being URLs with search parameters. We have calculated that there could be anything up to 300.000 of such URLs.
We have then asked for the sort parameters not to be passed back to server as a part of URL and that 301 is done for every requested page that has sort parameter in URL to the same base URL but without the sort parameter. We have also excluded URLs with search parameter via robots.txt
So now the sort parameters are being posted back to server using javascript. All that javascript does is using Form.submit with reference to sort button A, B, C or D, depending which one was clicked on. The server side sorts out what should be shown and whether the order is ascending or descending as it knows what it showed before (keeps sessions). The site is IIS (not sure if this matters).
However, despite now being a week after the change, the number of URLs restricted by robots.txt reported in WMT is still growing.
I would have expected for the number to slowly start to drop as Google should not be able to find URLs with sort parameters as a part of URL as there is no reference to such URL on the page.
I have also tested redirect using Fiddler and it all seems fine, e.g.
www.example.com/Product.aspx?page=1&Sort=ABCD
redirects to
www.example.com/Product.aspx?page=1
and there is no reference to URL with &Sort=ABCD anywhere on the page.
How can Google find pages with &Sort=ABCD when these URLs do not exist any more on the site and the only way is to actually type them in to address bar? Or has Google maybe had all these URLs somewhere in its index and now that it tries to check them up, it hits robots.txt and reports this? Theoretically, someone could have linked to some of such URLs, but not to 10.000 of them!
Or do we just remove sort parameter from robots.txt and let 301 do the job over the time, when Google realises that there are no more such URLs?
Or is it just a waiting game and eventually they will start to drop?
Using URL removal tool would be nightmare with such high number of URLs.
Any advice?
Many thanks
remove sort parameter from robots.txt and let 301 do the job over the time
I would definitely do that. One benefit will be that Google can actually request and try to crawl those sort urls and get rid of them eventually - it may take many, many weeks/months to complete. Just be 100% sure that the http status for the redirect is 301 in all cases.
How can Google find pages with &Sort=ABCD when these URLs do not exist any more on the site
Google has mysterious ways when it comes to url discovery. Those urls may have been queued up for the future, as the crawling budget allows for your site. And yes, once Google gets a taste of a parameter like that, they may often "invent" other variables to see how your server handles them.