joined:Jan 22, 2012
Last week I got the bright idea that I'd try to add rel canonical, [en.wikipedia.org
], support to my crawler/search engine (available at seekquarry.com).
I thought this would be relatively straightforward: As part of usual parsing check for the canonical link, if present, check if the url in the canonical link matches the current url. If yes, then index the page and use extracted links. If not, use redirect code, index the page, but don't use links.
So I am testing this and am I coming across sites like http://example.com/index.html (hover over the link to see the full link) whose rel canonical is: http://example.com/index.htm which gives a 404 error. I noticed this because this was happening for major sites (which I won't shame here) and I would check my index and wonder: Why isn't http://example.com/ getting properly indexed?
Ideally, from the crawler perspective we want to ignore the rel canonical info in this case.
So this left me with the problem how to implement rel canonical without too much overhead? In the situation where you are crawling in a distributed way,it can be a pain to do a look up to see if one of the other crawler processes has already gotten the canonical page and if so whether the response was a 404. So that means that it would be better if the current crawler process makes an ancillary request of the canonical page to make sure it is okay. However, if you take this approach it might mess with any Crawl-Delay values that the site had and then the crawler process has to get into the scheduling business.
I still like the idea of using rel canonical to try to reduce the crawling needed of a web site to get a decent crawl of it. I am guessing Google does something super clever to implement this. What I decided to do was check the edit distance between
the current url and the canonical one, if it was less than three then I ignore the rel canonical info.
[edited by: goodroi at 5:15 pm (utc) on Jul 19, 2014]
[edit reason] fixed urls [/edit]