Need Help Finding Inbound Links to Old URLS

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Need Help Finding Inbound Links to Old URLS

Planet13

6:56 am on Aug 14, 2010 (gmt 0)

Hi there, Everyone:

This may be a stupid question, but...

I need help finding inbound links to my old (query-style) URLs.

My site has been around ten years and has gone through two different URL format changes (query String to a "Search Friendly" Style about 5 years ago, and from the Search Friendly to SEO style last year).

I am sure there are sites out there that link to my old query string style URLs. How can I find those links (so I can do 301 redirects)?

The old query style URLs have been blocked in my robots.txt file to avoid duplicate content.

Any suggestions on how to do this would be great.

phranque

10:36 am on Aug 14, 2010 (gmt 0)

you can find referred requests or crawler requests that get a 404 response by looking through a significant sample of your server access log files.
you can find urls that are excluded by robots.txt in your GWT dashboard in the Crawl errors area.
not all of these will be 404 candidates but you should be able to recognize the urls.
there are several link databases and link analysis tools, both free and paid, that will show urls that have been discovered.
some will tell you which are 404 and some won't but you can always check them yourself and again you're looking for a certain recognizable type of url.
note that some will also report links that have been discovered in the past but no longer exist.

Planet13

6:21 pm on Aug 14, 2010 (gmt 0)

Thank You, Phranque:

you can find urls that are excluded by robots.txt in your GWT dashboard in the Crawl errors area.

I see there are about 150 listed there. I doubt that there are 150 inbound links using those old query string URLs though that point to those pages

I think that google might have seen some internal links from several years back and indexed them, and because I wasn't able to 301 or 404 them, google still keeps trying to crawl them (even with the disallow in robots.txt).

They DO have a canonical link tag with the new SEO URL, so I don't know why they would still be trying to crawl them if it is just trying to follow OLD internal URLs.

I wonder if I should stop blocking them with the robots.txt? I am worried about having duplicate URLs (even though there are canonical link tags). And I don't really want to bog down my .htaccess file with a lot of 301 redirects if I don't have to.

phranque

10:56 pm on Aug 14, 2010 (gmt 0)

just because they are disallowed by robots.txt doesn't mean google won't index the urls or continue trying to request the urls.
also, if you have the urls excluded by robots.txt, the canonical link tags will never be seen by the SE's.

legacy urls should generally get a 404/410 or 301 status code response.
also note that if there is a discernible pattern to the url transformation from old to new you can use regular expressions and a fewer RewriteRules.

g1smd

11:25 pm on Aug 14, 2010 (gmt 0)

doesn't mean Google won't index the URLs

Just to clarify that in this context "index" simply means "record the fact these URLs exist" rather than "index the content on those pages". In this case, the URLs will appear as URL-only entries in the SERPs.

Planet13

6:38 am on Aug 15, 2010 (gmt 0)

also, if you have the urls excluded by robots.txt, the canonical link tags will never be seen by the SE's.

Doh! Guess I forgot about that. Get me another doughnut, Marge.

phranque

10:19 am on Aug 16, 2010 (gmt 0)

yes to what g1smd said, although Google may use other means besides crawling and indexing your content on that page to discover a suitable title and snippet as observed in this thread [webmasterworld.com].