g1smd - 10:57 pm on Jun 18, 2012 (gmt 0)
I'm trying to figure something vaguely similar out on a CMS (not WordPress though) driven site.
Site has a couple of thousand products. Redesigned site had URL scheme changed from multiple parameters including product ID, category details and parent category details to simple extensionless URL with product ID and product name.
In the old URLs only the ID parameter decided which content was displayed, and the other parameter values went mostly unchecked (maybe altering just one or two words on the page or in the metadata) leading to infinite duplicate content on the old site. The other parameters could contain random values and the page would still display. Old URLs now redirect to new format via a rewrite to a special PHP script hooked up to the database to check which old product IDs are still valid and where they redirect to. Removed products now return 410. Invalid IDs now return 404.
WMT reports and analytics data for the last few years indicated there were a number of duplicate content URLs for each product, in a variety of different formats. These all now redirect.
It's become apparent from looking at the logs since the new site launched that each product has 40 or 50 or more (sometimes a LOT more) old format URLs that now redirect to a single new URL per product. After a few weeks of spidering the new site eating a massive number of redirects, Googlebot crawling suddenly flatlined to a couple of hundred URLs per day. Google still requests large numbers of redirected URLs as well as hitting the odd old URL that returns 404 or 410 and just a few of the new URLs.
I'm wondering if the large number of redirects has triggered some sort of crawl budget limiter. I have not seen this behaviour before, but then again I've never seen a site with quite such a mess of a (previous) URL structure as this one.