We have developed a plugin that allows us to display used vehicle listings from a centralized, third-party database. The functionality works similar to autotrader.com or cargurus.com, and there are two primary components:
1. Vehicle Listings Pages: this is the page where the user can use various filters to narrow the vehicle listings to find the vehicle they want.
2. Vehicle Details Pages: this is the page where the user actually views the details about said vehicle. It is served up via Ajax, on the Vehicle Listings Pages.
The Vehicle Listings pages (#1), we do want indexed and to rank. These pages have additional content besides the vehicle listings themselves, and those results are randomized or sliced/diced in different and unique ways. They're also updated twice per day. Google seems to like these pages and has already begun ranking them well.
We do not want to index #2, as these pages appear and disappear all of the time, based on dealer inventory, and don't have much value in the SERPs. Additionally, other sites such as autotrader.com, Yahoo Autos, and other draw from this same database, so we're worried about duplicate content. We did not originally think that Google would even be able to index these pages, as they are served up via Ajax. However, it seems we were wrong, as Google has already begun indexing them. Not only is duplicate content an issue, but these pages are not meant for visitors to navigate to directly! If a user were to navigate to the url directly, from the SERPs, they would see a page that isn't styled right.
Now we have to determine the right solution: robots.txt or nofollow meta tag for these pages. Below is my analysis so far of the pros and cons of each.
Super easy to implement
Conserves crawl budget for large sites
Ensures crawler doesn't get stuck. After all, if our website only has 500 pages that we really want indexed and ranked, and vehicle details pages constitute another 1,000,000,000 pages, it doesn't seem to make sense to make Googlebot crawl all of those pages.
Doesn't prevent pages from being indexed, as we've seen, probably because there are internal links to these pages. We could nofollow these internal links, thereby minimizing indexation, but this would lead to each 10-25 noindex internal links on each Vehicle Listings page (bad for SEO?)
Does prevent vehicle details pages from being indexed
Allows ALL pages to be crawled (advantage?)
Difficult to implement (vehicle details pages are served using ajax, so they have no <head> tag. Solution would have to involve X-Robots-Tag HTTP header and Apache, sending a noindex tag based on querystring variables.
Crawler could get stuck/lost in so many pages
Cannot be used in conjunction with robots.txt. After all, crawler never reads noindex meta tag if blocked by robots.txt
Initially, we implemented robots.txt. We figured that we'd have a happier crawler this way, as it wouldn't have to crawl zillions of thin/partially duplicate vehicle details pages, and we wanted it to be like these pages didn't even exist. However, Google seems to be indexing at least a handful of these pages, and we don't know whether robots-disallowed content is excluded from Panda/dupe content filters (if it is, we don't really have a problem, but I doubt this). We could nofollow the links pointing to these pages, but we don't know if this would be problematic in terms of on-page SEO. If it isn't, this might the best solution.
If we implement noindex on these pages (and doing is a difficult task itself), then we will be certain these pages aren't indexed. However, to do so we will have to remove the robots.txt disallowal, in order to let the crawler read the noindex tag on these pages. Intuitively, it doesn't make sense to me to make googlebot crawl zillions of vehicle details pages, all of which are noindexed, and it could easily get stuck/lost/etc. It seems like a waste of resources, and in some shadowy way bad for SEO.
Any thoughts or advice you guys have would be hugely appreciated, as I've been going in circles on this for a couple of days now.