Optimizing site for crawling

I am creating new dynamic website, with a very simple ux. You have a search form with one select menu and a text input, one fills in the form presses submit and a single matched result is returned. The site is very simple, the pages are fully rendered server side, so no ajax or js is involved in delivering content.

The problem is, as I have faced before, that all the content of the website is hidden behind this form. Googlebot will not likely go beyond the form so no crawling. But since I encountered this before, I have learned my lesson. I will add "static" links to a subset of the pages, such that the bot can find the content. I will create a sitemap, but still from experience I doubt that site map alone will be enough to get Google to crawl the site, specially considering the size of the site.

So the question is how to design and layout the links to efficiently allow Googlebot (or other SE bots) to crawl? Let me add that there is a lot of content, the db contains hundreds of millions of unique records and each record can be a results page. So simply placing a static link to every page in some sort of a tree structure of static pages is not feasible. Even if this were done, there be would far too many links for Google to crawl efficiently and all those pages would be low quality and utterly useless to users.

So then what. My idea is to break the data down into groups based on two levels of facets (eg: Level 1 -> T-Shirts, Pants, Dresses, Level 2 -> Black, Blue, Green) then for each sub category pull random samples from the db.

Level 1 - the home page. It has links, to top level facet pages (50 links).
Level 2 -- Each facet page has up to 10 tables, one table for a second level facet, these are selected randomly as the second level facet can have more than ten different values (eg: more than 10 colors) and each table has ten randomly selected links to results pages with those facets. (100 links total)

On each page request the the facet pages change, the random selection re-occurs and the links changes. On some of the larger collections, generating the facet pages can take several seconds as the 100 links need to be pulled from the db. How much of a problem is a slow load time for Googlebot at the top level.

Now on the result page I include up to ten links to other pages with similar characteristics. But given the diversity of the data at times there are no similar items (as defined in my similarity algo). So this then creates a spider trap or dead-end for the bot. Need I worry about such a dead end?

I was thinking of adding, in addition to (or in the absence of) the similar links, a random link to a random location in the data. This will prevent dead-ends but will likely teleport the bot to some location that is unrelated to its source. The other obvious solution is to broaden the similarity criteria, but this then slows the db lookup and also the page load speed on these critical landing pages. I can also reduce the breadth of random selection to some record with the the two level of facets, but again such a search in the db is expensive.

From a technical perspective I am using Mongodb. and Python/pymongo. All my data is in a single collection, and I am using the $sample operator to pull the random records. But to pull the records based on two levels of facets, the $sample, occurs in the second stage of the aggregation pipleine, $match being the first, so it uses Collscan instead of the indexes. As a result it is terribly slow on large collections. I could change my schema and divide the data into separate collections based on the top level facet, drop the second level facet and then I could do $sample at the top level of the aggregation pipeline, which would greatly speed things up. But this would mean that my page would display 100 random links as opposed to 10 times 10 random but related links.

Any ideas?

Optimizing site for crawling

NickMNS

not2easy

NickMNS

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week