Forum Moderators: phranque

Message Too Old, No Replies

Optimizing site for crawling

         

NickMNS

2:11 pm on Jun 22, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am creating new dynamic website, with a very simple ux. You have a search form with one select menu and a text input, one fills in the form presses submit and a single matched result is returned. The site is very simple, the pages are fully rendered server side, so no ajax or js is involved in delivering content.

The problem is, as I have faced before, that all the content of the website is hidden behind this form. Googlebot will not likely go beyond the form so no crawling. But since I encountered this before, I have learned my lesson. I will add "static" links to a subset of the pages, such that the bot can find the content. I will create a sitemap, but still from experience I doubt that site map alone will be enough to get Google to crawl the site, specially considering the size of the site.

So the question is how to design and layout the links to efficiently allow Googlebot (or other SE bots) to crawl? Let me add that there is a lot of content, the db contains hundreds of millions of unique records and each record can be a results page. So simply placing a static link to every page in some sort of a tree structure of static pages is not feasible. Even if this were done, there be would far too many links for Google to crawl efficiently and all those pages would be low quality and utterly useless to users.

So then what. My idea is to break the data down into groups based on two levels of facets (eg: Level 1 -> T-Shirts, Pants, Dresses, Level 2 -> Black, Blue, Green) then for each sub category pull random samples from the db.

Level 1 - the home page. It has links, to top level facet pages (50 links).
Level 2 -- Each facet page has up to 10 tables, one table for a second level facet, these are selected randomly as the second level facet can have more than ten different values (eg: more than 10 colors) and each table has ten randomly selected links to results pages with those facets. (100 links total)

On each page request the the facet pages change, the random selection re-occurs and the links changes. On some of the larger collections, generating the facet pages can take several seconds as the 100 links need to be pulled from the db. How much of a problem is a slow load time for Googlebot at the top level.

Now on the result page I include up to ten links to other pages with similar characteristics. But given the diversity of the data at times there are no similar items (as defined in my similarity algo). So this then creates a spider trap or dead-end for the bot. Need I worry about such a dead end?

I was thinking of adding, in addition to (or in the absence of) the similar links, a random link to a random location in the data. This will prevent dead-ends but will likely teleport the bot to some location that is unrelated to its source. The other obvious solution is to broaden the similarity criteria, but this then slows the db lookup and also the page load speed on these critical landing pages. I can also reduce the breadth of random selection to some record with the the two level of facets, but again such a search in the db is expensive.

From a technical perspective I am using Mongodb. and Python/pymongo. All my data is in a single collection, and I am using the $sample operator to pull the random records. But to pull the records based on two levels of facets, the $sample, occurs in the second stage of the aggregation pipleine, $match being the first, so it uses Collscan instead of the indexes. As a result it is terribly slow on large collections. I could change my schema and divide the data into separate collections based on the top level facet, drop the second level facet and then I could do $sample at the top level of the aggregation pipeline, which would greatly speed things up. But this would mean that my page would display 100 random links as opposed to 10 times 10 random but related links.

Any ideas?

not2easy

4:47 pm on Jun 22, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



This probably isn't the answer you need, but you asked for "any" ideas... Google seems to love crawling and indexing search results pages, I have been trying everything I can think of in "Parameters" to try to make them stop. I don't want those pages because they show the same thing as using Category links or Brand links. Maybe going the other way could help? Adding Parameters in GSC is an option and you choose whether they should be followed or not. Of course mine are not using Python but php/js.

NickMNS

10:55 pm on Jun 22, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google seems to love crawling and indexing search results pages

That is interesting, but it doesn't reflect my experience. I think that is beacuse (pure speculation here) Google may have two crawl modes "discovery" to find new content and "give me more" for sites that they know and trust and that already rank well but for which they would like to index more content.

Discovery mode applies to new sites, and sites with few trust signals. They crawl the sites to discover the content but it may never rank, so the vigor with which the site is crawled is limited. This is my case, so I feel that a I need to spoon feed the bot just to be sure it will crawl as much content as possible.

In the "Feed me more" mode, the bot will try anything and everything to find pages because it assumes that any content will rank well and draw in users. In this case the bot may actually fill in forms or parameters.

Regardless at this stage in a project I need to assume that the bot is in "Discovery" mode, and if not then great.

Back to the technicals...
I played around a bit with the code and I determined that what I mentioned above is wrong. '$sample' will do collscan regardless of where it is in the aggregation pipeline. So the only way to speed things up is to limit the items it scans by refining the search in the first stage of the pipeline. So this is what I have done, and it speeds thing up a little bit. So how crucial is page load time for the first page in the crawl chain? Again, the first page with all the links will be slow to load, but each subsequent link will load in a reasonable time. Do I need to stress about this?