|Is Googlebot indexing content loaded via JS lazy load?|
I'm developing a mobile version of a site that most likely will use responsive design and become the only version of the site. As a UI feature I find lazy loading very convenient. Just in case it's not a commonly accepted technically proper term: you load only part of the page, scroll close to the pages's end and a JS loads another piece of content via AJAX and stitches it to the back of the page. Could be automatic (hence "lazy") or by you pressing a "Load More" button - the Googlebot accessibility implications would be the same, I presume.
I have some monstrously long pages and breaking them into several separately loading chunks makes them much quicker to load and just basically more usable. But the issue of course is how to make sure Googlebot reads the page in its entirety and not just the tiny piece in the beginning.
I'm guessing that Googlebot might try to run the JS and perhaps even get the content in small chunks, although I am not too sure about that either - AJAX requests are POST by default. They could be GET but I don't want them to have their own URLs so chunks of a whole page could not be indexed on their own. Besides, without knowing the framework of the page as a whole, how do you know how to stitch pieces together?
Some of the particular issues in my mind are:
- cloaking concerns in case a page gets specifically crafted for Googlebot
- Should I allow robots.txt access to the script that AJAX posts to?
- Should I use special markup that might make it easier for Googlebot to stitch parts together?
- If Gbot gets content in chunks, what's going to happen to my crawl budget? Am I going to end up with dozen times less pages indexed because each may require dozen visits to collect all chunks?
- I may be missing a few more issues, perhaps obvious ones, please clue me in
I know I am embarrassingly late to the (mobile) party and many people have dealt with this already. Can someone share advice on the proper way to have the a lazy-loading page indexed by Google and Bing in its entirety? Well, G and B as well as anyone who really-really wants to read long pages.
There are two separate aspects here:
1) Will you have a full "desktop" version of the site alongside mobile site and if so, will the structure of the mobile site (apart from how the page is loaded) be the same between mobile version and desktop version?
Since there will be no problem with Google indexing if you have two sites and follow Google best practices, lets have this discussion focus on whether (and how) Google can index the full content of the page that has part of its content loaded via lazy load JS.
This question pertains specifically to the situation when there is no separate mobile version of the site. Otherwise I would simply insert rel="canonical" pointing to the full non-mobile version into the header and be done with it. However, there are several consideration that make me want to do one-version-fits-all instead of m.example.com type of a "shadow" site:
- Despite Google publishing guidelines for this type of situation, I can never be sure what they say is the same as what they do and I would still be concerned about any possible massive intra-site duplication.
- I don't want to support two distinctly separate sites - design, updates, security, two completely different frameworks etc., etc.
- The desktop version may use some design update anyhow. Besides, I find lazy loading a convenient feature on desktop as well.
- I don't know if a reliable way to identify mobile users exists. I would always be wondering if I send a certain percentage of mobile users to desktop and vice versa.
So, I would really like to stay with one responsive layout site for both mobile and desktop users.
Thanks on the explanation 1script. This means that your question is not really focused on a mobile site and js lazy load, instead the question is valid for both, desktop and mobile:
How (and is it) the content loaded via js lazyload attributed to the parent page?
You say you would like to keep the same URL, this would therefore exclude having a separate hash bang (#!) fragments as they would be treated as separate URLs from crawling and indexing point of view.
I am not sure myself, and I am speculating here that perhaps including the <noscript> element which would include the whole page content may do the trick?
Unfortunately, since this is built on JQuery Mobile, using hashes anywhere in the URI is not a good idea because it may confuse JQM's internal navigation within the DOM. But, regardless, hash bang or not, that link to another chunk of content is not yet present until someone interacts with the page in some way (scrolls or clicks "Load More").
Putting the entire content of the page into <noscript> is out of question - it will defeat the purpose of splitting the very long content into chunks. Mobile users will end up having to download the entire page yet won't be able to see it until they download it again - the opposite of bandwidth reduction.
Is it possible to know which JS Google does run when they arrive on the page? Everything that is in the document.ready but nothing that requires an interaction (presumably)? How about external libraries like JQ and JQM, do they load those, too?
I was going to say "everybody uses it" but then I realized that all my lazy load examples - FB, Twitter, Flickr, eBay - are almost all gated communities. And even when they aren't exclusively gated (eBay, Flickr), they are probably not in the least bit concerned about Google's crawlability.
So, is this a show-stopper?
I think that what you are trying to achieve is what hash bang is for. However, Google instructions [developers.google.com...] also have a section "Handle pages without hash fragments"
|In order to make pages without hash fragments crawlable, you include a special meta tag in the head of the HTML of your page. The meta tag takes the following form: |
<meta name="fragment" content="!">
I am not an expert here, but from what I can understand, in the above case Googlebot will request the page with
This would mean that the full big page would only go to Googlebot, and the users (mobile phone and desktop) would only get the part of the page before lazy load triggered js was executed.
So in essence, if the pageA has lazy load partB, then:
Browsers, that is, mobile/desktop users
They request www.example.com/pageA and upon hitting the bottom of the page, lazy load js requests partB and loads it onto the pageA.
Providing there is <meta name="fragment" content="!"> on the pageA, then Googlebot will request www.example.com/pageA?_escaped_fragment_=
The script generating pageA should look for parameter _escaped_fragment_ and if exists, it should generate HTML that contains pageA + partB and send this back to the Googlebot.
or if pageA has more parts, then server site you would generate ALL parts and pass them back in HTML to Googlebot when url with this parameter is received
At least this is what my understanding of this method is, which I believe may give you exactly what you want.
Thanks again, aakk9999: I will look into _escaped_fragment_
Hope it's reliable enough way for G to read it. It sounds like it will probably cut my crawl budget by half: first get the header of the page, recognize there's a meta fragment tag in it, then come back again and read the page with ?_escaped_fragment_= added to the URI. Ah, well, two is better than 20 or more that some of the pages might have required if they had to return for each chunk individually.
I've looked into this <meta name="fragment" content="!"> thing and it sounds as if it opens up a whole slew of cloaking issues. I mean, Google is practically inviting you to cloak your page for Gbot. So, where do they draw the line?
For example, I'm thinking of this scenario: since the new mobile-friendly version of the site is going to be laid out completely differently, it is probably to be expected to have the traffic go down (a lot) until they re-crawl, re-index etc. So, would it be considered fair and ToS-like to simply keep serving Gbot the old version of the site on ALL of the site's URLs with ?_escaped_fragment_= added to them? To kind of soften the transition period from old to new? Too much cloaking? How about if I remove the ads (and why would you serve ads to a bot in the first place?) ...
Anyone knows if this is a convention Bing also follows?
I've been looking into this recently too, and the top result (that I see anyway) for 'infinite scroll' has a lot of good information and seems to indicate that it can be SEO-friendly.
|I mean, Google is practically inviting you to cloak your page for Gbot. |
In <meta name="fragment" content="!"> you are only inviting Google to see the HTML the way your user would see it if they scroll down the page. Since googlebot has to get the page in one request (rather than multiples as user does), you are just giving gbot all in advance.
The care must be taken if you use GET rather than POST because GET can be indexed as a separate URL. I am not really sure what is the recommendation here, whether in this case the GET URL has to be blocked by robots.txt, or robots noindex be used or have rel canonical? Or should the POST always be used when requesting the further bits of the page?
Otherwise, that is, if GET is used to grab the additional page content - Googlebot will see one complete page (where your server has returned all "bits" together), but there will also be part of this content on separate URLs - which can be a duplicate content.
Thank you for bringing up an important question, aakk9999. I don't think you can use POST here - they are looking for a URL crafted in a particular way, so it has to be GET.
|The care must be taken if you use GET rather than POST because GET can be indexed as a separate URL. |
I would assume you cannot block http://www.example.com/?_escaped_fragment_= else Gbot won't be able to read it. You also cannot (as my limited understanding of the process goes so far) meta noindex it because you do want it to be processed. So, which URL then will show in the SERPs?
It looks like it's going to affect the site's ranking in at least two ways:
The crawl budget will be effectively slashed in half because each page needs to be downloaded twice (each of the two different versions)
There is a chance ?_escaped_fragment_= version will show in SERPs, and not the normal URL, in which case you still have to redirect mobile users.
Has anyone implemented this yet? Any experiences to share?
@FranticFish: Thanks for the tip! Strange, I didn't think this is a "Let Me Google It For You" (LMGIFY) type of forum. Anyway, I accept that Google has the info and yet my searches yielded almost exclusively UI-point of view. Maybe I should have tried Bing :) LOL.
When I look for any infinite scrolling SEO related info, it seems to mostly refer to infinite loading of lists of links (like category listings and forum listings) and even though this is important from UI stand point, I don't even have these types of pages indexed and don't really care if Google can actually read them. Best info related to SEO of lazy loading (I prefer that to "infinite scroll" because when it comes to content it is not exactly "infinite") was a warning that it "may be a SEO problem". Dah!
<snip> I would be especially happy if anyone can share real world experience with lazy loading of content pages.
[edited by: aakk9999 at 3:58 pm (utc) on Aug 23, 2013]
[edit reason] ToS [/edit]
|I don't think you can use POST here - they are looking for a URL crafted in a particular way, so it has to be GET. |
I am not talking about URL of your "parent" page. I am talking about URL that is requested (via JS) to obtain extra content on the holding page. This must not be able to be indexed as a separate page. Perhaps this only happens if the solution was not implemented properly from the technical standpoint.
I will try to explain:
From what I have read, googlebot will request:
and you will return the whole page (made up from all bits that would lazy load for a visitor, such as partB, partC etc...)
Googlebot will index this HTML under URL www.example.com/pageA
What I am saying is that you should make sure that partB and partC cannot be requested via GET and viewed as a separate page.
The ?_escaped_fragment_= is something Googlebot will add itself to URL when requesting URL providing that it finds <meta name="fragment" content="!"> within the page HTML. So whilst I am not sure (not tested), I would believe you will not have a problem with URLs being indexed with this parameter. In fact, other than technical people, nobody would even know they can request your page with this parameter and that in this case the whole page would be returned.
|The crawl budget will be effectively slashed in half because each page needs to be downloaded twice (each of the two different versions) |
If instead of lazy loading scroll you had a pagination, then you would use at least the same or if not bigger crawling budget. On the other hand, I am not sure how Google implemented it. It may be that only the first time Google gets "small" page and thereafter keeps requesting the "full version" with extra parameter, because in the "full version" you would also have this meta tag. Please bear in mind this is just a speculation from my side.
Mod's note: Real world experience can be shared, but please observe ToS with regards to domain name, niche and keywords.
|I would be especially happy if anyone can share real world experience with lazy loading of content pages. |
@aakk9999: yes, the script that delivers separate chunks is definitely going to be both POST and blocked in robots.txt , I thought you are talking about ?_escaped_fragment_= page.
Pagination is still important: some pages without pagination would be so long (1Mb+) that feeding them in their entirety to any bot would be just silly - most of its content will be ignored (Bing is already generating page size errors as it is).
So, in reality we are talking about an even more complicated setup
real users will only see
http://www.example.com/page1.html (if they scroll far enough, it will eventually be 1MB of text)
But bots will have to get several pages:
http://www.example.com/page1.html?_escaped_fragment_= (first 200kB chunk)
http://www.example.com/page1-1.html?_escaped_fragment_= (second 200kB chunk)
http://www.example.com/page1-2.html?_escaped_fragment_= (third 200kB chunk)
http://www.example.com/page1-3.html?_escaped_fragment_= (fourth 200kB chunk)
http://www.example.com/page1-4.html?_escaped_fragment_= (fifth 200kB chunk)
And all these extra pages will have to be linked from the first, canonical if you will, fragment page http://www.example.com/page1.html?_escaped_fragment_=
Real users won't need those links, so you see, this is where the idea that bots and people see the same content starts to break down. I know, this is a little more than wee bit complicated and that's why I'm almost certain Gbot will not get it right (and Bing will be simply lost)
@FranticFish: sorry for my previous comment, I forgot that this actually is a LMGIFY type forum due to the ToS policy against external links. Just wanted to say that I did Google it, but nothing useful turned up or, more accurately, the answers I got created more open questions than I had when I started, hence this thread.
No worries; I didn't properly appreciate the distinction between lazy load and infinite scroll (which, as you say, is for amalgamating paginated pages).
Why not just serve Google (and/or other bots) the pages without the lazy load code? That's no more cloaking than a responsive site is - the spiderable content and the human-readable content are exactly the same.
@FranticFish: the more I'm thinking about it, the more I am leaning towards a hybrid approach. I didn't want to call it infinite scrolling because I despise the actual truly infinite scrolling myself. Lazy loading 10-20 items of content may be fine, and even useful on mobile phones, but if it really never ends, it becomes very hard to navigate and you can never go back and find anything anymore. And don't even get me started on Flickr implementation of lazy loading huge pictures that simply crushes my browser!
So, I am thinking I will end up still using pagination to split content into larger but still manageable chunks but will lazy-load those to real users in 1/20th increments. This is a mobile-first type of site, I just want to be polite to people that pay for bandwidth.
So, I think good old cloaking per se may not be required but if the ?_escaped_fragment_= mechanism works reliably, I would use it to serve bots the entire page, already assembled server-side.
Now if I could only get a word back from Bing about whether they support it!
Personally, I think in this type of situation I might let the pagination be indexed rather than trying to associate the separate pages all with one URL and then use pushState() to change the URL displayed to the corresponding fragment of the main page so any links or bookmarks take visitors there rather than the pagination.
So, basically I'm saying I think I'd consider:
1.) Paginate and let the SEs index the separate pages.
2.) Set up all paginated pages to "lazy load" the next and previous content so if a visitor "landed in the middle" of a set of pages they could still have the same experience they would have if they navigated to that content via the "main page".
3.) "Mask" the paginated URL as the main URL with a corresponding "identifier" using pushState() [EG /the-section/the-info/27 would show as /the-section/the-info#27] so visitors land on the correct location next time they visit or if they link, etc.
4.) Make sure I used rel=prev, rel=next and rel=start on the pagination.
5.) I might even run rel=canonical href=http://www.example.com/the-section/the-info#27 on the http://www.example.com/the-section/the-info/27 locations in one section to test and see what the SEs did with it.
@JD_Toims: thanks for your input! When you say
, do you mean feed bots the ?_escaped_fragment_= (fully assembled) version of the page or rely entirely on their being able to follow the chain of pushState changes? I guess this is where my grasp of the object starts slipping: how would Gbot receive the pushState change? Don't they have to actually execute JS code, as in "emulate" scrolling for that?
|Paginate and let the SEs index the separate pages. |
Also, I cannot have URIs with hashes in them - JQuery Mobile will go nuts since it creates hundreds of its own hashed URIs it uses to handle events within the DOM. Did I pick a wrong framework for the situation?
I think I'd go with unique URLs in a way I was sure all SEs could follow to get to the content via <noscript><a rel=next href=/the-section/the-info/2>2</a><a href=/the-section/the-info/3>3</a><a href=/the-section/the-info/4>4</a></noscript>
There's more than just Google to code for and we know they can all follow a URL within a <noscript> tag, so I'd feed them all "bread and butter" to make sure they could crawl.
I'm not sure on jQuery Mobile, but the only way to keep the same URL for linking, etc and not send visitors and links to separate URLs is with a #. Everything else "counts" as a unique URL, even a query string, so if you're not able to use a # with jQuery Mobile it might not be the best solution.
One work-around to jQuery Mobile and not being able to use #'s might be to just paginate the mobile version normally [separate URLs] and rel=canonical those to the main [non-mobile] fragmented URL.
Well, the idea was to feed fragmented to mobiles (in small chunks) and whole pages - to bots. I'm going to have to investigate if there's a JQM workaround that makes it work with pushState
|One work-around to jQuery Mobile and not being able to use #'s might be to just paginate the mobile version normally [separate URLs] and rel=canonical those to the main [non-mobile] fragmented URL. |