|Missing <BODY> tag - will Google not index the page properly?|
| 2:42 pm on Dec 28, 2008 (gmt 0)|
If a standard HTML-based web page does not contain either a <BODY> tag, a </BODY> tag, or both, will Google fail to crawl and index the page properly ?
Are there any other meta tags and associated delimeters that Google frowns upon, if missing ?
Thanks in advance !
| 3:04 pm on Dec 28, 2008 (gmt 0)|
body element is optional in HTML, so there is no technical reason why Google would have a problem with this. The closing tag is certainly unnecessary, but as you suspect I would think the issue is more to do with Googlebot successfully distinguishing the end of the
head section and the beginning of the
Assuming there is no confusion in the sense that the page is valid HTML, and that no
head elements appear in the source code after any content, then you will almost certainly not have any problems. The lack of a clear delimiter makes it more important that there are no errors in your markup however.
As for the question about other delimiters and required elements, the most frequent cause of errors are due to unclosed elements, meaning that the parser can skip over some of your content.
| 3:07 pm on Dec 28, 2008 (gmt 0)|
|If a standard HTML-based web page does not contain either a <BODY> tag, a </BODY> tag, or both, will Google fail to crawl and index the page properly? |
That is an interesting question!
I'm going to set up a few pages just for my own peace of mind. I'm aware that you don't need any of the normal markup that we have been accustomed to using. I wonder how a document will perform if it had only content and NO <head>/<body> elements?
| 3:54 pm on Dec 28, 2008 (gmt 0)|
I should have stated that I have some old sites, where some of its pages are not indexed, where the HTML was handcrafted by yours truly. There may be issues with these files, that have been revealed by some of the public domain HTML validators that are available on the NET.
I was wondering if anyone had done some formalized testing in this arena, and if Google implies certain tags, if missing. I too, will need to some testing in this arena.
| 5:17 pm on Dec 28, 2008 (gmt 0)|
I would think that if the files pass validation (at either HTML 3.2 or at HTML 4.01 Transitional) then the bot will have no issues to try to guess/correct.
| 6:05 pm on Dec 28, 2008 (gmt 0)|
|I would think that if the files pass validation (at either HTML 3.2 or at HTML 4.01 Transitional) then the bot will have no issues to try to guess/correct. |
I'm setting up my test page at this very moment. It will not pass validation. I'd have to "undo" quite a bit to have a page that validated and matched the existing site. So, I'm going to deal with the 11 errors and 2 warnings that are present. We'll see how not having <head> and <body> elements changes things.
I'm not too certain I have things right. Honestly? I've never built a page without those elements.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<style type="text/css">@import url("file.css");</style>
Even with the 11 errors and 2 warnings, the page displays just fine in the browsers. The visitor would never know the difference. The bot probably won't either. I've also got a slight advantage too as I use SOC (Source Ordered Content) and can serve primary content first which of course allows Google to index what it came for, the content of the page. The rest of the fluff is secondary. ;)
If the browser displays the page as it should, the bot is most likely going to get the same thing. All that stuff in the <head> is to further refine the documents contents. I would think this experiment would help in determining how SOC comes into play since we don't have the ability to specify a <title> (see update) and description. We have to rely on the first thing the bot indexes which in an SOC environment is going to be an <h1> followed by a summary of the page content (IPW). ;)
7.3 The HTML element
Start tag: optional, End tag: optional
7.4.1 The HEAD element
Start tag: optional, End tag: optional
7.5.1 The BODY element
Start tag: optional, End tag: optional
Update: I was able to add a <title> element.
| 5:34 pm on Dec 29, 2008 (gmt 0)|
2008-12-29 Update: My test results are in and the page was indexed just fine. In fact, it holds top positions for its targeted keyword phrases without <html></html>, <head></head>, and <body></body> elements.
As expected, <title> was indexed along with the <h1> and first <p> showing as snippet.
Yes, I can see results in less than 24 hours sometimes.
| 6:01 pm on Dec 29, 2008 (gmt 0)|
One underlying issue here is how well do Google's error recovery routines work - and when do they fail. I know from experience that browser error recovery varies from browser to browser. Google's error recovery is in still another category, partly because their end goal is not to render the page visually (although they do some checking along those lines) but to analyze it for search relevance.
I once worked with a page that displayed fine in the major browsers, but Google's index was missing just some of the text - with some relatively uncommon terms in it. On investigating, I discovered a missing angle bracket [ > ] for a tag just before the unfindable text. I fixed that and the phrase became findable within a few days.
That experience is now a few years back, and I'll bet that Google's error recovery has continued to improve. After all, they want to find content. Even though they may not include it in the final decision, they still want to make that choice.
And in this case, it's clear that the page could be easily indexed - so that question is now answered - thanks P1R.
| 1:58 am on Dec 31, 2008 (gmt 0)|
When talking about delimiters for the sections of a document, we can only make vague guesses as to the way Googlebot handles the markup. I've certainly had "minimalist" pages indexed with no problems, I doubt that there are any elements which can be considered essential to the document being indexed by Googlebot. The HTML specifications require implied HTML, HEAD, and BODY elements when none are present (this means that the parser has to add them into its DOM tree), however as Google only has to extract data and not actually render the page, it may well not function in the same way that a graphical brouwser would.
The challenge is not really to see whether valid minimalist pages will be parsed, but to try and determine how Googlebot uses the
body element with pages containing confusing markup. For example:
<h1>Will this be parsed?</h1>
<meta name="keywords" content="test">
<p>Or will the document start here?
If you think this kind of markup is unlikely, you can find a lot of examples where poorly-implemented server-side includes are contained within a page where the included markup is a complete HTML document rather than a fragment - and so you get multiple
body elements within the same page.
Googlebot has to be very liberal in what it accepts, due to the nature of the pages it has to digest, in much the same way that browsers handle extremely-broken documents. However, some errors will undoubtedly make the parser skip over zones of content, as tedster mentioned above.
See this example from HTML5 developer (and Google employee) Ian Hickson: Tag Soup: How UAs handle <x> <y> </x> </y> [ln.hixie.ch] to get an idea of how user agents such as browsers and Googlebot work when handling invalid markup.
| 3:02 am on Dec 31, 2008 (gmt 0)|
I doubt Google will worry much - but some browsers may decide to fail to display the pages as you would wish.
It's much more a browser issue than an SE issue.
I suspect that a missing <title> tag and / or metadescription would have a much greater influence on the serps.
Depending on your site, geolocation tags, charactersets and 'expire' metatags may also matter.
Test your site in Opera and all the major browsers before making final decisions.