Forum Moderators: open

Message Too Old, No Replies

Parsing HTML - code or just text content?

Do search engines index HTML code or just the textual content?

         

JamieBrown

1:00 pm on Jun 28, 2005 (gmt 0)

10+ Year Member



Hi,

I've done a WebmasterWorld search on this and couldn't find anything, so apologies if its already been discussed! I'm a bit new here! :-)

I was wondering if anyone knew how the search spiders parse HTML code. On the Google site it says that a site looks to the Google spider very much like it does in Lynx, which basically just displays the page text and "ignores" the HTML. On other sites it says it looks like "View Source", which of course, shows the full HTML source. Which is true?

i.e. ignoring the 100K limit, does having lots of HTML code on the page actually affect anything, or is this HTML code just automatically stripped out and just the text used instead?

Also, is there evidence to suggest that content near the top of HTML is given a higher 'importance' than content further down (again ignoring the 100K limit). I've heard that this is true, but haven't seen any authoritative comments on the matter.

If someone knows the answer, or can point me to a similar thread, I'd be very grateful!

Thanks!

James.

pageoneresults

2:04 pm on Jun 28, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello JamieBrown, Welcome to WebmasterWorld!

Excellent topic you've chosen to discuss. Let's see if we can get some definitive answers for you from our membership.

Here's a good topic to start with from our Board Administrator, Brett Tabke back in June of 2001 which is still applicable today. The concepts are the same, the technology has improved.

How Search Engines Work [webmasterworld.com]

JamieBrown

8:14 am on Jun 29, 2005 (gmt 0)

10+ Year Member



Hi PageOne!

Thanks for your reply and the link! :-) I guess that I should start by getting my terminology right.

So, does the indexer see a page like Lynx or like "View Source"? As an example to clarify my dilema, if you look at EBay.com in Lynx, and then look at the source of EBay.com, they are radically different. If the Google indexer sees it as in Lynx then there is good content right at the top of the page, but if the indexer sees it as in View Source, then the things at the top are scripts, tables and image maps.

Thanks! :-)

James

JamieBrown

10:29 am on Jun 30, 2005 (gmt 0)

10+ Year Member



Hi all,

I actually think I've answered my own first question - I've discovered the wonderful world of spider emulators. So spiders do see things more like Lynx than View Source! If any other noobs like me would like a go, just put "spider emulator" into Google and you'll get loads. :-)

But it would still be nice to know if there is any evidence that suggests that text higher up in the page is regarded as more relevant by the search engine than text lower down the page. If so, by how much!

Any pointers?! :-)

Cheers,

Jamie.