homepage Welcome to WebmasterWorld Guest from 54.227.12.4
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / WebmasterWorld / New To Web Development
Forum Library, Charter, Moderators: brotherhood of lan & mack

New To Web Development Forum

This 36 message thread spans 2 pages: 36 ( [1] 2 > >     
How Do Search Engine Robots Work?
Questions and Answers
pageoneresults




msg:3206923
 4:02 pm on Jan 3, 2007 (gmt 0)

10:06 pm on June 11, 2001
How search engines work. A primer.
[webmasterworld.com...]

Search engines consist of five discrete software components:

  1. Spider : a robotic browser like program that downloads webpages.
  2. Crawler : a wandering spider that automatically follows links found on pages.
  3. Indexer : a blender like program that dissects webpages that are downloaded by spiders.
  4. The Database : a warehouse of the pages downloaded and processed.
  5. Search Engine Results Engine : digs search results out of the database

Some questions to ponder...

  1. Do robots accept cookies?
  2. What happens if my site forces a cookie?
  3. Do robots execute JavaScript functions?
  4. Could I be doing something technically that is stopping a robot from indexing my site?
  5. How do robots interpret my page?
  6. In what order to robots index my page? What is the very first step that robot takes?

Those are some general questions that I'm sure most that are New To Web Development might have.

What questions do you have in regards to robots?

And, who has the answers to the above? ;)

As an added bonus, I finally confirmed who coined the term SERP (Search Engine Results Page). It was Brett_Tabke as confirmed in the above topic. ;)

 

jatar_k




msg:3208676
 11:27 pm on Jan 4, 2007 (gmt 0)

  1. Do robots accept cookies?

    normally, no

  2. What happens if my site forces a cookie?

    well, if the page requires that cookie for the robot to see the content then the robot sees nothing

  3. Do robots execute JavaScript functions?

    normally, no

  4. Could I be doing something technically that is stopping a robot from indexing my site?

    most definitely, I have seen tons of foolishness over the years with clients, friends or just surfing the web

  5. How do robots interpret my page?

    I assume you mean that they just read the source as opposed to "viewing" it (sorry question is a tiny bit vague)

  6. In what order to robots index my page? What is the very first step that robot takes?

    in the order they find them, not sure on this question

    robots who behave always request a robots.txt first

I know you weren't asking me but I figure some people might be a little nervous to make a mistake. I, on the other hand, am not afraid of being a fool. ;)

pageoneresults




msg:3208714
 12:02 am on Jan 5, 2007 (gmt 0)

We're on the right track!

Do robots accept cookies?
normally, no

What happens if a robot hits a site that sends it into a 302 loop because it doesn't accept cookies. Is there a predefined limit of how many loops it will go through and what the final outcome is?

I, on the other hand, am not afraid of being a fool. ;)

lol! Me neither. That's why I started the topic. And, since there was no response for quite some time, I was feeling "quite the fool". ;)

I have some questions in my mind that need answered and I don't have someone sitting next to me who can answer each one with authority.

jatar_k




msg:3208942
 4:53 am on Jan 5, 2007 (gmt 0)

>> What happens if a robot hits a site that sends it into a 302 loop because it doesn't accept cookies

the good news or the bad news

I have seen servers go down because of things like this though not exactly

I have seen sites go down from problems like these

I would hope that a bot from a larger SE would have a timeout, I would hope all of them would, but I am guessing that you wouldn't get properly indexed regardless of how the bot dealt with it

jdMorgan




msg:3209738
 8:27 pm on Jan 5, 2007 (gmt 0)

Spider : a robotic browser like program that downloads webpages.
Crawler : a wandering spider that automatically follows links found on pages.

I'd tend to label the first item simply a page viewer or downloader, and describe spider and crawler as synonymous.

The analogy is that a spider traverses the strands (links) Web, and a crawler follows the strands/links as well; Both download content and add the links they find to their list of further pages to be fetched. These terms are used interchangeably by search engines.

"Robots" is the generic terms for any automated program that fetches pages (or perhaps just server headers) for any reason, with crawlers/spiders as a subcategory of robots.

Jim

mack




msg:3211969
 10:45 am on Jan 8, 2007 (gmt 0)

Do robots accept cookies?

Generally not, but it does depend on the purpose of the spider. Most search engine spiders would not accept cookies because the content the search engine would read may be very different to the end user, who does not have the cookie. The aim of the spider is to deliver default cookie free content.

What happens if my site forces a cookie?

I think this may depend on how you force the cookie. In either instance the bot may simply not programmatically be able to handle a cookie.

Do robots execute JavaScript functions?

Again this may differ from bot to bot. In many cases the JS would be a vital page of the overall page and therefore may be indexed as part. As search engines get smarter the inclusion of JS based elements and page sections is becoming more and more common. There are still problems associated with the inclusion of JS and is some instances this may not be desirable.

Could I be doing something technically that is stopping a robot from indexing my site?

You could and I guess you should. There are always sections of a site that you don't want to show up in serps. Blocking can be done using the standard robots.txt file or even your .htaccess file.

The problem with using only robots.txt is certain pages may still appear in results even although the page has not been indexed. Generally if this happens the results will show your page title based on anchor text, and will have no description.

How do robots interpret my page?

Are you asking for the holly grail lol

I think this varies a lot from bot to bot, but most search engines will try to read the page as much like a human as possible. The actual page code is not as important as many believe, but it does need to validate enough for the spider to be able to read the page.

The robot will follow a link to your page. Right away if has anchor text. This may provide the bot with clues about your pages possible content.

As the bot lands on your page it will read your header section title possible description and keywords although both these elements are very much depreciated. It will then read your body section and work out the exact content of your page.

It will then use various techniques to work out the importaint aspects of the page. For example bold text and <h> tag to make an area of text more important.

Every bot works differently but there are rules most follow.

Mack.

pageoneresults




msg:3212132
 3:05 pm on Jan 8, 2007 (gmt 0)

jatar, jd and mack, I'm glad you guys found this topic!

In regards to cookies, I'm finding sites that are serving a 302 loop because of cookies. Yes, they are serving that to the bot too.

Let's talk about how robots interpret your page for a bit. If I follow Brett's historical topic, you have three different types of robots, a spider, crawler and indexer.

First the Spider comes around and requests the URI. It reads server header information and other on page <head></head> information. Then the Crawler follows all the links within that domain (those that are found and allowed). Then the Indexer reads the html while making heads and tails of it.

Is that the process with today's technically savvy robots?

jatar_k




msg:3212157
 3:37 pm on Jan 8, 2007 (gmt 0)

I'm with Jim, I think crawler and spider are synonymous anymore

I would even think that the 3 distinctions are different aspects/behaviours of the same thing, the SE robot

as we see historically different bots from Google sharing the workload and multi tasking I think robots have become more multi use and less specialized.

pageoneresults




msg:3212158
 3:41 pm on Jan 8, 2007 (gmt 0)

So, we have a multi-tasking bot. Actually, we have a network of bots each indexing specific content. Okay, that answers one question.

Once the multi-tasking bot requests a URI, the first thing it does is request the server headers, correct?

From there it requests the <head></head> of the document, correct?

And then from there it traverses the html/xhtml stripping out all markup and ending up with one big chunk of text, correct?

Easy_Coder




msg:3212254
 5:25 pm on Jan 8, 2007 (gmt 0)

How do robots interpret my page?

I would think that the robots job is to just grab your html and stuff it somewhere back on a server where the crawling originated from and then move on to the next item/url in its work list.

It's the Indexing mechanism that will do the intrepretation of your HTML.

mack




msg:3212612
 10:33 pm on Jan 8, 2007 (gmt 0)

The bot generally downloads the entire page to a server at the SE. What the SE generally ends up with is a local copy of the indexible web.

It is locally on the se server clusters that the real number crunching begins. Links, anchor, content analysis etc are all taken into account at this stage.

One think I think we can also take into account is the search engine spider update schedule. Most search engines use the Last modified header to determines if a page has updated. It makes little or no sense to re-index pages that have not changed.

Mack.

coopster




msg:3212716
 12:06 am on Jan 9, 2007 (gmt 0)

Once the multi-tasking bot requests a URI, the first thing it does is request the server headers, correct?

I don't know if it still holds true, but I believe the steps are actually ...

  1. Look up DNS (possibly from local cache)
  2. Connect to host
  3. Send request
  4. Receive Response

As mentioned, it may only be a HEAD method that the robot requested. Therefore the only information received would be the headers. Otherwise yes, the header information is always passed first followed by the message-body.

So, the robot either requests just the headers, or it may request the headers and message-body.

pleeker




msg:3212739
 12:26 am on Jan 9, 2007 (gmt 0)

The robot will follow a link to your page. Right away if has anchor text.

Really? So if I have an outbound link, the bot leaves my page and starts crawling the linked-to page? Yikes. When does it come back?

I always assumed it collects links and adds them to a list for future processing.

One other question I have:

How would a bot handle a page with a LOT of links? Is there really such thing as too many links on a page for a bot to handle?

jatar_k




msg:3212951
 6:33 am on Jan 9, 2007 (gmt 0)

>> the bot leaves my page and starts crawling the linked-to page

I wouldn't say it is quite that instaneous, it would store those links and add them to it's list, or tonce that page is stored the indexer would add a tick that you linked to that page and weight it accordingly.

>> Is there really such thing as too many links

I would think that goes more into the realm of the algo, as opposed to what robots do

shri




msg:3212988
 7:44 am on Jan 9, 2007 (gmt 0)

Would highly recommend downloading and letting an open source crawler like Nutch do a few crawls to understand how a basic search engine crawler works.

pleeker




msg:3212990
 7:45 am on Jan 9, 2007 (gmt 0)

I would think that goes more into the realm of the algo, as opposed to what robots do

To a degree, but I'm looking at a shopping site (major retailer) that has more than 300 links on the home page, and the main nav is buried (due to CSS) down near the bottom of the code. So my concern is if the bot will bother with so many links.

(The closest thing to a site map has about 500 links on it.)

Does a bot care if there are that many links? Will it just give up at some point?

IanKelley




msg:3213044
 9:15 am on Jan 9, 2007 (gmt 0)

500 links is nothing. I wouldn't worry. A lot of dynamically generated sites have far more internal links than that. And many directories have far more external.

The big SE spiders undoubtedly have certain criteria they use to determine whether or not to stop adding links to their crawl list from a given site but I'd be willing to bet none of them have a specific numerical limit.

mack




msg:3213238
 12:42 pm on Jan 9, 2007 (gmt 0)

if I have an outbound link, the bot leaves my page and starts crawling the linked-to page? Yikes. When does it come back?

We need to remember that spiders run in threads, there is not simply one instance of the bot, the bot will follow links, but will also finish the pages it is currently working on.

If we think in terms of the larger search engines there may be thousands of threads / spider processes running at any one time.

Mack.

pleeker




msg:3213668
 6:25 pm on Jan 9, 2007 (gmt 0)

Thanks IanKelley and mack.

pageoneresults




msg:3213801
 7:54 pm on Jan 9, 2007 (gmt 0)

500 links is nothing. I wouldn't worry.

I might. 500 links is quite a load for a single page. I usually keep them to an absolute minimum to not dilute the page too much.

Google has something to say about more than 100 links per page...

Offer a site map to your users with links that point to the important parts of your site. If the site map is larger than 100 or so links, you may want to break the site map into separate pages.

and...

Keep the links on a given page to a reasonable number (fewer than 100).

[google.com...]

pageoneresults




msg:3213844
 8:49 pm on Jan 9, 2007 (gmt 0)

Most search engines use the Last modified header to determines if a page has updated. It makes little or no sense to re-index pages that have not changed.

Okay, what if the server does not support the Not Modified header? I'll assume Googlebot will then reindex that page? Does it have any sort of "compare" functionality. I mean, would it compare the new page to the old page and determine changes and use that if 304 was not supported?

10.3.5 304 Not Modified
If the client has performed a conditional GET request and access is allowed, but the document has not been modified, the server SHOULD respond with this status code. The 304 response MUST NOT contain a message-body, and thus is always terminated by the first empty line after the header fields.

If Googlebot doesn't do the compare, and a 304 is not supported, that means a page that has not changed since the last indexing gets indexed again.

Wouldn't it be to my advantage to make sure that my server supports the 304 Not Modified? :)

[google.com...]

304 (Not modified)
The requested page hasn't been modified since the last request. When the server returns this response, it doesn't return the contents of the page.

You should configure your server to return this response (called the If-Modified-Since HTTP header) when a page hasn't changed since the last time the requestor asked for it. This saves you bandwidth and overhead because your server can tell Googlebot that a page hasn't changed since the last time it was crawled.

And to harness those bots? I really only want them to index my freshest and most relevant content. Don't I? What if that bot is programmed to retrieve only so much information?

pageoneresults




msg:3214046
 12:02 am on Jan 10, 2007 (gmt 0)

I'd like to add that there is an additional header involved here prior to the 304.

14.25 If-Modified-Since
[w3.org...]

Google specifically refers to this header in their webmaster guidelines and suggests that your server be configured to support it.

coopster




msg:3214129
 1:25 am on Jan 10, 2007 (gmt 0)

If the server does not support it, you can still program around it and send your own header.

There is also a must read document titled "The Anatomy of a Search Engine" from a duo at Stanford that is of utmost interest to anybody involved in SEO, SEM and any other SE acronym you can think of. I'm certain any and all of us in this thread have heard and read the document, but for future readers I would highly recommend the read.

BTW, P1R -- thanks for starting the thread. The discussion is great and much appreciated. Reminds me of evenings at PubCon ;)

Enviromed




msg:3214302
 5:45 am on Jan 10, 2007 (gmt 0)

Do robots crawl pdf's?

IanKelley




msg:3214307
 5:56 am on Jan 10, 2007 (gmt 0)

Yes, some crawlers download PDFs and some indexers parse them.

adsoft13




msg:3214447
 9:39 am on Jan 10, 2007 (gmt 0)

We have made a test and result shows:
Msn and google bots do execute Javascripts.
They don't understand all commands, but pretty good at most comman commands.

skadamo




msg:3215393
 11:21 pm on Jan 10, 2007 (gmt 0)

Can robots submit forms?

Obviously they can't fill out fields but can they click a button that POST to a server?

pageoneresults




msg:3215400
 11:30 pm on Jan 10, 2007 (gmt 0)

Can robots submit forms?

Great question!

I'd also like to add...

Can and do robots read what is between the <form></form> elements?

IanKelley




msg:3215420
 11:51 pm on Jan 10, 2007 (gmt 0)

There are lots of robots that both fill out submit forms. As far as I know none of them work for a major search engine though :-)

Virtually every crawler reads inside of <form> elements.

Because the <form> action is a url I imagine some crawlers do follow the link.

jatar_k




msg:3216104
 3:42 pm on Jan 11, 2007 (gmt 0)

robots don't actually click but they can read the action, get all the field names and then use the form action to send post/get to

This 36 message thread spans 2 pages: 36 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / New To Web Development
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved