| This 36 message thread spans 2 pages: < < 36 ( 1  ) || |
|How Do Search Engine Robots Work?|
Questions and Answers
10:06 pm on June 11, 2001
How search engines work. A primer.
|Search engines consist of five discrete software components: |
- Spider : a robotic browser like program that downloads webpages.
- Crawler : a wandering spider that automatically follows links found on pages.
- Indexer : a blender like program that dissects webpages that are downloaded by spiders.
- The Database : a warehouse of the pages downloaded and processed.
- Search Engine Results Engine : digs search results out of the database
Some questions to ponder...
- Do robots accept cookies?
- What happens if my site forces a cookie?
- Could I be doing something technically that is stopping a robot from indexing my site?
- How do robots interpret my page?
- In what order to robots index my page? What is the very first step that robot takes?
Those are some general questions that I'm sure most that are New To Web Development might have.
What questions do you have in regards to robots?
And, who has the answers to the above? ;)
As an added bonus, I finally confirmed who coined the term SERP (Search Engine Results Page). It was Brett_Tabke as confirmed in the above topic. ;)
Robots (Googlebot included) definitely do action forms as I've seen pages indexed via the checkout and basket pages of shopping carts that use forms i.e.
I recommend blocking your basket, checkout, search etc pages in robots.txt as standard.
I'm a newbie here, so greetings all!
Many thanks for the great info on robots.txt and the shenanigans some crawlers (et al) get up to. It's good to know I'm not alone, so thanks again.
[edited by: Marlbro at 11:32 am (utc) on Jan. 22, 2007]
Welcome to WebmasterWorld Marlbro
Okay, I have some more questions. Here comes the noob in me...
So we have these "robots" that are threaded which perform various functions while indexing a document. Once that information is retrieved it is stored as a local copy with the search engine. That local copy is then crunched using the SEs algorithm. Pretty simple process eh?
Now, at what point do Semantics come into play? And how the hell is html/xhtml interpreted? I mean, if I spend all this time ensuring that my pages are structured semantically, utilize valid html/xhtml/css, where does all that come into play during the indexing routines?
What exactly does a bot do with raw text that has been stripped of its html? You see, there is this process that I'm still not 100% sure of. Based on my interpretation of what a robot is doing, there is a point where it strips all html and is left with raw text. How does it interpret anything from the raw text if it is void of its semantic containing elements?
Forgive me if I seem a little dense in this subject. Sometimes I just need the right "layman's" explanation and it all clicks. And there are times when it just doesn't click and I go through life with all these unanswered questions. :)
|Now, at what point do Semantics come into play? |
As a general rule, they don't. Which is what is wrong with Internet search today.
Most of today's search engines don't give a hoot about semantics. They care about keywords. I argue that they are actually helping to dis-educate our youth - who are being taught to talk in "keywords" instead of complete sentences.
Now, we may not be talking about the same thing: I really have to object to the incorrect usage of "semantics" to describe HTML markup that classifies content according to usage. This is NOT "semantics". Semantics is the study of meaning. HTML markup can't tell you the meaning of words, except in the very broadest terms, and really only peripheral to HTML markup's main job of dealing with visual presentation - e.g. "this is a list of something".
Today's web search engines generally don't know or care about the meanings of words. Some may be starting to pay attention to the meaning of HTML markup, especially as regards the "semantic web". I hate that term. I suppose the misnomer has been applied to stress the advantage of tagging, say, addresses. So, now we know that an address is an address, or a name is a name. And maybe the search engine can do something with knowing that an address is an address.
But knowing the subject and object of a sentence - let alone it's meaning? Fergetaboutit! This is where search has to go, and for some reason steadfastly refuses to do so...
|That local copy is then crunched using the SEs algorithm. Pretty simple process eh? |
Now, at what point do Semantics come into play?
It's often more like this:
1) Spider downloads pages and parses for links. The pages are sent to...
2) Another app stores them.
3) A third app parses the stored pages and extracts key words and phrases for storage in a database that will be accessed by...
4) A fourth app that queries the database based on keyword searches.
It's of course even more complex than that because there are so many other factors these days. Steps three and four branch out into a lot of sub processes and step four often retreives bits of the data stored in step 2.
Some day (maybe not so far off) computers and storage will be fast enough to do away with steps 2 and 3 altogether.
Semantics, as I understand the term, happens in steps three and four where additional information is stored/interpreted about the context that each keyphrase is found in. This could be the surrounding paragraph, the theme of the page, the presence of certain phrases with a high statistical probabability of classifying nearby text in a certain topic, etc...
Any search engine that's trying to do this sort of thing is changing their algoritms so frequently that I don't think it's really something you can white hat optimize for at this point.
In fact, like everything else in search engine algorithms, if it ever does become possible to optimize for it then it will quickly most of it's value.
Right now I'd say it's safe to assume that if your content is legitmate and makes sense then semantic ranking will benefit you, whereas if your content is disjointed (i.e. scraped) or mixes too many topics in a small amount of text then it might hurt you.
| This 36 message thread spans 2 pages: < < 36 ( 1  ) |