homepage Welcome to WebmasterWorld Guest from 54.161.191.254
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / WebmasterWorld / New To Web Development
Forum Library, Charter, Moderators: brotherhood of lan & mack

New To Web Development Forum

    
how do spiders work?
noob needs a general idea
twocats




msg:963802
 4:51 pm on Dec 20, 2002 (gmt 0)

ok, I'm overloading with info and I'm not getting a clear picture on what i'm reading. What I would like to know, in general terms, is how a spider or bot work, for like google.
Excuse me for these words i use, i'm sure they are wrong, but, here goes.
For example, does googlebot go on its first trip to a new site and look for index.html and start reading it and seeing where it can go from there, or does it "take a shapshot" of the root directory and then starts reading the file names in the directory and then read each file? How does it know where to go? what to read first?

The reason i'm asking is I plan to use php/mysql to create pages. My idea is to have a basic, text only html page sitting on the directory that a bot can read. I will have a index.php that will load these basic pages into a template, striping out what is not needed in order to display to a viewer. The basic page will be used to provide a "text only" version as well and will contain the keywords and other meta info.
Thanks

 

bcc1234




msg:963803
 6:19 pm on Dec 20, 2002 (gmt 0)

Remote clients cannot get a listing of your directory, unless you allow it.
Besides, many sites do not even have separate files located in directories at all.

The only way for spiders to find your pages is to follow the links. Most spiders get the home page and then move from there.

gsx




msg:963804
 9:36 pm on Dec 20, 2002 (gmt 0)

A spider reads in the index file (it could be .html but not necessarily - it's whatever the user will see when they type the domain name). It then simply follows the <a href...> tags to find other pages. Orphan pages are not usually indexed.

Following this (and sometimes seperate), they run it through a spam penalty/ban filter to get rid of people who are obviously trying to cheat...but that's another story :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / New To Web Development
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved