ok, I'm overloading with info and I'm not getting a clear picture on what i'm reading. What I would like to know, in general terms, is how a spider or bot work, for like google. Excuse me for these words i use, i'm sure they are wrong, but, here goes. For example, does googlebot go on its first trip to a new site and look for index.html and start reading it and seeing where it can go from there, or does it "take a shapshot" of the root directory and then starts reading the file names in the directory and then read each file? How does it know where to go? what to read first?
The reason i'm asking is I plan to use php/mysql to create pages. My idea is to have a basic, text only html page sitting on the directory that a bot can read. I will have a index.php that will load these basic pages into a template, striping out what is not needed in order to display to a viewer. The basic page will be used to provide a "text only" version as well and will contain the keywords and other meta info. Thanks
A spider reads in the index file (it could be .html but not necessarily - it's whatever the user will see when they type the domain name). It then simply follows the <a href...> tags to find other pages. Orphan pages are not usually indexed.
Following this (and sometimes seperate), they run it through a spam penalty/ban filter to get rid of people who are obviously trying to cheat...but that's another story :)