Forum Moderators: open
Let me use as my example the source code for a typical webpage that is about 20k in size (not including any graphics) - with full meta tags, a little javascript in the <head>, all the necessary html tags, explanatory <body> text, etc. So if a website has 100 pages like this, then 100 x 20k = 2 MB just for the documents (and again, not including any graphics files).
Now if a search engine spider crawls those 100 pages, what would be the approximate size of the index database that it would generate? Of course this will vary from site to site, but is there a ballpark figure? Is it about 20% of the original directory size? Or is it closer to 2%?
Just wondering....
Many people use the likes of Fluid Dynamics SE for their own personal sites and the "saving" of disk space is not as supreme as the likes of Google manage to do it.
Mack - yes, I am asking because I have been spending much time this past week looking into search engine scripts, and in the process, have learned a lot about the impact of indexing large sites on shared server cpu's; the limitations of some scripts versus the power of others, etc. Some of these scripts are open source - such as juggernautsearch; others are not, such as TurboSeek.
As a result of reading the documentation for a dozen or so such scripts, I am coming to the conclusion that I may want to schedule the indexing from my home Windows machine, using a DSL connection in the middle of the night. I might want to do this once a week for example. Then, transfer that database to the server via FTP so it is available to my visitors for search queries.
So, I was wondering what kind of size I would be looking at, for say 30,000 html pages. Given Brett's informative explanation, I am now hopeful it is actually not too much....