Forum Moderators: open

Message Too Old, No Replies

SE technical question

What is the size of an indexed document?

         

Reno

5:04 pm on Oct 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since many of you on this forum understand the actual mechanism of search engines, I'd like to ask a technical question.

Let me use as my example the source code for a typical webpage that is about 20k in size (not including any graphics) - with full meta tags, a little javascript in the <head>, all the necessary html tags, explanatory <body> text, etc. So if a website has 100 pages like this, then 100 x 20k = 2 MB just for the documents (and again, not including any graphics files).

Now if a search engine spider crawls those 100 pages, what would be the approximate size of the index database that it would generate? Of course this will vary from site to site, but is there a ballpark figure? Is it about 20% of the original directory size? Or is it closer to 2%?

Just wondering....

mack

5:47 pm on Oct 6, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Are you talking about a web search engine such as Google or Alltheweb. Or are you wondering what size of database you will require if you install a site search on your server?

brotherhood of LAN

6:00 pm on Oct 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Check out Brett's post [webmasterworld.com] giving good background on this sort of thing.

Many people use the likes of Fluid Dynamics SE for their own personal sites and the "saving" of disk space is not as supreme as the likes of Google manage to do it.

Reno

6:43 pm on Oct 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks BOL - Brett's explanation is hands down the best I have ever read. I saved it as a text document to my harddrive. I'm in awe of that level of understanding!

Mack - yes, I am asking because I have been spending much time this past week looking into search engine scripts, and in the process, have learned a lot about the impact of indexing large sites on shared server cpu's; the limitations of some scripts versus the power of others, etc. Some of these scripts are open source - such as juggernautsearch; others are not, such as TurboSeek.

As a result of reading the documentation for a dozen or so such scripts, I am coming to the conclusion that I may want to schedule the indexing from my home Windows machine, using a DSL connection in the middle of the night. I might want to do this once a week for example. Then, transfer that database to the server via FTP so it is available to my visitors for search queries.

So, I was wondering what kind of size I would be looking at, for say 30,000 html pages. Given Brett's informative explanation, I am now hopeful it is actually not too much....

mack

2:53 pm on Oct 7, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I use fdse and have an index of about 17 000 documents. I cay give you an average page size because they really are totaly different. FDSE only uses extracts from the pages also the meta tags but my database to index 17 000 pages is currently 45 meg! to be safe in your case you are looking at 100 meg of disk space.