Building a Web Directory - Webmaster General forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Building a Web Directory

Scouring for tips

brotherhood of LAN

11:05 am on Aug 31, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I am hoping to build a semi-large web directory about a particular web subject. This subject returns about 7 million results in Google, anyway, the topic is fairly large. So - I hope to keep speed and file spaces at their best!

I aim to do this through mySQL and PHP for the moment. I'm not really concerned about the PHP, more concerned about the database structure and the possible variables that can be squeezed into a table without bloating the total amount of data.

I was thinking along the lines of having a table containing something along the lines of ID,URL,Title,Description, which seems fairly simple!

Since mySQL is OK with wildcard matches, concatenations etc, I was looking to find a fast/compact way of using a single ID to represent a few meaningful variables that will help find/sort results. Sorta like the way every byte is accounted for at thy Google. I just want to make a decent directory but keep the overflow in code and content storage to a minimum.

At the moment, I will be adding/deleting the links myself, though hope to implement a spider that will work out the variables for me....and wait for me to assign a category :)

Anyway, before I start listing 101 vars that I could hopefully use in a "lean and mean" environment, are there are rules of thumbs and messiah-tips that should be known before diving into making a directory.

The dir I have just now is fairly ham and egg - so I wan't to be looking at something that will keep my negative thoughts about it at bay for at least a year!

Advice on building directories would be great - specifically in keeping things SMALL in the db.

I'd post an example of what I already had in mind, but the SQL and layout keeps flashing past my original thoughts and cause me to take another step back ;)

The biggest prob I have deciding at the mo is how the directory structure will be setup and the relationships between them.

Anyone have a divine heads up on directory building? :)

Grumpus

12:39 pm on Aug 31, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

For the structure, I'd create a table "CATS" with the following fields:

catID (Autonumber, Unique)
level (1,0) If 1, it is a Top Level Cat. 0 if it'll have a parent.
Name (Name of the Cat)
Description (Blob / Describe it)

Then I'd create another table called "STRUCTURE"

ID (Auto, Unique)
catID (Not Unique, Relates to catID of CATS)
parentCat (Holds the CatID from CATS that is the parent).

Then, if I'm in category "My Cat" (which is catID 42), you poll "STRUCTURE" for all of the records that have 42 in the parentCat field and you'll get a list of all the subtopics. Poll it where catID = 42 and you'll get all of its parent cats.

You'll also want two tables for sites, as they might fit into multiple cats. Create one that has the names, URL's descriptions, but have a second one that has all of the cats it belongs in. This way, you're not duplicating data.

The problem here is with sorting as it'll be pretty server intensive, so you might want a field in the "STRUCTURE" table that now consists of only numbers that contains the first 3 or 4 letters of the Cat name (or Site name in that table) so you can do an ORDER BY on that without having to call the CATS or SITES table to grab the name. 3 or 4 letters won't be a perfect sort in some cases, but it'll be close enough.

TIMSTAMP fields are also useful for sorting by what's new. All of the records in my database have 2 - one for when it was added, and one for when it was last updated. That way, I can see what really is NEW and what's just had some information changed recently. While you're at it, get a USER ID field in every table, too. That way, if you ever open it up so that people can submit their own site, you can poll a user database and see who added it. (You'd then need a "Publish" (1,0) field to show whether you approved the listing or not).

Good luck. Keep it generic and functional and you can sell the APP when you're done, too.

G.

jatar_k

3:16 am on Sep 3, 2002 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

BOL,

this has been on my mind since you posted it. I need a little more info before I commit to an answer.

How many sites does "semi-large" signify?
Will each sub cat have only one parent cat or many?
What do you mean by this "implement a spider that will work out the variables for me"?
How many top level cats will there be?

I assume this is the project where you referenced the structure for that sql query question.

You might have to give me a day or two again to think about it.;)

brotherhood of LAN

5:22 am on Sep 3, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How many sites does "semi-large" signify?

The DMOZ category for the site subject says there are 18000 sites. Since their cat structure is intermingled.....its probably about 30000 sites that fall on topic.

Will each sub cat have only one parent cat or many?

Hmm hard to say, its partly why I stir up the thread :)

Since the first posting I've spidered the entire category in question inside DMOZ using FDSE (everything from the 2n tier cat downwards - 3500 categories), putting an ID and title for each page as the table. (Will be looking more closely at that DMOZ attribution stuff....)

Ideally I'll be converting the uniform titles into word ids and basing some sort of relevance on category deepness to chip is a part of an "algo". The whole dmoz structure might be totally re-arranged or dropped after they're put into some sort of algo (mentioned below)

What do you mean by this "implement a spider that will work out the variables for me"?

Bit of a vague statement I guess :) .... With the above sorta thing in action using the already "seeded" categories, hopefully when I assign a category to a website part of its "score" for a particular word will be worked out simply by adding the "id" for a category. Something along the lines of a website being "themed" to a particular category or segment of categories. Hopefully some sort of relevance can be made from a spider that grabs the necessaries of DMOZ/listed sites and only a category will need to be assigned to give the listing some sort of proportion in a search.

How many top level cats will there be?

There are some sites that fall in the same as my site, and they'll be on the home page of the directory, that will lead into around 50 subcategories. The subject is on the second tier of both the yahoo and DMOZ directories so that cat structure can get pretty deep.....up to about 8 or 9 levels deep. There are around 4000 categories in total, though they could maybe get a little more specific with some sort of automation (just by adding a word ID hopefully)

>>>>I assume this is the project where you referenced the structure for that sql query question.

No- but it's the same site :) That Q was because I'm moving pages that are otherwise ASP to a Linux server where I hope to use PHP.

In regards to the directory, I'm thinking along the lines that word id's can pan out all along the database. Also, I think the directory will have to be "updated" monthly and not live. I will be spidering from either here (or somewhere with a faster connection) and keep costs down and avoid any sort of stress on a paid/website-on-it server :)

If I introduce you to what I hope to do with the DMOZ data, then hopefully you'll get a good idea of the sorta direction I want this to head....

id � title
31 � 1
30 � 2
35 � 2 3 4
33 � 2 5
32 � 2 6
34 � 2 7 9 8
51 � 2 7 9 8 10 11
44 � 2 7 9 8 13

I have changed each word in the title of each cat to a number. This number will ideally be the smallest variable to pass along to an algo that can work out relevancy.

Also, if say, a single keyword was searched on, more emphasis will be placed on the word id's closest to the left of the column (i.e. most generic in the DMOZ directory). There could be many variables here I suppose, like matching of URL to category, website category to title etc, importance of word in title order, search phrase order compared to DMOZ type order and website matches order. etc etc...all depending on the way the db is setup and how much space things take...

The way that FDSE spiders things I noticed it wouldn't be that hard to work out some link popularity. So up to now, I can see the site SQL table having something like this.

siteid � URL � dmozcat � sitetitle � site heading � text � links � descriptionsnippet
1 � site.com � 1 35 63 � 35 63 76 � 35 63 76 � 1 2 3 etc � site ids � etc
2 � site2.org � 42 45 99 � 42 45 55.. � etc � 1 2 3 etc � site ids � etc

Hopefully, the word ids can get some continuity into a decent algo to sort all the sites out. Hopefully link pop can be counted using the site ids in the first column by matching them up with themselves in the links. I was also wondering if the words appearing in URL's could also be substitued by numbers, though my guess this is not a good idea.

Basically, the core element of it all will revolve round words having short ID's. I think there is a life long worth of optimisation that will need to happen there though :) Hopefully the idea of not repeating data or making mistakes in diff columns by using a number in one place and an algo converting words to numbers then many problems will be sorted, and won't have to be "worked" by me.....

I also have a years worth of searches for the site in question, both in-site and directory queries. This info may be useful, perhaps to stagger particular search phrases....I know they will be of some use anyway! :)

The idea of having the word id and number list may be used across the site. It already has a dictionary, and the mentioning of numbers across various areas of the site might help get a better theme and tie in to the directory.

The idea of association of keywords with DMOZ categories, plurals, similar searches, themes etc is the main emphasis, while getting the db structure and math for the algo fast and working will also have to be thought out. I think the working out of link pop etc won't be a massive task....after all, the category only takes up about 1% of the web, and if getting half the category listed would be a good job done. As I say I will most likely be doing the spidering here and transferring to the site at various points......

I think/hope the working out of on page scores and link counts won't be too much work. Google gives 8 million results for the single keyword for this subject....so say an average of 10 links per page....thats 80 million sums. The working out on the page stuff for those pages....8 million x each row.....well...that might take longer :)

I'm also thinking like issues of preventing the spider from indexing off topic content, and making it steer clear of definite no-no content....don't want to be studying po-rn for a-n-other topic :)

Basically I'm open to all thoughts and suggestions here. I think it can be done reasonably cheap and without too much tearing out of hair.....just a matter of getting the right blueprint to start with...

Hopefully that spurs a few ideas jatar ;)

brotherhood of LAN

11:12 pm on Sep 14, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

After pondering this and that, it breaks down to one question.

Every category is an id (left column) and a title. Should I be converting the words in the titles column into numbers? i.e. smaller on scale.

id � title
1 � 1
2 � 2
3 � 1 2

=

id � title
1 � widget
2 � widgets
3 � widget widgets

Or should I be leaving the words as they are in text? The idea is to have a seperate table of "word_ids" to save space and repitition of words.

Hopefully a directory can also be a decent search engine too....idea being that any indexed content is going to be converted to the same small numbers mentioned above.

Any thoughts?

The idea of having more than one number in a particular field is something I'm having problems overcoming for the forseeable directory using the numbers approach...

brotherhood of LAN

11:48 pm on Sep 14, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Disregard that last post....I see I should read into normalization more closely!

This directory business is quite a thing...

ukgimp

2:32 pm on Sep 17, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Is the following a good idea for assigning a subject to a particular separate record.

�SUBJECT�
SubjectID
SubjectDescription
SubjectDirectoryLevel

Containing a logical data hierarchy like below.

TOP (1 #SubjectDirectoryLevel)
Mathematics(2)
Engineering(2)
Computing(2)
Algebra(3)
Geometry(3)
Statistics(3)
Sub Alegbra 1 (4)
Sub Algebra 2 (4)

�RELATIONSHIPS�
RelationshipID
ParentID
ChildID

With the parent/child relationships using the auto_increment Id�s from �SUBJECT�.

Is this approach going to work in practice. I would like to have web admin so that any registered user can add a record and assign a subject from drop down listings or by traversing down to the directory and adding from there.

I have added the subject directory level to aid the browsing performance.

Will this method be slow to search and display and administer. Like BoL I am trying to get my head round this and want to make things as scaleable as possible. If this is the same solution that Grumpus spoke of, appologies.

<ADDED>What about if a subject can fall into more than one cat, how much can that slow things down</ADDED>

Cheers

PS This is a for a non commercial application that I am trying to build.

aspdaddy

11:07 pm on Sep 17, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I think you could do the structure in one table -

tblCats(cat_id, cat_name, parent_cat_id)
parent_cat_id allows nulls for top level cats.

to get subcats for "My cat"

SELECT cat_name
FROM tblcats
WHERE cat_parent_id = (SELECT cat_id from tblCats where cat_name='My Cat')

UKG, What is subject?

jatar_k

11:20 pm on Sep 17, 2002 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

The numbers approach is what stumps me for this scenario. I keep imagining a million db calls just to sort out a small string of text.

Have you figured anything out BOL.

I seem to have come up with a large pile of scrap paper looking at possible structures. The relationship between the words, their numerical representations and the relevancy is where I keep falling down. I can only seem to have too many tables and too many calls.

I think I am having the same mental block you referred to in your last post. So what happened?

brotherhood of LAN

11:55 pm on Sep 17, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Jatar, no much progress, and the problem of "database calls" is the thing that lurked around as a potential problem......

I wanted to break everything down to their smallest possible state without compromising either speed or storage. This is the problem...otherwise it would just be text where required.

But then some things (like the DMOZ directory titles) consisted of more than one word i.e. the "flora and fauna" section. This leads on to the problem of having more than one entry for a particular field.....

So I suppose that each category level could be broken down into a number i.e. "flora and fauna" = 1......and an ID leading to the parent category to round it off.

In the document "Anatomy of a Search Engine", that you have no doubt read, halfway down they mention that part of the Google system is to convert words into word id's for storage. From a storage point of view this makes sense, and the little that I've read about Huffman coding and the like, the most repetitive words or phrases can acquire the smallest sized primary key for that word/phrase.

This is the line of thought I've been taking...

-The 11 Mb's of data I got from DMOZ was whittled down to more than half the size by converting words into numbers.
-The words used in the DMOZ cat structure basically cover the broad base of keywords that would be potentially be used in a search of the directory
-The words will also appear frequently in relevant sites and webpages that are related to the subject of the directory. Lack of these keywords can flag the site so that it is not indexed as off topic material
-word id's can be associated with title,meta description,anchor text etc and applied a score depending on how these keywords are used. i.e. flora and fauna is a large section of the categories and would have a different weight than a niche category where the "word id" is only mentioned once (flora and fauna spans over a 1000 categories)
-etc etc for ranking, considering link pop etc etc

It seems the only way to have something on this scale is to use some form of encoding, which may or may not sacrifice speed on an actual search...I'm not really sure.

You mention that real-time resources may get sucked up by all the calls to convert numbers into words, i.e. an unnecessary extra layer of processing that would not be needed. I was thinking that hopefully there would be ways to counteract this problem (though not as dramatic as googles servers and their cached results) ;)

Likewise mentioned above, I ponder the idea of doing that sort of processing offline away from the server, and the finished articles can be uploaded/updated periodically.

The use of uniform word id's seems to be the way to go......I wish I could explain more clearly what I'm looking for here and be more sure that it's attainable but I keep coming across new things as I go along...

Feedback would be good for something to bounce off ;) ......I know there will be a "preferred" way for such projects.

ukgimp

11:16 am on Sep 19, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I came across this, which is a tutorial on how to set up phpHoo which uses recursive queries to find the root to the top of the directory structure.

Looks promising, now all I have to do is fit the idea into what I am trying to do.

[webreference.com...]

Cheers

<ADDED>
aspdaddy

Subject is mearly the classification I am trying to assign to a particular record.

</ADDED>