Forum Moderators: phranque
I aim to do this through mySQL and PHP for the moment. I'm not really concerned about the PHP, more concerned about the database structure and the possible variables that can be squeezed into a table without bloating the total amount of data.
I was thinking along the lines of having a table containing something along the lines of ID,URL,Title,Description, which seems fairly simple!
Since mySQL is OK with wildcard matches, concatenations etc, I was looking to find a fast/compact way of using a single ID to represent a few meaningful variables that will help find/sort results. Sorta like the way every byte is accounted for at thy Google. I just want to make a decent directory but keep the overflow in code and content storage to a minimum.
At the moment, I will be adding/deleting the links myself, though hope to implement a spider that will work out the variables for me....and wait for me to assign a category :)
Anyway, before I start listing 101 vars that I could hopefully use in a "lean and mean" environment, are there are rules of thumbs and messiah-tips that should be known before diving into making a directory.
The dir I have just now is fairly ham and egg - so I wan't to be looking at something that will keep my negative thoughts about it at bay for at least a year!
Advice on building directories would be great - specifically in keeping things SMALL in the db.
I'd post an example of what I already had in mind, but the SQL and layout keeps flashing past my original thoughts and cause me to take another step back ;)
The biggest prob I have deciding at the mo is how the directory structure will be setup and the relationships between them.
Anyone have a divine heads up on directory building? :)
catID (Autonumber, Unique)
level (1,0) If 1, it is a Top Level Cat. 0 if it'll have a parent.
Name (Name of the Cat)
Description (Blob / Describe it)
Then I'd create another table called "STRUCTURE"
ID (Auto, Unique)
catID (Not Unique, Relates to catID of CATS)
parentCat (Holds the CatID from CATS that is the parent).
Then, if I'm in category "My Cat" (which is catID 42), you poll "STRUCTURE" for all of the records that have 42 in the parentCat field and you'll get a list of all the subtopics. Poll it where catID = 42 and you'll get all of its parent cats.
You'll also want two tables for sites, as they might fit into multiple cats. Create one that has the names, URL's descriptions, but have a second one that has all of the cats it belongs in. This way, you're not duplicating data.
The problem here is with sorting as it'll be pretty server intensive, so you might want a field in the "STRUCTURE" table that now consists of only numbers that contains the first 3 or 4 letters of the Cat name (or Site name in that table) so you can do an ORDER BY on that without having to call the CATS or SITES table to grab the name. 3 or 4 letters won't be a perfect sort in some cases, but it'll be close enough.
TIMSTAMP fields are also useful for sorting by what's new. All of the records in my database have 2 - one for when it was added, and one for when it was last updated. That way, I can see what really is NEW and what's just had some information changed recently. While you're at it, get a USER ID field in every table, too. That way, if you ever open it up so that people can submit their own site, you can poll a user database and see who added it. (You'd then need a "Publish" (1,0) field to show whether you approved the listing or not).
Good luck. Keep it generic and functional and you can sell the APP when you're done, too.
G.
this has been on my mind since you posted it. I need a little more info before I commit to an answer.
How many sites does "semi-large" signify?
Will each sub cat have only one parent cat or many?
What do you mean by this "implement a spider that will work out the variables for me"?
How many top level cats will there be?
I assume this is the project where you referenced the structure for that sql query question.
You might have to give me a day or two again to think about it.;)
How many sites does "semi-large" signify?
Will each sub cat have only one parent cat or many?
Since the first posting I've spidered the entire category in question inside DMOZ using FDSE (everything from the 2n tier cat downwards - 3500 categories), putting an ID and title for each page as the table. (Will be looking more closely at that DMOZ attribution stuff....)
Ideally I'll be converting the uniform titles into word ids and basing some sort of relevance on category deepness to chip is a part of an "algo". The whole dmoz structure might be totally re-arranged or dropped after they're put into some sort of algo (mentioned below)
What do you mean by this "implement a spider that will work out the variables for me"?
How many top level cats will there be?
>>>>I assume this is the project where you referenced the structure for that sql query question.
No- but it's the same site :) That Q was because I'm moving pages that are otherwise ASP to a Linux server where I hope to use PHP.
In regards to the directory, I'm thinking along the lines that word id's can pan out all along the database. Also, I think the directory will have to be "updated" monthly and not live. I will be spidering from either here (or somewhere with a faster connection) and keep costs down and avoid any sort of stress on a paid/website-on-it server :)
If I introduce you to what I hope to do with the DMOZ data, then hopefully you'll get a good idea of the sorta direction I want this to head....
id ¦ title
31 ¦ 1
30 ¦ 2
35 ¦ 2 3 4
33 ¦ 2 5
32 ¦ 2 6
34 ¦ 2 7 9 8
51 ¦ 2 7 9 8 10 11
44 ¦ 2 7 9 8 13
I have changed each word in the title of each cat to a number. This number will ideally be the smallest variable to pass along to an algo that can work out relevancy.
Also, if say, a single keyword was searched on, more emphasis will be placed on the word id's closest to the left of the column (i.e. most generic in the DMOZ directory). There could be many variables here I suppose, like matching of URL to category, website category to title etc, importance of word in title order, search phrase order compared to DMOZ type order and website matches order. etc etc...all depending on the way the db is setup and how much space things take...
The way that FDSE spiders things I noticed it wouldn't be that hard to work out some link popularity. So up to now, I can see the site SQL table having something like this.
siteid ¦ URL ¦ dmozcat ¦ sitetitle ¦ site heading ¦ text ¦ links ¦ descriptionsnippet
1 ¦ site.com ¦ 1 35 63 ¦ 35 63 76 ¦ 35 63 76 ¦ 1 2 3 etc ¦ site ids ¦ etc
2 ¦ site2.org ¦ 42 45 99 ¦ 42 45 55.. ¦ etc ¦ 1 2 3 etc ¦ site ids ¦ etc
Hopefully, the word ids can get some continuity into a decent algo to sort all the sites out. Hopefully link pop can be counted using the site ids in the first column by matching them up with themselves in the links. I was also wondering if the words appearing in URL's could also be substitued by numbers, though my guess this is not a good idea.
Basically, the core element of it all will revolve round words having short ID's. I think there is a life long worth of optimisation that will need to happen there though :) Hopefully the idea of not repeating data or making mistakes in diff columns by using a number in one place and an algo converting words to numbers then many problems will be sorted, and won't have to be "worked" by me.....
I also have a years worth of searches for the site in question, both in-site and directory queries. This info may be useful, perhaps to stagger particular search phrases....I know they will be of some use anyway! :)
The idea of having the word id and number list may be used across the site. It already has a dictionary, and the mentioning of numbers across various areas of the site might help get a better theme and tie in to the directory.
The idea of association of keywords with DMOZ categories, plurals, similar searches, themes etc is the main emphasis, while getting the db structure and math for the algo fast and working will also have to be thought out. I think the working out of link pop etc won't be a massive task....after all, the category only takes up about 1% of the web, and if getting half the category listed would be a good job done. As I say I will most likely be doing the spidering here and transferring to the site at various points......
I think/hope the working out of on page scores and link counts won't be too much work. Google gives 8 million results for the single keyword for this subject....so say an average of 10 links per page....thats 80 million sums. The working out on the page stuff for those pages....8 million x each row.....well...that might take longer :)
I'm also thinking like issues of preventing the spider from indexing off topic content, and making it steer clear of definite no-no content....don't want to be studying po-rn for a-n-other topic :)
Basically I'm open to all thoughts and suggestions here. I think it can be done reasonably cheap and without too much tearing out of hair.....just a matter of getting the right blueprint to start with...
Hopefully that spurs a few ideas jatar ;)
Every category is an id (left column) and a title. Should I be converting the words in the titles column into numbers? i.e. smaller on scale.
id ¦ title
1 ¦ 1
2 ¦ 2
3 ¦ 1 2
=
id ¦ title
1 ¦ widget
2 ¦ widgets
3 ¦ widget widgets
Or should I be leaving the words as they are in text? The idea is to have a seperate table of "word_ids" to save space and repitition of words.
Hopefully a directory can also be a decent search engine too....idea being that any indexed content is going to be converted to the same small numbers mentioned above.
Any thoughts?
The idea of having more than one number in a particular field is something I'm having problems overcoming for the forseeable directory using the numbers approach...
“SUBJECT”
SubjectID
SubjectDescription
SubjectDirectoryLevel
Containing a logical data hierarchy like below.
TOP (1 #SubjectDirectoryLevel)
Mathematics(2)
Engineering(2)
Computing(2)
Algebra(3)
Geometry(3)
Statistics(3)
Sub Alegbra 1 (4)
Sub Algebra 2 (4)
“RELATIONSHIPS”
RelationshipID
ParentID
ChildID
With the parent/child relationships using the auto_increment Id’s from “SUBJECT”.
Is this approach going to work in practice. I would like to have web admin so that any registered user can add a record and assign a subject from drop down listings or by traversing down to the directory and adding from there.
I have added the subject directory level to aid the browsing performance.
Will this method be slow to search and display and administer. Like BoL I am trying to get my head round this and want to make things as scaleable as possible. If this is the same solution that Grumpus spoke of, appologies.
<ADDED>What about if a subject can fall into more than one cat, how much can that slow things down</ADDED>
Cheers
PS This is a for a non commercial application that I am trying to build.
Have you figured anything out BOL.
I seem to have come up with a large pile of scrap paper looking at possible structures. The relationship between the words, their numerical representations and the relevancy is where I keep falling down. I can only seem to have too many tables and too many calls.
I think I am having the same mental block you referred to in your last post. So what happened?
I wanted to break everything down to their smallest possible state without compromising either speed or storage. This is the problem...otherwise it would just be text where required.
But then some things (like the DMOZ directory titles) consisted of more than one word i.e. the "flora and fauna" section. This leads on to the problem of having more than one entry for a particular field.....
So I suppose that each category level could be broken down into a number i.e. "flora and fauna" = 1......and an ID leading to the parent category to round it off.
In the document "Anatomy of a Search Engine", that you have no doubt read, halfway down they mention that part of the Google system is to convert words into word id's for storage. From a storage point of view this makes sense, and the little that I've read about Huffman coding and the like, the most repetitive words or phrases can acquire the smallest sized primary key for that word/phrase.
This is the line of thought I've been taking...
-The 11 Mb's of data I got from DMOZ was whittled down to more than half the size by converting words into numbers.
-The words used in the DMOZ cat structure basically cover the broad base of keywords that would be potentially be used in a search of the directory
-The words will also appear frequently in relevant sites and webpages that are related to the subject of the directory. Lack of these keywords can flag the site so that it is not indexed as off topic material
-word id's can be associated with title,meta description,anchor text etc and applied a score depending on how these keywords are used. i.e. flora and fauna is a large section of the categories and would have a different weight than a niche category where the "word id" is only mentioned once (flora and fauna spans over a 1000 categories)
-etc etc for ranking, considering link pop etc etc
It seems the only way to have something on this scale is to use some form of encoding, which may or may not sacrifice speed on an actual search...I'm not really sure.
You mention that real-time resources may get sucked up by all the calls to convert numbers into words, i.e. an unnecessary extra layer of processing that would not be needed. I was thinking that hopefully there would be ways to counteract this problem (though not as dramatic as googles servers and their cached results) ;)
Likewise mentioned above, I ponder the idea of doing that sort of processing offline away from the server, and the finished articles can be uploaded/updated periodically.
The use of uniform word id's seems to be the way to go......I wish I could explain more clearly what I'm looking for here and be more sure that it's attainable but I keep coming across new things as I go along...
Feedback would be good for something to bounce off ;) ......I know there will be a "preferred" way for such projects.
Looks promising, now all I have to do is fit the idea into what I am trying to do.
[webreference.com...]
Cheers
<ADDED>
aspdaddy
Subject is mearly the classification I am trying to assign to a particular record.
</ADDED>