Now, I've got the crawling side pretty much worked out. Has anyone seen any decent language-independant algorythms for searching / ranking the crawled pages?
(BTW I'm thinking more Inktomi here than Hilltop ;) )
[www-2.cs.cmu.edu...]
Thanks for moving my research up a few weeks ;)
I've looked and looked SK. There really isn't anything great out there. I wrote several of my own [joefarmer.com]. I'm slowly collecting notes for another run up to the summit some time next year. Building a good se is alot like mountain climbing.
The big difference for me this time, is that the routines have become so large and complex, that I need to break it down further for easier maintenance.
I'm going to go with discrete subroutines for EVERYTHING. Almost all of the first tier major utils and subroutines will be 100% stand alone. I've had it with interdependencies that are so hard to maintain from program to program.
A) url manager. will be responsible for managing all inbound (fresh) urls. That's it's only job is to manage the spiders INBOX. It's a critical task that I've over looked in the past by throwing it in with the spider. No so grasshopper, it needs to stand alone.
B) Crawler. What I call a crawler and what other people call a crawler are probably different. This is nothing but a spider manager that is the intelligent glue between the url manager and the spider.
C) Spider. Brain dead. Get a url from the crawler and fetch it.
D) Indexer. Generally run by a cron job. Takes pages fetched by spider/crawler and rips them apart.
E) PreDB Ranker and Key file generator. Runs through the indexed pages and looks for anomalies in the indexer. I run a keyword density analyzer over the various page parts here and pre-select the top keywords for a "index match" field. It also allows me to later build keyword<->url KEY indexes for fast search lookup (just like the big boys).
F) Database Insert Manager. Handles the actual stuffing of indexed pages into the db. Needs to be stand alone too. A subset of related routines can handle the purging of expired pages and manage how you update the db.
G) Database search retrieval. This is all dependent on the database used - so I want it stand alone. Most small se's combine this with the search algo - I want to break that bondage.
H) Da algo. Dials and knobs everywhere. If you see more than 10 lines of code that don't an If-Then somewhere in them - you need to rewrite the code properly. There is nothing like being able to tweak up and down every little aspect of the algo.
I'm going to lay out the search match routines this way:
- What variables on the page can I can control? Make a list of them and reserve a variable for each and every one. (use good variable names like: TitleDensity, TitleLength. may seem trivial now, but after 100+ variable you'll wish you'd done it).
The current list of page entities I look at:
Headers: server, cache status, expires, last modified.
Title: length, density, duplicate words, keyword location.
Meta's: same as title. I also do things like filter out anything over 7-8 words.
Headings: look for pure "<Hx>" factors.
Bold/Italic: not much use any more, but still important.
Page Density: a good kwda is a must.
Links: I go the 9yards and break them down into kw's. I also do a validity check for cgi stuff.
PAS Errors (page author stupid): I dupe check for the major page entities, and fall throughs from the indexer to see if it was a simple problem that caused it.
Other SE's: it's always nice when you can have an independent comparison ;-)
Finer control: P-tags. Look for full bodied sentences and give them extra weight.
Table tags - how deep is the content?
- Even those page entities you don't envision be able to control - make a variable for it anyway. Search engines beg to be tweaked and expanded. Build the easy updating in from the word go.
For instance, instead of saying:
if $TitleLength >80 then $rank = 1
if $TitleLength >$maxtitlelength then $rank=$MaxTitleLengthBoost.
If you go that route of variables everywhere, you can build a killer search control panel and tweak the algo any time any place without a fuss.
- Try to make all the search checks variable. You need to be able to step through the checks in any order. Don't worry that it will take a great deal of code - it will. However, most search match and ranking routines will be if/then/action based. That type of code is going to execute pretty fast. The system drain and slow down will come when it is time to sort the ranked results and pump them out of the server.
How to do that is up to you. For me and perl, I am going with a subroutine as a variable name setup with a OnGoSub style routine dispatch. That way, the routine names can go into an array and be called in any order. I also like it that way in order for the search routines to cause search aborts (if TitleLength>200 == $spammers! hehe)
The last thing I'll say is log files, log files, log files. Track everything with multiple levels of debug information available. Nothing could save you more time later than a good status log.
Didn't really work - it doesn't scale at all well if there's a large amount of URLs for the same server.
What I ended up doing was a having a MySQL table of URLs, parsing out the domain, and having a seperate table with a 'do not revisit until' field for the domains. Then, when the URL manager is sorting out a list of URLs for the spider to fetch, it can check if it's okay to revisit the domain yet.
If so, insert the URL into the 'to fetch' list, update the timestamp to a future time, and move on to the next URL.
This seems to work okay - well behaved spider, and I can run in parallel to best use the bandwidth available. (Also, this approach has the possibility of a Fresh! feature and artificial boosts if the domains are treated seperately)
>>What variables on the page can I can control? Make a list of them and reserve a variable for each and every one<<
Hmm, looks like the parser is going to be the real hog in all this - not to worry, this part can be done offline.
So, if I understand what you're saying:
- Do a density analysis to find the basic keywords
- Grab as much information as possible on the density, distribution etc of each kw and store it
- The actual search retrieves the info for pages that have the kw associated with it, rank not an issue
- The algo then uses this info according to whatever criteria you've set, and returns a ranked set of URLs
Right?
As much "pre live search" indexing of the page that you can do - the better. Some things are easy to pre-score (title, metas, headings, url) or even run time match. The tough one is the actual page content. It's where the real meat of the algo is at.
Ideally, you'll have enough server power to do lots of runtime matching. Most boxes don't have that luxury - so pre-scoring is critical.
>density, distribution etc of each kw and store it
If you can do that - it's great. The problem in the run time match algo will be getting speed out of it. It takes a great deal of processing to check the title - score, check the meta - score, etc. The match algo itself can be quite speedy. It falls down when you start checking deeper into the page entities one-by-one.
A cheap ploy? Run everything together in one simple bulk store of the entire page in the db (title, metas, url). Then your match routine only has to check that entire page in one gulp, one time. You can still seperate out the various parts either in another db field, or with some embedded character that will get ignored by the match routine.
My older se's used a cheap trick: if say you want the title to have 3 times as much weight as the page - just store it 3 times. When the search algo comes along looking for "word X", it already has three times the weight. Obviously leads to bigger pages, but a very slick trick for faster matching routines that use regexs. (it's what I do at jf to get 1/2 second search responses over 20k pages and match categories to boot - oh ya, it's flat files!).
The pre-score is the route I'm going for the next version though. I want to get up to 100k pages and only sql can handle that size - I can't get the speed out of mysql I want though. The problem with mysql is there is so much post processing you have to do. They search for "foo", and foo matches 10k docs, that's 10k docs you'll have to sort out with perl.
I can do the pre-score mostly based on 5 years of keyword searches. I know what people mostly search for since I've got a db of 5million searches preformed. I can prescore the top 2k searches by looking for matches. Then it is only a question of dialing up or down the individual page entities. It also lends itself nicely to precaching of the actual serps.
I've a database of around 300k emails, and searches on exact matches are lightning. As soon as I start doing partial matches or regex searches, it becomes painfully slow compared to flat files.
I think the trade-off is going to have to be the 'bloat' of the database caused by pre-scoring more or less everything, against the speed of exact searching.
Or maybe I'll look into an alternative DB package - did you have anything in mind?
> 5 years of keyword searches
Hehe, this is actually one of my ulterior motives for doing the whole thing ;)
Have you thought about tossing chunks of the db into memory(ram) if your running over it more than once? ..then get another chunk...
!! use tr/// for counting matches in long strings(such as body text)
!! pre-compile your regex where possible using the 'o' modifier
!! pre-build your top 20% terms (or more) into static files(flat) of site IDs
*** don't think 1 big DB...segments!!