Forum Moderators: coopster
I'm sort-of proficient with PHP, but I would have no idea where to even start coding such an enormous project.
Does anyone know how I would go about finding out:
i) Where to find commercially available SE systems?
ii) How to start researching how to program a SE?
I've spent a good while looking about on the net, but can't seem to find anything conclusive. If I can't buy the complete system, I'm prepared to learn how to program it myself. First prize would be to buy a ready-made script. Although I think it would be a fascinating project to code one from scratch...
I understand that bandwidth requirements etc are all enormous for this sort of venture, but I'm just mulling all this over in my mind at the moment; practicalities can come later. Besides, I envision this being a fairly limited niche SE.
Maybe I'm being totally naive in expecting this sort of thing to be available, but here's hoping...
Regards,
R.
Having looked through the AspSeek site again, I see that it was written in C++. Mmm, seems like I'm going to have to whip out the old university textbooks again!
So from what I can gather so far, all the data is crawled and gathered and entered into the DB with C++ or Java or another programming language (as opposed to scripting languages like PHP, ASP). Then the data is just accessed by the user from the database via a scripting language enabled page.
I would also have a look at hyperseek for the base - seems pretty solid and has a lot of functionality built in. Depending on what you want to spend, you can get the basic function as written, or you can get the source code with full access to the algo for any customization you might need.
Justin
But it is written in PHP, which means it should be pretty easy to customise. Easier than C++ (for me anyhow).
When buying pre-packaged solutions like this, do you think that it's worth springing the extra cash to get the source code? Call me paranoid, but I like having everything under my control. I don't like using 'little black boxes' that I can't see into!
The big search engines i'm sure have significant separation between the various components of the system, and there's no reason why you shouldn't follow a similar practise on your smaller scale - it will stand you in much better stead for the future.
Consider that all the majors have the ability to provide cached pages. This points to a very simple front-end process - the one we know as "Googlebot", "msnbot" and friends. Their job i'm sure is simply to retrieve pages and store them in their entirety into a the central cache, and nothing else.
The indexing process will then come along and pick up pages from the cache and do its work, along with a URL extractor - probably running independently from the indexer that then decides what should be crawled next.
This is where you can start being clever - deciding what to crawl next and sending instructions back to the retrieval process.
If I were starting on a search engine project now I think my approach would be to shell out to wget as the crawling agent - you can configure the user-agent to be whatever you want, and it can read the list of URLs to fetch from a file (created by your indexing process), and store pages on your local filesystem, ready to be picked up by the indexing process.
Good luck!
I think you should approach search engine development in a systems fashion, rather they trying to code it all into one monster PHP script that crawls, stores, indexes etc. all in the one process.
I agree with dmorison and that is why if I were to use hyperseek, I would buy the source - the first thing I would do is break it into pieces and have a starting point for each (or most) of my processes. (I believe you actually get the source code for all packages, but in the lower version(s) it is encrypted, so there is nothing you can do with it.)
IMO: I think it is absolutely worth the extra, not only for the starting points, but because I would get to see inside the head of the people who wrote it including what they are doing, how they are doing it and with those two points covered, I could actually start to determine the why. (Can be a huge benefit when building a parallel system.) Actually, I would probably spend the first week disecting it and then go from there.
Justin
The big search engines i'm sure have significant separation between the various components of the system,
It was fascinating at the last conference to see how narrow the scope of the Google engineers is. One team does nothing but work on the part relating to resolving canonical URLs. That's it. If your question wasn't about canonical URLs and HTTP return codes, go to another table.
Unfortunately my current plans have been scuppered by a competitor launching the EXACT sort of thing that I've been planning. And they're big, they've even got a large PR company working for them. Back to the drawing board...