Forum Moderators: coopster

Message Too Old, No Replies

Starting a new search engine

Any advice available?

         

zulu_dude

12:47 pm on Sep 2, 2005 (gmt 0)

10+ Year Member Top Contributors Of The Month



I'm wanting to start a new search engine. Before you fall off your chair laughing, I don't quite mean that I'm wanting to compete with the likes of G and Y! This is intended to be along the same sort of lines, but limited to a specific niche.

I'm sort-of proficient with PHP, but I would have no idea where to even start coding such an enormous project.

Does anyone know how I would go about finding out:
i) Where to find commercially available SE systems?
ii) How to start researching how to program a SE?

I've spent a good while looking about on the net, but can't seem to find anything conclusive. If I can't buy the complete system, I'm prepared to learn how to program it myself. First prize would be to buy a ready-made script. Although I think it would be a fascinating project to code one from scratch...

I understand that bandwidth requirements etc are all enormous for this sort of venture, but I'm just mulling all this over in my mind at the moment; practicalities can come later. Besides, I envision this being a fairly limited niche SE.

Maybe I'm being totally naive in expecting this sort of thing to be available, but here's hoping...

Romeo

1:02 pm on Sep 2, 2005 (gmt 0)

10+ Year Member



Before starting on your own, for a first research you may look at some of the projects (aspseek, nutch, grub) discussed here:
[webmasterworld.com...]
"Concept: an Open Source Search Engine -- Anyone care to theorize about how it could play out?"

Regards,
R.

zulu_dude

1:55 pm on Sep 2, 2005 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks Romeo... that's the exact thread that started me off thinking about all of this.

Having looked through the AspSeek site again, I see that it was written in C++. Mmm, seems like I'm going to have to whip out the old university textbooks again!

So from what I can gather so far, all the data is crawled and gathered and entered into the DB with C++ or Java or another programming language (as opposed to scripting languages like PHP, ASP). Then the data is just accessed by the user from the database via a scripting language enabled page.

HeadBut

9:49 pm on Sep 2, 2005 (gmt 0)

10+ Year Member



I don't see why you couldn't do it with PHP and MYSQL.
PHP could go to a web page and parse and enter your info into MYSQL, like a spider or crawler. I've done that for my site/s.
PHP could be your search tool also. (With MYSQL)

I think it's very do-able!

jd01

11:19 pm on Sep 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It is possible with php, there are even some tutorials for creating a crawler and parsing the information. (I don't remember where they are or I would post them - I think one was at codewalkers.com)

I would also have a look at hyperseek for the base - seems pretty solid and has a lot of functionality built in. Depending on what you want to spend, you can get the basic function as written, or you can get the source code with full access to the algo for any customization you might need.

Justin

zulu_dude

9:15 am on Sep 3, 2005 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks for the tips... went and checked out hyperseek and that seems to be along the lines of what I'm looking for: a solid base that I can build on and customise as I wish. Although it does seem to rely heavily on meta results, which is a bit of a negative (maybe). And its spider isn't a crawler, which is definitely a negative.

But it is written in PHP, which means it should be pretty easy to customise. Easier than C++ (for me anyhow).

When buying pre-packaged solutions like this, do you think that it's worth springing the extra cash to get the source code? Call me paranoid, but I like having everything under my control. I don't like using 'little black boxes' that I can't see into!

dmorison

10:23 am on Sep 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think you should approach search engine development in a systems fashion, rather they trying to code it all into one monster PHP script that crawls, stores, indexes etc. all in the one process.

The big search engines i'm sure have significant separation between the various components of the system, and there's no reason why you shouldn't follow a similar practise on your smaller scale - it will stand you in much better stead for the future.

Consider that all the majors have the ability to provide cached pages. This points to a very simple front-end process - the one we know as "Googlebot", "msnbot" and friends. Their job i'm sure is simply to retrieve pages and store them in their entirety into a the central cache, and nothing else.

The indexing process will then come along and pick up pages from the cache and do its work, along with a URL extractor - probably running independently from the indexer that then decides what should be crawled next.

This is where you can start being clever - deciding what to crawl next and sending instructions back to the retrieval process.

If I were starting on a search engine project now I think my approach would be to shell out to wget as the crawling agent - you can configure the user-agent to be whatever you want, and it can read the list of URLs to fetch from a file (created by your indexing process), and store pages on your local filesystem, ready to be picked up by the indexing process.

Good luck!

jd01

5:51 pm on Sep 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think you should approach search engine development in a systems fashion, rather they trying to code it all into one monster PHP script that crawls, stores, indexes etc. all in the one process.

I agree with dmorison and that is why if I were to use hyperseek, I would buy the source - the first thing I would do is break it into pieces and have a starting point for each (or most) of my processes. (I believe you actually get the source code for all packages, but in the lower version(s) it is encrypted, so there is nothing you can do with it.)

IMO: I think it is absolutely worth the extra, not only for the starting points, but because I would get to see inside the head of the people who wrote it including what they are doing, how they are doing it and with those two points covered, I could actually start to determine the why. (Can be a huge benefit when building a parallel system.) Actually, I would probably spend the first week disecting it and then go from there.

Justin

ergophobe

9:17 pm on Sep 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




The big search engines i'm sure have significant separation between the various components of the system,

It was fascinating at the last conference to see how narrow the scope of the Google engineers is. One team does nothing but work on the part relating to resolving canonical URLs. That's it. If your question wasn't about canonical URLs and HTTP return codes, go to another table.

zulu_dude

10:09 am on Sep 5, 2005 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks for all the input... SE design is definitely a fascinating field and something I want to learn more about.

Unfortunately my current plans have been scuppered by a competitor launching the EXACT sort of thing that I've been planning. And they're big, they've even got a large PR company working for them. Back to the drawing board...