Forum Moderators: coopster

Message Too Old, No Replies

Personal Search Engines

         

itwebxpert

6:51 pm on Oct 28, 2004 (gmt 0)

10+ Year Member



Hi all,

I would like to create a personal Search Engine for my website which will be able to search for a keyword in every pages within my website.

I appreciate if anyone could advise or provide useful links to sort that out?

Regards

jatar_k

6:55 pm on Oct 28, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



try
[htdig.org...]
or
[phpdig.net...]
or even
[google.com...]

Code Sentinel

7:47 am on Oct 29, 2004 (gmt 0)

10+ Year Member



phpdig was easy to setup but I found it incredibly slow.. it works though and works well.. but will take a LOT of time if you have many pages(thousands). One of my sites took a week to spider and that was even with the sleep function removed and it pegged my home machine at 100% once the database started getting large(also due to removing sleep function).

I couldn't decipher htdig since I am not familiar with command line linux/unix commands nor have access to shell and whatnot at my host.

google search will require your site be in google I believe, if half of your site isn't spidered then it may not be the best solution.

There was a few others I spotted but never got around to trying, iSearch(not the toolbar) is PHP and free for non commercial/profit sites. PHP-Pheonix, Interspire FastFind, TSEP(The Search Engine Project), I only list these since they seem to be some of the few that actually have a spider portion.. others would only do filesystem indexing or just do the search/display part based on your own database(which may not hold all pages of the site).

If you run your whole site off a database with common table/row structure then you can just use the database search features.

If you want to cook your own then roll up your sleeves :) There are a few ways to actually download the page(CURL, http methods, php's file functions with url wrappers) but you will also need error handling, cookie/session handling, regular expressions for parsing data such as titles, content, new links, storing and flagging links so you don't visit more than once unless specified(reindex), handling relative links or malformed href, exclusions(robots.txt, meta stuff like nofollow, noidex), probably more I don't recall off hand.

You could probably skip some of these if you want brute force spidering without regard to your bandwidth allocation.

Then you need a way to search the data efficiently then display it nicely.. I haven't researched this far yet.. somewhat overwhelmed by the spider part. How to do it efficiently so the system that is doing the spidering and indexing isn't pegged at 100% CPU usage for a week :P I'm sure many shared hosts won't like the cpu usage either.

charlier

8:11 am on Oct 29, 2004 (gmt 0)

10+ Year Member



I use swish-e on my sites.

"SWISH-E is a fast, powerful, flexible, free, and easy to use system for indexing collections of Web pages or other files. See the article How to Index Anything by Josh Rabinowitz in the Linux Journal for more information."

Its very fast for indexing records form a database. It also has a couple helper scripts for doing spidering and displaying results.

[swish-e.org...]

itwebxpert

8:12 am on Oct 29, 2004 (gmt 0)

10+ Year Member



Hi,

Thank you for all of you who contributed to this topic. Since I am not expert in web yet, I would like to create a Search Engine for my website which displays information about products on sales.

I have been through [htdig.org...] abit of time, but it seems to be complicated.

Please advise the simplest way to create a Web Search Engine.

I highly appreciate any advice.

Cheers

ncw164x

8:49 am on Oct 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could try this
[focalmedia.net...]

It indexes all your sites pages into mysql and then you can search via keywords

I use the same search software but it is integrated into the directory software by the same company