Forum Moderators: IanTurner & engine

Message Too Old, No Replies

fast new uk search - that has its own results

         

penfold25

3:16 am on Jun 16, 2004 (gmt 0)

10+ Year Member



Just stumbled accross a uk engine called [ukwizz.com...] that seems to be very new, the results are not too bad and serps seem to come up pretty fast. Anyone else heard of it?
Im sure it could definitely take a small percentage of the UK market maybe....

[edited by: Brett_Tabke at 8:12 pm (utc) on June 16, 2004]
[edit reason] [webmasterworld.com...] [/edit]

sidyadav

4:17 am on Jun 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I quite like Mooter, and I'd like to see whether UKWizz can compete - cluster wise.

No, you may have misunderstood me - I wasn't talking about that kind of clustering (which is keyword clustering)

I was talking about the simple domain clustering (max 1 result per domain) which a lot of search engines have (Google, Gigablast etc).

Also, the points you stated above - you just repeated it for the second or the third time (post count..?)

Sid

jmccormac

4:22 am on Jun 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Then I did some digging, and it looks like you can fully index about 300,000 web pages for about 20 bucks a month including software, spidering and hosting.
At a guess, a spider should be able to get through approximately 3.1 million pages in a single month of continuous spidering. With more hardware and bandwitdh, that could increase but it would involve a distributed database with distributed spidering.

The problem with spidering is that when your search index starts getting large, the processing of the spidered pages gets slower. This would mean that breaking the database down over a number of smaller databases is often a better idea especially if the search engine can combine results from a number of different search databases.

The Dmoz float would provide approximately 166K URLs (rough figure - not checked for duplicates) but not all of them would be active or even usable. Going deeper on some of these URLs may not be such a good idea. The UKwizz index is more comprehensive than a simple Dmoz float. And it is that difference that may give UKwizz an advantage. From a hardware point of view, I am not sure that caching the indexed webpages is a good idea as it may be better to just have more indexed pages than a snapshot of each page as well as the results. However it does encourage site stickiness because the user is held on the SE a little longer.

In any SE operation, the biggest problem is the quality of the search index that the user sees. The index has to be cleaned of the dead sites, the "coming soon" sites, the holding pages and the wrongly categorised sites. Apart from monitoring the spiders, this is the toughest part of running an SE. You have to be able to make a decision to delete a site that may have hundreds of pages (typically because it is just a Dmoz clone or is just a PPC/affiliate swamp with no original content) without hesitation or remorse. SEs live and die by their indices and if UKwizz can provide a linkswamp free and spam free index of UK websites that is better than bigger, competing SEs then it will be have a fighting chance. But monetizing the results will be crucial.

Regards...jmcc

[edited by: jmccormac at 4:57 am (utc) on June 24, 2004]

jmccormac

4:54 am on Jun 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I was talking about the simple domain clustering (max 1 result per domain) which a lot of search engines have (Google, Gigablast etc).
That's the 'Group By Site' function sidyadav. I think that it is an option in how Aspseek presents the results.

It is one of the most useful aspects of any search engine because it reduces the number of options for the user and increases the usability of the search engine. I just checked on UKwizz for a well known UK city and it seems to be working well.

Regards...jmcc

exmoorbeast

10:35 am on Jun 24, 2004 (gmt 0)

10+ Year Member



Great post JMCC,very useful.

How complex would the ranking algo be on something like UKwizz in comparison to Google, or maybe even something like Inktomi. Are we talking chalk and cheese here?

I hear there are some large engines out there that are just completely unable to deal with certain sites, and I wonder why this is? Will read your post again when I have some time with great interest.

sidyadav

10:54 am on Jun 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> That's the 'Group By Site' function sidyadav.

Well, there may be different names for it. I got "Site Clustering" from the Gigablast Advanced Search [gigablast.com] page, as that's the only search engine I know which seemed to mention it (a lot of others use it though).

Sid

christopher

12:11 pm on Jun 24, 2004 (gmt 0)



"I hear there are some large engines out there that are just completely unable to deal with certain sites, and I wonder why this is? Will read your post again when I have some time with great interest"
----------------------------

Lot's of sites use Content Management Systems, and a lot of engines won't be able to access the information mostly cos the site owner doesn't want their admin sections becoming public knowledge.

So usernames and passwords are set up, and web designed to disallow spiders access. But the engines seem to penalize these sites.

There is another possibilty however:

If you penalise lot's of sites from the free listings -this forces them to buy paid services.

People will seek alternatives. There is always someone else who is willing to help the customer.

This is what lots of sites need - to get started. I see all this technology out there, either copied or slightly altered (to avoid breaking copyright laws ) but it's the same.

Same doesn't give value. New ideas do.

Advertisers want value - to enable them to compete with the big boys. Give it to them.

jmccormac

1:54 am on Jun 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How complex would the ranking algo be on something like UKwizz in comparison to Google, or maybe even something like Inktomi. Are we talking chalk and cheese here?
The Aspseek software that UKWizz is using is Open Source exmoorbeast. It is possible to analyse the algorithm which is more than can be said for Google's algorithm. :)

However Google and Inktomi's relevancy and interconnectedness aspects are probably more advanced than most Open Source search engines. This is really Google's unique selling point. However Google, for all its turnip fields of PhDs cannot beat a properly refined, country level search engine. This is the USP that UKWizz has to build upon.

The fundamental difference between a country level search engine and a Google type macro search engine is that the country level SE has a known, finite, area to cover. It is a micro SE to Google's macro.

The macro search engine is in effect trying to do what the country level SE is doing but it is doing it backwards. It is spidering everything and then hoping that it can apply relevance to the resulting mass of data. The country level SE has already taken the step of reducing that potential mass of data by planning what to spider.

I hear there are some large engines out there that are just completely unable to deal with certain sites, and I wonder why this is?
At a guess, the answer is down to the sheer automation of the process. The Googles and the Inktomis do not have the level oversight that you can achieve with a country level SE. Dupe sites, cloaked sites and linkswamps can be a lot easier to identify and by pre-indexing, you can build a blocklist of these sites.

Some of the spam that has been in operation recently is designed to mimic the profile of a directory with static URLs. This can be the hardest to find. While some of these sites are legitimate sites that have made their dynamic webpages more spiderable, the problems are due to the linkswamps that are trying to game Google. These have some characteristics that both help and hinder SEs.

The danger for any SE is that the PPC links will be integrated with legitimate content. It is all too easy to ban sites with any URL with a known PPC/affiliate link. However that is a drastic solution and if the mainstream search engines were to do this, it would cause massive problems and inevitable retaliation. Google and Inktomi/Overture have too much to lose by knocking each other out. They may hit one or two of the small guys or assign a penalty to any site containing such PPC links but they keep away from the big players because it would cost too much.

Some of the smarter operators of these big sites hide the affiliate stuff in Javascript. (Adsense is also in embedded Javascript.) Most spiders ignore Javascript and as a result, these sites are rarely identified as being linkswamps. The only thing that the spider indexes is the apparently legitimate content. By periodically changing the keyword targeted content, it is possible for a linkswamp to give all the signals of being a legitimate human created site. And because Google, Inktomi and all the rest are highly automated processes, there is no real intervention unless a user spots the site and complains. These sites have certain characteristics that can give them away but a country level SE would have a better chance at detecting them because of its index being smaller and because of its operator(s) being more highly motivated. And it is easier to identify particular hosts and operators.

With any system, people are going to find weaknesses. Google's Page Rank system was a classic example of something working well on paper and in reality, but being wide open to abuse. I think that at the higher level of some of the bigger SEs, there is an awareness of this problem and that is why the whole Semantic Web thing is being pushed forward.

Regards...jmcc

jmccormac

2:02 am on Jun 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, there may be different names for it. I got "Site Clustering" from the Gigablast Advanced Search page, as that's the only search engine I know which seemed to mention it (a lot of others use it though).
Yep Sidyadav. :) Too many names for the same thing. Though most of them will have "more results from this site" as the indicator for it. It seems to be on as a default on most search engines.

Regards...jmcc

jmccormac

10:56 am on Jun 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So usernames and passwords are set up, and web designed to disallow spiders access. But the engines seem to penalize these sites.
On a Page Rank basis, these sites may appear but against a fully spiderable site, they haven't as good a chance. It is not a conspiracy.

Regards...jmcc

exmoorbeast

6:51 pm on Jun 25, 2004 (gmt 0)

10+ Year Member



jmcc

Thanks for taking the time to educate us....

I'd hit you back with something here, but I'm time limited at the moment...also i'm gonna share this with some of the really techie guys I know as they probably can get more out of it than me. PM me if you ever want to build an index cos i reckon I'd love to support it!

To me this is the best post in the UK forum this year, once again thank you very much indeed. You've restored my faith in coming to this part of WebmasterWorld.

christopher

8:16 pm on Jun 25, 2004 (gmt 0)



Yep - UK & Ireland SE is one of the best forums on WebmasterWorld.

Probably because there is more happening with Search Engines than directories I guess. But Directories can be a lot of fun to run and read about.

It's nice to see the small guys actually beat the Majors by doing something they can't or aren't willing to.

It's very satisfying.

This 71 message thread spans 3 pages: 71