| 7:05 pm on Oct 9, 2012 (gmt 0)|
London might be a good test for such a search engine.
|jmcc, a formidable task but one that you're familiar with. Why not try a city, like London to start off with? The UK is a big country and makes the task bigger, London would be a good 'proof of concept'. |
The seedlist of sites is approximately 8 million. This will fall significantly with the application of some web survey classification algorithms. (The ones I use for classifying web usage in TLDs). This kind of approach is different to the conventional search engine approach of spidering everything and then trying to clean the index.
|The first issue I see is knowing where to spider and the redundancy of discarding 'non-UK' sites. I guess you would want to start off with a seed-list of sites and follow the links a couple of levels deep and you could have a good proportion of related sites. |
The UK webspace is also more complex than some in that it has a few hosters that are transnational hosters covering a number of countries. This causes an adjacent markets effect where countries selling into the UK market will have registered a .uk domain name.
The redundancy issue (clone websites), is one with which Google has problems. Once the clone websites get into the index, it becomes very difficult to solve. However the web usage algorithms actually solve it before indexing. However as the UK is a mature market in web terms, a high percentage of UK websites will be hosted in the UK.
| 7:13 pm on Oct 9, 2012 (gmt 0)|
This is part of the approach that I was looking at. As it would be deeply linked to the search engine, a site getting knocked out of the index may well have the directory entry knocked out too. Though Bing has been improving over the last year or so (or Google has been declining), the last thing I would want to do is adopt its approach to local search. The vortal approach to niche search is also a possibility. It is one of the things that I've tested out on a small (approximately than 2K sites) scale.
|The directory model has failed so many times it's a little bit tragic though. DMOZ - spammed to hell (they never reacted to market changing around them and the model didn't scale). Yahoo - died a slow death (again, they never reacted with the market changes). Any link directory - was always spam. Yellow Pages, etc - slow to embrace technology, slow to adapt their model. But I do think a directory model of some sorts could work. |
| 7:22 pm on Oct 9, 2012 (gmt 0)|
It looks like the most obvious approach but it is also a very dangerous one when it comes to a country level search engine because spam, out of area sites, holding pages and PPC parked sites make their way into the index. While the UK is ccTLD positive (more local ccTLD domains registered than com/net/org/biz/info etc), there is still a significant number of non .uk websites that would be excluded in a TLD restricted crawl.
|Your initial crawl could just be bone-headed in terms of TLD, that would give a good base, a decent document collection. |
| 8:12 pm on Oct 9, 2012 (gmt 0)|
Right, I should have used ccTLD in my post, specifically .uk for this discussion and that doesn't negate the need for a high quality seed list. If you cut a crawler loose on a junky seed list, you'll only crawl junk.
You'd also need some type of domain depth setting. Initially keep it low and only harvest a lowish amount of links per document. That way you don't end up in the weeds. Once you get going it's pretty easy to spot "trusted" and easier yet to spot spam. Once you get a grip on "trusted" you move those to a different crawl and go deep. Keeping a crawler out of the ditch is a bit of a trick but can be done with some practice.
| 8:35 pm on Oct 9, 2012 (gmt 0)|
Well the seed list is going to be relatively high quality after it goes through the web usage algorithms. These identify spam/junk and clones before they get to the main index.
| 9:03 pm on Oct 9, 2012 (gmt 0)|
I read somewhere that hubpages, not a search engine by most interpretations raised a couple million dollars to get going
To me a decent Search engine would be a race between money runing out and the site becoming self sustaining, i don't believe that the tech itself is insurmountable
| This 36 message thread spans 2 pages: < < 36 ( 1  ) |