@Seoskunk you glorious b&&&& :D Thanks i'm checking that out now. Nice.
That's the future (not necessarily that actual project), but that's exactly what we're missing. An engine for the people, by the people, controlled by the people. It's here, right now, immature, useless (results), and unknown (except to Seoskunk) - but it will grow.
This is exactly how the future 'rolls', won't do us much good now - but exciting to watch.
Ixquick and StartPage were both mentioned, but they are one and the same. Startpage is the US 'version' of Ixquick that they put up partly to separate the US and Euro (NL) divisions. They do no crawling, relying on digesting results from Google, Bing and others. They sell advertising similar to AdWords for income. They do offer tools that let you use Google's tools anonymously. Duckduckgo uses Yandex crawlers.
Just like the internet itself, search needs to be decentralized -- so no one person/entity can control it.
As I mentioned here (in December 2009),
and more detailed here (in 2011):
.. and several other threads over the years about a "distributed search engine".
@not2easy .. so with regard to DDG, building and maintaining a crawler might be key to providing a certain uniqueness to the index .. I could see it running in conjunction with Yandex crawler for a period of time before DDG can become fully independent in it's results ..
|Duckduckgo uses Yandex crawlers. |
:: lightbulb ::
That explains why Yandex keeps crawling pages that it can't index! (For those who have never checked it out: Yandex's wmt includes a breakdown of why various pages aren't indexed. One category is "unsupported language".) Yup, there they are on DuckDuckGo.
DuckDuckGo also has an interestingly different way of handling queries in languages it doesn't know. I tried two unrelated languages in two unrelated scripts and saw the same technique at play each time. Not an approach I'd recommend, but definitely an alternative to g###'s exact-match-or-nothing approach.
How feasible would it be to create our own search engine and what things would we need to do it?
I use AAfter Search - a kind of ugly looking SERP, however, all the important links (wikipedia/local yellow pages/real-time) related to the search queries are at the top of the results that are useful to me.
Update created this post for those interested in a new P2P search engine....... [webmasterworld.com...]
|brotherhood of LAN|
From engine's original post on 19th June
|This may just be a temporary boost: DuckDuckGo is reporting a 2 million to 3 million [twitter.com] jump in traffic over only a few days since the Prism story broke. |
The figure at the time was a bit higher but has just surpassed 3.5 million.
These are the questions you have to ask:
|How feasible would it be to create our own search engine and what things would we need to do it? |
1. What do you want to index?
2. What will be the size of the working index?
3. How often will the index be updated?
4. How do you intend to keep the index clean?
5. What levels of expertise are available?
6. What kind of hardware is available?
7. What kind of funding is available?
8. How long is the period between startup and launch?
9. How will it make money?
In reply to how feasible would it be to create our own search engine and what things would we need to do it?
To actually do a large scale crawl can be done on a relatively small number of machines. For example, IRLBot did well over a billion pages using just a single machine in 2009. If you look at how the ClueWeb 2012 dataset was collected it again only used a small number of machines -- fewer than were used for ClueWeb 2009. You could also forgo crawling altogether and use public data sets like commoncrawl.org or Internet Archive's datasets. These datasets are large enough that you don't want to just download them. With Clueweb you can pay for hard drives but also need to sign a use agreement.
There are of course open source crawlers/indexers out there that work at web scale such as nutch/lucene/solr, or my own search engine yioop.
The hard part is not getting the pages with a small number of machines -- the hard part is to be able to process the pages to any significant degree. This is slowly changing, the algorithms to fake what the big boys are doing are getting better and of course all hardware is getting faster. As an example on something that is a helped by a lot of machines, consider a conjunctive query. A typical web index has list structures of the form (word, list of documents that the word appeared in). Assume these lists are sorted by some global importance measure like page rank. The lists associated with a word can be for a web crawl in the order of millions to billions of entries. On a single word query you can just return roughly the first x members from the list and be done with it. So the look up time is proportional to the number of results you want.
For a two word query where each word is relatively common but where the two words don't occur often on the same document (for example, earthquake soccer), the time to find documents can be proportional to the length of the shorter list, which might still be millions or billions of entries. This results in super slow results on a small number of machines to answer these queries. Splitting the index (the lists for each word) across many machines, means each machine doesn't have as a long a list to search for intersections.
Using distributed search engine might solve this problem, but has its own headaches.
With a small number of machines, you were largely out of luck until recently. Now you can use a hybrid inverted index/suffix tree where suffix trees are done as in:
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran: Inverted indexes for phrases and strings. SIGIR 2011. 555-564
to get something like a conjunctive query with a fallback to exact string match but in time proportional to the number of results rather than list sizes.
My own feeling is that from a technical perspective it will become increasingly possible for individuals to have their own web scale crawls of the web in the future.
|brotherhood of LAN|
Nice post cpollett, and I hope you're right that hardware will allow for a wider market of people to try and build competing engines that produce quality results in good time.
It sounds like you've worked a lot on the problem on the sheer size of data and what structures can be used.
What are your plans for your search engine?
RE: seed data, blekko has an Amazon public dataset available, listing unique domains and unique URLs alongside some ranking information. The domain list totals 170 million, so the URL list must be pretty huge. Worth knowing if you're looking for a seed list to start with. I remember these kind of threads mentioning using DMOZ as a seed list, which is tiny in comparison.
I think that a lot of web scale attempts at building search engines are theorised by people who really don't understand the fabric of the web when it comes to the numbers of domain names. I would include some of the Google people in that because I consider that they don't understand the web and are merely applying sticking plasters to a rather large overfilled string bag in an attempt to shore up an algorithm with more holes than a lump of activated carbon.
Not all domain names have working websites. Depending on the TLD, most will be a combination of holding pages (or soft 404s as some people call them) and PPC parking. Then there are brand protection registations which may redirect via 301s and 302s to the primary site. Then there are clones and mutants (sites that are not 30xed to the main site and are either complete copies or so slightly different that they appear to be different sites to ordinary search engines. Then there are the compromised websites. Fairly quickly, you get to a core dataset which may be around 9 to 23% of that TLD. And then the real work starts in sorting the active from the abandoned websites.
The technology is easy. The algorithms are tougher but Google can be beaten because it is now in its infinite monkeys mode (the Shakespeare thing) of spidering the web and hoping that its blind crawling will give all that steaming lump of data some relevance. The real genius is no knowing what not to spider.
And the real kicker is that the cargo-cult SEOs and their FUD buddies in Google are destroying the link structure of the web by scaring webmasters into not linking to other sites. That puts any new SE on very shaky ground because they will miss a major section of the web. New websites rarely link to others. Index page outbound links are becoming rarer.
On the page at https://www.ixquick.com/eng/prism-program-revealed.html you can read,
"Our company is based in The Netherlands, Europe. US jurisdiction does not apply to us, at least not directly."
not2easy had written,
"Ixquick and StartPage were both mentioned, but they are one and the same. Startpage is the US 'version' of Ixquick that they put up partly to separate the US and Euro (NL) divisions."
Last time i checked - that is to say, when i checked today, ;-) - ixquick.com and startpage.com were in the same IP address block 188.8.131.52/24 routed via a San Jose, California, based Server belonging to Dollar Phone Corp / Supernet.
How is a server located in the US outside of US jurisdiction?
|brotherhood of LAN|
Welcome to the forums yaimapitu
Their claim is a tall order, considering that their visitors may be from anywhere in the world & their data would cross a number of jurisdictions/countries.
For those who are interested in such things, I just noticed that gigablast has gone open source after being acquired by yippy. You can checkout their source code:
at the bottom of the page it has links to documentation which looked interesting but I have only slightly skimmed it.
That's a lot of code and a lot of reading. It is an interesting move by Gigablast though.
|brotherhood of LAN|
Very interesting, thanks for sharing.
| This 48 message thread spans 2 pages: < < 48 ( 1  ) |