How to spider one country?

Forum Moderators: bakedjake

Message Too Old, No Replies

How to spider one country?

... just one language?

MattVrecar

7:01 pm on Oct 16, 2004 (gmt 0)

i am writing a spider, do you know how would it be possible to spider sites which uses a certain language on the site? for example lets take poland: i want to save only sites which are written in polish language. ive learned that it wont help to do IP 2 country because many servers are located outside poland. it wont help if we check meta tags for language settings because some sites dont put polish info there. so how would a spider know that a website is written in polish?

jmccormac

7:55 pm on Oct 16, 2004 (gmt 0)

It would be difficult. You would have to parse each page for common Polish words. Just based on com/net/org/biz/info domains on Polish webservers, you would have approximately 55K domains. The .pl cctld is bigger and the stats are at [dns.pl...] It may be possible to find more com/net/org/biz/info domains with a dictionary search of each gtld based on Polish language terms.

The first part of designing a country search engine is to decide what to spider and build a list of domains/websites to be checked. This is more difficult than spidering the websites. At a guess, since there are 256122 .pl domains, you may have about 250K Polish domains in com/net/org/biz/info.That would be about 200K or so Polish owned domains hosted outside of Poland. However the breakdown on active domains for com/net/org/biz/info is about 70%. Only 50% or less may have active websites with content.And you always have the directories sites to give a starting list. Dmoz provides a Polish section I think which is in RDF format and is easily parsed.

My main work is on the Irish websites/domains and it is easy to build a country search engine when the country is small.:)

Regards...jmcc

MattVrecar

8:11 pm on Oct 16, 2004 (gmt 0)

even if your country is small, i think you have the same problem. so you suggest to find common polish words (poland was just an example, btw). ok, seems ok but still not a perfect solution. some sites might use one polish world from my list and i will put them to DB. other sites might not use any of words on my list and i will miss them. besides that, there is another problem. polish (or irish) sites probably represent less than 3% of all websites. it will be waste of resources and time to search whole internet and just save polish sites :( this is really a challenging task...

any other ideas than dictionary of most popular words?

jmccormac

9:17 am on Oct 17, 2004 (gmt 0)

even if your country is small, i think you have the same problem. so you suggest to find common polish words (poland was just an example, btw). ok, seems ok but still not a perfect solution.

The problem is that there is no perfect solution when it comes to country search engines even when you have all the domain data.

some sites might use one polish world from my list and i will put them to DB. other sites might not use any of words on my list and i will miss them. besides that, there is another problem. polish (or irish) sites probably represent less than 3% of all websites. it will be waste of resources and time to search whole internet and just save polish sites :( this is really a challenging task.

You've got to think differently. :) We do not have the resources of Google or MSN. So we have to select what we spider before we spider it. Google/MSN/Yahoo use the "infinite monkeys at an infinite number of typewriters" [1] approach to spidering. They spider everything and hope that their algorithms will be able to extract the right data. The approach for a country search engine is to pre-select the websites to be spidered based on the sites being relevant to that country. It is a completely different way of looking at the problem. Spidering the entire web is a waste of resources.And it is a case of spidering something like 25 million (guess) websites to extract something like 500K sites. (The number of Irish sites would probably be less than 250K).

any other ideas than dictionary of most popular words?

Break the search down into gtlds/.pl. The first and most accessible data will be com/net/org/biz/info.

Start with the main lists of easily identified Polish gtld domains (language keywords can also be used to filter domain names), verify that they have webservers and then pre-index the top page from each of those sites. On that dataset, you can then apply filters to determine active sites. This filter is applied to the web data first. When you have the active sites dataset, you start looking for Polish keywords in the title/metadata and then on the body text. Each filter is applied in steps so that the process becomes like a sieve or net.

There are other sources for websites relating to Poland - the Dmoz.org Polish section and Polish directories. They can often provide information on sites that you have missed with the initial spidering.

I should really write a FAQ or a book on all these techniques. :)

Regards...jmcc

[1] The theory that an infinite number of monkeys at an infinite number of typewriters given an infinite time period will eventually produce the entire works of William Shakespeare.:)

Dave_A

5:33 am on Oct 21, 2004 (gmt 0)

From my angle I only spider web sites from New Zealand so I limit spidering to .co.nz or country specific domain names, the problem comes when people use .org or .net and .com domain names, then it's a case of visit the web site manually before I send in the spider.
It would be good to limit the spider to a single country but as yet I don't know a way of writing a php spidering script that would be able to do that.

jmccormac

10:24 am on Oct 21, 2004 (gmt 0)

From my angle I only spider web sites from New Zealand so I limit spidering to .co.nz or country specific domain names, the problem comes when people use .org or .net and .com domain names, then it's a case of visit the web site manually before I send in the spider.

Handling com/net/org/biz/info is difficult for any country where a strong cctld exists.Approximately 34557 (Based on stats for 15-Oct-2004 and limited by .nz nameservers and nameservers on identified New Zealand IP ranges.) possible New Zealand owned gtld domains exist. So that would be a lot of manual visits. The other problem is that the New Zealand market may have the same problems that Ireland has and a lot of the potential New Zealand hosters may be hosted in other countries. At a guess, Australia would be the main hoster country with the US being secondary.

It would be good to limit the spider to a single country but as yet I don't know a way of writing a php spidering script that would be able to do that.

If there is a high level of cctld restriction, (cctld domain holders have to have strong connection/links to the country) then the cctld is a good start because the probability of the cctld domain website being related to the country is high. When it gets to the gtld domains, it is a question of pre-selection rather than blind crawling (as Google/Yahoo/MSN do).

Once you've built up a core list of country specific websites, you can use this data to extract links to other probable country specific websites. However this has to be done with a database so that you can easily delete duplicates and run the extracted links against a block list of known bad or known unrelated links.

PHP has a good regexp and a lot of the work will be regexp based initially. The key is building the spider list beforehand. This is essentially what I've been doing for my main country SE/directory site. However the same techniques can be used on any country or gtld. I've already experimented with the UK but a lack of time/resources means that I cannot implement a UK SE for a while.

Regards...jmcc

Maxime

2:42 pm on Oct 23, 2004 (gmt 0)

Take look at [maxime.net.ru...]

Fischerlaender

8:48 am on Oct 24, 2004 (gmt 0)

The principle of locality (or locality of reference [all-science-fair-projects.com]) in computer science tells us that the "likelihood of referencing a resource is higher if a resource near it was just referenced".

In terms of spidering a country's web this means that it's very likely that a polish page is linking to other polish pages.

So my strategy is to start with a large sample of country-specific pages which I take from Dmoz. My spider follows all the links from within this sample and fetches the linked pages for which a simple language checking is done. If the fetched page isn't in the desired language I discard it, otherwise this page is incorporated in my index and all the links get spidered.

Although I haven't done any quantitative analysis of this approach the results tell me that it works fairly well. (I'd guess that about 90% of all pages my spider is downloading are in the desired language.)

jmccormac

8:37 pm on Oct 24, 2004 (gmt 0)

In terms of spidering a country's web this means that it's very likely that a polish page is linking to other polish pages.

True. However brochureware websites (business sites of 5 or so pages that have no interactivity and are not updated continually) do not tend to link heavily.

So my strategy is to start with a large sample of country-specific pages which I take from Dmoz. My spider follows all the links from within this sample and fetches the linked pages for which a simple language checking is done. If the fetched page isn't in the desired language I discard it, otherwise this page is incorporated in my index and all the links get spidered.

It works as a starting position but it suffers from the GIGO problem (Garbage in - garbage out). It depends on the quality of the Dmoz index and the quality of the editors.

I wonder if there is a business in providing a country-level search index for each country.

Regards...jmcc

Larryhat

4:12 am on Oct 25, 2004 (gmt 0)

Hello all:

Most anybody with a website up wants it to be seen and visited. How about attracting webmasters with a free site submission form on your site? Specifically name the country/language and its variations and variant spellings so the submission page gets found. Use all the keywords (submit, suggest, site, page, directory, inclusion ..)

Personally, I would go by language as much as country, since so many Polish/Irish etc. live abroad.

Best -Larry

jmccormac

4:29 am on Oct 25, 2004 (gmt 0)

Most anybody with a website up wants it to be seen and visited. How about attracting webmasters with a free site submission form on your site? Specifically name the country/language and its variations and variant spellings so the submission page gets found. Use all the keywords (submit, suggest, site, page, directory, inclusion ..)

The problem is that user submission is not sufficient on its own to keep a search engine viable. All of the search engines that I've seen set up in Ireland over the past few years have based their initial dataset on Dmoz and user submissions. None of them are still around. I think that one of them lasted for 14 months before giving up. The same pattern probably exists for all country level search engines.

Now that bandwidth is somewhat cheaper, the latest idea is to crawl the links outwards from the initial dataset. The links outwards may be to other sites in the same dataset and as a result, this wastes more resources than it gathers in new sites. The alternative is to identify other high link count, authority sites like directories and integrate them into the dataset.

Without a proper website acquisition strategy, any country level search engine relying only on user submissions is going be in trouble. In order to survive against Google/MSN/Yahoo, a country level search engine has to be better. The problem with relying on language for identifying Irish owned sites is that like the Americans, Australians, New Zealanders, Canadians and British, we speak English. :) I guess I should develop an accent parser.

Regards...jmcc

Larryhat

5:12 am on Oct 25, 2004 (gmt 0)

Hi JM: Points all well taken. How about this rather tedious method: Since Ireland is rather small, do an search for cities, towns and counties in Ireland. Take the first list of place-names from a map index maybe. Google/Yahoo each place and see what pops up.

I just Googled for "county Donegal" (exact phrase) and G returned 57,400 listings. They won't show you all of them of course, but the ones that show will have their own outgoing links.

I don't see any 'magic bullet' here, just long hard work . Anybody else? - LH

Dave_A

5:37 am on Oct 25, 2004 (gmt 0)

An Accent parser, G'day mate Dave from Downunder here?
That would be a neat idea but where would you start scripting such a thing?
The small countries have an edge on the major Search engines like google, because unless you do a specific search you get links too all over the place and in countries like NZ and Ireland we have the bonus of a degree of customer loyalty and nationalism.
I set up a NZ based search engine five months ago (www.linknz.co.nz)and so far we are getting around 90,000 searches a week, which isn't bad considering the population of New Zealand is around four million.
Being that hosting is quite cheap, a countrywide search engine soon becomes quite a good investment and quite easy to set up. The degree of usage per head of population is quite high.
Google is good but who wants Web sites from all over the place.
I have set up a free indexing web service for all Kiwi based web sites and we have a small database of around 65,000 web sites indexed so far.
It seems small enough to remain almost a local service which everyone likes.
So if at some stage I were to ask for a dollar to index a web site, many people wouldn't refuse it and we can offer a degree of speed that may other larger Search engines can't match.
People from the other end of my country can call me on the phone or email me and say "Can you index my site?" and it's simply a case of There you go Bro it's done.GRIN!
Considering the fact that it's hardly worth google coming down this far south.
Let the AMELICAN's have the world's biggest (GRIN!) but in Kiwi land we would be the first to say that size isn't everything... Grin.

All the best
Dave A

jmccormac

2:08 pm on Oct 25, 2004 (gmt 0)

Hi JM: Points all well taken. How about this rather tedious method: Since Ireland is rather small, do an search for cities, towns and counties in Ireland. Take the first list of place-names from a map index maybe. Google/Yahoo each place and see what pops up.

Hard way of doing it Larry. Most of the .com/.net variants of Irish town and county names are squatted by that iCeltic mob in the USA. A lot of others tend to be mail only domains. I run the stats on each hoster in com/net/org/biz/info/ie tlds every week based on zonefiles so it would be easy to do such a thing. The other complicating factor is that a lot of Irish town/county names are also used in the USA.

I don't see any 'magic bullet' here, just long hard work .

The magic is in the regexp incantations but it is long hard work. I've been doing it for about five years now. :)

Regards...jmcc

jmccormac

2:32 pm on Oct 25, 2004 (gmt 0)

An Accent parser, G'day mate Dave from Downunder here?
That would be a neat idea but where would you start scripting such a thing?

It sounds a bit strange Dave but each country has unique terms and phrases. Some of it would involve heavy parsing of the body text in a page but a higher payoff could be in parsing the domain name/urls for these keywords.

The small countries have an edge on the major Search engines like google, because unless you do a specific search you get links too all over the place and in countries like NZ and Ireland we have the bonus of a degree of customer loyalty and nationalism.

Yes. However Google's advertising budget and brand recognition is a hard thing to beat unless you've got a good domain name. Unfortunately for me, Google decided to open an office in Dublin so it too could claim to be an Irish search engine.

Google is good but who wants Web sites from all over the place.
I have set up a free indexing web service for all Kiwi based web sites and we have a small database of around 65,000 web sites indexed so far.
It seems small enough to remain almost a local service which everyone likes.

This is the key selling point of a country level search engine in a nutshell. It offers links to websites that are only relevant to the country.

I don't think Google can come close to providing such a service. As for Microsoft - its "researchers" are so preoccupied with this semantic web concept and its effect on localisation that they cannot see the simple solutions to the problems of local search.

Regards...jmcc