technology choice for HTML Parsing/quering - Website Technology Issues forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

technology choice for HTML Parsing/quering

technology choice for HTML Parsing/quering

grang

10:29 pm on Nov 14, 2005 (gmt 0)

10+ Year Member

Hi all,

I am trying to do something similar to a simple meta search engine (but without using the APIs, and instead querying and parsing/formatting the html.)

Since I am a newbie to the web development, I am in need of some help/guidance in deciding the below.

1. Should the html quering/parsing/reformatting be done on the client side or the server side.

(I think if it is done on the client side that way, the server need not do all the querying and the parsing)

2. Which language/platform should I use. Since I am new to web developement I am not sure which one to choose
(PHP/Perl/JavaScript/ASP.NET C#/VBscripts etc)). (and easy to learn)

I want this to be extensible, and the results should display asynchronously so the user need not wait for results for all the sites.

3. I might be wrong, but since multiple sites are involved, Is it better do html parsing
or
convert all the HMTLs to XMLs and then take the content you need and ignore rest? (which is more easy/extensible)

It would really of great help if you can provide me some guidance.

thanks for your time.
Guru

Iguana

11:13 pm on Nov 14, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

One problem you have got is that nearly all web pages are not valid as xml - so that option is out unless you are reading rss/atom feeds.

Next, you as the server cannot control the client because of browser security (so if you had the client call up an invisible iframe with a page in it, you wouldn't be able to get at the inner document from your main window).

Your only realistic option is to get the pages on the server side. Then you need to parse the html source (not necessarity what is render in the browser) - remembering that the 'rules' of HTML may not be followed e.g end tags may not be there. So you will need some clever programming.

As far as technology is concerned, you can easily get pages with ASP, VB.Net, Perl, or PHP. Perl is particularly elegant at string manipulation for parsing (if you understand Regular Expressions). PHP is available as standard in most cheap Linux hosting packages.

Asynchronous is fine because your page can pump out its response line by line if you like (in ASP turn off page buffering).

The big question is how are you going to identify what is the part of the page you are interested in? Every site is different and some sites have hand-coded HTML so that every page is slightly different. Go to a few sites and save the html of a page to your hard disk. Look at them and figure out in your head how you would get the relevant info out of them. Clues are that there may be <div class="something"> or <a href="somedirectory/*" that wrap the info you want. Remeber you need to filter out all the navigation and adverts.

Next, how can you encode the info you need to parse the pages into a data store? - e.g. Site1 has data in div.class=something, site2 a.href=somedirectory/* (but perhaps duplicated links - one from an image and one from a text link). In my experience, much info needs you to be able to find a specific part of the page such as a link and then to be able to get the text of the outer table row. A not insignificant problem when the html is complicated and/or incorrect.

So you have a real task on your hands. Let alone when the web page design or site navigation changes. You will begin to understand why Google does a pretty crude 'snippet' for its own pages. Good luck!

Iguana

11:14 pm on Nov 14, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Sorry, I should also have said: Welcome to Webmasterworld!