extracting text from a web page

Forum Moderators: open

Message Too Old, No Replies

extracting text from a web page

kadnan

6:20 am on Mar 21, 2008 (gmt 0)

Hi again.

I wonder whether it's possible to extract text of a webpage? Suppose I have following text.

<p>This is my text which is <a href="#">anchored</a> as well as bold but <div> it's not so good</div>.

and I can fetch just text within Tags? Do keep in mind that tags are not pre defined.

somebody suggested to use Lynx on Linux platform. I wonder whether there is any JS based solution?

Fotiman

2:26 pm on Mar 21, 2008 (gmt 0)

I'm not sure what you're trying to accomplish here. Are you wanting to get all of the text on the page without any of the HTML elements? In other words, if your page contained this:

<body>
<h1>Webmaster World</h1>
<p>
Hello and welcome to <a href="webmasterworld.com">Webmaster World</a>! <strong>Now get going!</strong>
</p>
</body>

Then you want to get a string like this:

Webmaster World Hello and welcome to Webmaster World! Now get going!

Is that what you're trying to accomplish?

kadnan

3:36 pm on Mar 21, 2008 (gmt 0)

exactly

Fotiman

5:01 pm on Mar 21, 2008 (gmt 0)


/** 
 * Get all text nodes at or beneath node n, concatenate the text 
 * and return it as a single string. 
 * @param {DOMNode} n The node to get the text at 
 * @return The concatenated string 
 */ 
function getText(n) { 
  var s = []; 
  function getStrings(n, s) { 
    var m; 
    if (n.nodeType == 3) { // TEXT_NODE 
      s.push(n.data); 
    } 
    else if (n.nodeType == 1) { // ELEMENT_NODE 
      for (m = n.firstChild; null != m; m = m.nextSibling) { 
        getStrings(m, s); 
      } 
    } 
  } 
  getStrings(n, s); 
  var result = s.join(" "); 
  return result; 
} 
alert(getText(document.body));

Note, in this example I concatenate all of the text nodes putting a space between them. If, however, you have inline elements like this:

Webmaster<span>World</span>

Then you'll end up with an extra space in the text that is retreived.

Hope this is enough to get you started.

fside

5:10 pm on Mar 21, 2008 (gmt 0)

> exactly <

I think that's a pretty standard regex string.

Mystring.replace(/<[^>]+>�&[^;]+;/g,'').replace(/ {2,}/g,' ')

The first replace strips tags, the html comment, and entities. The second reduces two or more consecutive spaces to one space.