Welcome to WebmasterWorld Guest from 54.166.114.43

Forum Moderators: open

Message Too Old, No Replies

extracting text from a web page

   
6:20 am on Mar 21, 2008 (gmt 0)

10+ Year Member



Hi again.

I wonder whether it's possible to extract text of a webpage? Suppose I have following text.

<p>This is my text which is <a href="#">anchored</a> as well as bold but <div> it's not so good</div>.

and I can fetch just text within Tags? Do keep in mind that tags are not pre defined.

somebody suggested to use Lynx on Linux platform. I wonder whether there is any JS based solution?

2:26 pm on Mar 21, 2008 (gmt 0)

WebmasterWorld Senior Member fotiman is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



I'm not sure what you're trying to accomplish here. Are you wanting to get all of the text on the page without any of the HTML elements? In other words, if your page contained this:

<body>
<h1>Webmaster World</h1>
<p>
Hello and welcome to <a href="webmasterworld.com">Webmaster World</a>! <strong>Now get going!</strong>
</p>
</body>

Then you want to get a string like this:

Webmaster World Hello and welcome to Webmaster World! Now get going!

Is that what you're trying to accomplish?

3:36 pm on Mar 21, 2008 (gmt 0)

10+ Year Member



exactly
5:01 pm on Mar 21, 2008 (gmt 0)

WebmasterWorld Senior Member fotiman is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month




/**
* Get all text nodes at or beneath node n, concatenate the text
* and return it as a single string.
* @param {DOMNode} n The node to get the text at
* @return The concatenated string
*/
function getText(n) {
var s = [];
function getStrings(n, s) {
var m;
if (n.nodeType == 3) { // TEXT_NODE
s.push(n.data);
}
else if (n.nodeType == 1) { // ELEMENT_NODE
for (m = n.firstChild; null != m; m = m.nextSibling) {
getStrings(m, s);
}
}
}
getStrings(n, s);
var result = s.join(" ");
return result;
}
alert(getText(document.body));

Note, in this example I concatenate all of the text nodes putting a space between them. If, however, you have inline elements like this:

Webmaster<span>World</span>

Then you'll end up with an extra space in the text that is retreived.

Hope this is enough to get you started.

5:10 pm on Mar 21, 2008 (gmt 0)

5+ Year Member



> exactly <

I think that's a pretty standard regex string.

Mystring.replace(/<[^>]+>&[^;]+;/g,'').replace(/ {2,}/g,' ')

The first replace strips tags, the html comment, and entities. The second reduces two or more consecutive spaces to one space.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month