Welcome to WebmasterWorld Guest from 54.196.214.35

Forum Moderators: open

Message Too Old, No Replies

extracting text from a web page

     
6:20 am on Mar 21, 2008 (gmt 0)

Full Member

10+ Year Member

joined:Dec 11, 2002
posts: 213
votes: 0


Hi again.

I wonder whether it's possible to extract text of a webpage? Suppose I have following text.

<p>This is my text which is <a href="#">anchored</a> as well as bold but <div> it's not so good</div>.

and I can fetch just text within Tags? Do keep in mind that tags are not pre defined.

somebody suggested to use Lynx on Linux platform. I wonder whether there is any JS based solution?

2:26 pm on Mar 21, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member fotiman is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 17, 2005
posts:4966
votes: 10


I'm not sure what you're trying to accomplish here. Are you wanting to get all of the text on the page without any of the HTML elements? In other words, if your page contained this:

<body>
<h1>Webmaster World</h1>
<p>
Hello and welcome to <a href="webmasterworld.com">Webmaster World</a>! <strong>Now get going!</strong>
</p>
</body>

Then you want to get a string like this:

Webmaster World Hello and welcome to Webmaster World! Now get going!

Is that what you're trying to accomplish?

3:36 pm on Mar 21, 2008 (gmt 0)

Full Member

10+ Year Member

joined:Dec 11, 2002
posts: 213
votes: 0


exactly
5:01 pm on Mar 21, 2008 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member fotiman is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 17, 2005
posts:4966
votes: 10



/**
* Get all text nodes at or beneath node n, concatenate the text
* and return it as a single string.
* @param {DOMNode} n The node to get the text at
* @return The concatenated string
*/
function getText(n) {
var s = [];
function getStrings(n, s) {
var m;
if (n.nodeType == 3) { // TEXT_NODE
s.push(n.data);
}
else if (n.nodeType == 1) { // ELEMENT_NODE
for (m = n.firstChild; null != m; m = m.nextSibling) {
getStrings(m, s);
}
}
}
getStrings(n, s);
var result = s.join(" ");
return result;
}
alert(getText(document.body));

Note, in this example I concatenate all of the text nodes putting a space between them. If, however, you have inline elements like this:

Webmaster<span>World</span>

Then you'll end up with an extra space in the text that is retreived.

Hope this is enough to get you started.

5:10 pm on Mar 21, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 14, 2008
posts:144
votes: 0


> exactly <

I think that's a pretty standard regex string.

Mystring.replace(/<[^>]+>&[^;]+;/g,'').replace(/ {2,}/g,' ')

The first replace strips tags, the html comment, and entities. The second reduces two or more consecutive spaces to one space.