homepage Welcome to WebmasterWorld Guest from 23.20.149.27
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / JavaScript and AJAX
Forum Library, Charter, Moderator: open

JavaScript and AJAX Forum

    
extracting text from a web page
kadnan




msg:3607093
 6:20 am on Mar 21, 2008 (gmt 0)

Hi again.

I wonder whether it's possible to extract text of a webpage? Suppose I have following text.

<p>This is my text which is <a href="#">anchored</a> as well as bold but <div> it's not so good</div>.

and I can fetch just text within Tags? Do keep in mind that tags are not pre defined.

somebody suggested to use Lynx on Linux platform. I wonder whether there is any JS based solution?

 

Fotiman




msg:3607297
 2:26 pm on Mar 21, 2008 (gmt 0)

I'm not sure what you're trying to accomplish here. Are you wanting to get all of the text on the page without any of the HTML elements? In other words, if your page contained this:

<body>
<h1>Webmaster World</h1>
<p>
Hello and welcome to <a href="webmasterworld.com">Webmaster World</a>! <strong>Now get going!</strong>
</p>
</body>

Then you want to get a string like this:

Webmaster World Hello and welcome to Webmaster World! Now get going!

Is that what you're trying to accomplish?

kadnan




msg:3607357
 3:36 pm on Mar 21, 2008 (gmt 0)

exactly

Fotiman




msg:3607409
 5:01 pm on Mar 21, 2008 (gmt 0)


/**
* Get all text nodes at or beneath node n, concatenate the text
* and return it as a single string.
* @param {DOMNode} n The node to get the text at
* @return The concatenated string
*/
function getText(n) {
var s = [];
function getStrings(n, s) {
var m;
if (n.nodeType == 3) { // TEXT_NODE
s.push(n.data);
}
else if (n.nodeType == 1) { // ELEMENT_NODE
for (m = n.firstChild; null != m; m = m.nextSibling) {
getStrings(m, s);
}
}
}
getStrings(n, s);
var result = s.join(" ");
return result;
}
alert(getText(document.body));

Note, in this example I concatenate all of the text nodes putting a space between them. If, however, you have inline elements like this:

Webmaster<span>World</span>

Then you'll end up with an extra space in the text that is retreived.

Hope this is enough to get you started.

fside




msg:3607417
 5:10 pm on Mar 21, 2008 (gmt 0)

> exactly <

I think that's a pretty standard regex string.

Mystring.replace(/<[^>]+>&[^;]+;/g,'').replace(/ {2,}/g,' ')

The first replace strips tags, the html comment, and entities. The second reduces two or more consecutive spaces to one space.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / JavaScript and AJAX
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved