homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / JavaScript and AJAX
Forum Library, Charter, Moderator: open

JavaScript and AJAX Forum

extracting text from a web page

 6:20 am on Mar 21, 2008 (gmt 0)

Hi again.

I wonder whether it's possible to extract text of a webpage? Suppose I have following text.

<p>This is my text which is <a href="#">anchored</a> as well as bold but <div> it's not so good</div>.

and I can fetch just text within Tags? Do keep in mind that tags are not pre defined.

somebody suggested to use Lynx on Linux platform. I wonder whether there is any JS based solution?



 2:26 pm on Mar 21, 2008 (gmt 0)

I'm not sure what you're trying to accomplish here. Are you wanting to get all of the text on the page without any of the HTML elements? In other words, if your page contained this:

<h1>Webmaster World</h1>
Hello and welcome to <a href="webmasterworld.com">Webmaster World</a>! <strong>Now get going!</strong>

Then you want to get a string like this:

Webmaster World Hello and welcome to Webmaster World! Now get going!

Is that what you're trying to accomplish?


 3:36 pm on Mar 21, 2008 (gmt 0)



 5:01 pm on Mar 21, 2008 (gmt 0)

* Get all text nodes at or beneath node n, concatenate the text
* and return it as a single string.
* @param {DOMNode} n The node to get the text at
* @return The concatenated string
function getText(n) {
var s = [];
function getStrings(n, s) {
var m;
if (n.nodeType == 3) { // TEXT_NODE
else if (n.nodeType == 1) { // ELEMENT_NODE
for (m = n.firstChild; null != m; m = m.nextSibling) {
getStrings(m, s);
getStrings(n, s);
var result = s.join(" ");
return result;

Note, in this example I concatenate all of the text nodes putting a space between them. If, however, you have inline elements like this:


Then you'll end up with an extra space in the text that is retreived.

Hope this is enough to get you started.


 5:10 pm on Mar 21, 2008 (gmt 0)

> exactly <

I think that's a pretty standard regex string.

Mystring.replace(/<[^>]+>&[^;]+;/g,'').replace(/ {2,}/g,' ')

The first replace strips tags, the html comment, and entities. The second reduces two or more consecutive spaces to one space.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / JavaScript and AJAX
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved