Forum Moderators: open
I wonder whether it's possible to extract text of a webpage? Suppose I have following text.
<p>This is my text which is <a href="#">anchored</a> as well as bold but <div> it's not so good</div>.
and I can fetch just text within Tags? Do keep in mind that tags are not pre defined.
somebody suggested to use Lynx on Linux platform. I wonder whether there is any JS based solution?
<body>
<h1>Webmaster World</h1>
<p>
Hello and welcome to <a href="webmasterworld.com">Webmaster World</a>! <strong>Now get going!</strong>
</p>
</body>
Then you want to get a string like this:
Webmaster World Hello and welcome to Webmaster World! Now get going!
Is that what you're trying to accomplish?
/**
* Get all text nodes at or beneath node n, concatenate the text
* and return it as a single string.
* @param {DOMNode} n The node to get the text at
* @return The concatenated string
*/
function getText(n) {
var s = [];
function getStrings(n, s) {
var m;
if (n.nodeType == 3) { // TEXT_NODE
s.push(n.data);
}
else if (n.nodeType == 1) { // ELEMENT_NODE
for (m = n.firstChild; null != m; m = m.nextSibling) {
getStrings(m, s);
}
}
}
getStrings(n, s);
var result = s.join(" ");
return result;
}
alert(getText(document.body));
Note, in this example I concatenate all of the text nodes putting a space between them. If, however, you have inline elements like this:
Webmaster<span>World</span>
Then you'll end up with an extra space in the text that is retreived.
Hope this is enough to get you started.