How to get content of html page using DOM

Forum Moderators: open

Message Too Old, No Replies

How to get content of html page using DOM

How to get source of html page using DOM to save as separate file

kokilakr

4:57 am on Jun 1, 2010 (gmt 0)

Hello,

Can you please inform me how to get content/source of webpage(.html) maintaining current state using standard javascript/DOM interface?

I need to save webpage source to another .html file by maintaining its current state.

Thanks,

astupidname

6:54 am on Jun 1, 2010 (gmt 0)

There are several issues with doing so (why would you want to?):

1. There is no good, reliable cross-browser methods of accessing the page's doctype declaration properly from the client side. This would be best retrieved from the server side by parsing the raw file for it's doctype declaration with php.

2. document.getElementsByTagName('html')[0].innerHTML will contain the contents of the html element in it's "live" state. One problem I see already, is that IE does not seem to include quotes around element attributes such as id in elements which were added in via innerHTML (have not checked via createElement, or many other attributes, but is now irrelevant because of said problem existence). Also, the html element it's self, you may want to check it's attributes collection and include them, or parse that from the server side should be fine, highly unlikely any client-side code would mess with the html elements attributes.

3. So, in theory a person could capture the html elements innerHTML, send to server side via ajax, have server side code parse the raw page for it's doctype and possibly the html elements attributes, also possibly the opening xml declaration if using it and xhtml doctype. But as stated in 2 about the quotes, there may be other issues too, may still wind up with malformed html in the end. Who knows, you may have javascript which messes with the stylesheets objects... which could wind up leaving changed styles out of the innerHTML. At any rate, once you get the info to the server side, it's a piece of cake to save a file then with php, though again I will state it may not be an exact "carbon-copy".

4. Usually things of this nature are (pardon) folly, which may have not been properly thought out, and are better handled another way, IMO. What would you want to save the page's "live" state for? Is it really necessary to capture any/all changes that may have been made to the page via javascript?

p.s. welcome to webmasterworld!

astupidname

7:02 am on Jun 1, 2010 (gmt 0)

5. LOL, if the user is using Firebug, there will be an extra div added in to your page: <div firebugversion="1.5.3" style="display: none;" id="_firebugConsole"></div>
not to mention greasemonkey issues too...

kokilakr

7:16 am on Jun 1, 2010 (gmt 0)

Hello,

Thanks for the response.

1. I need to do something similar to "Save as complete web page", while downloading all images and .js and resources included in the webpage and save the HTML page with modified contents. And I need current state of javascript to be maintained in my saved .html file.

2. Here I have problem in saving .html file.

As you said i have tried to get html content as follows,

a.

document.childNodes[i].outerHTML;

, is not part of DOM standard. This maintains javascript state. But the entities like   and & are also displayed.

b.

document.getElementsByTagName('html')[0].innerHTML;

also maintains state. But entities like   and & are also displayed.

3. So basically i need some way to get html source, using standard DOM interface to maintain state and also which displays entities as proper characters.

Thanks in advance

[edited by: kokilakr at 7:36 am (utc) on Jun 1, 2010]

kokilakr

7:31 am on Jun 1, 2010 (gmt 0)

Hello,

some more information,

1. I cannot use php. Javascript is the only option.
2. I need this to work for QT Webkit.

Thanks,

astupidname

7:34 am on Jun 1, 2010 (gmt 0)

How much javascript is on the page and how many things are being changed (potentially, user action dependant?)? This is one of the big issues. Odds are you may be better off reading javascript variable states (having javascript variables accessible for each potential change to the page), ignoring the pages actual contents on the browser side, sending the states to php on the client side, and having php echo the page contents with the mimic'd javascript variable states. Of course if there is page content which varies from initial php output of the page, that will need to be captured as well...

astupidname

7:35 am on Jun 1, 2010 (gmt 0)

Oh, well I'd have to say I think you may be SOL then..., but I don't know much about QT Webkit

kokilakr

7:43 am on Jun 1, 2010 (gmt 0)

Oh, well I'd have to say I think you're SOL then...

What do you mean by SOL? Sadly Outta Luck?

How can I go forward?

astupidname

7:49 am on Jun 1, 2010 (gmt 0)

Not "Sadly" but that's very close, think doggy doo-doo
I don't know, those are all my suggestions, maybe someone else will pipe up with an idea or two.

whoisgregg

9:13 pm on Jun 1, 2010 (gmt 0)

If the only real problem with the output is html entities, then there are javascript html entity decoder scripts out there.

john_k

1:23 pm on Jun 3, 2010 (gmt 0)

document.body.innerHTML will give you the current state of all HTML between the <body> and </body> tags. That includes anything the user has typed or clicked and anything that was done to alter the HTML via javascript.

Fotiman

1:36 pm on Jun 3, 2010 (gmt 0)

Note, I seem to recall a few years ago having a problem with password field values not being included when doing something similar to this, so if you use the innerHTML approach, make sure you test it well.