Forum Moderators: coopster

Message Too Old, No Replies

parse a html page and generate a report like how many links,images,wor

parse a html page and generate a report like how many links,images,words d

         

asacool

5:23 am on Apr 2, 2009 (gmt 0)

10+ Year Member



I'm stumped with php and regular expressions again

I need to parse a html page and generate a report like how many links,images,words the page holds.

How can I do this ?

Also need need to read the link paths (hrefs) and image paths and replace them with a predefined path (maybe a php variable)...any way i could achieve this?

coopster

7:58 pm on Apr 7, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Yes, but it is exactly as you specified. You need to read the html page into memory as a variable and parse it with pattern matching -- for which regular expressions are ideal.

rocknbil

8:50 pm on Apr 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



any way i could achieve this?

A complex task is not really complex if you break it down in parts and address each part individually. Work from general to specific.

I would do this like so:

- create regexps that extract all the HTML/XHTML from the page. At this point, I'm not doing any specific identifying, if it starts with a < and ends with a >, it "matches."

- Once that is working, remove the HTML from the page, storing the page sans HTML into a second variable or array, based on the work done above.

- Once that is working, you should now have text only; split it on newlines or spaces and count it. This part is more or less done (complicated only slightly by <head> content, but you could return to this later as I'm about to do with HTML . . . )

- Now return to the HTML chunk in step 1. Start building regexps to sort out the elements you want to count. Start with links, then images . . . etc.

- Once you have regexps that work, revise step 1 so it gets all the HTML and splits/counts in the same step. (Part of this, of course, is identifying and ignoring any closing tags.)

Rinse and repeat for any fine-tuning you want to do for the word splits.