Forum Moderators: coopster
I need to parse a html page and generate a report like how many links,images,words the page holds.
How can I do this ?
Also need need to read the link paths (hrefs) and image paths and replace them with a predefined path (maybe a php variable)...any way i could achieve this?
any way i could achieve this?
A complex task is not really complex if you break it down in parts and address each part individually. Work from general to specific.
I would do this like so:
- create regexps that extract all the HTML/XHTML from the page. At this point, I'm not doing any specific identifying, if it starts with a < and ends with a >, it "matches."
- Once that is working, remove the HTML from the page, storing the page sans HTML into a second variable or array, based on the work done above.
- Once that is working, you should now have text only; split it on newlines or spaces and count it. This part is more or less done (complicated only slightly by <head> content, but you could return to this later as I'm about to do with HTML . . . )
- Now return to the HTML chunk in step 1. Start building regexps to sort out the elements you want to count. Start with links, then images . . . etc.
- Once you have regexps that work, revise step 1 so it gets all the HTML and splits/counts in the same step. (Part of this, of course, is identifying and ignoring any closing tags.)
Rinse and repeat for any fine-tuning you want to do for the word splits.