Scraping for fun and profit

Forum Moderators: coopster

Message Too Old, No Replies

Scraping for fun and profit

Fixing Invalid HTML?

vabtz

6:51 pm on Jun 15, 2005 (gmt 0)

Hey there.

I wrote a scraper to pull one of my affiliate merchants because they won't give me a proper feed.

I originally planned on using an xml parser to pull the data I needed to make my own feed. Turns out their html is hosed and invalid. Its hidious in fact.

Anyone have any tools or advice on making valid html out of invalid html?

coopersita

10:02 pm on Jun 15, 2005 (gmt 0)

Have you tested tidy?

[w3.org...]

There is even an online version:

[infohound.net...]

willybfriendly

10:23 pm on Jun 15, 2005 (gmt 0)

Does the code look any better if you strip out everything you don't need? I have a similar situation and was able to clean it up with judicious use of striptags and preg_replace. Problem is that it is so specific that I have to watch their pages - if anything changes mine will break :(

WBF

vabtz

10:46 pm on Jun 15, 2005 (gmt 0)

coopersita:
Great suggestion! I'll look into it. I did use tidy visa via Firefox but I didn't make the connection. Although I think it is too broken for tidy as well. But I probably can prep it some prior to sending it to tidy.

willybfriendly:
Yeah. Thats what I did as well. I wrote a few functions to do checks to make sure the format wasn't changed as well. But its a stop gap solution really.

Maybe I should get them to fire their webmaster lol