any tips on getting crap HTML pages into DB?

Forum Moderators: phranque

Message Too Old, No Replies

any tips on getting crap HTML pages into DB?

inherited a poorly coded site...

Craig_F

4:04 pm on May 12, 2005 (gmt 0)

I have about 250 pages of HTML, from the same site, BUT they are done in old school html (fonts and all that) and they reside within about 4 different designs so even the html that is there isn't consistent.

Any tips on how to get just the body content out so I can get it into a DB? I have tried a program that strips the HTML, and it does that well, but that also removes the pictures, links , etc, and I can't have that.

This must be a fairly common task, so where do you guys start?

kazecoder

4:39 pm on May 12, 2005 (gmt 0)

I would think you could use PHP to do a file open and parse out everything between the <body> and </body> tags and then use a SQL statement to put it in the database.

Longhaired Genius

4:41 pm on May 12, 2005 (gmt 0)

I use HTML-Kit for Windows. It includes HTML-Tidy which can show/fix errors and strip font tags automatically. Then, if necessary, I use its find/replace with simple regular-expressions.

Craig_F

5:37 pm on May 12, 2005 (gmt 0)

I don't know regular expressions or PHP, but simple regular expressions should't be too hard to learn. Thanks for the ideas, I guess I could even get a macro to do some of this...