How to strip out everything in a web page but the content?

Forum Moderators: open

Message Too Old, No Replies

How to strip out everything in a web page but the content?

I think this is a regex question

MrSpeed

3:01 pm on May 2, 2003 (gmt 0)

I have a large web site that I would like to convert to a database driven site. What I am trying to do is extract the content and stuff into a databse so the site can be presented with a single template.

The content of each page luckily has comment tags to mark the start of the content and pretty consistent markup to denote the end of the content.

blah blah blah
</td>
</tr>
</table>
<table width="740">.....

I have tried to use the global find/replace tool and regular expressions in Homesite to strip everything in the file before the  but I get errors. I tried to use other text editors but the regular expressions only matches one line and not to the beginning of the file.

I have tried
.*
^.*
^^.*

The site is on IIS without perl. So I am looking for a desktop type solution. I have homesite, dreamweaver, a few different text editors.

Any ideas please?

RonPK

4:22 pm on May 2, 2003 (gmt 0)

HomeSite should do it. However, you need to escape the hyphens in your regexp pattern, like this:

^.*<!\-\-Start Content\-\->

rayvd

4:53 pm on May 6, 2003 (gmt 0)

If you have access to a Linux/Un*x box...

% lynx -dump <url>

% links -dump <url>

links is a little nicer IMHO. This may or may not do what you wanted, but it does a nice job of stripping the tags, and would be worth trying before you go and wrote a Perl script to do it or something :)

MrSpeed

5:22 pm on May 6, 2003 (gmt 0)

Well since I'm a little better at ASP than Unix languages I wrote a script to complete the task. It used a little regexp and some brute force replace functions. Got close to 600 pages in an access database now which will be great going forward.

grahamstewart

12:18 am on May 7, 2003 (gmt 0)

If you can use PHP then the

strip_tags()

function does this for you.

Chuma

12:26 am on May 7, 2003 (gmt 0)

Notetab allows you to strip HTML from a file (you can preserve URLs if you wish.)

I use it quite regulary when I want to read information from a website but don't want to download all the other elements of the page as well.

Thanks.