Forum Moderators: open

Message Too Old, No Replies

How to strip out everything in a web page but the content?

I think this is a regex question

         

MrSpeed

3:01 pm on May 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have a large web site that I would like to convert to a database driven site. What I am trying to do is extract the content and stuff into a databse so the site can be presented with a single template.

The content of each page luckily has comment tags to mark the start of the content and pretty consistent markup to denote the end of the content.

<!--Start Content-->
blah blah blah
</td>
</tr>
</table>
<table width="740">.....

I have tried to use the global find/replace tool and regular expressions in Homesite to strip everything in the file before the <!--Start Content--> but I get errors. I tried to use other text editors but the regular expressions only matches one line and not to the beginning of the file.

I have tried
.*<!--Start Content-->
^.*<!--Start Content-->
^^.*<!--Start Content-->

The site is on IIS without perl. So I am looking for a desktop type solution. I have homesite, dreamweaver, a few different text editors.

Any ideas please?

RonPK

4:22 pm on May 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



HomeSite should do it. However, you need to escape the hyphens in your regexp pattern, like this:

^.*<!\-\-Start Content\-\->

rayvd

4:53 pm on May 6, 2003 (gmt 0)

10+ Year Member



If you have access to a Linux/Un*x box...

% lynx -dump <url>

or

% links -dump <url>

links is a little nicer IMHO. This may or may not do what you wanted, but it does a nice job of stripping the tags, and would be worth trying before you go and wrote a Perl script to do it or something :)

MrSpeed

5:22 pm on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well since I'm a little better at ASP than Unix languages I wrote a script to complete the task. It used a little regexp and some brute force replace functions. Got close to 600 pages in an access database now which will be great going forward.

grahamstewart

12:18 am on May 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you can use PHP then the
strip_tags()
function does this for you.

Chuma

12:26 am on May 7, 2003 (gmt 0)

10+ Year Member



Notetab allows you to strip HTML from a file (you can preserve URLs if you wish.)

I use it quite regulary when I want to read information from a website but don't want to download all the other elements of the page as well.

Thanks.