column selection regex

Forum Moderators: coopster & phranque

Message Too Old, No Replies

column selection regex

Three to one, but not that one

Josefu

7:35 pm on Feb 21, 2006 (gmt 0)

I would much appreciate any help on this as I am literally tugging my hair out on this one.

I have a website full of three-columned tables that I would like to 'strip' of their outer columns, leaving just the centre one (to be later transformed into a variable for insertion into a php template) - but I can's seem to come up with a working regex that will select only tables with three columns - as there are sometimes several tables in a page - all I could come up with selects 'across' tables. I've tried negative/positive lookahead, everything.

The biggest problem with this sort of search/replace is the line beaks and unpredictable characters that have to be captured between <td></td> tags - everything I come up with will select 'across' <tr> and <table> tags. Can anyone help? Thanks in advance.

coopster

7:44 pm on Feb 21, 2006 (gmt 0)

Clarification?

<table> 
<tr><td>R1C1</td><td>Row1Column2</td><td>R1C3</td></tr> 
<tr><td>R2C1</td><td>Row2Column2</td><td>R2C3</td></tr> 
<tr><td>R3C1</td><td>Row3Column2</td><td>R3C3</td></tr> 
</table> 
<table> 
<tr><td>R1C1</td><td>R1C2</td><td>R1C3</td><td>R1C4</td></tr> 
<tr><td>R2C1</td><td>R2C2</td><td>R2C3</td><td>R2C4</td></tr> 
<tr><td>R3C1</td><td>R3C2</td><td>R3C3</td><td>R3C4</td></tr> 
</table>

You are trying to grab the bolded portions?

Josefu

8:09 pm on Feb 21, 2006 (gmt 0)

Thank you for your rapid reply -!

Atually, it's simpler:

<table>
<tr><td>T1C1</td><td>Table1Column2</td><td>T1C3</td></tr>
</table>
<table>
<tr><td>T1C1</td><td>Table2Column2</td><td>T1C3</td></tr>
</table>
<table>
<tr><td>T1C1</td><td>Table3Column2</td><td>T1C3</td></tr>
</table>

...and the central column's content is varied and full of other 'tagged' input (text, images and comments) and carriage returns.

perl_diver

8:12 pm on Feb 21, 2006 (gmt 0)

not sure if you are asking for a perl solution or just a regexp in general. This type of thing is best done with an html parsing module:

[search.cpan.org...]

might be the ticket.

Or you can try and use a regexp but you may have no hair left at all before you figure out a reliable way to parse html using a regexp.

Josefu

8:45 pm on Feb 21, 2006 (gmt 0)

Thanks to you too, sir. Well, let me put it this way - if there were a bbedit board on Webmasterworld, I'd be in it : ) I've been doing my best to learn 'Perl flavour' Regex (for other uses of course) so I guess that's what brought me here.

Extracting the data from the table could be a solution for sure - but I do have a few hairs left.

All this is to the goal of preparing the content for insertion into a new php skin - so I wanted the centre column to retain its formatting (img, style tags) but as a unique table - which eventually would be stripped as well. The problem for now is 'frame and isolate'. Already cleaning the 1995-era html was a chore : P

Josefu

9:39 pm on Feb 21, 2006 (gmt 0)

Whoa, nelly - I think I found it. I was looking way past the problem in wanting to parse the html... as html. What I did is strip all the carriage returns, and re-insert them before every <table><tr> <td> then </tr> tag. Then I strip out all the extra whitespace... and then I have a bunch of lines bracketed with enclosing tags. From then it's just a game of 'count the lines' as I can fully exploit the ^ and $ tags. Putting it all back together is as simple as giving the stripped page a run through Html Tidy.

If someone does find this stickler through pure regex (without the above). Hats off - every hat I own!

Thanks for all your input - I'll be checking back : )

perl_diver

10:51 pm on Feb 21, 2006 (gmt 0)

very good, hope it works OK.

Josefu

10:36 am on Feb 22, 2006 (gmt 0)

Thank you - I'm almost done actually : )