Forum Moderators: open

Message Too Old, No Replies

Tags Inside an <a> Tag

         

brotherhood of LAN

4:58 pm on Jun 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello all,

I've been making some regex to strip pages of their various HTML tags, one of the first and foremost things done to grab information from a page is its links, and I'm having problems matching the links in REGEX due to tags being nested inside them.

For instance, the <font> tags used on the links at the top of this page prevent me from making a regex match to the links in the top navbar.....

I was wondering if anyone had a list of tags that (in valid HTML) can nest inside a <a> tag.

I'm not too savvy on the "laws" of HTML, so if there is a name for elements that are allowable in an <a> tag that would help, or a list would be just as helpful. The <i>,<b>,<font> come to mind, but I'm sure there's many more.

Any pointers would be great .... :)

pageoneresults

6:00 pm on Jun 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Interesting question BRO. I've always placed elements outside of the <a href> with the exception of certain <span> elements. I could not find anything specific to which elements can be nested within the <a href></a> but I did find something at MSN where they state this in regards to form controls (in a Mobile Web Application)...

When nesting tags, the hyperlink tag (anchor tag: <a>) does not recognize nested tags. For example, nesting the <b> or <i> tag as literal text inside the <a> tag will not render a link as bold or italic. The control completely ignores all tags inside of the <a> tag.

I look at other sites that are still using <font> tags and I see them outside of the <a href>, not inside (nested). I wonder if Brett does this for a specific reason. ;)

[edited by: pageoneresults at 6:08 pm (utc) on June 9, 2003]

korkus2000

6:02 pm on Jun 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>The control completely ignores all tags inside of the <a> tag.

I hope that doesn't include the img tag. ;)

pageoneresults

6:07 pm on Jun 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Nah, that comment from MS came from information relative to creating Mobile Web Applications.

Birdman

6:14 pm on Jun 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why not strip out everything between the <a> and </a>, then do another regex to strip any nested tags and compile your own list as you find them?

Just a thought.

fixed bad grammar

[edited by: Birdman at 6:22 pm (utc) on June 9, 2003]

davemarks

6:15 pm on Jun 9, 2003 (gmt 0)

10+ Year Member



Placing tags inside the <a href=""></a> tag allows you to overide any style fixed to it. For example css or anything specified in the <body> tag

brotherhood of LAN

7:05 pm on Jun 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey Birdman,

I'd lay the REGEX down but I'll save that for the PHP forum ;) The regex wants to find any anchor text inside a tag as well as a link, I'd hadn't thought of finding nested tags "on the fly" as you mention. At the mo I'm stripping out these tags before looking for links with the REGEX. Hopefully its not wishful thinking that I wouldnt have to add more code to look for tags within <a> tags especially.

pageone, "nested" was the word I was looking for, cheers ;) Any tags that can be "nested" inside an anchor is what I'd need to know....

So is HTML lax enough to have pretty much anything in there? I'm flicking through an HTML book and have a good idea of what might be legal and what not, but if there was a hard and fast rule it would make it a tad easier :)

g1smd

10:11 pm on Jun 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmmm. I have never given this a thought, about what tags are valid inside <a> ... </a>. I have always done what felt right, and then checked it at the code validator.

You have me wondering if <a><table> ... </table></a> is actually valid, or any other such crazy stuff. It shouldn't be as you are dropping a block level element inside a line level element.

TGecho

11:30 pm on Jun 9, 2003 (gmt 0)

10+ Year Member



"You have me wondering if <a><table> ... </table></a> is actually valid, or any other such crazy stuff. It shouldn't be as you are dropping a block level element inside a line level element."

What if you do a display:block on it?

pageoneresults

11:39 pm on Jun 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



g1smd, I know you were kidding. Just to confirm, you cannot nest a table within a hyperlink. At least not according to the HTML 4.01 Transitional DTD...

Line 20, column 104: document type does not allow element "TABLE" here; missing one of "APPLET", "OBJECT", "MAP", "IFRAME", "BUTTON" start-tag

g1smd

12:13 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Pageone: I'm shocked that you even tried it! ;-))

pageoneresults

12:16 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey, I'll try anything once! You never know until you try it. Some interesting effects when wrapping a table inside an <a href>. ;)

Birdman

12:28 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, I was about to test every HTML tag inside the anchor, then I came to my senses ;)

ShawnR

12:33 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi

The question of which tags can be legally nested between an <a> tag and its closing tag is interesting, but you don't need to worry about that if all you want to do is strip out all tags.

Just strip out:

#</?[^>]+>#

It won't 'parse' it to test for matching end tags; but if you just want to strip them out, and you're not interested in parsing it, this should do fine.

Shawn

brotherhood of LAN

1:09 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OK, here's the regex, btw, I sorta hoped that <table> tags would not go inside a href, it might sound like a daft question what fits inside an <a> tag, but I guess it depends on what information your'e taking from a page ;)

heres the regex

"'</?a\s*([^>]*)?href\s*=[\"\']?([^\'\">]*)[\"\']?[^>]*>([^<]*)</\s*a>'ims"

Basically a link is matched alongside its anchor text. The anchor is matched by checking for the first ">" of the a tag until it meets "<" as part of </a>. If a tag is nested in there the regex wont match.......but there might be tags in there that apply to weighting that shouldnt be removed..while others might get removed (dependso n what tags are allowed in there)

It's for an SE. I'd like to apply scores to various tags, <b>,<i>,<emphasis> and the like perhaps being a way to add weight to parts of a document.

If there were a list of a tags that legally can be nested inside an <a> tag it would make my job easier. The links are matched with REGEX before anything else is done with the page - like turning everything to lowercase so that case-sensitive URL's and the like won't be a problem. It's not so much the problem of what can fit inside <a> tags but which order I can quickly parse a page in.

I'll probably end up ignoring most <html> tags in the weighting, still though, I'd have thought there was a hard and fast (and easy) rule to know about whats legal and whats not....I guess not ;)

g1smd

1:37 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Only inline (not block level) tags can legally be between <a> and </a>; and nothing that normally resides in the <head> part can be there either.

However, are you only dealing with validated code, or real world tag soup?

pageoneresults

1:45 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just took a trip back using the Wayback Machine and the W3C's site back then had all presentational markup outside of the <a href>. The only nested elements within <a href> were <img src> tags which we all know are legal.

So, is there a correct way to nest elements when it comes to anchors? It doesn't matter to me as I utilize CSS. And, I've come to find out that certain span elements do not work outside the <a href> so they have to be nested.

Hey BRO, sorry to steer this thing all over the road.
I promise to stop smoking that stuff, really I do. ;)

pageoneresults

2:24 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Text elements that you may find nested within <a href> (anchor) tags...

Phrase Markup:
em, strong, dfn, code, samp, kbd, var, cite

Font Markup:
tt, i, b, u, strike, big, small, sub, sup

Special Elements:
a, img, applet, font, basefont, br, script, map

Form Field Elements:
input, select, textarea

A - Hyperlinks [htmlhelp.com]

Arrrggghhh! This has been driving me crazy all day. I've tried just about every search query combination that I can think of that would return a list of tags that can be nested within the anchor tag. The only definitive information I can find is on the <img src> which is usually inside the <a href> anyway.

Bottom line, it looks like any inline element can be inside that anchor tag. No block level elements unless of course you are overriding the properties of that element using CSS, although this is not recommended.

It appears that placing inline markup inside or outside of the anchor tag is acceptable as long as the nesting order is correct.

I've always designed under the assumption that I want that anchor as close to the anchor text as possible without any markup inbetween. Therefore I've always placed style outside of the anchor unless of course it was something that required it be inside of the anchor (few instances).

However, are you only dealing with validated code, or real world tag soup?

Thanks to various WYSIWYG editors out there, we have the tag soup. It is all relative to how the user applies elements to the text. If they specify <strong><em> first and then apply a hyperlink, the <strong><em> elements end up outside of the anchor.

If they specify the anchor first and then apply <strong><em>, the elements end up inside of the anchor. Which is correct?

ShawnR

9:24 am on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"...It's for an SE. I'd like to apply scores to various tags..."

Oh, so you do need to parse it, not just strip out the content. If this is to be written in perl have you thought about using the CPAN HTML::parser library instead of reinventing the wheel?

Shawn

brotherhood of LAN

1:59 pm on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>reinvent the wheel

hey dont say that, it was fun making it ;) It's not Perl either, I am clueless with Perl. It's easier for me to write 50/60 lines of code than try to fathom someone elses. This <a> thing is a minor problem :). Editing my code for unusual HTML makes it easier.... p.s. strip_tags() would do the same thing, I'm just doing a preg_replace and leaving the first character of the tag and a flag.

>thy tag soup

I guess its inevitable if I spider anything from the web, soup will be involved. Thanks for some of that legwork guys I wouldnt have known where ot start.

Looking at pageones list, I wouldn't mind stripping most of them out. I'll have to remember these terms - "inline" and "nested"......next time someone has some HTML that doesnt comply with my regex I'll send them an e-mail ;)

ShawnR

2:23 pm on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"...reinvent the wheel..."
Sorry, it was just an expression. I didn't mean it to come across as negative. ;) Glad you're having fun. And doing it yourself also improves your level of understanding...

brotherhood of LAN

2:27 pm on Jun 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No worries, I've stickied you the REGEX if you think it needs tweaking feel free :)