Forum Moderators: open
I've been making some regex to strip pages of their various HTML tags, one of the first and foremost things done to grab information from a page is its links, and I'm having problems matching the links in REGEX due to tags being nested inside them.
For instance, the <font> tags used on the links at the top of this page prevent me from making a regex match to the links in the top navbar.....
I was wondering if anyone had a list of tags that (in valid HTML) can nest inside a <a> tag.
I'm not too savvy on the "laws" of HTML, so if there is a name for elements that are allowable in an <a> tag that would help, or a list would be just as helpful. The <i>,<b>,<font> come to mind, but I'm sure there's many more.
Any pointers would be great .... :)
When nesting tags, the hyperlink tag (anchor tag: <a>) does not recognize nested tags. For example, nesting the <b> or <i> tag as literal text inside the <a> tag will not render a link as bold or italic. The control completely ignores all tags inside of the <a> tag.
I look at other sites that are still using <font> tags and I see them outside of the <a href>, not inside (nested). I wonder if Brett does this for a specific reason. ;)
[edited by: pageoneresults at 6:08 pm (utc) on June 9, 2003]
I'd lay the REGEX down but I'll save that for the PHP forum ;) The regex wants to find any anchor text inside a tag as well as a link, I'd hadn't thought of finding nested tags "on the fly" as you mention. At the mo I'm stripping out these tags before looking for links with the REGEX. Hopefully its not wishful thinking that I wouldnt have to add more code to look for tags within <a> tags especially.
pageone, "nested" was the word I was looking for, cheers ;) Any tags that can be "nested" inside an anchor is what I'd need to know....
So is HTML lax enough to have pretty much anything in there? I'm flicking through an HTML book and have a good idea of what might be legal and what not, but if there was a hard and fast rule it would make it a tad easier :)
You have me wondering if <a><table> ... </table></a> is actually valid, or any other such crazy stuff. It shouldn't be as you are dropping a block level element inside a line level element.
The question of which tags can be legally nested between an <a> tag and its closing tag is interesting, but you don't need to worry about that if all you want to do is strip out all tags.
Just strip out:
#</?[^>]+>#
It won't 'parse' it to test for matching end tags; but if you just want to strip them out, and you're not interested in parsing it, this should do fine.
Shawn
heres the regex
"'</?a\s*([^>]*)?href\s*=[\"\']?([^\'\">]*)[\"\']?[^>]*>([^<]*)</\s*a>'ims"
Basically a link is matched alongside its anchor text. The anchor is matched by checking for the first ">" of the a tag until it meets "<" as part of </a>. If a tag is nested in there the regex wont match.......but there might be tags in there that apply to weighting that shouldnt be removed..while others might get removed (dependso n what tags are allowed in there)
It's for an SE. I'd like to apply scores to various tags, <b>,<i>,<emphasis> and the like perhaps being a way to add weight to parts of a document.
If there were a list of a tags that legally can be nested inside an <a> tag it would make my job easier. The links are matched with REGEX before anything else is done with the page - like turning everything to lowercase so that case-sensitive URL's and the like won't be a problem. It's not so much the problem of what can fit inside <a> tags but which order I can quickly parse a page in.
I'll probably end up ignoring most <html> tags in the weighting, still though, I'd have thought there was a hard and fast (and easy) rule to know about whats legal and whats not....I guess not ;)
So, is there a correct way to nest elements when it comes to anchors? It doesn't matter to me as I utilize CSS. And, I've come to find out that certain span elements do not work outside the <a href> so they have to be nested.
Hey BRO, sorry to steer this thing all over the road.
I promise to stop smoking that stuff, really I do. ;)
Phrase Markup:
em, strong, dfn, code, samp, kbd, var, cite
Font Markup:
tt, i, b, u, strike, big, small, sub, sup
Special Elements:
a, img, applet, font, basefont, br, script, map
Form Field Elements:
input, select, textarea
A - Hyperlinks [htmlhelp.com]
Arrrggghhh! This has been driving me crazy all day. I've tried just about every search query combination that I can think of that would return a list of tags that can be nested within the anchor tag. The only definitive information I can find is on the <img src> which is usually inside the <a href> anyway.
Bottom line, it looks like any inline element can be inside that anchor tag. No block level elements unless of course you are overriding the properties of that element using CSS, although this is not recommended.
It appears that placing inline markup inside or outside of the anchor tag is acceptable as long as the nesting order is correct.
I've always designed under the assumption that I want that anchor as close to the anchor text as possible without any markup inbetween. Therefore I've always placed style outside of the anchor unless of course it was something that required it be inside of the anchor (few instances).
However, are you only dealing with validated code, or real world tag soup?
Thanks to various WYSIWYG editors out there, we have the tag soup. It is all relative to how the user applies elements to the text. If they specify <strong><em> first and then apply a hyperlink, the <strong><em> elements end up outside of the anchor.
If they specify the anchor first and then apply <strong><em>, the elements end up inside of the anchor. Which is correct?
hey dont say that, it was fun making it ;) It's not Perl either, I am clueless with Perl. It's easier for me to write 50/60 lines of code than try to fathom someone elses. This <a> thing is a minor problem :). Editing my code for unusual HTML makes it easier.... p.s. strip_tags() would do the same thing, I'm just doing a preg_replace and leaving the first character of the tag and a flag.
>thy tag soup
I guess its inevitable if I spider anything from the web, soup will be involved. Thanks for some of that legwork guys I wouldnt have known where ot start.
Looking at pageones list, I wouldn't mind stripping most of them out. I'll have to remember these terms - "inline" and "nested"......next time someone has some HTML that doesnt comply with my regex I'll send them an e-mail ;)