Welcome to WebmasterWorld Guest from

Forum Moderators: incrediBILL

Message Too Old, No Replies

PLEASE HELP>>> Understanding anchor link syntax

a href link syntax

9:35 am on Oct 22, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Oct 20, 2009
votes: 0

I've been working on a crawl script that pulls all links on a page. I can successfully obtain the link value with my script, but to properly format the results. I need to better understand anchor (href) syntax.

what I've concluded so far are:
1. http or https -> direct link to another page (GOOD)
2. mailto:, javascript:, #, -> browser interactivity (IGNORE)
3. / -> site base (GOOD)

the issues arise when my crawler is on a page link:

and link result is:

I think a refresher of how browsers properly format links to get them to their target would very much help me. PLEASE ANYONE, been working on this for 2 days now. Cant find anything through google for anchor or link syntax and structures.

1:12 pm on Oct 22, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member swa66 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 7, 2003
votes: 0

If you want the real knowledge of how a URL needs to be parsed:

your 1. is wrong in the sense that it must start with http:// or https://

There are far more possibilities and when writing a parser to put in a bot: it needs to understand them all.

Hence fall back onto the BNF syntax in the RFCs.