homepage Welcome to WebmasterWorld Guest from 54.227.41.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / HTML
Forum Library, Charter, Moderators: incrediBILL

HTML Forum

    
PLEASE HELP>>> Understanding anchor link syntax
a href link syntax
miketheman




msg:4011414
 9:35 am on Oct 22, 2009 (gmt 0)

I've been working on a crawl script that pulls all links on a page. I can successfully obtain the link value with my script, but to properly format the results. I need to better understand anchor (href) syntax.

what I've concluded so far are:
1. http or https -> direct link to another page (GOOD)
2. mailto:, javascript:, #, -> browser interactivity (IGNORE)
3. / -> site base (GOOD)

the issues arise when my crawler is on a page link:
html://www.example.com/folder/thisone.html

and link result is:
search.php?some=value

I think a refresher of how browsers properly format links to get them to their target would very much help me. PLEASE ANYONE, been working on this for 2 days now. Cant find anything through google for anchor or link syntax and structures.

 

swa66




msg:4011521
 1:12 pm on Oct 22, 2009 (gmt 0)

If you want the real knowledge of how a URL needs to be parsed:
ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt
ftp://ftp.rfc-editor.org/in-notes/rfc1738.txt

your 1. is wrong in the sense that it must start with http:// or https://

There are far more possibilities and when writing a parser to put in a bot: it needs to understand them all.

Hence fall back onto the BNF syntax in the RFCs.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / HTML
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved