Forum Moderators: open
what I've concluded so far are:
1. http or https -> direct link to another page (GOOD)
2. mailto:, javascript:, #, -> browser interactivity (IGNORE)
3. / -> site base (GOOD)
the issues arise when my crawler is on a page link:
html://www.example.com/folder/thisone.html
and link result is:
search.php?some=value
I think a refresher of how browsers properly format links to get them to their target would very much help me. PLEASE ANYONE, been working on this for 2 days now. Cant find anything through google for anchor or link syntax and structures.
your 1. is wrong in the sense that it must start with http:// or https://
There are far more possibilities and when writing a parser to put in a bot: it needs to understand them all.
Hence fall back onto the BNF syntax in the RFCs.