Forum Moderators: phranque

Message Too Old, No Replies

Structure of a URI

Elements of it

         

brotherhood of LAN

6:00 pm on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm trying to use some regex to get links, title off a page etc, but a bit of a stumbling block is the regex for the actual URL, and keeping any potential problems far far away from the equation :)

Probs range from relative URL's, capitalization, the use(or lack of) anchor text, badly formed HTML, other elements in the <a tag, hard line breaks or excessive spaces in the HTML etc....that can all stop a potential regex match as a URL.

Are there any official docs out there about the structure of a URI?

This is sort of a 50/50 question about URI standards and about regex. I guess if I can make sure that all valid URL's are matched, and non-matches are not, then the whole thing will work 100% :)

lorax

6:10 pm on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey BOL,
I assume you've been here: [php.net...]

and or here: [php.net...]

Of course - these are for URLs not URIs.

brotherhood of LAN

6:24 pm on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



hey lorax,

Cheers for that, another point for the PHP crew! I'll look at them to see what parts of the URL they use for parsing.

The script will be grabbing pages off the web, ie page is the string and I'll be preg_matching all links.

The regex I have just now is not working 100%, but I don't want to be leaving it too "loose" so that it might pick up garbage along the way.

andreasfriedrich

7:13 pm on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are there any official docs out there about the structure of a URI?

RFC2396 - Uniform Resource Identifiers (URI): Generic Syntax [faqs.org]

B. Parsing a URI Reference with a Regular Expression
[...]
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

andreasfriedrich

8:02 pm on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Trying to parse an HTML document using simple regular expressions is bound to fail on anything but the most simple HTML. I would suggest that you use an HTML parser. Either use Google to find one [google.de] or write one yourself. It´s a rather easy but boring task.

You could use Gisle Aas´ HTML::Parser as an example. Although you might need to use an older version since the newer one is implemented in C I think. That being the case, it should be faily easy to use that to interface with PHP. Should be just some minor changes in the source code.

Here is an HTML parser I wrote a couple of years ago in Java.

package af.HTML; 

import java.net.*;
import java.io.*;
import java.util.Vector;
import java.util.StringTokenizer;
import java.util.Stack;
import java.util.Hashtable;
import java.util.Vector;
import af.HTML.HTMLKonvertierung;

public class HTMLParser implements HTMLKonstanten {

public static interface ElementGefunden {
public void elementGefunden(HTMLElement el) throws IOException ;
public void elementGefunden(URL url);
}

Reader r;
StreamTokenizer st;
URL url;
ElementGefunden eg;
HTMLElement aktEl = null;
HTMLElement ordElter = null;
HTMLElement basisElter = null;
HTMLElement imgElter = null;
HTMLTags TAGS;

public HTMLParser(String start_url, ElementGefunden eg) {
try {
url = new URL(start_url);
r = new BufferedReader(new InputStreamReader(url.openStream()));
st = new StreamTokenizer(r);
} catch (MalformedURLException e1) {
System.out.println("MalformedURLException: " + e1);
} catch (IOException e2) {
System.out.println("IOException: " + e2);
}

st.wordChars('!', '~');
st.ordinaryChar('=');
st.ordinaryChar('>');
st.ordinaryChar('<');
st.ordinaryChar('(');
st.ordinaryChar(')');
st.ordinaryChars('\u0000', '\u0020');
st.slashSlashComments(false);
st.slashStarComments(false);
this.eg = eg;

TAGS = new HTMLTags();
}

public void start() throws IOException {

String innerText;
boolean tag_OK;

while (st.ttype!= st.TT_EOF) {
if (aktEl!= null && basisElter!= null) {
if (aktEl.TagFlag() == TITLE) {
basisElter.setTitle(parseTitle());
}
}
innerText = parseText(st);
if (innerText.length()!= 0) {
if (aktEl!= null) {
aktEl.addInnerText(innerText);
}
}
tag_OK = parseTag(st);
if (!tag_OK && innerText == "") { break; }
}
r.close();
}

private String parseTitle() throws IOException {
try {
BufferedReader in = new BufferedReader(
new InputStreamReader(basisElter.Url().openStream()));
StringBuffer sb = new StringBuffer();
String line, kline;
int o;
char c;
line = in.readLine();
AUSSEN:
while (true) {
if (line.length() == 0) { break; }
o = line.indexOf('<');
if (o!= -1) {
line = line.substring(o+1);
if (line.startsWith("title") ¦¦ line.startsWith("TITLE")) {
line = line.substring(5);
if (line.startsWith(">")) {
line = line.substring(1);
while (true) {
line += in.readLine();
kline = line.toLowerCase();
o = kline.indexOf("</title>");
if (o!= -1) { return line.substring(0, o); }
else { continue; }
}
}
} else { continue; }
} else { line = ""; line += in.readLine();}
}
} catch (IOException e) { System.out.println(e); }
return "";
}

private String parseText(StreamTokenizer st) throws IOException {

StringBuffer text = new StringBuffer();
Character sz = null;

PARSE: while(true) {
st.nextToken();

switch (st.ttype) {
case st.TT_EOF: {
break PARSE;
}
case st.TT_EOL: {
break;
}
case st.TT_WORD: {
text.append(HTMLKonvertierung.fromHTML(st.sval));
break;
}
case st.TT_NUMBER: {
}
default: {
if (st.ttype == '<') { //¦¦ st.ttype == '>') {
st.pushBack();
break PARSE;
} else {
text.append((char)st.ttype);
}
}
}
}
return text.toString();
}

private boolean parseTag(StreamTokenizer st) throws IOException {
Hashtable attribute = new Hashtable();
String text = null;
Integer tagFlag = new Integer(-1);
int le, hp;
String l;
st.nextToken();

if(st.ttype==st.TT_EOF) {
return false;
} else if(st.ttype!= '<') {
st.pushBack();
return false;
}

st.nextToken();
if(st.ttype!=st.TT_WORD) {
st.pushBack();
attribute = zumTagendeSpringen(st);
return true;
} else {
String tn = st.sval.toLowerCase();
tagFlag = (Integer)TAGS.TAGS.get(tn.toUpperCase());

if (tn.equals("!--")) { zumKommentarendeSpringen(st); return true; }

attribute = zumTagendeSpringen(st);

if (tagFlag == null &&!tn.startsWith("/") ¦¦
(tn.startsWith("/") && TAGS.TAGS.get(tn.substring(1).toUpperCase()) == null)) {
return true;
}

if (tn.startsWith("/")/* && (aktEl!= null)*/) {
tn = tn.substring(1);
tagFlag = (Integer)TAGS.TAGS.get(tn.toUpperCase());
elementSchließen(tagFlag.intValue());
return true;
}

if (tagFlag.intValue() == SCRIPT) { zumScriptendeSpringen(st); return true; }

if (isOptET(tagFlag.intValue()) && basisElter!= null) {
switch (tagFlag.intValue()) {
case COLGROUP: { break; }
case DD: { elementSchließen(DD); elementSchließen(DT); break; }
case DT: { elementSchließen(DD); elementSchließen(DT); break; }
case LI: { elementSchließen(LI); break; }
case OPTION: { elementSchließen(OPTION); break; }
case P: { elementSchließen(P); break; }
case TBODY: { elementSchließen(THEAD); break; }
case TD: { elementSchließen(TD); break; }
case TFOOT: { elementSchließen(TBODY); break; }
case TH: { elementSchließen(TH); break; }
case THEAD: { break; }
case TR: { elementSchließen(TR); elementSchließen(COLGROUP); break; }
}
}

if (isVerbET(tagFlag.intValue())/* ¦¦ tagFlag.intValue() == BODY*/) {
if (tagFlag.intValue() == IMG) {
if (ordElter.Fertig()) { imgElter = bubble(ordElter); }
else { imgElter = ordElter; }
imgElter.addKind(new HTMLElement(tagFlag, attribute, imgElter, true));
}
return true;
}
if (basisElter == null) {
basisElter = new HTMLElement(tagFlag, attribute, url);
ordElter = basisElter;
} else {
if (ordElter.Fertig()) { ordElter = bubble(ordElter); }
aktEl = new HTMLElement(tagFlag, attribute, ordElter);
ordElter.addKind(aktEl);
ordElter = aktEl;
if (tagFlag.intValue() == A && attribute.containsKey("href")) {
l = (String)attribute.get("href");
if (l.toLowerCase().startsWith("mailto:")) { return true; }
if (l.toLowerCase().startsWith("news:")) { return true; }
if (l.toLowerCase().startsWith("javascript:")) { return true; }
if (l.startsWith("#")) { return true; }
try { if (l.substring(l.lastIndexOf(".")).indexOf("htm") == -1) {
return true; } }
catch (StringIndexOutOfBoundsException e) {}
if ((hp = l.lastIndexOf("#"))!= -1) { l = l.substring(0, hp); }
try { eg.elementGefunden(new URL(url, l)); }
catch (MalformedURLException e) {
System.out.println(e.getLocalizedMessage());
}
}
}
return true;
}
}

private Hashtable zumTagendeSpringen(StreamTokenizer st) throws IOException {
Hashtable ht = new Hashtable();
String s = null;
String w = null;

while (true) {
st.nextToken();
if (st.ttype == st.TT_EOF) { ht = null; break; } else
if (st.ttype == '>') { break; } else
if (st.ttype == st.TT_WORD) {
s = st.sval;
for (int i = 0; i < ATTR.length; i++) {
if (s == ATTR[i]) {
ht.put(s, null);
continue;
}
}
st.nextToken();
if (st.ttype == '=') {
st.nextToken();
if (st.ttype == st.TT_NUMBER) {
ht.put(s, anfzEntfernen(new Double(st.nval).toString()));
} else if (st.ttype == st.TT_WORD) {
ht.put(s, anfzEntfernen(st.sval));
}
} else {
st.pushBack();
continue;
}
}
}
return ht;
}

private void zumKommentarendeSpringen(StreamTokenizer st) throws IOException {
while (true) {
st.nextToken();
if (st.ttype == st.TT_EOF) { break; } else
if (st.ttype == '-') {
st.nextToken();
if (st.ttype == '-') {
st.nextToken();
if (st.ttype == '>') { break; }
}
}
}
}

private void zumScriptendeSpringen(StreamTokenizer st) throws IOException {
while (true) {
st.nextToken();
if (st.ttype == st.TT_EOF) { break; } else
if (st.ttype == '<') {
st.nextToken();
if (st.ttype == st.TT_WORD) {
if (st.sval.toUpperCase().equals("/SCRIPT")) {
Hashtable x = zumTagendeSpringen(st);
break;
}
}
}
}
}

private String parseSZ(String s) throws IOException {
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.ordinaryChar('&');
st.ordinaryChar(';');
st.ordinaryChar('#');
StringBuffer sbuf = new StringBuffer();
while (st.ttype!= st.TT_EOF) {
if (st.ttype == st.TT_WORD) { sbuf.append(st.sval); } else
if (st.ttype == '&') {
st.nextToken();
if (st.ttype == st.TT_WORD) {
if (st.sval.equals("shy")) { st.nextToken(); continue; }
sbuf.append(name2hex(st.sval));
} else
if (st.ttype == '#') {
st.nextToken();
sbuf.append(unicode2hex((int)st.nval));
}
} else {
sbuf.append((char)st.ttype);
}
st.nextToken();
}
return sbuf.toString();
}
private char name2hex(String name) {
for (int i = 0; i < HTML_SZ.length; i++) {
if (name.equals(HTML_SZ[i])) { return SZ[i]; }
}
return '\u0000';
}
private char unicode2hex(int nummer) {
for (int i = 0; i < UNICODE_SZ.length; i++) {
if (nummer == UNICODE_SZ[i]) { return SZ[i]; }
}
return '\u0000';
}

private HTMLElement bubble(int tagFlag, HTMLElement el) {
if (el.TagFlag() == tagFlag && el.Fertig() == false) { return el; }
else {
if (el.Elter()!= null) {
switch (tagFlag) {
case DD: {
if (el.TagFlag() == DL && el.Fertig() == false) {
return null;
}
}
case DT: {
if (el.TagFlag() == DL && el.Fertig() == false) {
return null;
}
}
case LI: {
if ((el.TagFlag() == OL ¦¦ el.TagFlag() == UL) &&
el.Fertig() == false) {
return null;
}
}
case TD: TH: TR: {
if (el.TagFlag() == TABLE && el.Fertig() == false) {
return null;
}
}
case TH: {
if (el.TagFlag() == TABLE && el.Fertig() == false) {
return null;
}
}
case TR: {
if (el.TagFlag() == TABLE && el.Fertig() == false) {
return null;
}
}
}
el = bubble(tagFlag, el.Elter());
}
else { return null; }
}
return el;
}
private HTMLElement bubble(HTMLElement el) {
if (!el.Fertig()) { return el; }
else {
if (el.Elter()!= null) { el = bubble(el.Elter()); }
else { return null; }
}
return el;
}

private void elementSchließen(int tagFlag) throws IOException {
if (tagFlag == aktEl.TagFlag()) {
aktEl.setFertig(true);
} else {
HTMLElement temp = bubble(tagFlag, aktEl);
if (temp!= null) {
temp.setFertig(true);
aktEl = temp;
if (temp.Elter() == null) { elementeMelden(); }
}
}
}

private void elementeMelden() throws IOException {
eg.elementGefunden(basisElter);
aktEl = null;
basisElter = null;
}

private String anfzEntfernen(String str) {
if (str.startsWith(new Character('"').toString()) ¦¦
str.startsWith("'")) {
str = str.substring(1);
}
if (str.endsWith(new Character('"').toString()) ¦¦
str.endsWith("'")) {
str = str.substring(0, str.length()-1);
}
return str;
}

private boolean isOptET(int tagFlag) {
for (int i = 0; i < OPT_ET.length; i++) {
if (OPT_ET[i] == tagFlag) { return true; }
}
return false;
}

private boolean isVerbET(int tagFlag) {
for (int i = 0; i < VERB_ET.length; i++) {
if (VERB_ET[i] == tagFlag) { return true; }
}
return false;
}

private boolean isBlock(int tagFlag) {
for (int i = 0; i < BLOCK.length; i++) {
if (BLOCK[i] == tagFlag) { return true; }
}
return false;
}

private boolean isEinfAttr(String attribut) {
for (int i = 0; i < ATTR.length; i++) {
if (ATTR[i] == attribut) { return true; }
}
return false;
}

public static class Test {

public static void main(String args[]) throws IOException {
HTMLParser p = new HTMLParser(args[0],
new HTMLParser.ElementGefunden() {
public void elementGefunden(HTMLElement el) {
System.out.println("Text: " + el.InnerText());
}
public void elementGefunden(URL url) {
System.out.println("Link: " + url.toString());
}
});
p.start();
System.out.println("Ende");
}
}
}

brotherhood of LAN

9:25 pm on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



wow, I'll bookmark that, I just managed to search for a parser and test that out; for now ;)

I was using curl to grab a page, split it into the header, head and body and regex the title,description,headings,paragraphs etc from each section....count chars, occurrences etc before inserting them into a table.

After removing line breaks and gaps between tags the code seems fairly uniform after testing it over 10 pages.

Parser....regex/curl.....hmm :)

andreas, if possible could you sticky me a good HTML (free) parser you've tried and I can maybe see where I'm going wrong?

Thanks for the to both of you for hte help

andreasfriedrich

10:07 pm on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry Richard, but I have never used an HTML parser for PHP and don´t really know one. I looked at PEAR but there is none available. Now if all your pages were XHTML compliant then you could use PHP´s expat interface but then that would probably be way to strict considering the state of most pages out there.

I always use Perl and HTML::Parser [search.cpan.org] for those task.

Andreas