Forum Moderators: phranque

Message Too Old, No Replies

how to extract urls from a txt file?

         

indiandomain

6:57 am on May 21, 2004 (gmt 0)

10+ Year Member



i have a 1gb txt file with several urls.
the data looks like this
<ExternalPage about="http://www3.example.com/PHILLIPSHOTGLASS/GlassPage.html">
<d:Title>John phillips Blown glass</d:Title>
<d:Description>A small display of glass by John Phillips</d:Description>
</ExternalPage>
<d:Title>Computers</d:Title>
<link r:resource="http://www.example.ie/FME/"/>
<link r:resource="http://pages.example.com/computers/pnyhlen/Timeline.html"/>
</Topic>

is there anyway to extract all the urls from this file?
ive tried the xargs and greg command but it doesnt work.

anyone with a solution please.

regards
id

<use example.com in code>

[edited by: tedster at 11:07 am (utc) on May 21, 2004]

Easy_Coder

11:17 am on May 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Could you post a little more? This looks like a pretty good challenge but the data looks incomplete. Can you provide a complete Topic Tag.

incywincy

11:32 am on May 21, 2004 (gmt 0)

10+ Year Member



this regular expression for a uri comes from the w3c website which i coded in tcl, i'm sure you could modify it to pull out all of your links

regexp {^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?} $uri x dontcare1 scheme dontcare2 authority path dontcare3 query dontcare4 fragment

indiandomain

12:16 am on May 26, 2004 (gmt 0)

10+ Year Member



ok let me explain further/

i have a txt file that contains some data which looks like
<ExternalPage about="example.com/PHILLIPSHOTGLASS/GlassPage.html">
<d:Title>John phillips Blown glass</d:Title>
<d:Description>A small display of glass by John Phillips</d:Description>
</ExternalPage>
<d:Title>Computers</d:Title>
<link r:resource="http://www.example.ie/FME/"/>
<link r:resource="example.com/computers/pnyhlen/Timeline.html"/>
</Topic>

i want a script which extracts only the domains from this file and saves it in a txt file.

i was given this unix command but it doesnt work.

grep 'http://' t.txt ¦ sed 's/.*\(http:.*\)\".*/\1/' ¦ perl -MURI -e 'while(<>) { $url = URI->new($_); print $url->authority,"\n"; }'

anyone can help?