Forum Moderators: open

Message Too Old, No Replies

Parsing links from webpage

Anyone have code?

         

aspdaddy

4:58 pm on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Has anyone done this with ASP, how do you determine the start and end positions of the actual urls?

Thanks.

txbakers

6:23 pm on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think the variable is called "SERVER_SCRIPT" to return the actual full URL.

From there you can parse it six ways to sunday if you need to.

The QueryString is everything past the?

You can split the URL at the "/" to get the individual paths, etc.

aspdaddy

7:23 pm on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I meant the urls in the page content, sorry I didnt explain better :).

I just cant seem to debug the parsing code so it works with all links, it either grabs to short or too long - just wondered if anyone had done this before in ASP and has an algorithm.

txbakers

7:29 pm on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ah, that is different. You can do that with plain old javascript.

I have a routine that checks the querystring for content, and if there is content, appends another variable to it, or else redirects.

It's the same concept though, just client side.

Is that what you are looking for? Maybe you can post an example.

aspdaddy

8:16 pm on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



we must be on different wavelength tonite :)

The code might not make much sense as its mostly function calls, but can you see what im trying to do here, pull out all the internal links.

[quote]

blnFound=true
strFind="href="
intCurrent=1
intStart=1
intEnd=1

' get the next page
strFileContents = getPage ( strURL )

while ( blnFound = true )
' find start of a link
intStart = instr(intCurrent,strFileContents,strFind,vbTextCompare)

if (intStart = 0) then
' add this page to front of visited list
visited.AddHead ( strURL )

' exit loop
blnFound = false
else
' find first char of url
intStart = intStart +len( strFind )+1

' find last char of url - finds next ',",> and space and uses lowest value
intEnd = findEnd( intStart, strFileContents)

' parse out the link and convert to lower case
strLink = lcase(Mid(strFileContents, intStart, intEnd - intStart))

' trim spaces and remove any quotes
strLink = Tidy(strLink)

' check that its not a dependent image/script etc.
if ( isDocument( strLink )) then

'check its an internal link
if ( isInternalLink(strLink) ) then

' convert to full uri syntax
strLink = getAbsoluteURI( strLink )
[/quote]

The problem is that different sites mark up links differently with quotes around the urls, titles before/after the href etc. I just cant seem to get it to work on ALL links.

I also need it to work for iframes and javascript links but one thing at a time..

There must be an easier way than this?