homepage Welcome to WebmasterWorld Guest from 50.17.86.12
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
modify href and src tags
anoopam9

5+ Year Member



 
Msg#: 3794549 posted 10:18 am on Nov 26, 2008 (gmt 0)

i downloaded html source code of a website

then I want a regex to modify every single link(<a href ) in it.

its like
<a href="www.mysite.com/mining.cgi?http://www.link.com"

and image links should remain the same if it is complete link and if it is like img src="bg.jpg" then append to it the whole link like src = [somesite.com...]

I need the regex.

 

janharders

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3794549 posted 9:34 pm on Dec 8, 2008 (gmt 0)

I'd go for eval in the replacing part, eg

$string =~ s/href="([^"]+)"/work_on_link($1)/egis;
$string =~ s/src="([^"]+)"/work_on_image($1)/egis;

sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url = 'http://www.example.com/images/' . $url;
}
return 'src="' . $url . '"';
}

sub work_on_link {
my $url = shift;
if($url =~ m!^https?://!)
{
$url = 'http://www.example.com/mining.cgi?' . $url;
}
return 'href="' . $url . '"';
}

might want to check wether you can use regexps easyily (eg not too many cases you have to check not to mess things up, <link href=""> etc ...), otherwise you'd have to use a tag-parser and iterate through the document-tree.

anoopam9

5+ Year Member



 
Msg#: 3794549 posted 4:04 am on Dec 12, 2008 (gmt 0)

Thanks for the help dude.... it is working !
but if the link is like
href=http://....;

instead of

href="http://......";

if it does not have double quotes around it..

Also if url is like

href=text.html
href="text.html" or like
href="/text.html" or like
href="../text.html" or like
href="./text.html";

same with src tag too!

I have got through only these cases but there may be several other forms in which the href is written.

I know there will be some simple regular expression which deals with every href case.

i have the code in here...
in this code
$url1 = $FORM{'URL'};
example : [yahoo.com...]

$html =~ s/href="([^"]+)"/work_on_link($1)/egis;
$html =~ s/src\s*=\s*"([^"]+)"/work_on_image($1)/egis;

sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url =~ s/^\\//;
$url = $url1 .'/'. $url;
}
return 'src="' . $url . '"';
}

sub work_on_link {
my $url = shift;
if($url =~ m!^http?://!)
{
$url = 'http://www.someurl.cgi?' . $url;
}
elsif($url =~ m!^(/)?://!)
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
else
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
return 'href="' . $url . '"';
}

janharders

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3794549 posted 4:54 pm on Dec 14, 2008 (gmt 0)

In that case you should use something like URI to build absolute links based on the url the html came from. the code below assumes, the links are relative to http://www.example.com/adirectory/ (specified in $baseurl. See if that helps and ask if something is unclear

#!/usr/bin/perl -w
use strict;
use URI;

my $string = join("", <DATA>);
print $string;
my $baseurl = 'http://www.example.com/adirectory/';
$string =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;

print $string;

sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.exampl.com/rewrite.cgi?' . $u3 . '"';
}

__DATA__

<a href="http://www.example.com">this is a link</a>
<a href="/mydir/">this is a link</a>
<a href="./file.htm">this is a link</a>
<a href="../anotherfile.htm">this is a link</a>

<a href=http://www.example.com>this is a link</a>
<a href=/mydir/>this is a link</a>
<a href=./file.htm>this is a link</a>
<a href=../anotherfile.htm>this is a link</a>

<img src="http://www.example.com">
<img src="/mydir/">
<img src="./file.gif">
<img src="../anotherfile.gif">

<img src=http://www.example.com>
<img src=/mydir/>
<img src=./file.gif>
<img src=../anotherfile.gif>


anoopam9

5+ Year Member



 
Msg#: 3794549 posted 9:46 am on Dec 29, 2008 (gmt 0)

Thanks for the help.. :)

Here is what I am doing I am downloading the page source using

system ( "/usr/bin/lynx -source '$QUERY' > link.html");

then open the html file and using regex to change and display the webpage.
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");
.......
.......
......
print($html);

How can I implement the code provided by you.( I am really not much familiar with perl )

janharders

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3794549 posted 4:13 pm on Jan 1, 2009 (gmt 0)


#!/usr/bin/perl -w
use strict;
use URI;
my $QUERY = 'http://www.example.com/';

system ( "/usr/bin/lynx -source '$QUERY' > link.html");
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");

my $baseurl = $QUERY;
$html =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;

print $html;

sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.example.com/rewrite.cgi?' . $u3 . '"';
}

should get you started. happy new year.

anoopam9

5+ Year Member



 
Msg#: 3794549 posted 3:56 am on Feb 5, 2009 (gmt 0)

Thanks dude! this code worked perfectly except one or two small issues.
some links are like href="#basics" and it does not display javascript.

I don't know why it is doing like that.

krugs

5+ Year Member



 
Msg#: 3794549 posted 9:38 pm on Feb 5, 2009 (gmt 0)

You don't want to change links like this:

href="#basics"

those are internal page links

What do you mean by javascript won't display?

anoopam9

5+ Year Member



 
Msg#: 3794549 posted 4:48 am on Feb 12, 2009 (gmt 0)

The thing here is I change every link on the page to direct through my proxy.
like [yahoo.com...] to
[myproxy.com...]

if there is a href like href="home.html"
it changes it to
[myproxy.com...]
but if there is a href like href="#home"
it changes it to
[myproxy.com...]

and from there on I start getting problems in links.
I need(have) to modify(href with #) it in order make my proxy work correctly.

its not the javascript actually,, its the java applet. But its not to worry about. If it doesn't work it will be fine but I need to make the proxy work for normal html links.

Thanks for the help dude! this forum has really helped me alot!

krugs

5+ Year Member



 
Msg#: 3794549 posted 7:05 am on Feb 12, 2009 (gmt 0)

OK, but I'm sorry, I simply do not understand what you are trying to do. Maybe someone else will.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved