homepage Welcome to WebmasterWorld Guest from 54.161.214.221
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
modify href and src tags
anoopam9




msg:3794551
 10:18 am on Nov 26, 2008 (gmt 0)

i downloaded html source code of a website

then I want a regex to modify every single link(<a href ) in it.

its like
<a href="www.mysite.com/mining.cgi?http://www.link.com"

and image links should remain the same if it is complete link and if it is like img src="bg.jpg" then append to it the whole link like src = [somesite.com...]

I need the regex.

 

janharders




msg:3802893
 9:34 pm on Dec 8, 2008 (gmt 0)

I'd go for eval in the replacing part, eg

$string =~ s/href="([^"]+)"/work_on_link($1)/egis;
$string =~ s/src="([^"]+)"/work_on_image($1)/egis;

sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url = 'http://www.example.com/images/' . $url;
}
return 'src="' . $url . '"';
}

sub work_on_link {
my $url = shift;
if($url =~ m!^https?://!)
{
$url = 'http://www.example.com/mining.cgi?' . $url;
}
return 'href="' . $url . '"';
}

might want to check wether you can use regexps easyily (eg not too many cases you have to check not to mess things up, <link href=""> etc ...), otherwise you'd have to use a tag-parser and iterate through the document-tree.

anoopam9




msg:3805776
 4:04 am on Dec 12, 2008 (gmt 0)

Thanks for the help dude.... it is working !
but if the link is like
href=http://....;

instead of

href="http://......";

if it does not have double quotes around it..

Also if url is like

href=text.html
href="text.html" or like
href="/text.html" or like
href="../text.html" or like
href="./text.html";

same with src tag too!

I have got through only these cases but there may be several other forms in which the href is written.

I know there will be some simple regular expression which deals with every href case.

i have the code in here...
in this code
$url1 = $FORM{'URL'};
example : [yahoo.com...]

$html =~ s/href="([^"]+)"/work_on_link($1)/egis;
$html =~ s/src\s*=\s*"([^"]+)"/work_on_image($1)/egis;

sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url =~ s/^\\//;
$url = $url1 .'/'. $url;
}
return 'src="' . $url . '"';
}

sub work_on_link {
my $url = shift;
if($url =~ m!^http?://!)
{
$url = 'http://www.someurl.cgi?' . $url;
}
elsif($url =~ m!^(/)?://!)
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
else
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
return 'href="' . $url . '"';
}

janharders




msg:3807222
 4:54 pm on Dec 14, 2008 (gmt 0)

In that case you should use something like URI to build absolute links based on the url the html came from. the code below assumes, the links are relative to http://www.example.com/adirectory/ (specified in $baseurl. See if that helps and ask if something is unclear

#!/usr/bin/perl -w
use strict;
use URI;

my $string = join("", <DATA>);
print $string;
my $baseurl = 'http://www.example.com/adirectory/';
$string =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;

print $string;

sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.exampl.com/rewrite.cgi?' . $u3 . '"';
}

__DATA__

<a href="http://www.example.com">this is a link</a>
<a href="/mydir/">this is a link</a>
<a href="./file.htm">this is a link</a>
<a href="../anotherfile.htm">this is a link</a>

<a href=http://www.example.com>this is a link</a>
<a href=/mydir/>this is a link</a>
<a href=./file.htm>this is a link</a>
<a href=../anotherfile.htm>this is a link</a>

<img src="http://www.example.com">
<img src="/mydir/">
<img src="./file.gif">
<img src="../anotherfile.gif">

<img src=http://www.example.com>
<img src=/mydir/>
<img src=./file.gif>
<img src=../anotherfile.gif>


anoopam9




msg:3815457
 9:46 am on Dec 29, 2008 (gmt 0)

Thanks for the help.. :)

Here is what I am doing I am downloading the page source using

system ( "/usr/bin/lynx -source '$QUERY' > link.html");

then open the html file and using regex to change and display the webpage.
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");
.......
.......
......
print($html);

How can I implement the code provided by you.( I am really not much familiar with perl )

janharders




msg:3817365
 4:13 pm on Jan 1, 2009 (gmt 0)


#!/usr/bin/perl -w
use strict;
use URI;
my $QUERY = 'http://www.example.com/';

system ( "/usr/bin/lynx -source '$QUERY' > link.html");
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");

my $baseurl = $QUERY;
$html =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;

print $html;

sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.example.com/rewrite.cgi?' . $u3 . '"';
}

should get you started. happy new year.

anoopam9




msg:3842699
 3:56 am on Feb 5, 2009 (gmt 0)

Thanks dude! this code worked perfectly except one or two small issues.
some links are like href="#basics" and it does not display javascript.

I don't know why it is doing like that.

krugs




msg:3843329
 9:38 pm on Feb 5, 2009 (gmt 0)

You don't want to change links like this:

href="#basics"

those are internal page links

What do you mean by javascript won't display?

anoopam9




msg:3847705
 4:48 am on Feb 12, 2009 (gmt 0)

The thing here is I change every link on the page to direct through my proxy.
like [yahoo.com...] to
[myproxy.com...]

if there is a href like href="home.html"
it changes it to
[myproxy.com...]
but if there is a href like href="#home"
it changes it to
[myproxy.com...]

and from there on I start getting problems in links.
I need(have) to modify(href with #) it in order make my proxy work correctly.

its not the javascript actually,, its the java applet. But its not to worry about. If it doesn't work it will be fine but I need to make the proxy work for normal html links.

Thanks for the help dude! this forum has really helped me alot!

krugs




msg:3847752
 7:05 am on Feb 12, 2009 (gmt 0)

OK, but I'm sorry, I simply do not understand what you are trying to do. Maybe someone else will.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved