Welcome to WebmasterWorld Guest from 54.147.63.124

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

modify href and src tags

     
10:18 am on Nov 26, 2008 (gmt 0)

New User

5+ Year Member

joined:Nov 26, 2008
posts: 5
votes: 0


i downloaded html source code of a website

then I want a regex to modify every single link(<a href ) in it.

its like
<a href="www.mysite.com/mining.cgi?http://www.link.com"

and image links should remain the same if it is complete link and if it is like img src="bg.jpg" then append to it the whole link like src = [somesite.com...]

I need the regex.

9:34 pm on Dec 8, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0


I'd go for eval in the replacing part, eg

$string =~ s/href="([^"]+)"/work_on_link($1)/egis;
$string =~ s/src="([^"]+)"/work_on_image($1)/egis;

sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url = 'http://www.example.com/images/' . $url;
}
return 'src="' . $url . '"';
}

sub work_on_link {
my $url = shift;
if($url =~ m!^https?://!)
{
$url = 'http://www.example.com/mining.cgi?' . $url;
}
return 'href="' . $url . '"';
}

might want to check wether you can use regexps easyily (eg not too many cases you have to check not to mess things up, <link href=""> etc ...), otherwise you'd have to use a tag-parser and iterate through the document-tree.

4:04 am on Dec 12, 2008 (gmt 0)

New User

5+ Year Member

joined:Nov 26, 2008
posts: 5
votes: 0


Thanks for the help dude.... it is working !
but if the link is like
href=http://....;

instead of

href="http://......";

if it does not have double quotes around it..

Also if url is like

href=text.html
href="text.html" or like
href="/text.html" or like
href="../text.html" or like
href="./text.html";

same with src tag too!

I have got through only these cases but there may be several other forms in which the href is written.

I know there will be some simple regular expression which deals with every href case.

i have the code in here...
in this code
$url1 = $FORM{'URL'};
example : [yahoo.com...]

$html =~ s/href="([^"]+)"/work_on_link($1)/egis;
$html =~ s/src\s*=\s*"([^"]+)"/work_on_image($1)/egis;

sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url =~ s/^\\//;
$url = $url1 .'/'. $url;
}
return 'src="' . $url . '"';
}

sub work_on_link {
my $url = shift;
if($url =~ m!^http?://!)
{
$url = 'http://www.someurl.cgi?' . $url;
}
elsif($url =~ m!^(/)?://!)
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
else
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
return 'href="' . $url . '"';
}

4:54 pm on Dec 14, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0


In that case you should use something like URI to build absolute links based on the url the html came from. the code below assumes, the links are relative to http://www.example.com/adirectory/ (specified in $baseurl. See if that helps and ask if something is unclear

#!/usr/bin/perl -w
use strict;
use URI;

my $string = join("", <DATA>);
print $string;
my $baseurl = 'http://www.example.com/adirectory/';
$string =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;

print $string;

sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.exampl.com/rewrite.cgi?' . $u3 . '"';
}

__DATA__

<a href="http://www.example.com">this is a link</a>
<a href="/mydir/">this is a link</a>
<a href="./file.htm">this is a link</a>
<a href="../anotherfile.htm">this is a link</a>

<a href=http://www.example.com>this is a link</a>
<a href=/mydir/>this is a link</a>
<a href=./file.htm>this is a link</a>
<a href=../anotherfile.htm>this is a link</a>

<img src="http://www.example.com">
<img src="/mydir/">
<img src="./file.gif">
<img src="../anotherfile.gif">

<img src=http://www.example.com>
<img src=/mydir/>
<img src=./file.gif>
<img src=../anotherfile.gif>

9:46 am on Dec 29, 2008 (gmt 0)

New User

5+ Year Member

joined:Nov 26, 2008
posts:5
votes: 0


Thanks for the help.. :)

Here is what I am doing I am downloading the page source using

system ( "/usr/bin/lynx -source '$QUERY' > link.html");

then open the html file and using regex to change and display the webpage.
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");
.......
.......
......
print($html);

How can I implement the code provided by you.( I am really not much familiar with perl )

4:13 pm on Jan 1, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0



#!/usr/bin/perl -w
use strict;
use URI;
my $QUERY = 'http://www.example.com/';

system ( "/usr/bin/lynx -source '$QUERY' > link.html");
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");

my $baseurl = $QUERY;
$html =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;

print $html;

sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.example.com/rewrite.cgi?' . $u3 . '"';
}

should get you started. happy new year.

3:56 am on Feb 5, 2009 (gmt 0)

New User

5+ Year Member

joined:Nov 26, 2008
posts: 5
votes: 0


Thanks dude! this code worked perfectly except one or two small issues.
some links are like href="#basics" and it does not display javascript.

I don't know why it is doing like that.

9:38 pm on Feb 5, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


You don't want to change links like this:

href="#basics"

those are internal page links

What do you mean by javascript won't display?

4:48 am on Feb 12, 2009 (gmt 0)

New User

5+ Year Member

joined:Nov 26, 2008
posts: 5
votes: 0


The thing here is I change every link on the page to direct through my proxy.
like [yahoo.com...] to
[myproxy.com...]

if there is a href like href="home.html"
it changes it to
[myproxy.com...]
but if there is a href like href="#home"
it changes it to
[myproxy.com...]

and from there on I start getting problems in links.
I need(have) to modify(href with #) it in order make my proxy work correctly.

its not the javascript actually,, its the java applet. But its not to worry about. If it doesn't work it will be fine but I need to make the proxy work for normal html links.

Thanks for the help dude! this forum has really helped me alot!

7:05 am on Feb 12, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


OK, but I'm sorry, I simply do not understand what you are trying to do. Maybe someone else will.