Welcome to WebmasterWorld Guest from 54.196.233.239

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

modify href and src tags

     

anoopam9

10:18 am on Nov 26, 2008 (gmt 0)

5+ Year Member



i downloaded html source code of a website

then I want a regex to modify every single link(<a href ) in it.

its like
<a href="www.mysite.com/mining.cgi?http://www.link.com"

and image links should remain the same if it is complete link and if it is like img src="bg.jpg" then append to it the whole link like src = [somesite.com...]

I need the regex.

janharders

9:34 pm on Dec 8, 2008 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



I'd go for eval in the replacing part, eg

$string =~ s/href="([^"]+)"/work_on_link($1)/egis;
$string =~ s/src="([^"]+)"/work_on_image($1)/egis;

sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url = 'http://www.example.com/images/' . $url;
}
return 'src="' . $url . '"';
}

sub work_on_link {
my $url = shift;
if($url =~ m!^https?://!)
{
$url = 'http://www.example.com/mining.cgi?' . $url;
}
return 'href="' . $url . '"';
}

might want to check wether you can use regexps easyily (eg not too many cases you have to check not to mess things up, <link href=""> etc ...), otherwise you'd have to use a tag-parser and iterate through the document-tree.

anoopam9

4:04 am on Dec 12, 2008 (gmt 0)

5+ Year Member



Thanks for the help dude.... it is working !
but if the link is like
href=http://....;

instead of

href="http://......";

if it does not have double quotes around it..

Also if url is like

href=text.html
href="text.html" or like
href="/text.html" or like
href="../text.html" or like
href="./text.html";

same with src tag too!

I have got through only these cases but there may be several other forms in which the href is written.

I know there will be some simple regular expression which deals with every href case.

i have the code in here...
in this code
$url1 = $FORM{'URL'};
example : [yahoo.com...]

$html =~ s/href="([^"]+)"/work_on_link($1)/egis;
$html =~ s/src\s*=\s*"([^"]+)"/work_on_image($1)/egis;

sub work_on_image {
my $url = shift;
if($url !~ m!^https?://!)
{
$url =~ s/^\\//;
$url = $url1 .'/'. $url;
}
return 'src="' . $url . '"';
}

sub work_on_link {
my $url = shift;
if($url =~ m!^http?://!)
{
$url = 'http://www.someurl.cgi?' . $url;
}
elsif($url =~ m!^(/)?://!)
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
else
{
$url = 'http://www.someurl.cgi?' . $url1 . $url;
}
return 'href="' . $url . '"';
}

janharders

4:54 pm on Dec 14, 2008 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



In that case you should use something like URI to build absolute links based on the url the html came from. the code below assumes, the links are relative to http://www.example.com/adirectory/ (specified in $baseurl. See if that helps and ask if something is unclear

#!/usr/bin/perl -w
use strict;
use URI;

my $string = join("", <DATA>);
print $string;
my $baseurl = 'http://www.example.com/adirectory/';
$string =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;

print $string;

sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.exampl.com/rewrite.cgi?' . $u3 . '"';
}

__DATA__

<a href="http://www.example.com">this is a link</a>
<a href="/mydir/">this is a link</a>
<a href="./file.htm">this is a link</a>
<a href="../anotherfile.htm">this is a link</a>

<a href=http://www.example.com>this is a link</a>
<a href=/mydir/>this is a link</a>
<a href=./file.htm>this is a link</a>
<a href=../anotherfile.htm>this is a link</a>

<img src="http://www.example.com">
<img src="/mydir/">
<img src="./file.gif">
<img src="../anotherfile.gif">

<img src=http://www.example.com>
<img src=/mydir/>
<img src=./file.gif>
<img src=../anotherfile.gif>

anoopam9

9:46 am on Dec 29, 2008 (gmt 0)

5+ Year Member



Thanks for the help.. :)

Here is what I am doing I am downloading the page source using

system ( "/usr/bin/lynx -source '$QUERY' > link.html");

then open the html file and using regex to change and display the webpage.
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");
.......
.......
......
print($html);

How can I implement the code provided by you.( I am really not much familiar with perl )

janharders

4:13 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member




#!/usr/bin/perl -w
use strict;
use URI;
my $QUERY = 'http://www.example.com/';

system ( "/usr/bin/lynx -source '$QUERY' > link.html");
open ("fh", "<link.html");
sysread("fh", my $html, 100000);
close("fh");

my $baseurl = $QUERY;
$html =~ s/(href¦src)=(?:"¦'¦)([^">]+)(?:"¦'¦)/work_on_link($1, $2, $baseurl)/egis;

print $html;

sub work_on_link {
my $context = shift;
my $url = shift;
print $url . "\n";
my $baseurl = shift;
my $u1 = URI->new($baseurl);
my $u2 = URI->new($url);
my $u3 = $u2->abs($u1);
return $context . '="http://www.example.com/rewrite.cgi?' . $u3 . '"';
}

should get you started. happy new year.

anoopam9

3:56 am on Feb 5, 2009 (gmt 0)

5+ Year Member



Thanks dude! this code worked perfectly except one or two small issues.
some links are like href="#basics" and it does not display javascript.

I don't know why it is doing like that.

krugs

9:38 pm on Feb 5, 2009 (gmt 0)

5+ Year Member



You don't want to change links like this:

href="#basics"

those are internal page links

What do you mean by javascript won't display?

anoopam9

4:48 am on Feb 12, 2009 (gmt 0)

5+ Year Member



The thing here is I change every link on the page to direct through my proxy.
like [yahoo.com...] to
[myproxy.com...]

if there is a href like href="home.html"
it changes it to
[myproxy.com...]
but if there is a href like href="#home"
it changes it to
[myproxy.com...]

and from there on I start getting problems in links.
I need(have) to modify(href with #) it in order make my proxy work correctly.

its not the javascript actually,, its the java applet. But its not to worry about. If it doesn't work it will be fine but I need to make the proxy work for normal html links.

Thanks for the help dude! this forum has really helped me alot!

krugs

7:05 am on Feb 12, 2009 (gmt 0)

5+ Year Member



OK, but I'm sorry, I simply do not understand what you are trying to do. Maybe someone else will.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month