Welcome to WebmasterWorld Guest from 54.196.153.46

Forum Moderators: goodroi

Message Too Old, No Replies

Php robots.txt parser script

does that exitst? if not regex help needed.

     
3:35 am on Jun 6, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 18, 2002
posts:638
votes: 0


Are there any scripts that demonstrate how to parse robots.txt file and find the urls that are disallowed?

If not, is it possible to achieve this with regex?
Thanks.

4:45 pm on June 6, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 3, 2003
posts:1633
votes: 0


Did a quick Google search and didn't uncover any specific "here is some PHP code for parsing robots.txt".

However, search for "php search engine robots" and you'll find a few open source PHP search engine tools that claim to include robots.txt parsing.

Download the code and see how they do it...

10:49 pm on June 8, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 18, 2002
posts:638
votes: 0



No luck. They make it so complicated.

Something like this i would understand:

function is_allowed($url, $robotstxt){
if allowed return 1 else return 0;
}

If i can write can working function, i'll post it.
If anybody has more info please do let me know.

10:36 am on June 9, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 18, 2002
posts:638
votes: 0


I made the php code below, it returns the following array with dissallowed paths from a robots.txt.

Array ($Current_Line)
(
[] => http*//www.example.com/gfx/
[I] => http*//www.example.com/cgi-bin/
[2] => http*//www.example.com/QuickSand/
[3] => http*//www.example.com/pda/
[4] => http*//www.example.com/zForumFFFFFF/

)

if the current url was : [example.com...]
how can i preg_match this url with the dissalowed string?

if preg_match("/".$current_line[3]."/i", $current_url){
echo "Forbidden";
}

Ive tried the preg_matc above, but it returns an error message.


<?php
$current_url="http://www.example.com/QuickSand/index.htm";
$robotsdomain="http://www.example.com";
$my_user_agent="User-agent: *"; //my useragent
$robots=file('http://www.example.com/robots.txt');
for ($i=0;$i<sizeof($robots);$i++){
if (trim($robots[$i])==$my_user_agent){ // rules for agent: *
for ($checkrules=1;$checkrules<10;$checkrules++){
if (trim($robots[$i+$checkrules])!=""){
$pos = strpos( $current_line[$count],"User-agent");
if (is_integer($pos)) break;
$pos = strpos( $current_line[$count],"#");
if (is_integer($pos)) $current_line[$count]=substr($current_line[$count],0,$pos);
$current_line[$count]=str_replace("Disallow: ", "" ,$robotsdomain.$robots[$i+$checkrules]);
$count++;
}
}
}
}
print_r($current_line);
echo $current_url;
?>

2:57 pm on June 9, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 18, 2002
posts:638
votes: 0


Sorry about this big thread. I've finished the code. So i'll post it. Maybe usefull for others like me.

usage: echo robots_allowed($url);

returns 1 if allowed to crawl
returns 0 if not not allowed to crawl

(of course robots.txt should saved once and then be retrieved from the database and not opened for every url)

function robots_allowed($url){
$current_url=$url;
$xmp=explode("/", $current_url."/");
$robotsdomain=trim("http://".$xmp[2]);
$stipped_robotsdomain=str_replace("/","",$robotsdomain);
$stripped_current_url=str_replace("/", "" ,$url);
$my_user_agent="User-agent: *"; //my useragent
$robots=Read_Content($robotsdomain.'/robots.txt');
$robots=explode("\n",$robots);
for ($i=0;$i<sizeof($robots);$i++){
if (trim($robots[$i])==$my_user_agent){ // rules for agent: *
for ($checkrules=1;$checkrules<10;$checkrules++){
if (trim($robots[$i+$checkrules])!=""){
$pos = strpos( $current_line[$count],"User-agent");
if (is_integer($pos)) break;
$pos = strpos( $current_line[$count],"#");
if (is_integer($pos)) $current_line[$count]=substr($current_line[$count],0,$pos);
$disallow_line=str_replace("Disallow: ", "" ,$robots[$i+$checkrules]);
//$disallow_line=str_replace("http://", "" ,$disallow_line);
$disallow_line=str_replace("/", "" ,$disallow_line);
$newdata[$num]=$stipped_robotsdomain.$disallow_line;
$num++;
$count++;
}
}
}
}
$forbidden=1;
for ($last=0;$last<20;$last++){
if (trim($newdata[$last])!=""){
if (preg_match("/".trim($newdata[$last])."/i",$stripped_current_url)) {$forbidden=0;}
}
}
return $forbidden;
}
Function Read_Content($url){// Open een url return content
$handle=@fopen($url,"r");
if($handle){
$contents = fread ($handle, 10000);
fclose($handle);
}
return $contents;
}

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members