homepage Welcome to WebmasterWorld Guest from 50.19.199.154
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Php robots.txt parser script
does that exitst? if not regex help needed.
ikbenhet1




msg:1526638
 3:35 am on Jun 6, 2003 (gmt 0)

Are there any scripts that demonstrate how to parse robots.txt file and find the urls that are disallowed?

If not, is it possible to achieve this with regex?
Thanks.

 

dmorison




msg:1526639
 4:45 pm on Jun 6, 2003 (gmt 0)

Did a quick Google search and didn't uncover any specific "here is some PHP code for parsing robots.txt".

However, search for "php search engine robots" and you'll find a few open source PHP search engine tools that claim to include robots.txt parsing.

Download the code and see how they do it...

ikbenhet1




msg:1526640
 10:49 pm on Jun 8, 2003 (gmt 0)


No luck. They make it so complicated.

Something like this i would understand:

function is_allowed($url, $robotstxt){
if allowed return 1 else return 0;
}

If i can write can working function, i'll post it.
If anybody has more info please do let me know.

ikbenhet1




msg:1526641
 10:36 am on Jun 9, 2003 (gmt 0)

I made the php code below, it returns the following array with dissallowed paths from a robots.txt.

Array ($Current_Line)
(
[] => http*//www.example.com/gfx/
[I] => http*//www.example.com/cgi-bin/
[2] => http*//www.example.com/QuickSand/
[3] => http*//www.example.com/pda/
[4] => http*//www.example.com/zForumFFFFFF/

)

if the current url was : [example.com...]
how can i preg_match this url with the dissalowed string?

if preg_match("/".$current_line[3]."/i", $current_url){
echo "Forbidden";
}

Ive tried the preg_matc above, but it returns an error message.


<?php
$current_url="http://www.example.com/QuickSand/index.htm";
$robotsdomain="http://www.example.com";
$my_user_agent="User-agent: *"; //my useragent
$robots=file('http://www.example.com/robots.txt');
for ($i=0;$i<sizeof($robots);$i++){
if (trim($robots[$i])==$my_user_agent){ // rules for agent: *
for ($checkrules=1;$checkrules<10;$checkrules++){
if (trim($robots[$i+$checkrules])!=""){
$pos = strpos( $current_line[$count],"User-agent");
if (is_integer($pos)) break;
$pos = strpos( $current_line[$count],"#");
if (is_integer($pos)) $current_line[$count]=substr($current_line[$count],0,$pos);
$current_line[$count]=str_replace("Disallow: ", "" ,$robotsdomain.$robots[$i+$checkrules]);
$count++;
}
}
}
}
print_r($current_line);
echo $current_url;
?>

ikbenhet1




msg:1526642
 2:57 pm on Jun 9, 2003 (gmt 0)

Sorry about this big thread. I've finished the code. So i'll post it. Maybe usefull for others like me.

usage: echo robots_allowed($url);

returns 1 if allowed to crawl
returns 0 if not not allowed to crawl

(of course robots.txt should saved once and then be retrieved from the database and not opened for every url)

function robots_allowed($url){
$current_url=$url;
$xmp=explode("/", $current_url."/");
$robotsdomain=trim("http://".$xmp[2]);
$stipped_robotsdomain=str_replace("/","",$robotsdomain);
$stripped_current_url=str_replace("/", "" ,$url);
$my_user_agent="User-agent: *"; //my useragent
$robots=Read_Content($robotsdomain.'/robots.txt');
$robots=explode("\n",$robots);
for ($i=0;$i<sizeof($robots);$i++){
if (trim($robots[$i])==$my_user_agent){ // rules for agent: *
for ($checkrules=1;$checkrules<10;$checkrules++){
if (trim($robots[$i+$checkrules])!=""){
$pos = strpos( $current_line[$count],"User-agent");
if (is_integer($pos)) break;
$pos = strpos( $current_line[$count],"#");
if (is_integer($pos)) $current_line[$count]=substr($current_line[$count],0,$pos);
$disallow_line=str_replace("Disallow: ", "" ,$robots[$i+$checkrules]);
//$disallow_line=str_replace("http://", "" ,$disallow_line);
$disallow_line=str_replace("/", "" ,$disallow_line);
$newdata[$num]=$stipped_robotsdomain.$disallow_line;
$num++;
$count++;
}
}
}
}
$forbidden=1;
for ($last=0;$last<20;$last++){
if (trim($newdata[$last])!=""){
if (preg_match("/".trim($newdata[$last])."/i",$stripped_current_url)) {$forbidden=0;}
}
}
return $forbidden;
}
Function Read_Content($url){// Open een url return content
$handle=@fopen($url,"r");
if($handle){
$contents = fread ($handle, 10000);
fclose($handle);
}
return $contents;
}

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved