Forum Moderators: Robert Charlton & goodroi
Thanks
[edited by: tedster at 4:19 pm (utc) on May 2, 2008]
You might disallow those urls in your robots.txt file, but it would be best to do that only if there's a pattern you can use for the disallow rule, rather than listing all 2504 urls one by one.
As tedster confirmed, this does not seem to affect indexing or ranking negatively, but it looks a bit odd and probably causes a lot of unnecessary bandwidth for both google and ourselves.
These entries are said to vanish on their own after a while, but it really takes a long time sometimes. Google offers a way to delete single URLs, but this interface is limited to 100 pages.
In the past I managed to get them deleted quite quickly by ftp-ing dummy-pages with a "noindex" robots-metatag (the entries vanish as soon as googlebot comes along next time), but currently the folder-structure of my URLs-not-found is so weird, that I avoid the effort. Nevertheless, it doesn't seem too complicated to write a little php script (with a text-input field for the csv-data offered by google), which generates a potentially infinite bulk of such dummy-pages using fwrite.
It is one of my high-prority projects to switch to an absolute href-notation in order to avoid this problem in the future. But I cannot mind stating that this "recommendation" is a great shame for the google engineers in the first place. (Though they are really doing a great job in general)
I'm not really sure, but I think, this doesn't help. This only causes googlebot to NOT have a look at all, but has no effect on the rotten data-base. I think I tried that two years ago, and it didn't work. But this is without warranty because a long time ago.
I assume these urls did resolve at one time, and now they don't - is that correct?
That is correct Tedster thanks. I was making up some pages and categories on my WPMU powered site. I deleted them.
"I regularly experience a similar phenomenon, because I still did not manage to switch from relative to absolute paths on my website, and sometimes googlebot seems to get confused by this during the crawl process."
Thanks Oliver. that could tie in with another problem i have been having. Fella that does my site with me just mentioned this on a cache problem-
"Decided to have a look at in firefox and for some very strange reason firefox is using the file path I was using originally and not the blogs one which IE7 is using"
I dopn't know if all this is related. Maybe these pages are being 'stored' somewhere still? Also on one of my subdomains none of the pages are being indexed. I will ask my colleague to use this webmasterworld account as he knows more and thank you for all your help. Your the guys in the know and i really need to know.
As always in such cases, it is absolutely without warranty, and written in my own very amateurish style, but it worked for me. Maybe some of the more sophisticaed programmers in here might cross-check or improve it.
To my experiences, the quickest way to get those ugly 404s out of your GWT is, to create dummy pages with a noindex metatag. this is extremely difficult, if - for whatever reason- a complicated directory structure prevents you from doing it with easy means. For instance, as I said, I still use many relative paths for internal links. If I make a mistake myself on any one page with several internal links on it, or if googlebot gets confused somewhere, I often get a couple hundreds of such 404s in GWT.
the following script swallows your google .csv-file (which you may download from GWT: Please save as "notfound.csv" or change line 1 accordingly. save both (this script and the csv-file) in your root directory). then it loops through the entries, checks whether any file exists under the given entry, and in case there is none, creates a dummy file with the content defined in lines 4-6.
Please note: I use the split-command in line 10 with a comma-needle in order to parse each line of your csv-file. If your URLs contain any commas (mine never do, but I saw some site where this was the case) the script might not work. I also can't tell what will happen in line 17ff if your URLs-not-found contain soome double-slashes (//).
maybe you check it on your local XAMP-environment first or reduce the csv-file to two or three lines before looping through the whole bunch.
<?php
$mydata = file('notfound.csv');
$text = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">';
$text .='<html><head><meta NAME="robots" CONTENT="noindex,follow"></head>';
$text .='<body></body></html>';
foreach($mydata as $thisindex => $myline){//loop through csv
if($thisindex > 0){//ignore header line of csv-file
$lar = split(',',$myline);//split line
$myurl = $lar[0];//extract URL
$myerror = substr($lar[1],0,3);//extract error code
if($myerror = '404'){//action only on 404s, who knows
$uar = parse_url($myurl);
$mypath = substr($uar['path'],1);//extract path + delete preceding slash
if(!file_exists($mypath)){//does the file exist meanwhile?
$par = split('/',$mypath);//extract directories involved
$numlevels = count($par)-1;//distance from root
$thisfilename=$par[$numlevels];
$pathindex = 0;
while ($pathindex < $numlevels){//loop through directories
$thisdir = $par[$pathindex];
if($pathindex == 0){$subpath = $thisdir;}
else {$subpath .= '/'.$thisdir;}
if (!file_exists($subpath)){$st = mkdir($subpath,0777);}
$pathindex = $pathindex +1;
}
if ($handle = fopen($mypath, "w")){//finally write dummy file to disk
fwrite ($handle,$text);
fclose($handle);
echo $myerror.' newly created: <a href = "'.$mypath.'">'.$mypath.'</a><br>';
}
}
else{
echo '<b>file exists!</b>:<a href = "'.$mypath.'">'.$mypath.'</a><br>';
}
}
}
}
?>