Forum Moderators: goodroi
Since working with Google and Microsoft to support a single format for submission with Sitemaps, we have continued to discuss further enhancements to make it easy for webmasters to get their content to all search engines quickly.All search crawlers recognize robots.txt, so it seemed like a good idea to use that mechanism to allow webmasters to share their Sitemaps. You agreed and encouraged us to allow robots.txt discovery of Sitemaps on our suggestion board. We took the idea to Google and Microsoft and are happy to announce today that you can now find your sitemaps in a uniform way across all participating engines. To do this, simply add the following line to your robots.txt file:
Sitemap: http://www.example.tld/sitemap.xml
[edited by: engine at 11:41 am (utc) on April 12, 2007]
Credit to Ask.com, Google, Microsoft Live Search and Yahoo!
FYI [sitemaps.org...]
I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage.
Also .. google webmaster tools gives an error on thet syntax for the robots.txt file.
Parsing results
Sitemap: http://www.example.com/sitemap.xml
Syntax not understood
[edited by: Bewenched at 1:23 pm (utc) on April 12, 2007]
Otherwise it is as you say: heaven for scrapers,
but then again: too many requests(2 is toooo many) with no JS and NO Images: 1 warnning, then Block and wait till the next episode....
What would be the advantage or disadvantage of using this .xml feed
For one, Yahoo, Google, MSN, Ask and IBM are all supporting this same method. For another, it is included in the robots.txt file, which all of them hit first, and so makes it easier for them to find your sitemap.
Here's some PHP for generating a Sitemaps.org-compatible sitemap.xml file. Please post any modifications you might make to it:
<?php
/*########################################################
# Generates a sitemap per specifications found at: #
# http://www.sitemaps.org/protocol.html #
# DOES NOT traverse directories #
# 20070712 James Butler james at musicforhumans dot com #
# Based on opendir() code by mike at mihalism dot com #
# http://us.php.net/manual/en/function.readdir.php#72793 #
# Free for all: http://www.gnu.org/licenses/lgpl.html #
# #
# Useage: #
# 1) Save this as file name: sitemap_gen.php #
# 2) Change variables noted below for your site #
# 3) Place this file in your site's root directory #
# 4) Run from http://www.yourdomain.com/sitemap_gen.php #
# #
# <lastmod> -OPTIONAL #
# YYYY-MM-DD #
# <changefreq>-OPTIONAL #
# always #
# hourly #
# daily #
# weekly #
# monthly #
# yearly #
# never #
# <priority> -OPTIONAL #
# 0.0-1.0 [default 0.5] #
# #
# Add completed sitemap file to robots.txt: #
# Sitemap: http://www.yourdomain.com/sitemap.xml #
# #
########################################################*/ ######## CHANGE THESE FOR YOUR SITE #########
# IMPORTANT: Trailing slashes are REQUIRED!
$my_domain = "http://www.yourdomain.com/";
$root_path_to_site = "/root/path/to/site/";
$file_types_to_include = array('html','htm');
############## END CHANGES ################## $xml ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
$xml.="<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain."</loc>\n";
$xml.=" <priority>1.0</priority>\n";
$xml.=" </url>\n";
## START Modified mike at mihalism dot com Code ######
function file_type($file){
$path_chunks = explode("/", $file);
$thefile = $path_chunks[count($path_chunks) - 1];
$dotpos = strrpos($thefile, ".");
return strtolower(substr($thefile, $dotpos + 1));
}
$file_count = 0;
$path = opendir($root_path_to_site);
while (false!== ($filename = readdir($path))) {
$files[] = $filename;
}
sort($files);
foreach ($files as $file) {
$extension = file_type($file);
if($file!= '.' && $file!= '..' && array_search($extension, $file_types_to_include)!== false) {
$file_count++;
### END Modified mike at mihalism dot com Code ######
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain.$file."</loc>\n";
$xml.=" <lastmod>".date("Y-m-d",filemtime($file))."</lastmod>\n";
$xml.=" <changefreq>monthly</changefreq>\n";
$xml.=" <priority>0.5</priority>\n";
$xml.=" </url>\n";
}
}
$xml.="</urlset>\n";
if($file_count == 0){
echo "No files to add to the Sitemap\n";
}
else {
$sitemap=fopen("sitemap.xml","w+");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
exec("touch sitemap.xml");
exec("chmod 666 sitemap.xml");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
exec("chmod 644 sitemap.xml");
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
echo "File is not writable.<br>\n";
}
}
}
?>
site scrapers can really take advantage of it too...
however,
I've made a small php code that you can check who is requesting your sitemap. You can detect if the requester is a known searchengine or not.
<snip>
<?php
function botIsAllowed($ip){
//get the reverse dns of the ip.
$host = strtolower(gethostbyaddr($ip));
$botDomains = array('.inktomisearch.com',
'.googlebot.com',
'.ask.com',
);
//search for the reverse dns matches the white list
foreach($botDomains as $bot){
if (strpos(strrev($host),strrev($bot))===0){
$qip= gethostbyname($host);
return ($qip==$ip);
}
}
return false;
}
if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
echo "Banned!";
exit;
}
?>
[edited by: engine at 7:52 am (utc) on April 13, 2007]
[edit reason] No urls, thanks. See TOS [webmasterworld.com] [/edit]
Am I correct for below robots.txt format for autodiscovery?
"User-agent: *
Disallow:
Sitemap: [mysite.com...]
Or shall I have to add any codes there?
Thanks in advance.
Rahul D.
"User-agent: *
Sitemap: [mysite.com...]
Disallow:
Having it below disallow might be detrimental to your indexing...
I would thnk you would put it like this
"User-agent: *
Sitemap: [mysite.com...]
Disallow:Having it below disallow might be detrimental to your indexing...
The 'Sitemap' directive is independent of the 'user-agent' line, so it doesn't matter where you place it.
"User-agent: *
Disallow:
Sitemap: [mysite.com...]
Thanks in advance
Rahul D.
Thanks for your great script 'stupid script'.I tried to mod it to index a php based forum like this without success.
I got it to generate OK but it didn't want to pick up the threads.
Any ideas how to fix it :)
Thanks, netchicken!
Are independent files stored somewhere for spiders to access? If so, run the script in that directory (modify $my_domain to match) then merge the generated file with the maps from other directories to create one master sitemap.
If there are no independent files for spiders to find, then the script will not be as useful to you.
NOTE: If the URLs that you want to include in one of these sitemaps look like the ones on this forum (with query strings) BE SURE to convert all of the entities in your URLs. i.e. If this is the type of URL you want in your XML sitemap:
http:/ /www.example.com/pages.cgi?page=1§ion=14 then be sure to convert it to: http:/ /www.example.com/pages.cgi?page=1[b]&[/b]section=14 See these instructions [sitemaps.org] for all entities that need to be converted to UTF-8 before including them in sitemap.xml. [edited by: StupidScript at 9:18 pm (utc) on April 16, 2007]