Add Sitemap To robots.txt For Autodiscovery

Forum Moderators: goodroi

Message Too Old, No Replies

Add Sitemap To robots.txt For Autodiscovery

Ask.com, Google, Microsoft Live Search and Yahoo!

OldWolf

8:59 am on Apr 12, 2007 (gmt 0)

Since working with Google and Microsoft to support a single format for submission with Sitemaps, we have continued to discuss further enhancements to make it easy for webmasters to get their content to all search engines quickly.
All search crawlers recognize robots.txt, so it seemed like a good idea to use that mechanism to allow webmasters to share their Sitemaps. You agreed and encouraged us to allow robots.txt discovery of Sitemaps on our suggestion board. We took the idea to Google and Microsoft and are happy to announce today that you can now find your sitemaps in a uniform way across all participating engines. To do this, simply add the following line to your robots.txt file:
Sitemap: http://www.example.tld/sitemap.xml

More [ysearchblog.com...]

[edited by: engine at 11:41 am (utc) on April 12, 2007]

sem4u

11:19 am on Apr 12, 2007 (gmt 0)

This is a good move forward by the big three search engines and for the sitemaps protocol.

I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage.

engine

11:40 am on Apr 12, 2007 (gmt 0)

This is going to be so useful. A big thanks to the Search Engines for collaborating on this.

Credit to Ask.com, Google, Microsoft Live Search and Yahoo!

FYI [sitemaps.org...]

indianeyes

11:47 am on Apr 12, 2007 (gmt 0)

Thanks for telling! it is simple to be done.

Achernar

12:24 pm on Apr 12, 2007 (gmt 0)

I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage.

[sitemaps.org...]

BillyS

12:48 pm on Apr 12, 2007 (gmt 0)

Good move, makes it much easier to support one standard.

Bewenched

1:02 pm on Apr 12, 2007 (gmt 0)

Thats great news, but it can also lead the site scrapers to all of your content as well. That's my only fear.

Also .. google webmaster tools gives an error on thet syntax for the robots.txt file.

Parsing results

Sitemap: http://www.example.com/sitemap.xml

Syntax not understood

[edited by: Bewenched at 1:23 pm (utc) on April 12, 2007]

engine

1:12 pm on Apr 12, 2007 (gmt 0)

Site scrapers already ignore robots.txt, so no need to worry. Just focus on the search engines.

Bewenched

3:19 pm on Apr 12, 2007 (gmt 0)

Site scrapers already ignore robots.txt, so no need to worry. Just focus on the search engines

Yes I know that they ignore the sitemap, but putting this in our robots.txt will give them a complete roadmap to follow.

blend27

3:25 pm on Apr 12, 2007 (gmt 0)

Bewenched, the sitemap.xml could cloacked against IP Ranges. The normal user should not make requests nor looking for it to your sitemap.xml file.
As long as SE's don't provide CASHED Copy it's a cool feature.

Otherwise it is as you say: heaven for scrapers,

but then again: too many requests(2 is toooo many) with no JS and NO Images: 1 warnning, then Block and wait till the next episode....

Clark

4:57 pm on Apr 12, 2007 (gmt 0)

As long as SE's don't provide CASHED Copy it's a cool feature.

CASHed copy, I like it. I'd like a cash copy of my robots.txt please :)

But seriously, what's with xml all the time? Does anyone like xml?

[edited by: Clark at 4:58 pm (utc) on April 12, 2007]

blend27

5:17 pm on Apr 12, 2007 (gmt 0)

ye, it shouldnt be CASHed or CACHed, right Clark? :)

latimer

7:45 pm on Apr 12, 2007 (gmt 0)

What would be the advantage or disadvantage of using this .xml feed for yahoo rather than submitting txt map as discussed here:
[submit.search.yahoo.com...]

System

8:09 pm on Apr 12, 2007 (gmt 0)

redhat

The following message was cut out to new thread by goodroi. New thread at: robots_txt/3310606.htm [webmasterworld.com]
2:24 pm on April 13, 2007 (utc -5)

StupidScript

1:33 am on Apr 13, 2007 (gmt 0)

What would be the advantage or disadvantage of using this .xml feed

For one, Yahoo, Google, MSN, Ask and IBM are all supporting this same method. For another, it is included in the robots.txt file, which all of them hit first, and so makes it easier for them to find your sitemap.

Here's some PHP for generating a Sitemaps.org-compatible sitemap.xml file. Please post any modifications you might make to it:

<?php
/*########################################################
# Generates a sitemap per specifications found at:    #
#    http://www.sitemaps.org/protocol.html      #
# DOES NOT traverse directories             #
# 20070712 James Butler james at musicforhumans dot com #
# Based on opendir() code by mike at mihalism dot com  #
# http://us.php.net/manual/en/function.readdir.php#72793 #
# Free for all: http://www.gnu.org/licenses/lgpl.html  #
#                            #
# Useage:                        #
# 1) Save this as file name: sitemap_gen.php      #
# 2) Change variables noted below for your site     #
# 3) Place this file in your site's root directory   #
# 4) Run from http://www.yourdomain.com/sitemap_gen.php #
#                            #
# <lastmod>  -OPTIONAL                 #
# YYYY-MM-DD                      #
# <changefreq>-OPTIONAL                 #
# always                        #
# hourly                        #
# daily                         #
# weekly                        #
# monthly                        #
# yearly                        #
# never                         #
# <priority> -OPTIONAL                 #
# 0.0-1.0 [default 0.5]                 #
#                            #
# Add completed sitemap file to robots.txt:       #
# Sitemap: http://www.yourdomain.com/sitemap.xml    #
#                            #
########################################################*/

######## CHANGE THESE FOR YOUR SITE #########
# IMPORTANT: Trailing slashes are REQUIRED!
$my_domain = "http://www.yourdomain.com/";
$root_path_to_site = "/root/path/to/site/";
$file_types_to_include = array('html','htm');
############## END CHANGES ##################

$xml ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
$xml.="<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain."</loc>\n";
$xml.=" <priority>1.0</priority>\n";
$xml.=" </url>\n";
## START Modified mike at mihalism dot com Code ######
function file_type($file){
 $path_chunks = explode("/", $file);
 $thefile = $path_chunks[count($path_chunks) - 1];
 $dotpos = strrpos($thefile, ".");
 return strtolower(substr($thefile, $dotpos + 1));
}
$file_count = 0;
$path = opendir($root_path_to_site);
while (false!== ($filename = readdir($path))) {
 $files[] = $filename;
}
sort($files);
foreach ($files as $file) {
 $extension = file_type($file);
 if($file!= '.' && $file!= '..' && array_search($extension, $file_types_to_include)!== false) {
 $file_count++;
### END Modified mike at mihalism dot com Code ######
 $xml.=" <url>\n";
 $xml.=" <loc>".$my_domain.$file."</loc>\n";
 $xml.=" <lastmod>".date("Y-m-d",filemtime($file))."</lastmod>\n";
 $xml.=" <changefreq>monthly</changefreq>\n";
 $xml.=" <priority>0.5</priority>\n";
 $xml.=" </url>\n";
 }
}
$xml.="</urlset>\n";
if($file_count == 0){
 echo "No files to add to the Sitemap\n";
}
else {
 $sitemap=fopen("sitemap.xml","w+");
 if (is_writable("sitemap.xml")) {
 fwrite($sitemap,$xml);
 fclose($sitemap);
 echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
 echo "Remove items you do not want included in the search engines.<br>\n";
 echo "Modify < changefreq > and < priority > to taste.<br>\n";
 echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
 }
 else {
 exec("touch sitemap.xml");
 exec("chmod 666 sitemap.xml");
 if (is_writable("sitemap.xml")) {
  fwrite($sitemap,$xml);
  fclose($sitemap);
  exec("chmod 644 sitemap.xml");
  echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
  echo "Remove items you do not want included in the search engines.<br>\n";
  echo "Modify < changefreq > and < priority > to taste.<br>\n";
  echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
 }
 else {
  echo "File is not writable.<br>\n";
 }
 }
}
?>

argiope

2:57 am on Apr 13, 2007 (gmt 0)

yesterday I've blogged about this too.

site scrapers can really take advantage of it too...
however,

I've made a small php code that you can check who is requesting your sitemap. You can detect if the requester is a known searchengine or not.

<snip>

<?php
function botIsAllowed($ip){
//get the reverse dns of the ip.
$host = strtolower(gethostbyaddr($ip));
$botDomains = array('.inktomisearch.com',
'.googlebot.com',
'.ask.com',
);

//search for the reverse dns matches the white list
foreach($botDomains as $bot){
if (strpos(strrev($host),strrev($bot))===0){
$qip= gethostbyname($host);
return ($qip==$ip);
}
}
return false;
}

if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
echo "Banned!";
exit;
}
?>

[edited by: engine at 7:52 am (utc) on April 13, 2007]
[edit reason] No urls, thanks. See TOS [webmasterworld.com] [/edit]

Shetty

7:06 am on Apr 13, 2007 (gmt 0)

What happens if you have multiple sitemaps?

keyplyr

7:19 am on Apr 13, 2007 (gmt 0)

Adding "Sitemap: http://www.example.tld/sitemap.xml" invalidates robots.txt in each of the 3 validators I tried, as well as Google Webmaster Tools' robots.txt analysis (as mentioned above by Bewenched.)

I think I'll wait until everyone catches up to the new standard.

silverbytes

3:28 pm on Apr 13, 2007 (gmt 0)

Does that means that if you put that to your robots.txt you don't have to manually submit your sitemaps to any se again?

What are most relevants advantages?

activeco

8:25 pm on Apr 13, 2007 (gmt 0)

For those who want to use the feature at this stage, it seems like the cloaking is a must here.
Provide the version with the sitemap to SE's who definitely support it and the old version to all the rest.
Well, unless the big G figures it out, gets confused and issues a penalty for doing this.

StupidScript

1:39 am on Apr 14, 2007 (gmt 0)

If robots.txt doesn't pass the validation test, will the SEs bots ignore it, or not spider the site as usual?

netchicken1

5:15 am on Apr 14, 2007 (gmt 0)

Thanks for your great script 'stupid script'.

I tried to mod it to index a php based forum like this without success.

I got it to generate OK but it didn't want to pick up the threads.

Any ideas how to fix it :)

rahul_seo

11:19 am on Apr 14, 2007 (gmt 0)

Hello frnds..

Am I correct for below robots.txt format for autodiscovery?

"User-agent: *
Disallow:

Sitemap: [mysite.com...]

Or shall I have to add any codes there?

Thanks in advance.
Rahul D.

netchicken1

8:20 pm on Apr 14, 2007 (gmt 0)

I would thnk you would put it like this

"User-agent: *
Sitemap: [mysite.com...]
Disallow:

Having it below disallow might be detrimental to your indexing...

seasalt

12:51 am on Apr 15, 2007 (gmt 0)

I would thnk you would put it like this
"User-agent: *
Sitemap: [mysite.com...]
Disallow:
Having it below disallow might be detrimental to your indexing...

The 'Sitemap' directive is independent of the 'user-agent' line, so it doesn't matter where you place it.

beer234

4:42 am on Apr 15, 2007 (gmt 0)

Is it case sensitive? IE: Does it have to be Sitemap: or can it be sitemap: for us lazy webmasters.

Key_Master

2:32 pm on Apr 15, 2007 (gmt 0)

Great idea but it needs improvement. Allow webmasters to place a number sign in front of the new sitemap directive so our robots.txt will validate:

# Sitemap: http://www.example.com/sitemap.xml

rahul_seo

1:23 pm on Apr 16, 2007 (gmt 0)

Do I have to put double codes in between them?

"User-agent: *
Disallow:
Sitemap: [mysite.com...]

Thanks in advance
Rahul D.

StupidScript

9:17 pm on Apr 16, 2007 (gmt 0)

Thanks for your great script 'stupid script'.
I tried to mod it to index a php based forum like this without success.
I got it to generate OK but it didn't want to pick up the threads.
Any ideas how to fix it :)

Thanks, netchicken!

Are independent files stored somewhere for spiders to access? If so, run the script in that directory (modify $my_domain to match) then merge the generated file with the maps from other directories to create one master sitemap.

If there are no independent files for spiders to find, then the script will not be as useful to you.

NOTE: If the URLs that you want to include in one of these sitemaps look like the ones on this forum (with query strings) BE SURE to convert all of the entities in your URLs. i.e. If this is the type of URL you want in your XML sitemap:

http:/ /www.example.com/pages.cgi?page=1&section=14

then be sure to convert it to:

http:/ /www.example.com/pages.cgi?page=1[b]&amp;[/b]section=14

See these instructions [sitemaps.org] for all entities that need to be converted to UTF-8 before including them in sitemap.xml.

[edited by: StupidScript at 9:18 pm (utc) on April 16, 2007]

bwnbwn

9:17 pm on Apr 17, 2007 (gmt 0)

I got an error in webmasters area when Google downloaded my robots.txt with the sitemap added. Removed it till it is reconized..

This 32 message thread spans 2 pages: 32