Welcome to WebmasterWorld Guest from 54.146.201.80

Forum Moderators: goodroi

Message Too Old, No Replies

Add Sitemap To robots.txt For Autodiscovery

Ask.com, Google, Microsoft Live Search and Yahoo!

     
8:59 am on Apr 12, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 2, 2005
posts:112
votes: 0


Since working with Google and Microsoft to support a single format for submission with Sitemaps, we have continued to discuss further enhancements to make it easy for webmasters to get their content to all search engines quickly.

All search crawlers recognize robots.txt, so it seemed like a good idea to use that mechanism to allow webmasters to share their Sitemaps. You agreed and encouraged us to allow robots.txt discovery of Sitemaps on our suggestion board. We took the idea to Google and Microsoft and are happy to announce today that you can now find your sitemaps in a uniform way across all participating engines. To do this, simply add the following line to your robots.txt file:

Sitemap: http://www.example.tld/sitemap.xml


More [ysearchblog.com...]

[edited by: engine at 11:41 am (utc) on April 12, 2007]

11:19 am on Apr 12, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member sem4u is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Dec 18, 2002
posts:3061
votes: 0


This is a good move forward by the big three search engines and for the sitemaps protocol.

I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage.

11:40 am on Apr 12, 2007 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:22287
votes: 236


This is going to be so useful. A big thanks to the Search Engines for collaborating on this.

Credit to Ask.com, Google, Microsoft Live Search and Yahoo!

FYI [sitemaps.org...]

11:47 am on Apr 12, 2007 (gmt 0)

New User

5+ Year Member

joined:Apr 12, 2007
posts:1
votes: 0


Thanks for telling! it is simple to be done.
12:24 pm on Apr 12, 2007 (gmt 0)

Full Member

5+ Year Member

joined:Dec 3, 2006
posts:257
votes: 0


I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage.

[sitemaps.org...]
12:48 pm on Apr 12, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 1, 2004
posts:3181
votes: 0


Good move, makes it much easier to support one standard.
1:02 pm on Apr 12, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:July 26, 2006
posts:1619
votes: 0


Thats great news, but it can also lead the site scrapers to all of your content as well. That's my only fear.

Also .. google webmaster tools gives an error on thet syntax for the robots.txt file.

Parsing results

Sitemap: http://www.example.com/sitemap.xml

Syntax not understood

[edited by: Bewenched at 1:23 pm (utc) on April 12, 2007]

1:12 pm on Apr 12, 2007 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:22287
votes: 236


Site scrapers already ignore robots.txt, so no need to worry. Just focus on the search engines.
3:19 pm on Apr 12, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:July 26, 2006
posts:1619
votes: 0


Site scrapers already ignore robots.txt, so no need to worry. Just focus on the search engines

Yes I know that they ignore the sitemap, but putting this in our robots.txt will give them a complete roadmap to follow.

3:25 pm on Apr 12, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1665
votes: 35


Bewenched, the sitemap.xml could cloacked against IP Ranges. The normal user should not make requests nor looking for it to your sitemap.xml file.
As long as SE's don't provide CASHED Copy it's a cool feature.

Otherwise it is as you say: heaven for scrapers,

but then again: too many requests(2 is toooo many) with no JS and NO Images: 1 warnning, then Block and wait till the next episode....

4:57 pm on Apr 12, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0



As long as SE's don't provide CASHED Copy it's a cool feature.

CASHed copy, I like it. I'd like a cash copy of my robots.txt please :)

But seriously, what's with xml all the time? Does anyone like xml?

[edited by: Clark at 4:58 pm (utc) on April 12, 2007]

5:17 pm on Apr 12, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1665
votes: 35


ye, it shouldnt be CASHed or CACHed, right Clark? :)
7:45 pm on Apr 12, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 27, 2002
posts:114
votes: 0


What would be the advantage or disadvantage of using this .xml feed for yahoo rather than submitting txt map as discussed here:
[submit.search.yahoo.com...]

System

8:09 pm on Apr 12, 2007 (gmt 0)

redhat

 
 


The following message was cut out to new thread by goodroi. New thread at: robots_txt/3310606.htm [webmasterworld.com]
2:24 pm on April 13, 2007 (utc -5)
1:33 am on Apr 13, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 20, 2004
posts:1475
votes: 0


What would be the advantage or disadvantage of using this .xml feed

For one, Yahoo, Google, MSN, Ask and IBM are all supporting this same method. For another, it is included in the robots.txt file, which all of them hit first, and so makes it easier for them to find your sitemap.

Here's some PHP for generating a Sitemaps.org-compatible sitemap.xml file. Please post any modifications you might make to it:

<?php
/*########################################################
# Generates a sitemap per specifications found at: #
# http://www.sitemaps.org/protocol.html #
# DOES NOT traverse directories #
# 20070712 James Butler james at musicforhumans dot com #
# Based on opendir() code by mike at mihalism dot com #
# http://us.php.net/manual/en/function.readdir.php#72793 #
# Free for all: http://www.gnu.org/licenses/lgpl.html #
# #
# Useage: #
# 1) Save this as file name: sitemap_gen.php #
# 2) Change variables noted below for your site #
# 3) Place this file in your site's root directory #
# 4) Run from http://www.yourdomain.com/sitemap_gen.php #
# #
# <lastmod> -OPTIONAL #
# YYYY-MM-DD #
# <changefreq>-OPTIONAL #
# always #
# hourly #
# daily #
# weekly #
# monthly #
# yearly #
# never #
# <priority> -OPTIONAL #
# 0.0-1.0 [default 0.5] #
# #
# Add completed sitemap file to robots.txt: #
# Sitemap: http://www.yourdomain.com/sitemap.xml #
# #
########################################################*/

######## CHANGE THESE FOR YOUR SITE #########
# IMPORTANT: Trailing slashes are REQUIRED!
$my_domain = "http://www.yourdomain.com/";
$root_path_to_site = "/root/path/to/site/";
$file_types_to_include = array('html','htm');
############## END CHANGES ##################

$xml ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
$xml.="<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain."</loc>\n";
$xml.=" <priority>1.0</priority>\n";
$xml.=" </url>\n";
## START Modified mike at mihalism dot com Code ######
function file_type($file){
$path_chunks = explode("/", $file);
$thefile = $path_chunks[count($path_chunks) - 1];
$dotpos = strrpos($thefile, ".");
return strtolower(substr($thefile, $dotpos + 1));
}
$file_count = 0;
$path = opendir($root_path_to_site);
while (false!== ($filename = readdir($path))) {
$files[] = $filename;
}
sort($files);
foreach ($files as $file) {
$extension = file_type($file);
if($file!= '.' && $file!= '..' && array_search($extension, $file_types_to_include)!== false) {
$file_count++;
### END Modified mike at mihalism dot com Code ######
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain.$file."</loc>\n";
$xml.=" <lastmod>".date("Y-m-d",filemtime($file))."</lastmod>\n";
$xml.=" <changefreq>monthly</changefreq>\n";
$xml.=" <priority>0.5</priority>\n";
$xml.=" </url>\n";
}
}
$xml.="</urlset>\n";
if($file_count == 0){
echo "No files to add to the Sitemap\n";
}
else {
$sitemap=fopen("sitemap.xml","w+");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
exec("touch sitemap.xml");
exec("chmod 666 sitemap.xml");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
exec("chmod 644 sitemap.xml");
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
echo "File is not writable.<br>\n";
}
}
}
?>
2:57 am on Apr 13, 2007 (gmt 0)

New User

5+ Year Member

joined:Jan 9, 2007
posts:1
votes: 0


yesterday I've blogged about this too.

site scrapers can really take advantage of it too...
however,

I've made a small php code that you can check who is requesting your sitemap. You can detect if the requester is a known searchengine or not.

<snip>

<?php
function botIsAllowed($ip){
//get the reverse dns of the ip.
$host = strtolower(gethostbyaddr($ip));
$botDomains = array('.inktomisearch.com',
'.googlebot.com',
'.ask.com',
);

//search for the reverse dns matches the white list
foreach($botDomains as $bot){
if (strpos(strrev($host),strrev($bot))===0){
$qip= gethostbyname($host);
return ($qip==$ip);
}
}
return false;
}

if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
echo "Banned!";
exit;
}
?>

[edited by: engine at 7:52 am (utc) on April 13, 2007]
[edit reason] No urls, thanks. See TOS [webmasterworld.com] [/edit]

7:06 am on Apr 13, 2007 (gmt 0)

New User

10+ Year Member

joined:Oct 9, 2005
posts:32
votes: 0


What happens if you have multiple sitemaps?
7:19 am on Apr 13, 2007 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5805
votes: 64


Adding "Sitemap: http://www.example.tld/sitemap.xml" invalidates robots.txt in each of the 3 validators I tried, as well as Google Webmaster Tools' robots.txt analysis (as mentioned above by Bewenched.)

I think I'll wait until everyone catches up to the new standard.

3:28 pm on Apr 13, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 20, 2003
posts:1741
votes: 0


Does that means that if you put that to your robots.txt you don't have to manually submit your sitemaps to any se again?

What are most relevants advantages?

8:25 pm on Apr 13, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:June 13, 2004
posts:650
votes: 0


For those who want to use the feature at this stage, it seems like the cloaking is a must here.
Provide the version with the sitemap to SE's who definitely support it and the old version to all the rest.
Well, unless the big G figures it out, gets confused and issues a penalty for doing this.
1:39 am on Apr 14, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 20, 2004
posts:1475
votes: 0


If robots.txt doesn't pass the validation test, will the SEs bots ignore it, or not spider the site as usual?
5:15 am on Apr 14, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:May 27, 2005
posts:614
votes: 0


Thanks for your great script 'stupid script'.

I tried to mod it to index a php based forum like this without success.

I got it to generate OK but it didn't want to pick up the threads.

Any ideas how to fix it :)

11:19 am on Apr 14, 2007 (gmt 0)

New User

5+ Year Member

joined:May 13, 2006
posts:34
votes: 0


Hello frnds..

Am I correct for below robots.txt format for autodiscovery?

"User-agent: *
Disallow:

Sitemap: [mysite.com...]

Or shall I have to add any codes there?

Thanks in advance.
Rahul D.

8:20 pm on Apr 14, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:May 27, 2005
posts:614
votes: 0


I would thnk you would put it like this

"User-agent: *
Sitemap: [mysite.com...]
Disallow:

Having it below disallow might be detrimental to your indexing...

12:51 am on Apr 15, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2003
posts:125
votes: 0



I would thnk you would put it like this
"User-agent: *
Sitemap: [mysite.com...]
Disallow:

Having it below disallow might be detrimental to your indexing...

The 'Sitemap' directive is independent of the 'user-agent' line, so it doesn't matter where you place it.

4:42 am on Apr 15, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:July 10, 2004
posts:70
votes: 0


Is it case sensitive? IE: Does it have to be Sitemap: or can it be sitemap: for us lazy webmasters.
2:32 pm on Apr 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Great idea but it needs improvement. Allow webmasters to place a number sign in front of the new sitemap directive so our robots.txt will validate:

# Sitemap: http://www.example.com/sitemap.xml

1:23 pm on Apr 16, 2007 (gmt 0)

New User

5+ Year Member

joined:May 13, 2006
posts:34
votes: 0


Do I have to put double codes in between them?

"User-agent: *
Disallow:
Sitemap: [mysite.com...]

Thanks in advance
Rahul D.

9:17 pm on Apr 16, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 20, 2004
posts:1475
votes: 0


Thanks for your great script 'stupid script'.

I tried to mod it to index a php based forum like this without success.

I got it to generate OK but it didn't want to pick up the threads.

Any ideas how to fix it :)

Thanks, netchicken!

Are independent files stored somewhere for spiders to access? If so, run the script in that directory (modify $my_domain to match) then merge the generated file with the maps from other directories to create one master sitemap.

If there are no independent files for spiders to find, then the script will not be as useful to you.

NOTE: If the URLs that you want to include in one of these sitemaps look like the ones on this forum (with query strings) BE SURE to convert all of the entities in your URLs. i.e. If this is the type of URL you want in your XML sitemap:

http:/ /www.example.com/pages.cgi?page=1&section=14
then be sure to convert it to:
http:/ /www.example.com/pages.cgi?page=1[b]&amp;[/b]section=14
See these instructions [sitemaps.org] for all entities that need to be converted to UTF-8 before including them in sitemap.xml.

[edited by: StupidScript at 9:18 pm (utc) on April 16, 2007]

9:17 pm on Apr 17, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member bwnbwn is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 25, 2005
posts:3492
votes: 3


I got an error in webmasters area when Google downloaded my robots.txt with the sitemap added. Removed it till it is reconized..
This 32 message thread spans 2 pages: 32