Welcome to WebmasterWorld Guest from 54.226.62.26

Forum Moderators: goodroi

Message Too Old, No Replies

Add Sitemap To robots.txt For Autodiscovery

Ask.com, Google, Microsoft Live Search and Yahoo!

     

OldWolf

8:59 am on Apr 12, 2007 (gmt 0)

5+ Year Member



Since working with Google and Microsoft to support a single format for submission with Sitemaps, we have continued to discuss further enhancements to make it easy for webmasters to get their content to all search engines quickly.

All search crawlers recognize robots.txt, so it seemed like a good idea to use that mechanism to allow webmasters to share their Sitemaps. You agreed and encouraged us to allow robots.txt discovery of Sitemaps on our suggestion board. We took the idea to Google and Microsoft and are happy to announce today that you can now find your sitemaps in a uniform way across all participating engines. To do this, simply add the following line to your robots.txt file:

Sitemap: http://www.example.tld/sitemap.xml


More [ysearchblog.com...]

[edited by: engine at 11:41 am (utc) on April 12, 2007]

sem4u

11:19 am on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member sem4u is a WebmasterWorld Top Contributor of All Time 10+ Year Member



This is a good move forward by the big three search engines and for the sitemaps protocol.

I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage.

engine

11:40 am on Apr 12, 2007 (gmt 0)

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month



This is going to be so useful. A big thanks to the Search Engines for collaborating on this.

Credit to Ask.com, Google, Microsoft Live Search and Yahoo!

FYI [sitemaps.org...]

indianeyes

11:47 am on Apr 12, 2007 (gmt 0)

5+ Year Member



Thanks for telling! it is simple to be done.

Achernar

12:24 pm on Apr 12, 2007 (gmt 0)

5+ Year Member



I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage.

[sitemaps.org...]

BillyS

12:48 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Good move, makes it much easier to support one standard.

Bewenched

1:02 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Thats great news, but it can also lead the site scrapers to all of your content as well. That's my only fear.

Also .. google webmaster tools gives an error on thet syntax for the robots.txt file.

Parsing results

Sitemap: http://www.example.com/sitemap.xml

Syntax not understood

[edited by: Bewenched at 1:23 pm (utc) on April 12, 2007]

engine

1:12 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month



Site scrapers already ignore robots.txt, so no need to worry. Just focus on the search engines.

Bewenched

3:19 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Site scrapers already ignore robots.txt, so no need to worry. Just focus on the search engines

Yes I know that they ignore the sitemap, but putting this in our robots.txt will give them a complete roadmap to follow.

blend27

3:25 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bewenched, the sitemap.xml could cloacked against IP Ranges. The normal user should not make requests nor looking for it to your sitemap.xml file.
As long as SE's don't provide CASHED Copy it's a cool feature.

Otherwise it is as you say: heaven for scrapers,

but then again: too many requests(2 is toooo many) with no JS and NO Images: 1 warnning, then Block and wait till the next episode....

Clark

4:57 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




As long as SE's don't provide CASHED Copy it's a cool feature.

CASHed copy, I like it. I'd like a cash copy of my robots.txt please :)

But seriously, what's with xml all the time? Does anyone like xml?

[edited by: Clark at 4:58 pm (utc) on April 12, 2007]

blend27

5:17 pm on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



ye, it shouldnt be CASHed or CACHed, right Clark? :)

latimer

7:45 pm on Apr 12, 2007 (gmt 0)

10+ Year Member



What would be the advantage or disadvantage of using this .xml feed for yahoo rather than submitting txt map as discussed here:
[submit.search.yahoo.com...]

System

8:09 pm on Apr 12, 2007 (gmt 0)

redhat



The following message was cut out to new thread by goodroi. New thread at: robots_txt/3310606.htm [webmasterworld.com]
2:24 pm on April 13, 2007 (utc -5)

StupidScript

1:33 am on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What would be the advantage or disadvantage of using this .xml feed

For one, Yahoo, Google, MSN, Ask and IBM are all supporting this same method. For another, it is included in the robots.txt file, which all of them hit first, and so makes it easier for them to find your sitemap.

Here's some PHP for generating a Sitemaps.org-compatible sitemap.xml file. Please post any modifications you might make to it:

<?php
/*########################################################
# Generates a sitemap per specifications found at: #
# http://www.sitemaps.org/protocol.html #
# DOES NOT traverse directories #
# 20070712 James Butler james at musicforhumans dot com #
# Based on opendir() code by mike at mihalism dot com #
# http://us.php.net/manual/en/function.readdir.php#72793 #
# Free for all: http://www.gnu.org/licenses/lgpl.html #
# #
# Useage: #
# 1) Save this as file name: sitemap_gen.php #
# 2) Change variables noted below for your site #
# 3) Place this file in your site's root directory #
# 4) Run from http://www.yourdomain.com/sitemap_gen.php #
# #
# <lastmod> -OPTIONAL #
# YYYY-MM-DD #
# <changefreq>-OPTIONAL #
# always #
# hourly #
# daily #
# weekly #
# monthly #
# yearly #
# never #
# <priority> -OPTIONAL #
# 0.0-1.0 [default 0.5] #
# #
# Add completed sitemap file to robots.txt: #
# Sitemap: http://www.yourdomain.com/sitemap.xml #
# #
########################################################*/

######## CHANGE THESE FOR YOUR SITE #########
# IMPORTANT: Trailing slashes are REQUIRED!
$my_domain = "http://www.yourdomain.com/";
$root_path_to_site = "/root/path/to/site/";
$file_types_to_include = array('html','htm');
############## END CHANGES ##################

$xml ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
$xml.="<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n";
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain."</loc>\n";
$xml.=" <priority>1.0</priority>\n";
$xml.=" </url>\n";
## START Modified mike at mihalism dot com Code ######
function file_type($file){
$path_chunks = explode("/", $file);
$thefile = $path_chunks[count($path_chunks) - 1];
$dotpos = strrpos($thefile, ".");
return strtolower(substr($thefile, $dotpos + 1));
}
$file_count = 0;
$path = opendir($root_path_to_site);
while (false!== ($filename = readdir($path))) {
$files[] = $filename;
}
sort($files);
foreach ($files as $file) {
$extension = file_type($file);
if($file!= '.' && $file!= '..' && array_search($extension, $file_types_to_include)!== false) {
$file_count++;
### END Modified mike at mihalism dot com Code ######
$xml.=" <url>\n";
$xml.=" <loc>".$my_domain.$file."</loc>\n";
$xml.=" <lastmod>".date("Y-m-d",filemtime($file))."</lastmod>\n";
$xml.=" <changefreq>monthly</changefreq>\n";
$xml.=" <priority>0.5</priority>\n";
$xml.=" </url>\n";
}
}
$xml.="</urlset>\n";
if($file_count == 0){
echo "No files to add to the Sitemap\n";
}
else {
$sitemap=fopen("sitemap.xml","w+");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
exec("touch sitemap.xml");
exec("chmod 666 sitemap.xml");
if (is_writable("sitemap.xml")) {
fwrite($sitemap,$xml);
fclose($sitemap);
exec("chmod 644 sitemap.xml");
echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n";
echo "Remove items you do not want included in the search engines.<br>\n";
echo "Modify < changefreq > and < priority > to taste.<br>\n";
echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n";
}
else {
echo "File is not writable.<br>\n";
}
}
}
?>

argiope

2:57 am on Apr 13, 2007 (gmt 0)

5+ Year Member



yesterday I've blogged about this too.

site scrapers can really take advantage of it too...
however,

I've made a small php code that you can check who is requesting your sitemap. You can detect if the requester is a known searchengine or not.

<snip>

<?php
function botIsAllowed($ip){
//get the reverse dns of the ip.
$host = strtolower(gethostbyaddr($ip));
$botDomains = array('.inktomisearch.com',
'.googlebot.com',
'.ask.com',
);

//search for the reverse dns matches the white list
foreach($botDomains as $bot){
if (strpos(strrev($host),strrev($bot))===0){
$qip= gethostbyname($host);
return ($qip==$ip);
}
}
return false;
}

if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
echo "Banned!";
exit;
}
?>

[edited by: engine at 7:52 am (utc) on April 13, 2007]
[edit reason] No urls, thanks. See TOS [webmasterworld.com] [/edit]

Shetty

7:06 am on Apr 13, 2007 (gmt 0)

5+ Year Member



What happens if you have multiple sitemaps?

keyplyr

7:19 am on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Adding "Sitemap: http://www.example.tld/sitemap.xml" invalidates robots.txt in each of the 3 validators I tried, as well as Google Webmaster Tools' robots.txt analysis (as mentioned above by Bewenched.)

I think I'll wait until everyone catches up to the new standard.

silverbytes

3:28 pm on Apr 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does that means that if you put that to your robots.txt you don't have to manually submit your sitemaps to any se again?

What are most relevants advantages?

activeco

8:25 pm on Apr 13, 2007 (gmt 0)

10+ Year Member



For those who want to use the feature at this stage, it seems like the cloaking is a must here.
Provide the version with the sitemap to SE's who definitely support it and the old version to all the rest.
Well, unless the big G figures it out, gets confused and issues a penalty for doing this.

StupidScript

1:39 am on Apr 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If robots.txt doesn't pass the validation test, will the SEs bots ignore it, or not spider the site as usual?

netchicken1

5:15 am on Apr 14, 2007 (gmt 0)

10+ Year Member



Thanks for your great script 'stupid script'.

I tried to mod it to index a php based forum like this without success.

I got it to generate OK but it didn't want to pick up the threads.

Any ideas how to fix it :)

rahul_seo

11:19 am on Apr 14, 2007 (gmt 0)

5+ Year Member



Hello frnds..

Am I correct for below robots.txt format for autodiscovery?

"User-agent: *
Disallow:

Sitemap: [mysite.com...]

Or shall I have to add any codes there?

Thanks in advance.
Rahul D.

netchicken1

8:20 pm on Apr 14, 2007 (gmt 0)

10+ Year Member



I would thnk you would put it like this

"User-agent: *
Sitemap: [mysite.com...]
Disallow:

Having it below disallow might be detrimental to your indexing...

seasalt

12:51 am on Apr 15, 2007 (gmt 0)

10+ Year Member




I would thnk you would put it like this
"User-agent: *
Sitemap: [mysite.com...]
Disallow:

Having it below disallow might be detrimental to your indexing...

The 'Sitemap' directive is independent of the 'user-agent' line, so it doesn't matter where you place it.

beer234

4:42 am on Apr 15, 2007 (gmt 0)

10+ Year Member



Is it case sensitive? IE: Does it have to be Sitemap: or can it be sitemap: for us lazy webmasters.

Key_Master

2:32 pm on Apr 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Great idea but it needs improvement. Allow webmasters to place a number sign in front of the new sitemap directive so our robots.txt will validate:

# Sitemap: http://www.example.com/sitemap.xml

rahul_seo

1:23 pm on Apr 16, 2007 (gmt 0)

5+ Year Member



Do I have to put double codes in between them?

"User-agent: *
Disallow:
Sitemap: [mysite.com...]

Thanks in advance
Rahul D.

StupidScript

9:17 pm on Apr 16, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for your great script 'stupid script'.

I tried to mod it to index a php based forum like this without success.

I got it to generate OK but it didn't want to pick up the threads.

Any ideas how to fix it :)

Thanks, netchicken!

Are independent files stored somewhere for spiders to access? If so, run the script in that directory (modify $my_domain to match) then merge the generated file with the maps from other directories to create one master sitemap.

If there are no independent files for spiders to find, then the script will not be as useful to you.

NOTE: If the URLs that you want to include in one of these sitemaps look like the ones on this forum (with query strings) BE SURE to convert all of the entities in your URLs. i.e. If this is the type of URL you want in your XML sitemap:

http:/ /www.example.com/pages.cgi?page=1&section=14
then be sure to convert it to:
http:/ /www.example.com/pages.cgi?page=1[b]&amp;[/b]section=14
See these instructions [sitemaps.org] for all entities that need to be converted to UTF-8 before including them in sitemap.xml.

[edited by: StupidScript at 9:18 pm (utc) on April 16, 2007]

bwnbwn

9:17 pm on Apr 17, 2007 (gmt 0)

WebmasterWorld Senior Member bwnbwn is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I got an error in webmasters area when Google downloaded my robots.txt with the sitemap added. Removed it till it is reconized..
This 32 message thread spans 2 pages: 32
 

Featured Threads

Hot Threads This Week

Hot Threads This Month