sem4u

msg:3309261 | 11:19 am on Apr 12, 2007 (gmt 0) |
This is a good move forward by the big three search engines and for the sitemaps protocol. I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage.
|
engine

msg:3309290 | 11:40 am on Apr 12, 2007 (gmt 0) |
This is going to be so useful. A big thanks to the Search Engines for collaborating on this. Credit to Ask.com, Google, Microsoft Live Search and Yahoo! FYI [sitemaps.org...]
|
indianeyes

msg:3309296 | 11:47 am on Apr 12, 2007 (gmt 0) |
Thanks for telling! it is simple to be done.
|
Achernar

msg:3309323 | 12:24 pm on Apr 12, 2007 (gmt 0) |
| I was reading about it on a blog covering SES New York but I couldn't find any other mentions of it until I saw this thread on the homepage. |
| [sitemaps.org...]
|
BillyS

msg:3309345 | 12:48 pm on Apr 12, 2007 (gmt 0) |
Good move, makes it much easier to support one standard.
|
Bewenched

msg:3309353 | 1:02 pm on Apr 12, 2007 (gmt 0) |
Thats great news, but it can also lead the site scrapers to all of your content as well. That's my only fear. Also .. google webmaster tools gives an error on thet syntax for the robots.txt file. Parsing results Sitemap: http://www.example.com/sitemap.xml Syntax not understood [edited by: Bewenched at 1:23 pm (utc) on April 12, 2007]
|
engine

msg:3309368 | 1:12 pm on Apr 12, 2007 (gmt 0) |
Site scrapers already ignore robots.txt, so no need to worry. Just focus on the search engines.
|
Bewenched

msg:3309481 | 3:19 pm on Apr 12, 2007 (gmt 0) |
| Site scrapers already ignore robots.txt, so no need to worry. Just focus on the search engines |
| Yes I know that they ignore the sitemap, but putting this in our robots.txt will give them a complete roadmap to follow.
|
blend27

msg:3309482 | 3:25 pm on Apr 12, 2007 (gmt 0) |
Bewenched, the sitemap.xml could cloacked against IP Ranges. The normal user should not make requests nor looking for it to your sitemap.xml file. As long as SE's don't provide CASHED Copy it's a cool feature. Otherwise it is as you say: heaven for scrapers, but then again: too many requests(2 is toooo many) with no JS and NO Images: 1 warnning, then Block and wait till the next episode....
|
Clark

msg:3309551 | 4:57 pm on Apr 12, 2007 (gmt 0) |
As long as SE's don't provide CASHED Copy it's a cool feature. |
| CASHed copy, I like it. I'd like a cash copy of my robots.txt please :) But seriously, what's with xml all the time? Does anyone like xml? [edited by: Clark at 4:58 pm (utc) on April 12, 2007]
|
blend27

msg:3309567 | 5:17 pm on Apr 12, 2007 (gmt 0) |
ye, it shouldnt be CASHed or CACHed, right Clark? :)
|
latimer

msg:3309700 | 7:45 pm on Apr 12, 2007 (gmt 0) |
What would be the advantage or disadvantage of using this .xml feed for yahoo rather than submitting txt map as discussed here: [submit.search.yahoo.com...]
|
System redhat

msg:3310608 | 8:09 pm on Apr 12, 2007 (gmt 0) |
The following message was cut out to new thread by goodroi. New thread at: robots_txt/3310606.htm [webmasterworld.com] 2:24 pm on April 13, 2007 (utc -5)
|
StupidScript

msg:3309920 | 1:33 am on Apr 13, 2007 (gmt 0) |
| What would be the advantage or disadvantage of using this .xml feed |
| For one, Yahoo, Google, MSN, Ask and IBM are all supporting this same method. For another, it is included in the robots.txt file, which all of them hit first, and so makes it easier for them to find your sitemap. Here's some PHP for generating a Sitemaps.org-compatible sitemap.xml file. Please post any modifications you might make to it: <?php /*######################################################## # Generates a sitemap per specifications found at: # # http://www.sitemaps.org/protocol.html # # DOES NOT traverse directories # # 20070712 James Butler james at musicforhumans dot com # # Based on opendir() code by mike at mihalism dot com # # http://us.php.net/manual/en/function.readdir.php#72793 # # Free for all: http://www.gnu.org/licenses/lgpl.html # # # # Useage: # # 1) Save this as file name: sitemap_gen.php # # 2) Change variables noted below for your site # # 3) Place this file in your site's root directory # # 4) Run from http://www.yourdomain.com/sitemap_gen.php # # # # <lastmod> -OPTIONAL # # YYYY-MM-DD # # <changefreq>-OPTIONAL # # always # # hourly # # daily # # weekly # # monthly # # yearly # # never # # <priority> -OPTIONAL # # 0.0-1.0 [default 0.5] # # # # Add completed sitemap file to robots.txt: # # Sitemap: http://www.yourdomain.com/sitemap.xml # # # ########################################################*/ ######## CHANGE THESE FOR YOUR SITE ######### # IMPORTANT: Trailing slashes are REQUIRED! $my_domain = "http://www.yourdomain.com/"; $root_path_to_site = "/root/path/to/site/"; $file_types_to_include = array('html','htm'); ############## END CHANGES ################## $xml ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"; $xml.="<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n"; $xml.=" <url>\n"; $xml.=" <loc>".$my_domain."</loc>\n"; $xml.=" <priority>1.0</priority>\n"; $xml.=" </url>\n"; ## START Modified mike at mihalism dot com Code ###### function file_type($file){ $path_chunks = explode("/", $file); $thefile = $path_chunks[count($path_chunks) - 1]; $dotpos = strrpos($thefile, "."); return strtolower(substr($thefile, $dotpos + 1)); } $file_count = 0; $path = opendir($root_path_to_site); while (false!== ($filename = readdir($path))) { $files[] = $filename; } sort($files); foreach ($files as $file) { $extension = file_type($file); if($file!= '.' && $file!= '..' && array_search($extension, $file_types_to_include)!== false) { $file_count++; ### END Modified mike at mihalism dot com Code ###### $xml.=" <url>\n"; $xml.=" <loc>".$my_domain.$file."</loc>\n"; $xml.=" <lastmod>".date("Y-m-d",filemtime($file))."</lastmod>\n"; $xml.=" <changefreq>monthly</changefreq>\n"; $xml.=" <priority>0.5</priority>\n"; $xml.=" </url>\n"; } } $xml.="</urlset>\n"; if($file_count == 0){ echo "No files to add to the Sitemap\n"; } else { $sitemap=fopen("sitemap.xml","w+"); if (is_writable("sitemap.xml")) { fwrite($sitemap,$xml); fclose($sitemap); echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n"; echo "Remove items you do not want included in the search engines.<br>\n"; echo "Modify < changefreq > and < priority > to taste.<br>\n"; echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n"; } else { exec("touch sitemap.xml"); exec("chmod 666 sitemap.xml"); if (is_writable("sitemap.xml")) { fwrite($sitemap,$xml); fclose($sitemap); exec("chmod 644 sitemap.xml"); echo "DONE! <a href='sitemap.xml'>View sitemap.xml</a><br>\n"; echo "Remove items you do not want included in the search engines.<br>\n"; echo "Modify < changefreq > and < priority > to taste.<br>\n"; echo "Add 'Sitemap: ".$my_domain."sitemap.xml' to robots.txt.<br>\n"; } else { echo "File is not writable.<br>\n"; } } } ?>
|
argiope

msg:3309961 | 2:57 am on Apr 13, 2007 (gmt 0) |
yesterday I've blogged about this too. site scrapers can really take advantage of it too... however, I've made a small php code that you can check who is requesting your sitemap. You can detect if the requester is a known searchengine or not. <snip> <?php function botIsAllowed($ip){ //get the reverse dns of the ip. $host = strtolower(gethostbyaddr($ip)); $botDomains = array('.inktomisearch.com', '.googlebot.com', '.ask.com', ); //search for the reverse dns matches the white list foreach($botDomains as $bot){ if (strpos(strrev($host),strrev($bot))===0){ $qip= gethostbyname($host); return ($qip==$ip); } } return false; } if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){ echo "Banned!"; exit; } ?> [edited by: engine at 7:52 am (utc) on April 13, 2007] [edit reason] No urls, thanks. See TOS [webmasterworld.com] [/edit]
|
Shetty

msg:3310047 | 7:06 am on Apr 13, 2007 (gmt 0) |
What happens if you have multiple sitemaps?
|
keyplyr

msg:3310055 | 7:19 am on Apr 13, 2007 (gmt 0) |
Adding "Sitemap: http://www.example.tld/sitemap.xml" invalidates robots.txt in each of the 3 validators I tried, as well as Google Webmaster Tools' robots.txt analysis (as mentioned above by Bewenched.) I think I'll wait until everyone catches up to the new standard.
|
silverbytes

msg:3310373 | 3:28 pm on Apr 13, 2007 (gmt 0) |
Does that means that if you put that to your robots.txt you don't have to manually submit your sitemaps to any se again? What are most relevants advantages?
|
activeco

msg:3310665 | 8:25 pm on Apr 13, 2007 (gmt 0) |
For those who want to use the feature at this stage, it seems like the cloaking is a must here. Provide the version with the sitemap to SE's who definitely support it and the old version to all the rest. Well, unless the big G figures it out, gets confused and issues a penalty for doing this.
|
StupidScript

msg:3310845 | 1:39 am on Apr 14, 2007 (gmt 0) |
If robots.txt doesn't pass the validation test, will the SEs bots ignore it, or not spider the site as usual?
|
netchicken1

msg:3310909 | 5:15 am on Apr 14, 2007 (gmt 0) |
Thanks for your great script 'stupid script'. I tried to mod it to index a php based forum like this without success. I got it to generate OK but it didn't want to pick up the threads. Any ideas how to fix it :)
|
rahul_seo

msg:3311039 | 11:19 am on Apr 14, 2007 (gmt 0) |
Hello frnds.. Am I correct for below robots.txt format for autodiscovery? "User-agent: * Disallow: Sitemap: [mysite.com...] Or shall I have to add any codes there? Thanks in advance. Rahul D.
|
netchicken1

msg:3311309 | 8:20 pm on Apr 14, 2007 (gmt 0) |
I would thnk you would put it like this "User-agent: * Sitemap: [mysite.com...] Disallow: Having it below disallow might be detrimental to your indexing...
|
seasalt

msg:3311428 | 12:51 am on Apr 15, 2007 (gmt 0) |
I would thnk you would put it like this "User-agent: * Sitemap: [mysite.com...] Disallow: Having it below disallow might be detrimental to your indexing... |
| The 'Sitemap' directive is independent of the 'user-agent' line, so it doesn't matter where you place it.
|
beer234

msg:3311529 | 4:42 am on Apr 15, 2007 (gmt 0) |
Is it case sensitive? IE: Does it have to be Sitemap: or can it be sitemap: for us lazy webmasters.
|
Key_Master

msg:3311653 | 2:32 pm on Apr 15, 2007 (gmt 0) |
Great idea but it needs improvement. Allow webmasters to place a number sign in front of the new sitemap directive so our robots.txt will validate: # Sitemap: http://www.example.com/sitemap.xml
|
rahul_seo

msg:3312409 | 1:23 pm on Apr 16, 2007 (gmt 0) |
Do I have to put double codes in between them? "User-agent: * Disallow: Sitemap: [mysite.com...] Thanks in advance Rahul D.
|
StupidScript

msg:3312909 | 9:17 pm on Apr 16, 2007 (gmt 0) |
| Thanks for your great script 'stupid script'. I tried to mod it to index a php based forum like this without success. I got it to generate OK but it didn't want to pick up the threads. Any ideas how to fix it :) |
| Thanks, netchicken! Are independent files stored somewhere for spiders to access? If so, run the script in that directory (modify $my_domain to match) then merge the generated file with the maps from other directories to create one master sitemap. If there are no independent files for spiders to find, then the script will not be as useful to you. NOTE: If the URLs that you want to include in one of these sitemaps look like the ones on this forum (with query strings) BE SURE to convert all of the entities in your URLs. i.e. If this is the type of URL you want in your XML sitemap: http:/ /www.example.com/pages.cgi?page=1§ion=14 then be sure to convert it to: http:/ /www.example.com/pages.cgi?page=1[b]&[/b]section=14 See these instructions [sitemaps.org] for all entities that need to be converted to UTF-8 before including them in sitemap.xml. [edited by: StupidScript at 9:18 pm (utc) on April 16, 2007]
|
bwnbwn

msg:3313935 | 9:17 pm on Apr 17, 2007 (gmt 0) |
I got an error in webmasters area when Google downloaded my robots.txt with the sitemap added. Removed it till it is reconized..
|
| This 32 message thread spans 2 pages: 32 ( [1] 2 ) > > |
|
|