You know, that does make some sense, Keniki. After all, the sitemap.xml file hands over a list of urls directly to any scraper that wants to make use of it. And excessively scraped sites can struggle in the SERPs.
Sounds like a very good reason for cloaking to me.
On a site with no sitemap, searches for the sitemap.xml show up as file not found and I notice quite a few.
Presume they are scrappers on the prowl.
A downside of sitemaps I think.
I wondered that myself some time ago .... so i removed that from our site. Especially since I started getting google alerts that our content was appearing on a very obscure MFA website... I'm talking very deep pages of our site scraped and used.
Shame really that we cant give that information to the legit search engines.
I recomend checking your logs for attemps to access your sitemap file to see who is attempting to access that information and block them.
I am suspicious too and cloaking is one answer but it may not solve everything...
I am suspicious of sitemap generators. It would be quite possible to offer a free sitemap generator that pinged a scraper every time it was used.
I would really like to see a tool in google webmaster tools that allowed you to generate an .xml sitemap and as these are only for search engine use I see no reason name of file could not be randomly generated and it could also delete previous sitemap file.
I think the idea of including sitemap reference in robots.txt should be abandoned and all sitemaps submitted via ping to all search engines that use them and random generated file each time a sitemap is created. This would I think stop the scrapers.
I'd like to see the filename as a random name but that means you would have to pre-register that name with all the search engines.
In the meantime I'd like to hear from anyone with multiple sites, and that shows the file to everyone on some sites, and uses .htaccess to only allow known bots to access it on other sites.
Does that make any difference?
I would really prefer to find an answer without cloaking as I feel it will restrict emergence of geniune new search engines and I do feel we all have a responsibility to make the internet open.
Howether that dream seems to be well and truely over and shot to pieces at present. I am working with sites scraped to death and have seen clear identity theft as well. So taking a huge slice of humble pie........ webmasterworld where was that pearl script you use for cloaking robots.txt again and can it be applied to sitemaps?
I do also want a safe sitemap generator as in many ways can a free sitemap generator also send info to scraper sites without your knowledge. I would trust one from google.
|I would really prefer to find an answer without cloaking as I feel it will restrict emergence of geniune new search engines and I do feel we all have a responsibility to make the internet open. |
Sitemaps.xml is a serious scraping vulnerability which is one reason I don't use it as the sitemap.xml file is a clear path to crawl without hitting any spider traps so it should be cloaked, no doubt about it. Any time you give scrapers a clear path to avoid honey pots and spider traps they'll use it. With that said, the scrapers can simply scrape a search engine first using "site:mydomain.com" to get the equivalent of a sitemap and avoid your spider traps anyway.
That's why even robots.txt should be cloaked because you give the scrapers a list of user agents that you allow to crawl. Assuming you don't also restrict user agents by IP range or reverse DNS, the scrapers just adopt the allowed UA's and slide right through your .htaccess files or other user agent blocking fire walls.
However, cloaking sitemap.xml doesn't technically stop anyone else from crawling your site, it just means they have to crawl the old fashioned way. Simply check your log files to see what requested sitemap.xml and was denied every now and then and let anything new that looks worthy crawl your site on the next pass using the sitemaps.
incrediBILL is like an awesome poster with massive knowledge on things like this so I think we should all take some time to digest what he has said...
I am happy I found this thread. I was just noticing that some random sites were showing up as an inbound links to deep pages in Google WM tools. The link URLS use almost exactly the same title as my post, and when I visit the sites the URLs are different (redirect?).
Forgive me if I am uneducated on the matter, but what's going on? I have never dealt with this type of activity. Will this affect rankings? How can a person address these issues?
Why not just name your file something like mysitemap.xml and submit that URL in Google's Webmaster Tools? It doesn't HAVE to be named sitemap.xml, though doing so helps with the auto-discovery by engines without direct submit tools (a meta tag could fix that, but then scrapers can read that too.)
I guess if you are going to get scraped you are going to get scraped regardless, one way or another, so what's worse? Making it hard for the search engines AND the scrapers, or making it easy for both?
Hi Keniki, of course that the scenario you describe is totally possible.
I would use a random name, not the standart naming for the sitemap file or sitemap index like others already suggested. Also perhpas use G alerts to monitor references to your site in general.
Also hopefully you already check your statistics regularly, so now you will also need to check who is accessing the sitemap and if there are signs someone is abusing it just ip ban them through .htaccess or httpd.conf.
Cloaking should work too if done well, like the robots file is done here at webmasterworld for example, but then I have a cloaking phobia even when legitimate so I would just look for the bad guys rather and ban them.
|Why not just name your file something like mysitemap.xml |
If the script your using to generate sitemaps is suspect and sending info to scrapers and hijackers changing name or cloaking won't help. We need a safe site map generator first.
Edit: sorry but I didn't read the posts above which mine just repeats
Unless the bots can search for *.xml meaning they find every xml page, surely all you need to do is chnge the name of your sitemap?
Mine originally had a date in it to remind me, but I changed that to a simple name.
That name is submitted to google as the sitemap name, you can change it every time you resubmit your sitemap.
|Also hopefully you already check your statistics regularly, so now you will also need to check who is accessing the sitemap and if there are signs someone is abusing it just ip ban them through .htaccess or httpd.conf. |
No for me I am sick and tired of the .htacccess game I want to get to the root of the problem.
I know I'm dense, but I don't see the problem here. Your content should be available to scrapers even if it's not in a site map.
|That's why even robots.txt should be cloaked |
They only real way to prohibit access to pages is to password protect them.
|And excessively scraped sites can struggle in the SERPs. |
What does that mean?
Don't forget that robots.txt still overrules sitemap.xml. It is still possible to use traps such as blocking a page or directory using robots.txt but listing it in sitemap.xml.
The other very important point is that only genuine SEs should be reading sitemap.xml - human visitors should never pull that URL.
As traps go, cross-referencing all requests for that file against approved search engines and then blocking all who aren't on the list from the entire site is fairly foolproof.
A better trap might even be to serve those who are unauthorised to read the sitemap.xml file a whole different site of URLs, i.e. changing .html to .htm and using .htaccess to rewrite all those incorrect URLs to a script which feeds them random rubbish interspersed with copyright abuse messages.
|... I don't see the problem here. Your content should be available to scrapers even if it's not in a site map. |
You're right, even the simplest of scrapers can follow every available link and get your whole site regardless of whether or not there's a sitemap.
I think what people are referring to here are scrapers that can be identified because of a common user agent or crawl method.
Although I'm not sure having or not having a site map makes a difference even there because if you're checking using .htaccess you're checking every request. They will be just as denied/allowed/redirected visiting somedeeppage.html as they would be visiting index.html.
And then of course there are scrapers which look exactly like a legitimate browser and are therefore effectively invisible to most sites.
|And excessively scraped sites can struggle in the SERPs. |
What does that mean?
When someone mirrors your content it's possible for your page/site to get hit with a duplicate content penalty.
Perhaps the danger with sitemaps.xml is not that the bots would use it to scrape existing pages, but instead that they would use it to quickly locate new content.
The robots could download your sitemaps.xml today, adding it to their DB without following any of the links, return the following day, locate any new pages by comparing it to the old version, and then quickly scrape the new pages before the SE's even visit. So your content is first found by SE's under the scrappers site.
So perhaps there is something to the assumption by many people on these forums who have said (paraphrase) "After I submitted a sitemap to google I disappeared and my rankings went in the toilet" I made this assumption about a year ago and many thought I was nuts.
I've changed the name of my sitemap files and put them in another directory. Perhaps that will slow down the scrapers a bit.
Another possibility. Many people name their regular sitemap file (not the xml file) something like "sitemap.html" ..... Perhaps that is getting scrapped too? Might want to change that name also.
So what do you think about new chance to put sitemap url in robots.txt?
If I name my sitemap aa.xml but I put url in robots.txt, scrapers can easily GET
I guess if advantages to have xml sitemap for google,yahoo and msn outperform scrapers disadvantages. I'm not so sure now...
This thread must be a joke. Nothing can stop your content to be scraped while it is public.
The fact that your sitemap.xml might make it a bit easier for them, if you have one, but if you remove it, the will scrap you the old way!
I was skeptical of the XML feeds as well but as a test, I just happened to be launching a new market area with a few new subdomains. I used sitemaps on some and not on others and included in the robots.txt per the new protocol on some and submitted to Google Webmaster Tools on others. The sites that I included in the robots.txt and also submitted to Google's Webmaster Tools have double the number of pages included in Google's index so far.
I wouldn't advocate for cloaking a sitemap. Actually, I think preventing scrapers altogether will lead to "unnatural linking" patterns and can even get you penalized by G.
I took SEOMozs` advice and created a Sitemap, let it get discovered once from Google and then removed it from my web server.
>> Unless the bots can search for *.xml meaning they find every xml page, surely all you need to do is change the name of your sitemap? <<
If the online tool that you use to generate the sitemap is keeping a copy of your sitemap and using it for "other purposes", it is already too late.
I am puzzled by all the references to scrap and scrapping in this thread.
The correct words are scrape and scraping.
|This thread must be a joke. Nothing can stop your content to be scraped while it is public. |
Never say NOTHING can stop scraping just because nothing you use will stop it.
You can stop most, if not all, scraping if you have the proper tools running on your server. Sure, scrapers can still snag 1 or 2 pages from my site, as it takes at least 2 pages for my automated tools to detect non-human behavior characteristics. However, after 2-3 pages the door is slammed in their face for most garden variety scrapers, spammers looking for form pages, or email harvesters.
For instance, if something other than Google, Yahoo or another whitelisted 'bot actually requests files like robots.txt or sitemap.xml, they are instantly blocked from crawling any additional pages as no human (except nosy ones) ever request such files.
If they are downloading .HTML pages and aren't loading other required page components, such as images, CSS, .js or anything else that a browser requires to display my web pages, they are also instantly blocked. Of course there are a few rare bots that do all this, and they tend to do other stupid things themselves which takes a couple of additional page loads to detect and stop.
I could go on, but it's completely doable. You should've come to the PubCon session last year about stopping bad bots, we covered it in much detail.
|I wouldn't advocate for cloaking a sitemap. Actually, I think preventing scrapers altogether will lead to "unnatural linking" patterns and can even get you penalized by G. |
OK, that's just the silliest advocation of scraping that I've ever heard.
You assume that scrapers actually give you links, most do not. They steal your data, blend it into pages designed to steal your long-tails in G, and compete with you for your own traffic.
However, there is the rare scraper that actually provides links but even those are often discounted as spam sites these days, so you get no link value there either.
I think you meant:
"Never say NOTHING can stop AUTOMATED scraping..."
Many, many scrapers are still doing copy-paste!
I see it in my log with reference from:
"c:\Document and Settings\Some Guy\mywebsite.htm"
So, how do you stop copy-paste? I'm intriged.
| This 44 message thread spans 2 pages: 44 (  2 ) > > |