Forum Moderators: Robert Charlton & goodroi
Unless the bots can search for *.xml meaning they find every xml page, surely all you need to do is chnge the name of your sitemap?
BINGO!
The problem is most people use default file names.
Just like they install forum and blog software and don't change the comments page or obfuscate the HTML with the comment form and the spammer software locates the page and starts spamming.
That's the real solution, NEVER get lazy and use defaults.
I could go on, but it's completely doable. You should've come to the PubCon session last year about stopping bad bots, we covered it in much detail.
but it's completely doable
No doubt of that, at least when it comes to simple, stupid, scrapers.
However scrape blocking methods are equally effective with or without a sitemap.xml file. A sitemap will not allow a scraper to grab more of your pages before being blocked than it otherwise would.
Bottom line: this thread is an urban legend in the making.
Or is it already too late? Oh well, it's a harmless enough myth :-)
Well, it all depends on what data you feed them as far as a title tag goes :), I've been experimenting with one of the scraper groups, just playing mind games, it seems to have some positive effect for a short period of time. What I do see know is that they have adapted a bit and in most cases just get the pages that contain more that just one block of text.
But these issues could resolved by very simple "deny to profit" from the companies who sponsor or encourage this type of behavior. You how they are.
As far as Sitemaps and robots.txt
Both should cloaked these days. Strong trap is a must. Datacenters should be blocked as well(close to 1000 and counting). Anything that is outside of the area where you do business, if you run e-commerce, should first be greeted-presented with the welcome page/special deals page where contents are loaded all at once using/wrapping entire page in one DIV, this way the amount of possibilities and number of pages scraped is less than 3-2.
We don't do it to discourage our visitors from browsing our sites, researching various topics, we do it because we forced to.
And as far as 'c:\Document and Settings\Some Guy\mywebsite.htm', you do know the IP where it was originated from, right?. if so, your dynamically generated image should say: 'I now what you did last summer'
1. You make a sitemap, call it mysitemap1.xml
2. You submit it to google sitemaps
3. Google indexes it.
Now unless you link it to your site there is no way for any program to find it. Even if a scraper did a search of *.xml only google has seen the sitemap.
It only has 1 link and thats on google sitemaps.
If you don't even trust the software that you use to create your sitemap not to pass the name on. Then just change the name AFTER you have made the sitemap and BEFORE you submit it. Otherwise like, me, buy a decent program that does a good indexing job.
If you feel ultra paranoid you can submit it then when its been read, change the name.
I have used sitemaps for over a year now, googling for the name of my sitemaps has bought up nothing. they are not indexed in google at all.
I see a mountain emerging from the molehill.
Too late. The online tool that made your sitemap, already sucked your data off your site while it was making the sitemap, and has already assembled the scraped site and published it before you had even uploaded your new sitemap file to your own webserver. Google is already on the copy site indexing it...
Too late. The online tool that made your sitemap, already sucked your data off your site while it was making the sitemap, and has already assembled the scraped site and published it before you had even uploaded your new sitemap file to your own webserver. Google is already on the copy site indexing it...
Sounds like that might be it. I've often wondered why these sites are offering "free" online sitemap generators.
< link removed >
Every result on this search query is a scraper site linking to my website optimisation page, it is like that on all of my other pages as well, could these scraper sites be hurting my rankings? I am not very good at web design or anything technical really and I wouldn't know where to start with cloaking my sitemap or robots file, I do link to my xml sitemap and my RSS feed from my homepage, does this make it even worst? Any advise would be much appreciated.
<Sorry, no search results links
See Forum Charter [webmasterworld.com]>
[edited by: tedster at 7:15 am (utc) on May 8, 2007]