Welcome to WebmasterWorld Guest from 54.145.13.215

Message Too Old, No Replies

Are Scrapers Exploiting Your sitemap.xml File?

     
4:52 am on May 6, 2007 (gmt 0)

Junior Member

joined:Mar 15, 2007
posts:120
votes: 0


Many people seem to posting saying after adding sitemaps they are suffering a problem with content. Could sitemaps.xml be beeing abused. Is the new content title and meta tag scraped before the sitemap is submitted to google by sitemap generators?
6:29 am on May 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


You know, that does make some sense, Keniki. After all, the sitemap.xml file hands over a list of urls directly to any scraper that wants to make use of it. And excessively scraped sites can struggle in the SERPs.

Sounds like a very good reason for cloaking to me.

7:21 am on May 6, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 23, 2003
posts: 128
votes: 0


On a site with no sitemap, searches for the sitemap.xml show up as file not found and I notice quite a few.
Presume they are scrappers on the prowl.
A downside of sitemaps I think.
4:11 pm on May 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:July 26, 2006
posts:1619
votes: 0


I wondered that myself some time ago .... so i removed that from our site. Especially since I started getting google alerts that our content was appearing on a very obscure MFA website... I'm talking very deep pages of our site scraped and used.

Shame really that we cant give that information to the legit search engines.

I recomend checking your logs for attemps to access your sitemap file to see who is attempting to access that information and block them.

10:22 pm on May 6, 2007 (gmt 0)

Junior Member

joined:Mar 15, 2007
posts:120
votes: 0


I am suspicious too and cloaking is one answer but it may not solve everything...

I am suspicious of sitemap generators. It would be quite possible to offer a free sitemap generator that pinged a scraper every time it was used.

I would really like to see a tool in google webmaster tools that allowed you to generate an .xml sitemap and as these are only for search engine use I see no reason name of file could not be randomly generated and it could also delete previous sitemap file.

I think the idea of including sitemap reference in robots.txt should be abandoned and all sitemaps submitted via ping to all search engines that use them and random generated file each time a sitemap is created. This would I think stop the scrapers.

10:26 pm on May 6, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I'd like to see the filename as a random name but that means you would have to pre-register that name with all the search engines.

.

In the meantime I'd like to hear from anyone with multiple sites, and that shows the file to everyone on some sites, and uses .htaccess to only allow known bots to access it on other sites.

Does that make any difference?

11:10 pm on May 6, 2007 (gmt 0)

Junior Member

joined:Mar 15, 2007
posts:120
votes: 0


I would really prefer to find an answer without cloaking as I feel it will restrict emergence of geniune new search engines and I do feel we all have a responsibility to make the internet open.

Howether that dream seems to be well and truely over and shot to pieces at present. I am working with sites scraped to death and have seen clear identity theft as well. So taking a huge slice of humble pie........ webmasterworld where was that pearl script you use for cloaking robots.txt again and can it be applied to sitemaps?

I do also want a safe sitemap generator as in many ways can a free sitemap generator also send info to scraper sites without your knowledge. I would trust one from google.

12:48 am on May 7, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


I would really prefer to find an answer without cloaking as I feel it will restrict emergence of geniune new search engines and I do feel we all have a responsibility to make the internet open.

Sitemaps.xml is a serious scraping vulnerability which is one reason I don't use it as the sitemap.xml file is a clear path to crawl without hitting any spider traps so it should be cloaked, no doubt about it. Any time you give scrapers a clear path to avoid honey pots and spider traps they'll use it. With that said, the scrapers can simply scrape a search engine first using "site:mydomain.com" to get the equivalent of a sitemap and avoid your spider traps anyway.

That's why even robots.txt should be cloaked because you give the scrapers a list of user agents that you allow to crawl. Assuming you don't also restrict user agents by IP range or reverse DNS, the scrapers just adopt the allowed UA's and slide right through your .htaccess files or other user agent blocking fire walls.

However, cloaking sitemap.xml doesn't technically stop anyone else from crawling your site, it just means they have to crawl the old fashioned way. Simply check your log files to see what requested sitemap.xml and was denied every now and then and let anything new that looks worthy crawl your site on the next pass using the sitemaps.

1:13 am on May 7, 2007 (gmt 0)

Junior Member

joined:Mar 15, 2007
posts:120
votes: 0


incrediBILL is like an awesome poster with massive knowledge on things like this so I think we should all take some time to digest what he has said...
1:56 am on May 7, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 22, 2005
posts:63
votes: 0


I am happy I found this thread. I was just noticing that some random sites were showing up as an inbound links to deep pages in Google WM tools. The link URLS use almost exactly the same title as my post, and when I visit the sites the URLs are different (redirect?).

Forgive me if I am uneducated on the matter, but what's going on? I have never dealt with this type of activity. Will this affect rankings? How can a person address these issues?

1:59 am on May 7, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 24, 2003
posts:111
votes: 0


Why not just name your file something like mysitemap.xml and submit that URL in Google's Webmaster Tools? It doesn't HAVE to be named sitemap.xml, though doing so helps with the auto-discovery by engines without direct submit tools (a meta tag could fix that, but then scrapers can read that too.)

I guess if you are going to get scraped you are going to get scraped regardless, one way or another, so what's worse? Making it hard for the search engines AND the scrapers, or making it easy for both?

2:14 am on May 7, 2007 (gmt 0)

Full Member

10+ Year Member

joined:Mar 23, 2001
posts:244
votes: 1


Hi Keniki, of course that the scenario you describe is totally possible.

I would use a random name, not the standart naming for the sitemap file or sitemap index like others already suggested. Also perhpas use G alerts to monitor references to your site in general.
Also hopefully you already check your statistics regularly, so now you will also need to check who is accessing the sitemap and if there are signs someone is abusing it just ip ban them through .htaccess or httpd.conf.

Cloaking should work too if done well, like the robots file is done here at webmasterworld for example, but then I have a cloaking phobia even when legitimate so I would just look for the bad guys rather and ban them.

2:17 am on May 7, 2007 (gmt 0)

Junior Member

joined:Mar 15, 2007
posts:120
votes: 0


Why not just name your file something like mysitemap.xml

If the script your using to generate sitemaps is suspect and sending info to scrapers and hijackers changing name or cloaking won't help. We need a safe site map generator first.

2:40 am on May 7, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:May 27, 2005
posts:614
votes: 0


Edit: sorry but I didn't read the posts above which mine just repeats

Unless the bots can search for *.xml meaning they find every xml page, surely all you need to do is chnge the name of your sitemap?

Mine originally had a date in it to remind me, but I changed that to a simple name.

That name is submitted to google as the sitemap name, you can change it every time you resubmit your sitemap.

2:45 am on May 7, 2007 (gmt 0)

Junior Member

joined:Mar 15, 2007
posts:120
votes: 0


Also hopefully you already check your statistics regularly, so now you will also need to check who is accessing the sitemap and if there are signs someone is abusing it just ip ban them through .htaccess or httpd.conf.

No for me I am sick and tired of the .htacccess game I want to get to the root of the problem.

3:03 am on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 31, 2003
posts:1316
votes: 0


I know I'm dense, but I don't see the problem here. Your content should be available to scrapers even if it's not in a site map.

That's why even robots.txt should be cloaked

They only real way to prohibit access to pages is to password protect them.

And excessively scraped sites can struggle in the SERPs.

What does that mean?
4:03 am on May 7, 2007 (gmt 0)

Senior Member from MY 

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 1, 2003
posts:4847
votes: 0


Don't forget that robots.txt still overrules sitemap.xml. It is still possible to use traps such as blocking a page or directory using robots.txt but listing it in sitemap.xml.

The other very important point is that only genuine SEs should be reading sitemap.xml - human visitors should never pull that URL.

As traps go, cross-referencing all requests for that file against approved search engines and then blocking all who aren't on the list from the entire site is fairly foolproof.

A better trap might even be to serve those who are unauthorised to read the sitemap.xml file a whole different site of URLs, i.e. changing .html to .htm and using .htaccess to rewrite all those incorrect URLs to a script which feeds them random rubbish interspersed with copyright abuse messages.

5:16 am on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 4, 2001
posts: 1262
votes: 12


... I don't see the problem here. Your content should be available to scrapers even if it's not in a site map.

You're right, even the simplest of scrapers can follow every available link and get your whole site regardless of whether or not there's a sitemap.

I think what people are referring to here are scrapers that can be identified because of a common user agent or crawl method.

Although I'm not sure having or not having a site map makes a difference even there because if you're checking using .htaccess you're checking every request. They will be just as denied/allowed/redirected visiting somedeeppage.html as they would be visiting index.html.

And then of course there are scrapers which look exactly like a legitimate browser and are therefore effectively invisible to most sites.

And excessively scraped sites can struggle in the SERPs.

What does that mean?

When someone mirrors your content it's possible for your page/site to get hit with a duplicate content penalty.

7:34 am on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 7, 2003
posts:788
votes: 0


Perhaps the danger with sitemaps.xml is not that the bots would use it to scrape existing pages, but instead that they would use it to quickly locate new content.
The robots could download your sitemaps.xml today, adding it to their DB without following any of the links, return the following day, locate any new pages by comparing it to the old version, and then quickly scrape the new pages before the SE's even visit. So your content is first found by SE's under the scrappers site.
2:01 pm on May 7, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 22, 2005
posts:104
votes: 0


So perhaps there is something to the assumption by many people on these forums who have said (paraphrase) "After I submitted a sitemap to google I disappeared and my rankings went in the toilet" I made this assumption about a year ago and many thought I was nuts.

I've changed the name of my sitemap files and put them in another directory. Perhaps that will slow down the scrapers a bit.

Another possibility. Many people name their regular sitemap file (not the xml file) something like "sitemap.html" ..... Perhaps that is getting scrapped too? Might want to change that name also.

2:38 pm on May 7, 2007 (gmt 0)

Junior Member

5+ Year Member

joined:Nov 4, 2006
posts:128
votes: 0


So what do you think about new chance to put sitemap url in robots.txt?

If I name my sitemap aa.xml but I put url in robots.txt, scrapers can easily GET
1) Robots.txt
2) aa.xml

I guess if advantages to have xml sitemap for google,yahoo and msn outperform scrapers disadvantages. I'm not so sure now...

Regards.

3:44 pm on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 31, 2005
posts: 1651
votes: 0


This thread must be a joke. Nothing can stop your content to be scraped while it is public.
The fact that your sitemap.xml might make it a bit easier for them, if you have one, but if you remove it, the will scrap you the old way!
3:53 pm on May 7, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:June 24, 2005
posts:59
votes: 0


I was skeptical of the XML feeds as well but as a test, I just happened to be launching a new market area with a few new subdomains. I used sitemaps on some and not on others and included in the robots.txt per the new protocol on some and submitted to Google Webmaster Tools on others. The sites that I included in the robots.txt and also submitted to Google's Webmaster Tools have double the number of pages included in Google's index so far.
4:13 pm on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 12, 2004
posts:1355
votes: 0


I wouldn't advocate for cloaking a sitemap. Actually, I think preventing scrapers altogether will lead to "unnatural linking" patterns and can even get you penalized by G.
4:21 pm on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 2, 2003
posts:1184
votes: 0


I took SEOMozs` advice and created a Sitemap, let it get discovered once from Google and then removed it from my web server.
4:31 pm on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


>> Unless the bots can search for *.xml meaning they find every xml page, surely all you need to do is change the name of your sitemap? <<

If the online tool that you use to generate the sitemap is keeping a copy of your sitemap and using it for "other purposes", it is already too late.

4:34 pm on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I am puzzled by all the references to scrap and scrapping in this thread.

The correct words are scrape and scraping.

4:58 pm on May 7, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


This thread must be a joke. Nothing can stop your content to be scraped while it is public.

Never say NOTHING can stop scraping just because nothing you use will stop it.

You can stop most, if not all, scraping if you have the proper tools running on your server. Sure, scrapers can still snag 1 or 2 pages from my site, as it takes at least 2 pages for my automated tools to detect non-human behavior characteristics. However, after 2-3 pages the door is slammed in their face for most garden variety scrapers, spammers looking for form pages, or email harvesters.

For instance, if something other than Google, Yahoo or another whitelisted 'bot actually requests files like robots.txt or sitemap.xml, they are instantly blocked from crawling any additional pages as no human (except nosy ones) ever request such files.

If they are downloading .HTML pages and aren't loading other required page components, such as images, CSS, .js or anything else that a browser requires to display my web pages, they are also instantly blocked. Of course there are a few rare bots that do all this, and they tend to do other stupid things themselves which takes a couple of additional page loads to detect and stop.

I could go on, but it's completely doable. You should've come to the PubCon session last year about stopping bad bots, we covered it in much detail.

5:04 pm on May 7, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


I wouldn't advocate for cloaking a sitemap. Actually, I think preventing scrapers altogether will lead to "unnatural linking" patterns and can even get you penalized by G.

OK, that's just the silliest advocation of scraping that I've ever heard.

You assume that scrapers actually give you links, most do not. They steal your data, blend it into pages designed to steal your long-tails in G, and compete with you for your own traffic.

However, there is the rare scraper that actually provides links but even those are often discounted as spam sites these days, so you get no link value there either.

5:04 pm on May 7, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 31, 2005
posts: 1651
votes: 0


I think you meant:
"Never say NOTHING can stop AUTOMATED scraping..."

Many, many scrapers are still doing copy-paste!
I see it in my log with reference from:
"c:\Document and Settings\Some Guy\mywebsite.htm"

So, how do you stop copy-paste? I'm intriged.

This 44 message thread spans 2 pages: 44