Are Scrapers Exploiting Your sitemap.xml File?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Are Scrapers Exploiting Your sitemap.xml File?

Keniki

4:52 am on May 6, 2007 (gmt 0)

Many people seem to posting saying after adding sitemaps they are suffering a problem with content. Could sitemaps.xml be beeing abused. Is the new content title and meta tag scraped before the sitemap is submitted to google by sitemap generators?

incrediBILL

5:08 pm on May 7, 2007 (gmt 0)

Unless the bots can search for *.xml meaning they find every xml page, surely all you need to do is chnge the name of your sitemap?

BINGO!

The problem is most people use default file names.

Just like they install forum and blog software and don't change the comments page or obfuscate the HTML with the comment form and the spammer software locates the page and starts spamming.

That's the real solution, NEVER get lazy and use defaults.

callivert

5:27 pm on May 7, 2007 (gmt 0)

I could go on, but it's completely doable. You should've come to the PubCon session last year about stopping bad bots, we covered it in much detail.

IncrediBILL, as I didn't go to PubCon last year, and won't be going this year :( , is there a resource you could point me to on this topic? I'd like to set up some of these anti-scraping tools on my server.

IanKelley

6:54 pm on May 7, 2007 (gmt 0)

but it's completely doable

No doubt of that, at least when it comes to simple, stupid, scrapers.

However scrape blocking methods are equally effective with or without a sitemap.xml file. A sitemap will not allow a scraper to grab more of your pages before being blocked than it otherwise would.

Bottom line: this thread is an urban legend in the making.

Or is it already too late? Oh well, it's a harmless enough myth :-)

blend27

7:17 pm on May 7, 2007 (gmt 0)

--- those are often discounted as spam sites these days

Well, it all depends on what data you feed them as far as a title tag goes :), I've been experimenting with one of the scraper groups, just playing mind games, it seems to have some positive effect for a short period of time. What I do see know is that they have adapted a bit and in most cases just get the pages that contain more that just one block of text.

But these issues could resolved by very simple "deny to profit" from the companies who sponsor or encourage this type of behavior. You how they are.

As far as Sitemaps and robots.txt

Both should cloaked these days. Strong trap is a must. Datacenters should be blocked as well(close to 1000 and counting). Anything that is outside of the area where you do business, if you run e-commerce, should first be greeted-presented with the welcome page/special deals page where contents are loaded all at once using/wrapping entire page in one DIV, this way the amount of possibilities and number of pages scraped is less than 3-2.

We don't do it to discourage our visitors from browsing our sites, researching various topics, we do it because we forced to.

And as far as 'c:\Document and Settings\Some Guy\mywebsite.htm', you do know the IP where it was originated from, right?. if so, your dynamically generated image should say: 'I now what you did last summer'

netchicken1

8:33 pm on May 7, 2007 (gmt 0)

Even after reading all these threads I fail to see what the fuss is about.

1. You make a sitemap, call it mysitemap1.xml
2. You submit it to google sitemaps
3. Google indexes it.

Now unless you link it to your site there is no way for any program to find it. Even if a scraper did a search of *.xml only google has seen the sitemap.

It only has 1 link and thats on google sitemaps.

If you don't even trust the software that you use to create your sitemap not to pass the name on. Then just change the name AFTER you have made the sitemap and BEFORE you submit it. Otherwise like, me, buy a decent program that does a good indexing job.

If you feel ultra paranoid you can submit it then when its been read, change the name.

I have used sitemaps for over a year now, googling for the name of my sitemaps has bought up nothing. they are not indexed in google at all.

I see a mountain emerging from the molehill.

g1smd

10:47 pm on May 7, 2007 (gmt 0)

>> If you don't even trust the software that you use to create your sitemap not to pass the name on. Then just change the name AFTER you have made the sitemap and BEFORE you submit it. Otherwise like, me, buy a decent program that does a good indexing job. <<

Too late. The online tool that made your sitemap, already sucked your data off your site while it was making the sitemap, and has already assembled the scraped site and published it before you had even uploaded your new sitemap file to your own webserver. Google is already on the copy site indexing it...

IanKelley

11:09 pm on May 7, 2007 (gmt 0)

Fortunately the poster was referring specifically to client side software and so he did not, in fact, use an online tool ;-)

tiori

11:16 pm on May 7, 2007 (gmt 0)

Too late. The online tool that made your sitemap, already sucked your data off your site while it was making the sitemap, and has already assembled the scraped site and published it before you had even uploaded your new sitemap file to your own webserver. Google is already on the copy site indexing it...

Sounds like that might be it. I've often wondered why these sites are offering "free" online sitemap generators.

g1smd

11:22 pm on May 7, 2007 (gmt 0)

I always considered that it was part of the same scam for free "e-postcards"...

"Type your email address in, and then the email address of where you want the free e-postcard sent."

Nice easy way to harvest valid email addresses.

Eavesy

7:09 am on May 8, 2007 (gmt 0)

Hi guys I am new here, I found this thread via SERT and I am worried about the amount of spam/scaper sites linking to my different pages, every page of my site has about 20 (that I am aware of) linking to them, here is a classic example:

< link removed >

Every result on this search query is a scraper site linking to my website optimisation page, it is like that on all of my other pages as well, could these scraper sites be hurting my rankings? I am not very good at web design or anything technical really and I wouldn't know where to start with cloaking my sitemap or robots file, I do link to my xml sitemap and my RSS feed from my homepage, does this make it even worst? Any advise would be much appreciated.

<Sorry, no search results links
See Forum Charter [webmasterworld.com]>

[edited by: tedster at 7:15 am (utc) on May 8, 2007]

shoffy

5:12 pm on May 9, 2007 (gmt 0)

I've had this experience myself, but haven't let it bother me. I think that search engines would be smart enough to recognize scraping, mostly via the time it appeared. (Original article May... Scraped and found in June)

Might we be making too much of this?

matrix_neo

6:03 pm on May 9, 2007 (gmt 0)

I feel glad that I have never used sitemaps. My ignorance at its best ;)

ronmojohny

11:53 pm on May 12, 2007 (gmt 0)

How about screwing with them by using the <base href="http://www.yoursite.com"> tag in the head of your pages to get some free backlinks?

fabricator

12:07 pm on May 14, 2007 (gmt 0)

put your site name/url or just your own name into the text at random points, let the dumb scrapers copy it worts and all.

This 44 message thread spans 2 pages: 44