Welcome to WebmasterWorld Guest from 220.127.116.11
Forum Moderators: goodroi
My site has about 100,000 content pages. Google is only indexing about 75% of those pages. Got me to wondering, can a scraper come along and take that other 25% percent, get it indexed, and become the original source?
Just because G doesn't index 25% of my pages, does that imply they are not aware of those pages?
Seems like a potential big problem for sites that are not indexed well, and get scraped a lot, or duplicated/copied, etc.
Previously I had not been using any of the search engine's sitemap tools, but I decided to go ahead and submit a sitemap to them just for this reason... So they at least have some way to know what content is mine...
Not sure if this is an obvious topic, just thought I would throw it out there...
There are some common reasons for non-indexed content. Each situation has a different risk level for potential scrapers being the first discovered source.
Too soon - Search engines are fast but they still need a few days (or weeks depending on the site) to crawl & index content. If this is the case I would not worry too much. Search engines are usually much faster at discovering your content then scrapers are. A sitemap might help but is generally not necessary. What normally helps most to boost the speed search engines index your site is boosting your link popularity.
Unreachable - The content is hidden behind forms, blocked with robots.txt or has no links pointing to it. This has a high risk of scraping nightmare. You should definitely use a sitemap to expose this content to search engines (not to mention removing the roadblocks on your site).
Low Value - Search engines generally do not want to index blank, duplicate or other pages with very low value to users. If your page is not indexed, that does not mean it was not visited/crawled by the search engines. It does not matter how many sitemaps you submit, search engines will not index pages they deem to be of low value. The best way to fix this situation is to increase the amount of unique text on each page and boosting the link popularity wouldn't hurt.
In general, I have found most concerns about scrapers are blown out of proportion. I don't like scrapers and I try to stop them. But I worry much more about my competition.
Even if a scraper steals all of my content they will not outrank me. That is because I work hard to ensure my content is easily crawlable, most pages have significant & unique text, and my link popularity will blow away scrapers. This simple recipe ensures that my pages rank high and the search engines filter out the scrapers.
Occasionally a scraper site will slip through and ranks in the serps. That is when I pull out the DMCA requests and start emailing hosting companies.