Scraper Site Clearout Collateral Damage?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Scraper Site Clearout Collateral Damage?

Ian Cunningham

10:18 am on Jul 28, 2005 (gmt 0)

It seems like google has purged many scraper sites from the google serps, as per this thread:

[webmasterworld.com...]

I'm sure many people, including myself are very, very pleased about this as it stops scumbag sites from stealing our content.

However, it also appears that some non-scraper sites have been included in this purge (including my own). My site has been active for 5 years and is based on unique content.

Has anyone else been effected by this, and does google intend to refine the algorithm to stop valid, unique content sites from falling victim?

dataguy

1:26 pm on Aug 15, 2005 (gmt 0)

However not having one whengooglebot requests it each time is a good reason for delay in re-inclusion.

SEO1, may I ask what you base this on? I've never heard or seen anything that indicates this. Did I miss something?

My experience is that robots.txt is more often the cause of a site not being included. Not having one lessens the likelihood of a crawler ignoring your web site, since this would indicate all crawlers are allowed.

Seo1

3:57 pm on Aug 15, 2005 (gmt 0)

Hi Dataguy

It is found at the link below:

[google.com...]

I clipped a few of the important issues from the page below:

# Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler. Visit [robotstxt.org...] to learn how to instruct robots when they visit your site.
# If your company buys a content management system, make sure that the system can export your content so that search engine spiders can crawl your site.
# Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index.

Hope this helps.

dataguy

7:58 pm on Aug 15, 2005 (gmt 0)

Yeah, and that section also talks about testing your web site with the lynx browser.

SEO1, have you seen a high number of sites without a robots.txt file that have been banned on July 28th? I didn't think so.

This thread has deteriorated into a lot of speculation about issues that have shown no real relationship to the July 28th massacre.

I've been watching about 30 sites that are going through this ordeal. Some from WW members and some not. The issues that they seem to have in common have very little to do with what is being discussed in this thread.

The fact is that there are well known "signals of quality" factors that are being ignored while we are stuck here talking about missing robots.txt files and shared IP's.

Seo1

9:46 pm on Aug 15, 2005 (gmt 0)

Dataguy

Didn't know this was your personal thread.

So why did you ask where I found the information on the robots text file?

If you didn't really care you should just have went straight into your rant.

By the way most threads seem to come full circle if you read them from the beginning,,,,

but again since it's your private thread I'm out

have fun solving whjatever problems you have to deal with.....

Fortunately I don't have to deal with those hassles

Peace

dataguy

9:55 pm on Aug 15, 2005 (gmt 0)

SEO1,

I'm sorry, I didn't mean to offend you. I'm just trying to point out that there is a lot of discussion about this topic that doesn't appear to be relevant. Perhaps this is only because we don't have a lot of information to go on. This is not my thread, but I'd like to be able to contribute.

Webdetective

11:33 pm on Aug 15, 2005 (gmt 0)

If I believe my site is banned, I guess I should send Google a reinclusion request that reads something like this.

I believe my site might have been banned from Google, and I would like to ask you to please consider my site for reinclusion into the Google index. I was using some software-generated pages that might have been the reason. I have removed all of these pages and discontinued using this program. I have reviewed your webmaster guidelines at [google.com...] I apologize if I violated your guidelines, and I promise not to let it happen again.

Lorel

12:51 am on Aug 16, 2005 (gmt 0)

Dataguy

I've been watching about 30 sites that are going through this ordeal. Some from WW members and some not. The issues that they seem to have in common have very little to do with what is being discussed in this thread.

Being as you've been watching 30 sites and have noticed issues they have in common that may have caused them to be banned why don't you offer them here so we can comment on them.

JuniorOptimizer

1:09 am on Aug 16, 2005 (gmt 0)

Thanks for the poorly worded retort, Seo1, but it hasn't done much to change my opinion.

On the one hand you say there's no death penalty but in the next breath you say sites are banned. Banned is pretty much a death penalty, huh? No more traffic, no more bot, no more PageRank is one lifeless Google site.

And it's not because of a mission robots.txt file.

dataguy

1:25 am on Aug 16, 2005 (gmt 0)

Being as you've been watching 30 sites and have noticed issues they have in common that may have caused them to be banned why don't you offer them here so we can comment on them.

Glad you asked!

First off, the only sites I am looking at are dedicated search engine/directory web sites. These are not content or ecom sites with directories added on, these sites function only as search/directories. Each one also has a unique database of listings, not ODP.

There are various things that some have in common, but the three biggest factors that I see are #1 these sites have a lot of HTML errors when looked at through an HTML validator, and #2 each one has their domain registered for less than 18 months. #3 is so crazy I'm not going to mention it yet, but with my example, it is a 100% match.

I understand both of these issues are common, and there are a lot of sites which are like this that haven't beened banned. I believe that these may be "signals of quality" that have been eluded to in some of the discussions here at WW. They would probably be just part of the puzzle and not the whole puzzle, but it seems to make sense that a scraper site would have poorly coded HTML since a scraper site would likely be pages copied and pasted into another page, which often causes errors.

My guess is that when you add these factors to other factors such as a high percentage of outbound links, dupe content, etc. even some legit sites slip over the threshold and are banned.

Does this make sense?

I'm trying to get a thread together to compile a list of known or suspected signals of quality. Will keep you posted...

Seo1

1:29 am on Aug 16, 2005 (gmt 0)

Datsguy

Check the server.. it sounds like a good canidate for having been blacklisted.

Seo1

1:35 am on Aug 16, 2005 (gmt 0)

Junior

Again your new term is not one that is recognized and adding in more junk only adds to more junk...

Secondly it seems you need reading or comprehension lessons,,, my reference for robots.txt was meant for andem & sunflower regarding ""re-inclusion"" of their websites into the index.

Nowhere did I say it is a reason for being banned.

In closing jr in the past 3 years, I've helped 10 sites banned from Google get back in the to their index. the latest a 100,000 page pharmacy site based in Canada..... ...so I do think I have some knowledge of what I am talking about.

dataguy

1:44 am on Aug 16, 2005 (gmt 0)

Datsguy
Check the server.. it sounds like a good canidate for having been blacklisted.

There are nearly different 30 servers and each has other sites which are not banned... I don't think this is the problem.

I don't really think that anything before July really applies to this ban, this is different than anything that's happened in the last 3 years.

JuniorOptimizer

1:51 am on Aug 16, 2005 (gmt 0)

Yes, you definitely think you have knowledge. Thanks for all the help.

dataguy

2:09 am on Aug 16, 2005 (gmt 0)

Junior, the fact that your post appears after mine makes it look like you are mocking me, and I think you are actually referring to the person who posted above me.

I'm sorry that this thread seems to have been hi-jacked by some who would rather give out blanket advice without taking the time to understand the issues which necessitated the start of this thread in the first place.

There must be a better place to discuss this issue.

Lorel

5:55 pm on Aug 16, 2005 (gmt 0)

Dataguy,

There are nearly different 30 servers and each has other sites which are not banned... I don't think this is the problem.

Did you check the IP address of each one to see if they have been blocklisted?

Webdetective

9:13 pm on Aug 16, 2005 (gmt 0)

What if googlebot is continuing to spider a site that fell out of the Google index? Does that mean the site might not be banned? I always assumed if a site is banned, it doesn't get spidered, but maybe that's not the case.

moftary

9:38 pm on Aug 16, 2005 (gmt 0)

dataguy wrote:

#3 is so crazy I'm not going to mention it yet, but with my example, it is a 100% match.

I am very interested as you sound logic for your #1 and #2

Lorel

1:56 am on Aug 17, 2005 (gmt 0)

Web detective,

What if googlebot is continuing to spider a site that fell out of the Google index? Does that mean the site might not be banned? I always assumed if a site is banned, it doesn't get spidered, but maybe that's not the case.

Can you tell what googlebot did while it was there? Search engines can get caught up in image maps and other spider traps and go nowhere.

dataguy

2:40 am on Aug 17, 2005 (gmt 0)

I am very interested as you sound logic for your #1 and #2

moftary, check out [webmasterworld.com...]

Webdetective

3:23 am on Aug 17, 2005 (gmt 0)

Can you tell what googlebot did while it was there? Search engines can get caught up in image maps and other spider traps and go nowhere.

Loren,
googlebot has been mostly hitting my homepage index.html. My theory about this is it might be attempting to respider a large number of doorway pages I recently deleted from my site. I also use a custom 404 redirect in .htaccess
ErrorDocument 404 [.......]
that takes a visitor back to my homepage if they click to a non-existant page, instead of the usual "404 Not Found" page. I am assuming googlebot might be deindexing the missing pages.

Lorel

3:15 pm on Aug 17, 2005 (gmt 0)

Web Detective

You might check in the Apache forum to make sure you have your .htaccess set up correctly. That could be the problem and if you haven't redirected the doorway pages with a correct 301 that could be part of the problem also.

Webdetective

9:51 pm on Aug 17, 2005 (gmt 0)

Lorel,
Good idea. I posted my question in this Apache thread:

[webmasterworld.com...]

ErrorDocument 404 [mydomain.com...]
AddType text/x-server-parsed-html .html .htm
RewriteEngine On
RewriteCond %{HTTP_HOST} ^mydomain\.com$ [NC]
RewriteRule ^(.*)$ [mydomain.com...] [R=301,L]

Webdetective

11:49 pm on Aug 17, 2005 (gmt 0)

Regarding googlebot spidering my possibly banned site, it is spidering other pages on my site, but just it's respidering index.html rather frequently. I should have been more clear on this earlier.

I assumed the frequent respidering of my homepage was due to googlebot attempting to respider missing pages from Google's cache, but being redirected to my homepage since I have a custom 404 redirect back to my homepage in my .htaccess file. Maybe I should disable my custom 404 redirect, so that my webhost's default redirect takes over until googlebot is finished spidering everything.

girish

2:23 am on Aug 19, 2005 (gmt 0)

My site was a casualty of the July 28 tweek. Tonight it's back in the index. I had sent several re-inclusion requests - which I described in the locked thread- looks like they worked.

jd01

3:00 am on Aug 19, 2005 (gmt 0)

ErrorDocument 404 http://www.mydomain.com

I haven't been by the Apache forum in a few days, so I don't know if this has been answered there, but thought it would be good for all to know.

The correct error document location should be a server relative path. In this case:

ErrorDocument 404 /

If not a 302 will be served not a proper 404 'page not found header' - meaning, *all* pages on your site that do not exist are temporarily moved to your index page... this could cause some serious issues, like duplicating your home page every time a page is not found.

Justin

From the Apache Documentation:
Note that when you specify an ErrorDocument that points to a remote URL (ie. anything with a method such as "http" in front of it), Apache will send a redirect to the client to tell it where to find the document, even if the document ends up being on the same server. This has several implications, the most important being that the client will not receive the original error status code, but instead will receive a redirect status code.

The default status code for a redirect is a 302.

newkid2005

3:07 am on Aug 19, 2005 (gmt 0)

What the h***?

I am running a directory and everything is just fine.

Guess all you whiners need to rethink what EXCATLY a directory is.

JuniorOptimizer

10:29 am on Aug 19, 2005 (gmt 0)

Okay great one, what EXCATLY is a directory?

Webdetective

3:34 pm on Aug 19, 2005 (gmt 0)

My site was a casualty of the July 28 tweek. Tonight it's back in the index. I had sent several re-inclusion requests - which I described in the locked thread- looks like they worked.

girish,
What's the link to the closed thread again? I looked back and was unable to locate it. I too am about to reply to an autoreply back from help@google.com about my reinclusion request.

I'm not sure if I should repeat what I said in my original message I sent via Google's form page, or just let them know my site is missing from their index. Should I again admit to guilt for using doorway pages and apologize etc.. or leave that stuff out since I already mentioned that in my original message?

canucks1980

3:58 pm on Aug 19, 2005 (gmt 0)

What is a scraper site?

JuniorOptimizer

4:12 pm on Aug 19, 2005 (gmt 0)

A scraper site has none of its own original content, it pulls snippets of other information from other people's websites in order to make pages.

This 268 message thread spans 9 pages: 268