Forum Moderators: Robert Charlton & goodroi
[webmasterworld.com...]
I'm sure many people, including myself are very, very pleased about this as it stops scumbag sites from stealing our content.
However, it also appears that some non-scraper sites have been included in this purge (including my own). My site has been active for 5 years and is based on unique content.
Has anyone else been effected by this, and does google intend to refine the algorithm to stop valid, unique content sites from falling victim?
However not having one whengooglebot requests it each time is a good reason for delay in re-inclusion.
SEO1, may I ask what you base this on? I've never heard or seen anything that indicates this. Did I miss something?
My experience is that robots.txt is more often the cause of a site not being included. Not having one lessens the likelihood of a crawler ignoring your web site, since this would indicate all crawlers are allowed.
It is found at the link below:
[google.com...]
I clipped a few of the important issues from the page below:
# Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler. Visit [robotstxt.org...] to learn how to instruct robots when they visit your site.
# If your company buys a content management system, make sure that the system can export your content so that search engine spiders can crawl your site.
# Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index.
Hope this helps.
SEO1, have you seen a high number of sites without a robots.txt file that have been banned on July 28th? I didn't think so.
This thread has deteriorated into a lot of speculation about issues that have shown no real relationship to the July 28th massacre.
I've been watching about 30 sites that are going through this ordeal. Some from WW members and some not. The issues that they seem to have in common have very little to do with what is being discussed in this thread.
The fact is that there are well known "signals of quality" factors that are being ignored while we are stuck here talking about missing robots.txt files and shared IP's.
Didn't know this was your personal thread.
So why did you ask where I found the information on the robots text file?
If you didn't really care you should just have went straight into your rant.
By the way most threads seem to come full circle if you read them from the beginning,,,,
but again since it's your private thread I'm out
have fun solving whjatever problems you have to deal with.....
Fortunately I don't have to deal with those hassles
Peace
I believe my site might have been banned from Google, and I would like to ask you to please consider my site for reinclusion into the Google index. I was using some software-generated pages that might have been the reason. I have removed all of these pages and discontinued using this program. I have reviewed your webmaster guidelines at [google.com...] I apologize if I violated your guidelines, and I promise not to let it happen again.
I've been watching about 30 sites that are going through this ordeal. Some from WW members and some not. The issues that they seem to have in common have very little to do with what is being discussed in this thread.
Being as you've been watching 30 sites and have noticed issues they have in common that may have caused them to be banned why don't you offer them here so we can comment on them.
On the one hand you say there's no death penalty but in the next breath you say sites are banned. Banned is pretty much a death penalty, huh? No more traffic, no more bot, no more PageRank is one lifeless Google site.
And it's not because of a mission robots.txt file.
Being as you've been watching 30 sites and have noticed issues they have in common that may have caused them to be banned why don't you offer them here so we can comment on them.
Glad you asked!
First off, the only sites I am looking at are dedicated search engine/directory web sites. These are not content or ecom sites with directories added on, these sites function only as search/directories. Each one also has a unique database of listings, not ODP.
There are various things that some have in common, but the three biggest factors that I see are #1 these sites have a lot of HTML errors when looked at through an HTML validator, and #2 each one has their domain registered for less than 18 months. #3 is so crazy I'm not going to mention it yet, but with my example, it is a 100% match.
I understand both of these issues are common, and there are a lot of sites which are like this that haven't beened banned. I believe that these may be "signals of quality" that have been eluded to in some of the discussions here at WW. They would probably be just part of the puzzle and not the whole puzzle, but it seems to make sense that a scraper site would have poorly coded HTML since a scraper site would likely be pages copied and pasted into another page, which often causes errors.
My guess is that when you add these factors to other factors such as a high percentage of outbound links, dupe content, etc. even some legit sites slip over the threshold and are banned.
Does this make sense?
I'm trying to get a thread together to compile a list of known or suspected signals of quality. Will keep you posted...
Again your new term is not one that is recognized and adding in more junk only adds to more junk...
Secondly it seems you need reading or comprehension lessons,,, my reference for robots.txt was meant for andem & sunflower regarding ""re-inclusion"" of their websites into the index.
Nowhere did I say it is a reason for being banned.
In closing jr in the past 3 years, I've helped 10 sites banned from Google get back in the to their index. the latest a 100,000 page pharmacy site based in Canada..... ...so I do think I have some knowledge of what I am talking about.
Datsguy
Check the server.. it sounds like a good canidate for having been blacklisted.
There are nearly different 30 servers and each has other sites which are not banned... I don't think this is the problem.
I don't really think that anything before July really applies to this ban, this is different than anything that's happened in the last 3 years.
I'm sorry that this thread seems to have been hi-jacked by some who would rather give out blanket advice without taking the time to understand the issues which necessitated the start of this thread in the first place.
There must be a better place to discuss this issue.
What if googlebot is continuing to spider a site that fell out of the Google index? Does that mean the site might not be banned? I always assumed if a site is banned, it doesn't get spidered, but maybe that's not the case.
Can you tell what googlebot did while it was there? Search engines can get caught up in image maps and other spider traps and go nowhere.
I am very interested as you sound logic for your #1 and #2
moftary, check out [webmasterworld.com...]
Can you tell what googlebot did while it was there? Search engines can get caught up in image maps and other spider traps and go nowhere.
Loren,
googlebot has been mostly hitting my homepage index.html. My theory about this is it might be attempting to respider a large number of doorway pages I recently deleted from my site. I also use a custom 404 redirect in .htaccess
ErrorDocument 404 [.......]
that takes a visitor back to my homepage if they click to a non-existant page, instead of the usual "404 Not Found" page. I am assuming googlebot might be deindexing the missing pages.
[webmasterworld.com...]
ErrorDocument 404 [mydomain.com...]
AddType text/x-server-parsed-html .html .htm
RewriteEngine On
RewriteCond %{HTTP_HOST} ^mydomain\.com$ [NC]
RewriteRule ^(.*)$ [mydomain.com...] [R=301,L]
I assumed the frequent respidering of my homepage was due to googlebot attempting to respider missing pages from Google's cache, but being redirected to my homepage since I have a custom 404 redirect back to my homepage in my .htaccess file. Maybe I should disable my custom 404 redirect, so that my webhost's default redirect takes over until googlebot is finished spidering everything.
I haven't been by the Apache forum in a few days, so I don't know if this has been answered there, but thought it would be good for all to know.
The correct error document location should be a server relative path. In this case:
ErrorDocument 404 /
If not a 302 will be served not a proper 404 'page not found header' - meaning, *all* pages on your site that do not exist are temporarily moved to your index page... this could cause some serious issues, like duplicating your home page every time a page is not found.
Justin
From the Apache Documentation:
Note that when you specify an ErrorDocument that points to a remote URL (ie. anything with a method such as "http" in front of it), Apache will send a redirect to the client to tell it where to find the document, even if the document ends up being on the same server. This has several implications, the most important being that the client will not receive the original error status code, but instead will receive a redirect status code.
The default status code for a redirect is a 302.
My site was a casualty of the July 28 tweek. Tonight it's back in the index. I had sent several re-inclusion requests - which I described in the locked thread- looks like they worked.
girish,
What's the link to the closed thread again? I looked back and was unable to locate it. I too am about to reply to an autoreply back from help@google.com about my reinclusion request.
I'm not sure if I should repeat what I said in my original message I sent via Google's form page, or just let them know my site is missing from their index. Should I again admit to guilt for using doorway pages and apologize etc.. or leave that stuff out since I already mentioned that in my original message?