|URL Removal Tool Disaster!|
If you disallow Googlebot with robots.txt, how soon after will it return
Over the past couple of weeks I've unearthed a vast amount of duplicate content of the non 'www' variety and also case changing across directory and filenames.
I've been dealing with it using the URL removal tool on individual URL's or where possible using the robots.txt.
Since we're talking about 14,000 non 'www' pages, I decided to submit a non 'www' robots.txt to deal with this problem (in the early hours of this morning). I've done this on numerous previous occasions and found it to be very effective.
As of this writing the directories slated for removal are still pending, but my worst nightmare appears to be unfolding. It appears that Googlebot must have picked up the robots.txt through the www domain during the 20 or 30 seconds when the robots.txt was modified for the non www domain.
At this point in time I can't locate any of the pages in Google that were in these directories, although the actual number of pages indexed for the site still shows the same as prior to submitting the robots.txt.
If Googlebot picked up the robots.txt would it apply any disallowed directories it found globally to my pages in its index, or only to the individual page it was attempting to retrieve?
I'm hoping this change in Google's index is due to it removing the non www pages, but if it isn't, what are my best options for getting the www pages back into the index?
If you disallow Googlebot with robots.txt, how soon after can you persaude it to re-spider?
If what you think happened actually is what happened, recent reports have been in the area of six months for a return to normal.
If that's the case then I may have to recreate the deleted pages in a new folder.
A terrible waste of off-site backlinks though.
If you do so, you'll have to delete the original files from your server. Otherwise Google might consider them as duplicate content when the six months are over.
I disallowed Googlebot from a directory earlier this year. I was afraid of a duplicate content filter in that part of my site, and it appeared it might be affecting the rest of the site.
I removed the biggest part of what may have been tripping a dup filter, then changed robots.txt to allow Googlebot back in. GBot was disallowed for a total of 2-3 weeks, and was back spidering that directory almost immediately.
The pages in that directory were showing as supplemental, but are now listed normally. It certainly didn't take 6 months in my case. Perhaps things are different now, but I wouldn't toss good back links just yet.
Make your corrections, then see what happens before you do anything major.
Why are you posting after disaster is done, instead of reading related posts on this forum? This is a serious issue and I discussed it before, as I commited similiar mistake one year ago.
If you submit non-www URL to URL Console, it removes BOTH non-www and www!
So it doesn't matter what Googlebot picked from robots.txt - the mere fact you tried to deal with www/non-www duplicates with URL Console means you removed these pages from the index for six months.
If your requests are still pending, there is still time to recover by changing robots.txt but after that there is no way out.
The only way to deal with www/non-www duplicates is 301, even if recently Google didn't support it correctly. URL Console is a dangerous tool and should be used wisely!
I've had 301's running on this site for months to no effect.
Google only seems to understand instructions in the robots.txt and robots meta tag.
You're quite right Wizard, this is a serious issue, and while it's useful to have your info on WebmasterWorld, the most important place for this kind of information is on Google itself, right where the problem lies.
Google should provide additional information relating to the use of robots.txt and if it applies across both www and non-www.
We need a safe alternative to eliminate non-www content. I've tried using sitemap.xml in 2 versions for www and non-www in conjunction with robots meta tags (all my pages deliver this if non-www). It seems to work albeit slowly. If the sitemap had a status tag where you could specifically tell Google the page was to be completely removed from their index, this would be ideal.
Given the extent of this problem, I think that any company with even the slightest shred of integrity, would provide adequate guidance on the use of their systems/tools.
It is not the responsibility of webmaster forums to provide best second guess guidance on how to deal with what are very serious matters which may affect your livelihood.
I can understand Google's desire to eliminate dupelicate content, but the reality is that they are damaging more genuine businesses, than they are eliminating scraper sites.
As a result, Googles SERPs contain less quality sites, making Google the biggest scraper site of them all!
|2 versions for www and non-www in conjunction with robots meta tags (all my pages deliver this if non-www) |
I understand you change meta robots dynamically, and give 'index' for www but 'noindex' for non-www? Bad idea, because your competitor can use URL Console against you, submitting all your pages in non-www version and thus removing your site from Google.
I also noticed problems with 301 recently, but you got to be patient. One of my sites have such duplicates, despite of 301s, but Google marked duplicates as 'Supplemental' so they don't hurt rankings of correct URLs. I keep waiting, and hope Google will do something with 301 sooner or later, it's better solution for me than accidentally removing 4 years old DMOZ listed site ;)
A year ago I lost my private hobby homepage for six months, so it wasn't great lose for me, but I learned to be cautious with URL Console, even if I keep using it quite often, especially when 301s work wrong.
As for my advices against Google obligations, this forum is a place for sharing our experiences and advices, while Google is a private company not bound to care about webmasters.
In ideal world, Google would care about webmasters, for example giving complete information we need, and webmasters would care about Google, optimising their pages to make Googlebot work easier.
In real world, webmasters either don't care about Google and do pages with scripts, frames, flash and just improper HTML code so Googlebot often just cannot crawl them, or use tricks to fool Google to gain top positions.
We, webmasters, aren't cooperative for Google, so why expect cooperation from them?
Yes, perhaps you are perfectly white hat, so am I. But still, Google specialists have so much trouble with black hats and lame webmasters making uncrawlable sites, that they can hardly spare time to think about us. So all we have is this forum, and advises from GoogleGuy and senior members.
Anyway, perhaps you should write to Google. I did it when my site was removed because of this URL Console behaviour, and after 90 days I became nervous why it was not reappearing. After a few emails, followed with automated responses to me, they finally changed information in URL Console from 90 days to 6 months. Maybe you'll force them to write in URL Console description that it does remove both www and non-www, no matter what version has been submitted.
Wizard, are you saying that removing non-www using robots meta tag will also remove www page and vice verca?
If this is the case, I'm in deep trouble.
I've attempted to resurrect content in my one remaining directory by using the URL removal tool to delete the non-www pages. So far this appears to have worked well in that the pages are gone, but from what you're saying I could lose the www pages too.
If Google treats www and non-www as seperate pages in its index, surely the pages should respond to individual robots meta tags, as should pages with different case, different query_strings and my personal favourite: the addition of a trailing slash with your filename (or another in the same directory) on the end of the correct filename.
What's the best email address to contact the Duplicate Content Factory on?
If you chased off an innocent caller from your doorstep with a broom how long do you think it would be before he / she returned?
|Wizard, are you saying that removing non-www using robots meta tag will also remove www page and vice verca? |
Yes. When I removed my site year ago, I did exactly that - I configured site to decide whether to serve meta robots index or noindex depending on presence of string 'www' in http_host server variable. I submitted the duplicate URL version I intended to remove, and in URL Console requests list I saw only this URL version, however both www and non-www were out of the index for six months. After six months, correct version returned (I have set up proper 301 in meantime, so wrong version didn't return).
|If Google treats www and non-www as seperate pages in its index, surely the pages should respond to individual robots meta tags, as should pages with different case, different query_strings and my personal favourite: the addition of a trailing slash with your filename (or another in the same directory) on the end of the correct filename. |
Yeah, it's logical, and I thought exactly that way. I was wrong.
I cannot attest that they didn't correct it recently, as we keep discussing this problem here and I tried to point the issue to Googleguy, but I've experienced such behaviour I described above and it's quite easy to verify it - make a test page indexed in both www and non-www and remove one version, then check if both disappear.
What I'm sure is that last time I checked it, it worked this way. I didn't use anything specific in e-mail to Google, but a few months later Googleguy adviced to write 'URL Canonicalization' in subject, as far as I remember. Check this thread:
[webmasterworld.com...] for details.
The URL removal tool doesn't remove anything. It only hides things, bringing them back after six months. So it is no way to deal with anything.
You have to use a 301. That is the only thing that works. However, you can't 301 a Supplemental result. You have to bring a page back to life before you 301 it, and then leave the 301 in place permanently. If you ever take it off, the extinct pages probably will return.
Wizard, you're quite right.
I just looked over some of the non-www stuff that is showing 'complete' in Google's URL removal tool, and sure enough, the www page is gone too.
Needless to say I've suppressed the robots meta tag and reinstated the 301 redirects.
Fortunately, most of the non-www pages were cross domain indexed pages, so the effect on what is left of my site should be insignificant (he says crossing his fingers).
If a 301 is successful, does the page doing the redirecting disappear from Google's index or does it simply lurk in the dim, dark depths (about where I am right now!) of the index without incurring a duplicate content penalty?
Does Google interpret 301's on the fly as it encounters them or does it need to have a serious lay down before it processes a job lot?
Thanks for the advice guys.
I look for somebody who can make a tool too put big URL lists (over 100000!)into the Google removal tool!
Please don't enquire why. I simply look for somebody, who can code this.
I hope for an answer (PN). Thanks!
This is where something similar to the sitemap submission would be extremely useful.
I've submitted so many URLs over the past couple of weeks I think I have RSI from copying and pasting!
Seriously though; Google needs to beef up the instructions on the use of the URL removal tool. I'm now in a position that I doubt if my sites will generate enough income to keep me going for the next 6 months becuase I've inadvertantly removed www pages when I thought I was removing only non-www.
It's clear from this thread that I'm not the only person to assume, incorrectly, that if Google treats www and non-www URLs as 2 different pages, then each page can have its own robots meta tag to control removal.
In the URL removal tool's 'Google Information for Webmasters', Google refers to 'The Robot Exclusion Standard' but they do not state that they comply with it!
In the Robots Exclusion standard document is a table giving 4 examples of legal and individual locations of robots.txt, these are:
You will notice that the last example shows a non-www robots.txt seperate to the www version!
In the 'Google Information for Webmasters', they state that 'Each port must have its own robots.txt file' and then go on to say 'if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols'.
No mention is made of the non-www situation. In view of the fact that they remove both www and non-www pages as the result of a robots.txt placed at the root of either, or as a result of submitting a page with a robots meta tag, I feel that this is a very serious ommission that requires immediate clarification.
Perhaps I'm misinterpretting the Robots Exclusion Standard, but it appears to me that Google is at odds with this standard when it comes to non-www URLs, and I would even go so far as to say that Google is at odds with its own index! How can they justify showing www and non-www content from the same site as seperate pages (and applying a duplicate content penalty) whilst treating them as the same page for the purpose of robots?
I realise that the www and non-www robots.txt are the same file, but Google needs to inform webmasters in substantially more detail of the effects that using the URL removal tool will have on their sites. Refering webmasters to the Robots Exclusion Standard implies that they support that standard, which it appears they do not fully, and this is therefore misleading.
I would like to see the following added to the URL Removal Tool 'Google Information for Webmasters':
1. The effect of www robots.txt on non-www content and vice versa
2. The effect of robots meta tag in a www page on the corresponding non-www page and vice versa
3. Guidance on the safe removal of non-www content, the timepsan involved if 301's are the only option, and the introduction of some method of bulk non-www page removal
If you've had pages unexpectedly zapped by the URL removal tool then let's hear about it!
We had a problem where Google thought non-www was our homepage and was showing it as such for our www site despite us never promoting the non-www site and having no links to it.
We thus used the url-removal tool to get rid of the non-www page quickly and only afterwards read that is also removes the www page as well. We are now left with a very genuine, useful site with no homepage in Google.
We've submitted a reinclusion request (ref #352549097 if anyone from Google is reading this) to ask that the www homepage be re-included as it was never intended this was to be taken out and the url removal tool pages never indicated it would be.
Help! This is not right or fair. Us genuine webmasters are being penalised by a bug that should not exist.
|The URL removal tool doesn't remove anything. It only hides things, bringing them back after six months. |
I agree, my site was crawled and its links were followed during six months of being removed.
|I look for somebody who can make a tool too put big URL lists (over 100000!)into the Google removal tool! |
No doubt you've found that your competitor returns 404 or noindex for one URL version and you're going to kick them out of Google :)) I'm skilled enough to write such kind of tool, but I'm not sure if I want to hurt someone's site in Google!
johnrooke, I've experienced this too.
I have a couple of business directories (built from individual submissions) that forward to the same domain.
I serve up the correct homepage based upon the cgi.server_name. No problems for years. In the last few days I have found the files that are served up as index pages appearing in Googles index under the URL to the folder where they actually reside. With no backlinks they have magically inherited the pagerank of the homepage!
After years of operating these directories in this manner it looks like I may have to move them to seperate IPs.
I find myself wasting more time trying to appease Google simply to stay in its index, rather than improving my sites and content to satisfy visitors!
So how is Google improving search? Burying decent sites and diverting webmasters from the task of developing their websites - Oh Yeah BIG Improvement!
Google's still indexing the direcotries that were dropped. What's the point?
For 'Robot Exclusion Standard' Google obviously reads 'Index Exclusion Standard'.
Google is at odds with this standard. Google refers to 'stopping indexing' and being 'excluded from the index', while the standard uses the phrase 'exclude robots from a server'.