|then submit the URL of the robots.txt file to their URL console |
You meant to the URL submission console?
Not the url removal console?
Reid - I'm pretty sure g1 means this tool at Google:
USE EXTREME CAUTION with this - it's very powerful and *if you accidentally disallow* your entire site or directories it will delete them from Google's index in 1-2 days.
We used this successfully to remove pages listed under site:oursite.com that seemed to be created by our own 302 redirects to affiliate sites. To do this we excluded our CGI directory in robots.txt and then submitted the revised robots.txt to Google via the tool. We also deleted some entire domains we did not want indexed. But use this cautiously.
If only one could get reincluded as easily we'd have a happier thread here.
[edited by: joeduck at 5:46 am (utc) on April 23, 2005]
|To do this we excluded our CGI directory in robots.txt and then submitting the revised robots.txt |
so I use the removal tool to remove robots.txt because google has it cached and this will force googlebot to refetch it?
Or are you talking about submitting robots.txt for inclusion into google?
To get this thread back on topic now - since Iv'e been following it since the beginning of the MEGA thread about google 302's started by Japanese
by the way Japanese wasn't 'kicked out' of webmaster world for posting something wrong, like someone said earlier, he got mad and left because Brett snipped one of his posts (TOS).
If you want to read the 700+ thread Brett did tell him he is still welcome here but he hasn't come back.
Any way we were using the url removal tool in this fashion.
1. Block the hijacked page (your page) from googlebot in robots.txt (or with META on the page).
2. Submit hijackers URL into removal tool. (which will show a 404 because googlebot is blocked from seeing the page the 302 points at)
302 URL gets removed from google index.
3. Remove the robots.txt entry (or the META) so that googlebot will be able to continue crawling as it should.
This would remove the 302 badguy URL for 90 days before googlebot finds the 302 on badguys site and the whole cycle starts over.
Google extends the 90 days to 6 months - this is a good thing because it gives them lots of time before these URL's come back to haunt.
Some guys went beyond this good advice and tried to 'clean up' their SERP's and started working on their own site, this is when the thread got out of hand.
So they ended up removing their entire site because they submitted the non w*w version of their own site.
And a few other mistakes were made, one guy put
he submitted the hijackers URL and his own site went crashing down.
So to some the 6 months is really BAD news.
Googleguy shows up and helps some of these guys that messed it up get reincluded. But the 302 thing......
It seems that we did do the right thing - most of us did-- and google seems to have fixed something to do with 302's.
There are a few threads going about 301 problems since that fix. I even had some old long gone 301's re-appear. Some people have had old long gone DOMAINS reappear due to the 301 thing, no casualties yet but a lot of nervous webmasters.
So I assume the 302 thing is fixed but I want to nuke those old 301's with the removal tool. Along with a couple of my own stray cgi url's floating around in the SERP's.
[edited by: Reid at 6:01 am (utc) on April 23, 2005]
Reid - I think no to both your questions. They have a good description at the tool, but my understanding is that the tool simply gets a bot to revisit your robots.txt and follow the instructions. For example: you have pages that are appearing in Google that you _don't want to appear_. They are located in a directory called www.yoursite/cgi/
You could put a disallow in your robots.txt and then submit the robots.txt via the tool. Those pages will be deleted in 24-48 hours.
In our case it appeared that robots.txt was ignored during an update, leaving us with many odd pages, but then followed when we did the exclusion via robots.txt.
Again I urge caution to anybody using this. A simple typo in your robots.txt could blast your site out of the indexes entirely.
joeduck - so I guess it's a good idea to wait and see if googlebot is handling your new robots.txt ok before you blast it with the removal tool because if googlebot can't get by that robots'txt file, you end up blasting everything.
Afaik, the googlebot that does the removals is another googlebot than the one that does the indexing.
The one that does the indexing will probably see an URL somewhere on the internet and include that URL in the giant link pool, which means that it becomes part of the index. Later, another googlebot comes by and tries to spider that URL, but as it finds out that it's "robots.txt'ed" it gives up so the indexed URL becomes an "URL only" listing.
As always, the spider adds links, it doesn't remove them. So, to get the links that the spider bot found removed you have to ask the removal bot.
If your "robots.txt" file already includes the links that you want to remove you don't need to change it in any way. Just submit it to the removal bot (URL Console) and that one will go through all of your links and remove those that are in conflict with your "robots.txt" file.
Actually this is a pretty smart thing Google has, and it usually works like a charm. I often wish that the other SE's had an URL Console like it, as it's often close to impossible to get URL's removed from, say, Yahoo (several months of 404 or 410 seems to be needed).
I don't understand why the removal tool causes a page to be removed for six months. The removal tool will not work at all if it finds the page. Following that logic, shouldn't the page be re-indexed as soon as Googlebot comes across it again?
URLs with spaces are indexed with "%20" but the removal tool doesn't recognize these URLs, presumably because it cannot find the pages.
I find that a lot of dead pages have a "%20", probably partly due to the reason mentioned above. I suppose hijackers and other nare-do-wells can exploit this phenomenon to make their URLs harder to remove. Can the removal tool be tweaked to accept a URL with a %20?
Thanks for a nice summary there...
Not sure if I understand the question.
Your use of it seems to be a very clever way to delete hijacker pages, but different from ours which was more conventional and based on g1smd's experiences (thanks g1smd!).
Claus' explanation jives with some of our experiences -basically that you have two independent things here: spidering, which adds pages in somewhat unpredictable ways, and robots exclusion which very quickly deletes Google indexed pages according to the robots.txt exclusion protocol.
>> so I use the removal tool to remove robots.txt because google has it cached and this will force googlebot to refetch it? <<
No, not remove the robots.txt file! Simply spider the robots.txt file and remove from the index all the folders and pages mentioned in that file.
>> Or are you talking about submitting robots.txt for inclusion into google? <<
>> I don't understand why the removal tool causes a page to be removed for six months. The removal tool will not work at all if it finds the page. Following that logic, shouldn't the page be re-indexed as soon as Googlebot comes across it again? <<
Think carefully about this. Every URL that Google has ever seen in a link is recorded in a database. Some of those URLs show in the public index. Many do not. Pages that are spam, pages that no longer exist, malformed links, and stuff that has "noindex" assigned to it are still recorded as URLs along with their status. It has to be so, otherwise everytime a bot came across your link again it would simply re-add it to the public index. So, such URLs are in a "secret store" and should not be visible in the public index. In order that every page of the web isn't re-checked every 30 seconds, the data has an expiry date on it, for re-checking. If you remove pages using the URL controller, then they are flagged to not be included in public results for 6 months. After that time the robots.txt file is looked at again. If the URL is still in robots.txt then I assume it stays out of the public index, otherwise the URL itself is spidered for possible reinclusion.
By mistake I removed my site throught the removal tool, I left a / with not text after it, but the next day when I saw that my whole site was gone from google I changed my robots.txt again.
Here the last few days I see some of my site again in the seps, still only 5-10% like before, so now Im not sure if it is just google old serps or my site is realy back again. Today googlebot was by 14 times, I dont think it would do that if the site was removed completly with the removal tool.
|Today googlebot was by 14 times, I dont think it would do that if the site was removed completly with the removal tool. |
it sounds like your site is getting crawled again.
you could check your log files and see where it goes.
Maybe you are right, because if it has the 6 month removetool time, googlebot would only have visit the main page and robots, I think.
I have never checked where the robot is going, Im not even sure how to do that, I use Urchin.
Dear Mr. GG -
At your and Google support's suggestion we have used 301 redirects to fix 302 and canonical problems.
However, our pages with the 301s still show up even after the new ones have been spidered - ie the index now shows even more unintentional duplicate content than before (20k+ pages across 3 geographic sections of our site).
Any suggestions how to avoid duplicate penalties here?
We don't want to kill the OLD 301 redirected pages because they have links and PR we want to pass to the new pages.
joe- I recently requested the google team through the support form to email me so that I may completely report concerns similar to those that you have mentioned about 301/302 redirects. I tried to address to them the concerns that other webmasters have been reporting that we have been experiencing also and showing them examples from our own site. I don't know if they will reply with any explaination but the things I have been seeing have been kind of strange such as: external redirects are indexed under the redirecting url but with the content that contains the url. This itself could create 2000+ pages of duplicate content. There are more examples but I won't go into them. I believe that both 302/301 redirects were both affected by the "fix" that they attempted. Now it just may be that it is fixed and several months of crawling is needed to clean the index out. The fix may just need further tweaking or a complete overhaul on their end. We just don't know.
>> We don't want to kill the OLD 301 redirected pages because they have links and PR we want to pass to the new pages. <<
If your server is correctly set up then if anything attempts to access yourdomain.com it will be redirected to www.yourdomain.com with a HTTP status of "301 Moved". The content at the old location will not actually be accessed at all.
If Google is crawling only the new URLs then it will not have seen the redirect. You need to make a list of all the URLs in the index that should not be there and make a page of links pointing to those URLs. Put that big list on another site. Google will find the list, crawl the URLs, and see the redirects. Allow at least a few weeks for the old URLs to drop out of the index after that.
joe - "However, our pages with the 301s still show up even after the new ones have been spidered - ie the index now shows even more unintentional duplicate content than before (20k+ pages across 3 geographic sections of our site)."
Take a look at some of the old 301s caches if it shows. Check the date.
From what we have seen is that google has spidered 301's and instead of just showing the url it has been showing full title and descriptions and cached content of the new pages that it redirects to. This cached content is of our newest design in which the 301 has never seen at all. Now under a 301 this shouldn't happen. It is saying page A is now gone and B is the new location. Get rid of A show B.
We have seen a bunch of non www's fully listed (cahed pages) in which they have always been under a 301 redirect. They have ALWAYS returned a 301.
Normally1 a 301 redirect will take at least one full month to propagate to the SERP's. If they have changed the way they handle 301's i do not believe that this time frame is now shorter than before.
So, a 301 takes time. This, of course, does not explain wrong listings for URL's that have always been 301's.
1) That is, at least until a few months ago. I haven't done any 301's in a few months.
These 301 redirects have been up for 2 1/2 - some even 3 years. I believe that the recent updates caused some sort of problem with them both. This does not mean it is not fixed. It may mean a simple crawl can resort this thing out. But if there are penalties from this then getting the penalty lifted ane even a complete crawl would be a mess in it's own.
|I have never checked where the robot is going, Im not even sure how to do that, I use Urchin. |
What you need to do is see if you can download the raw log files from the server.
Pick a date that you know googlebot showed up and download the logfile for that day .
Unzip it and open it in wordpad.
Take a look here for info about how to read it
you can just do a quick scan and see what files googlebot is requesting.
If it is continually requesting the same files you have a problem but if is requesting a lot of different files then you can break open the champaign
when my site was 1 month old (about 7 months ago)
I consolidated a bunch of pages and deleted some URL's.
I was getting sporatice 404's (6-8 per month) so I added an .htaccess which 301'ed these depecrated URL's to the closest match (the page it was consolidated into). This got rid of the 404's and I figured these requests would just peter out since the pages no longer exist.
With the latest update these pages with the original cache were showing up in site:mysite.
I just commented out the .htaccess lines regarding these pages and nuked them with the removal tool.
Waiting to see what happens with fingers crossed.
Upon reflecting I think if I could do it over I would put a disallow in robots.txt for these URL's and submit the robots.txt instead.
Thanks to all for helpful insights. It appears that our 301's showing up along with redirected new pages is probably because we have a new url as well, so our new URL was spidered but the redirects at old URL are still unvisited by spider.
I'm still worried this could bother Google since the index reflects a lot of dupes.
g1 sounds good except that we've got so many redirects (tens of thousands at server level) that I'd be worried the spider would see that ocean of new links as spammy.
I think we'll sit tight and hope the 301s shake out in the coming update.
well my nukes of those 301's were denied because I re-installed the .htaccess 301 redirects from them.
So this time I inserted a disallow in robots.txt for these non-existant files and submitted robots.txt - see what happens. I left the .htaccess 301's in place this time since the robots.txt will disallow googlebot from requesting them in the first place.
crossed fingers again.
It seems to me that apart from trying to remove a URL from another site, the robots.txt method is the best way to go with this removal tool.
Whatever is disallowed in robots.txt will be removed from the index - even if it does not exist.
It even found a few images related to these files that no longer exist and included them in the removal request.
The 302 thing happened to me over a year ago... I tried posting about it and here and it didn't make it past the editors, so I never posted in here again. I think there is too many forums in here about how to rank and not enough about ethics, or how to make a buck without screwing someone over.
After filing all the DMCA stuff, writing all the long letters explaining exactly what these SEOs are knowingly doing the only thing I've seen is that site bouce around in nowhere land, without titles and descriptions, or buried under pages of of crap. So a friend told me to come and look at this thread, and I filed my little reinclusion request, ands this moring woke up to a new listing for my site.... one that has my title and description, but again, ins't associated with my URL.
I'm just so burnt on theives, I could spit. You can't even sue them because they aren't actually breaking any copyright laws.... unfair business practices, maybe, but not worth persuing in court. As far as I can tell, Google can't seem to figure out how to properly index a 302 so I would suggest, they simply stop trying to. If someone moves their site, that's their problem. If googlebot is so smart it should find the new site on its own and index it. Period. But if they can't do it without without completely screwing up a site that has never even moved servers or IPs, I would suggest they just don't try.
I started redirecting spurious www URLs to the default non-www versions at the end of March. I then went out of town, but since I started writing down the numbers on April 16, the number of spurious www versions has varied but has been declining slowly since April 21:
The number of my non-www pages in the index, which was inflated by about 300%, has been going down except for a small blip on April 22:
It would appear that--for my site, at least--Google is crunching its data (albeit slowly) and removing obsolete or duplicate URLs from its index in response to the redirect in my .htaccess file.
|ands this moring woke up to a new listing for my site.... one that has my title and description, but again, ins't associated with my URL. |
Is this the only thing you get when you search site:yoursite?
Is the cache a picture of your page but with the wrong URL?
Is it an old outdated cache of your page?
If so this is classic 302 hijack and can be fixed by you.
>> As far as I can tell, Google can't seem to figure out how to properly
>> index a 302 so I would suggest, they simply stop trying to.
Exactly. Why bother if it does nothing but decrease quality and frustrate webmasters anyway. Personally i've suggested many things; "treat them as a straight link", "treat cross-domain 302's differently" etc. but totally ignoring them so that they don't even count as a vote would also be fine with me.
I will welcome most things except from thinking that they are pages, as this is about the last thing they are. Imho, Google should not index stuff that is not pages.
I Have had my 301 redirect in for one site for a month or so and the crawling seems to be better than recently.
Traffic does not seem to be improving.
Looks like a slow process.
Any luck with your site traffic wise.
|Any luck with your site traffic wise. |
My rankings are back to normal or near-normal for most of the keyphrases that I track, but Google referrals are still way down, so it would appear that I'm still doing poorly for many "inside pages" that used to generate little traffic individually but fairly high traffic in the aggregate.
To put it another way, my Google referrals dropped about 75% on March 23. Now they're down about 70%. And my Yahoo-to-Google ratio is up to maybe 2.4:1 after being at a low of 2:1 a few weeks ago. That isn't a dramatic improvement, but at least things seem to be moving in a positive direction.
A little more good news while we are at it. We are finally getting a good googlebot crawl and with that a bunch of pages have returned to the index. Let's hope they stick.
EFV - again similar.
My homepage is positioned well and when I do searches I seem to be positioned OK for the words I look for my site on.
It definetly is the 100s of keyword combos that I could never think of that my site seems to be struggling on.