| 2:40 am on Jun 11, 2012 (gmt 0)|
It's one of those mistakes that is much too easy to make (especially for the non-SEO) but quite disastrous in its effect. I had one client make this error (truly a household name) and even for them, it took about two weeks to get back to near normal Google traffic again.
| 3:11 am on Jun 11, 2012 (gmt 0)|
Here's what I've done in the past. Not sure if the individual elements work, but the combination has worked and got pretty much everything recrawled and ranks / traffic restored in a few days ( about 30K pages ).
- Test robots.txt again from the Webmaster central tool. This causes a refresh of what Google knows as your latest robots.txt
- Submit the sitemaps again into Google and other search engines.
- Increase the crawl speed.
Like I said, these three were done and worked for me in Feb. Not sure if one individual step is enough.
And needless to say, this all depends on how well your site usually gets crawled.
One more thing... we do not have individual pages sending out if-modified-since headers so not sure if this is also a factor in getting recrawled / ranked.
| 3:31 am on Jun 11, 2012 (gmt 0)|
If you have been doing this long enough it can happen to anyone. Your rankings will be back to normal in a couple weeks. Try not to be to hard on the guy ;)
| 4:37 am on Jun 11, 2012 (gmt 0)|
In addition to what shri wrote, I always put this sitemap line in the robot.txt - simply add the following (first) line to your robots.txt file:
refer to - [webmasterworld.com...]
| 10:19 am on Jun 11, 2012 (gmt 0)|
@Zivush, how does that sitemap line help override the Disallow rule? Is it just because adding it makes you manually inspect the robots.txt file so you catch the problem quickly?
| 10:31 am on Jun 11, 2012 (gmt 0)|
Make sure that when you fix this you don't simply delete the robots.txt file making it 404 not found. If you do that, Googlebot often just thinks that it should honor the last robots.txt that it found and your site will continue to be not crawled.
Instead remove the "Disallow /" from it leaving a robots.txt file that explicitly allows crawling:
| 10:44 am on Jun 11, 2012 (gmt 0)|
|Instead remove the "Disallow /" from it leaving a robots.txt file that explicitly allows crawling: |
The correct syntax is:
with at least one blank line after the
Fix the robots.txt file. Wait 48 hours. Go to WMT and the use the "Fetch as Googlebot" function to retrieve your root page (www.example.com/) and then click on the "submit page and all linked pages" option.
I can't count the number of times I have accidentally uploaded the wrong robots.txt file to a site. However, the error has almost always been corrected within a few minutes. Even so, there have been a couple of occasions where Google had already grabbed the file seconds before the corrections were applied. In those cases it took 24 hours for Google to revisit and get the right version of the file. It's a shame there isn't a WMT button that says "I've messed up my robots.txt file; please discard the last version and grab the corrected one as soon as possible". If an incorrect file is corrected within 24 hours there appears to be no damage done.
| 12:32 pm on Jun 11, 2012 (gmt 0)|
I've done this. I bet most of us have.
| 3:52 am on Jun 13, 2012 (gmt 0)|
deadsea I did exactly what you said I should not do.
What is the robots.txt file that should be there?
| 4:41 am on Jun 13, 2012 (gmt 0)|
Looks like a misunderstanding. deadsea said that you SHOULD have a robots.txt file that says:
User-agent: * - but he forgot a line. g1smd then gave the correct syntax in his follow-up post. That syntax allows everything to be crawled. Then, if you have other needs, you can develop from there.
| 5:23 am on Jun 13, 2012 (gmt 0)|
Google remembers the urls it already indexed and has a lot of data about them so when you fix the robots.txt file Google will be more quick about restoring rank and indexing than a fresh crawl.
I've made a similar mistake(robots.txt file from wrong site uploaded and, of course, it blocked entire sections of the receiving site) and was fully restored within eight days, I caught the mistake in one day.
| 5:33 am on Jun 13, 2012 (gmt 0)|
Tedster what is the line, it should be:
Is that correct?
| 5:41 am on Jun 13, 2012 (gmt 0)|
whatson - Note that g1smd has it correct, including the extra blank line after the Disallow directive.
I'd follow his instructions precisely.
| 5:44 am on Jun 13, 2012 (gmt 0)|
The correct syntax is:
with at least one blank line after the
| 7:02 am on Jun 13, 2012 (gmt 0)|
Also, after the additional blank line after the Disallow: directive, add in the URL for the XML sitemap file for your website.
Test this with webmaster tools once its on, check that it is working correctly and also resubmit the sitemap file and the site itself as well just to be sure.
| 7:03 am on Jun 13, 2012 (gmt 0)|
Don't use "Allow" except in sections meant for specific, named robots that have explicitly said they know this word. The universally understood* word is "Disallow".
There are some awfully stupid robots out there ;) so don't get fancy.
* Understood. Not necessarily obeyed.
| 9:49 am on Jun 13, 2012 (gmt 0)|
Now I am confused. You realize I want google to crawl my site right? Why would have disallow in there?
| 9:54 am on Jun 13, 2012 (gmt 0)|
means disallow anything that begins with a slash (i.e. disallow everything).
means disallow nothing (i.e. allow everything).
As this is the "Robots Exclusion Protocol" everything hinges on this being a disallow list.
| 11:17 am on Jun 13, 2012 (gmt 0)|
If you think about the robots.txt protocol from the point of view of programing a bot, the "Disallow" standard makes sense. You wouldn't usually want a potentially monster list of every URL you that were allowed to visit - just a few "keep out" notices.
Even though both Bing and Google say they now support a few extensions to the standard syntax, the actual current standard is explained here: [robotstxt.org...]
...and here is Google's Help page: [support.google.com...] If you start blocking some URLs or URL patterns, the details Google provides can become important for getting the exact results that you intended.