In 30 May I disallowed a directory that contained only projects created by my site's users. Also, I set all user's folders to return a 404 error when accessed.
The next day GWT reported a drop in tracked pages from about 800 to about 50.
After 10 days I started noticing a moderate traffic/rank drop, and it's been getting worse since then.
What was responsible for the drop? The robots.txt policy or the 404 responses?
9:02 pm on Jun 7, 2013 (gmt 0)
Addressing only the mechanical aspects of the question:
Hunch: You've got an inappropriate use of belt-plus-suspenders. Or belt-plus-braces, depending on dialectal preference. The googlebot is going haywire because it isn't allowed into a directory, so it doesn't know that the subdirectories inside that directory don't exist any more.
Also, I set all user's folders to return a 404 error when accessed.
Normally you don't have to set a 404 explicitly when you're dealing with physical files. Did you really delete the directories? If so, a 410 will make the googlebot go away faster-- but only if you let it ask for the pages. You can remove the whole directory in gwt at the same time.
If this area was formerly visited by humans, make sure you've got a nice custom 410 page. Or at least use your existing 404 page. The Apache default 410 page is scary.
If the files are still there but you've changed them to restricted access, that should yield a 401 without any more work from your end.
5:20 am on Jun 8, 2013 (gmt 0)
My bad, I didn't explain it clearly. What I really did was:
I disallowed crawling using a robots.txt policy AND set .htaccess to return a 403 error when the directories were accessed, using (Options -Indexes).
PS: The directories still exist, but they were never visited by humans.
9:57 am on Jun 8, 2013 (gmt 0)
Gotcha. Not the whole content of the directories, just their indexes. And when your fingers typed 404 in the first post, your brain really meant to say 403.
Did all those subdirectories formerly have automatic index files, so any passing robot could see what's there? By switching off the auto-indexing, you've prevented google and other robots from discovering any new pages in the directories-- unless they learn about them from other means-- but you haven't stopped them from requesting the pages they already know about.
I kinda think it would be safer to slap a global no-index label on the directory. If it isn't practical to add meta tags to all the existing files, Option B is to make a supplementary little htaccess file and put it in your target directory. You may already have one there if that's how you turned off auto-indexing for the directory. Add a line that says
Header set X-Robots-Tag "noindex"
and it will cover everything in the directory.
:: looking vaguely around for someone who knows the answer to the SEO aspect of the question, on which subject I am clueless ::
1:07 pm on Jun 8, 2013 (gmt 0)
Blocking 750 of 800 indexed pages could drastically affect your site's traffic and ranking. That's 94% of your pages.
If the web pages in those directories which you've blocked have a significant number of inbound links, the link juice coming into the blocked pages can no longer be passed around to the various other pages on your site. Blocked pages cannot accumulate PageRank/link juice, and since they can't crawl the pages to find outbound links to other pages on your site, those internal pages linked to from the blocked pages are no longer getting passed PageRank/link juice.
Also, if a lot of those blocked pages were actually ranking for various keyword phrases, by blocking them from being crawled you will have killed your rankings for those phrases (some less relevant, lower ranking page my now be ranking but much lower) and thus the traffic will also have diminished.
Just a guess...
4:12 am on Jun 10, 2013 (gmt 0)
Thank you lucy24 and ZydoSEO for the responses.
What if 301 redirected all of these directories to the homepage, to not lose the link juice?
6:24 am on Jun 10, 2013 (gmt 0)
What if I 301 redirected all of these directories to the homepage
There may exist situations where a mass redirect of all requests to the home page is the best solution. I can't remember ever personally hearing of one.
The people at the other end of those links aren't linking to your home page, or to your site generically. They are-- or were --linking to a specific page.
1:55 pm on Jun 10, 2013 (gmt 0)
were never visited by humans
1. Dissallow: (done)
2. Remove URL's from index and cache in WMT (*linked from)
3. Remove /dir/ from index
Options: -Index is root config for my domains
No humans ever visited so nothing should change visitor wise.
* Be sure and remove any internal links. For any inbound links to pages I use:
This, code, I use initially to get broken link issues cleared up. Have to to see what effect it has down the road. In my case peekyou and others have broken links to the site, could send them to site map instead or...