homepage Welcome to WebmasterWorld Guest from 23.20.61.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Massive drop in Google rankings due to spidering issues
yosmc

10+ Year Member



 
Msg#: 4557758 posted 12:13 pm on Mar 23, 2013 (gmt 0)

I run a special-interest widget directory that is older than Google itself. It is useful to human visitors (in fact it's even been featured in offline media), it's been ranking well since the old Google days, and has never suffered any major drops from any of Googles updates or algo changes.

Over five years ago, I changed my site's back engine. The change was invisible to visitors, but internally, I installed a link management system that would help me manage and update my directory more efficiently. The script defaulted to a /links directory, so instead I tweaked the code so it would output to my traditional directory structure (along the lines of blue_widgets.php, red_widgets.php, etc.)

Middle of last year, Google suddenly started sending me warning mails via my Webmaster Tools account, telling me about "possible outages" and that "Googlebot can't access the site". I looked into those, and noticed that Google had found (and decided to spider) the default /links directory, even though it isn't actually linked anywhere on my site. Thinking that Google had no business crawling that directory in the first place, I simply ignored those messages.

Obviously, I shouldn't have: Over the course of 5 months and accompanied by a total of 30 warning messages, Google eventually started slamming my site, moving it down from page one to page 80 and beyond.

When I finally figured out the devastating scope of what was going on, I blocked the "links" directory via robots.txt and filed a removal request for all the "pages" that Google took from the /links directory and which it shouldn't have crawled in the first place.

As a result, the number of my indexed pages dropped from "several thousand" to "several dozen". This is really how it should be - like I said, this is a speciality directory, and there are only so many links in it. And: this is really how it always was, at least before Googlebot started snooping around in a directory it was never invited to.

"Index Status" in my Google Webmaster account, however, now looks like I have essentially taken my site down - if you don't know the full story, it seems like I removed 98% of my site, and put it into hibernation mode. And that seems to be exactly what Googles algos are concluding: even though my site is perfectly healthy, Google is still treating it like a half-dead zombie.

Can anyone recommend a path out of this nightmare, or should I just go and shoot myself? :p

 

goodroi

WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4557758 posted 1:07 pm on Mar 24, 2013 (gmt 0)

...I simply ignored those messages. Obviously, I shouldn't have: Over the course of 5 months and accompanied by a total of 30 warning messages, Google eventually started slamming my site...


I don't meant to sound harsh but it sounds like you have already shot yourself. I think you can take actions to recover. When you told Google to remove thousands of pages, did you check to see if any of those pages had inbound links? You may want to review those pages and get some pages reindexed.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4557758 posted 1:39 pm on Mar 24, 2013 (gmt 0)

instead of excluding those urls from being crawled you might want to consider redirecting non-canonical requests to the canonical urls using a 301 status code.

TheOptimizationIdiot



 
Msg#: 4557758 posted 3:27 pm on Mar 24, 2013 (gmt 0)

I'd let them know it's gone.

RewriteEngine on
RewriteRule ^links - [G]

If your script needs access to the /links directory internally it gets a bit more complicated, but the above will serve a 410 Gone saying "these pages have been intentionally removed" to visitors (including search engines) and since the directory was never intended to be visited I'd just let Google know it's gone now.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4557758 posted 4:31 pm on Mar 24, 2013 (gmt 0)

I don't know how this would play out today but several years ago a 20 page site using a popular wiki package had Google spidering and indexing the thousands of old page revisions and thousands of "report" pages due to lack of robots.txt file or other proceses to keep them out.

Some swift redirecting ensued, some of it only for searchengine requests (not often I do that); later replaced by general robots.txt exclusions and some other htaccess magic, a wait of some 6 months or more and finally things were somewhat back to normal.

Traffic was rarely an issue as several large sites regularly brought most of the referrals. Google was a minor source of traffic for the most part.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4557758 posted 7:09 pm on Mar 24, 2013 (gmt 0)

if google has discovered links from relevant and authoritative content to your /links/ urls I would avoid telling google those are Gone or excluding googlebot from crawling the 301 redirect to the canonical url.

TheOptimizationIdiot



 
Msg#: 4557758 posted 1:56 am on Mar 25, 2013 (gmt 0)

I personally wouldn't stress over a couple links to a few pages. Links are overrated from what I've seen lately anyway. If the only reason to redirect is "link weight accumulation" and not actual traffic, personally I'd just 410 everything, because it's the right status code for the situation and a 301 when it doesn't help actual traffic (visitors) to the page it's simply an attempt to manipulate.

People can say what they want, but if a link to a page does not generate actual traffic and the page needs to be removed for some reason the only reason to redirect is to manipulate (increase) rankings, people can try to "sell it" any way they like, but that's all it is at the core and personally I would not do it.

The only way I redirect any more is if it will help actual visitors, other than that if I need to remove a page for some reason even if there's something similar, it's 410 Gone.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4557758 posted 2:52 am on Mar 25, 2013 (gmt 0)

RewriteEngine on
RewriteRule ^links - [G]

If your script needs access to the /links directory internally it gets a bit more complicated

... but only to the tune of one RewriteCond or even just a [NS] flag. Not a problem.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4557758 posted 7:53 am on Mar 25, 2013 (gmt 0)

The only way I redirect any more is if it will help actual visitors, other than that if I need to remove a page for some reason even if there's something similar, it's 410 Gone.


how many actual visitors do you turn away with a 410 before changing that response to a 301?

my assumption is that a link from a relevant and authoritative page will eventually result in actual traffic.

TheOptimizationIdiot



 
Msg#: 4557758 posted 4:55 pm on Mar 25, 2013 (gmt 0)

Likely none, but I thought a custom 410 Gone page for any actual visitors that may somehow land there in the future was a nobrainer. Serving a 410 Gone doesn't mean you have to serve a blank page or don't provide relevant resources should someone happen to land on a page that's been removed.

yosmc

10+ Year Member



 
Msg#: 4557758 posted 5:57 pm on Mar 25, 2013 (gmt 0)

Thank you for all the replies - genuinely appreciated.

Just to reiterate - my site ranked well for a decade with only a couple of dozen pages indexed. So I thought it's not really relevant whether or not some of those thousands of pages had inbound links or not, because they were not needed to begin with. Wrong? My conlusion now is that Google is either still upset that it saw all those spidering errors over the course of five months, or because (from their view) 98% of the site has disappeared from one day to the next.

Concerning the various redirects and status codes - well as it's now, Google isn't going to see those, because the Googlebot is shut out via robots.txt. So the general line of recommendation is to actually let the Googlebot in again, just to let it know that the stuff to spider is gone? To be honest I don't fully understand how deliberately placing a robots.txt and deliberately removing pages from Google's index doesn't convey the same message, namely that the removal was intentional.

(...oh and yes, I do need access to /links to actually manage my directory.)

Is the general consensus also that such a situation will eventually work itself out, or not unless I take the recommended actions?

TheOptimizationIdiot



 
Msg#: 4557758 posted 3:31 pm on Mar 26, 2013 (gmt 0)

So the general line of recommendation is to actually let the Googlebot in again, just to let it know that the stuff to spider is gone?

Yes

To be honest I don't fully understand how deliberately placing a robots.txt and deliberately removing pages from Google's index doesn't convey the same message, namely that the removal was intentional.

Because when you block their access they don't know if anything has been removed or not, because they can't access it. When you let them access the URL and find a "removed" notice, then they know.

It's like if you looked at my site one day, found a page and came back the next and I gave you a message that says "you don't have permission to visit this page", would you know if I removed the page or if it was still there and I just didn't want you to see it any more? There would be no way for you to tell.

A robots.txt block and letting them spider URLs then providing them with a status code that tells them what the current status of the information associated with the URL is (200 OK, 301 Permanently Moved, 302/307 Undefined/Temporarily Redirected, 404 Not Found -- could be temporary or permanent, 403 Forbidden, 410 Gone -- purposely removed permanently, etc.) are totally different things.

If they can't get to a URL (blocked by robots.txt) they can't know what the current status of that URL and the associated information is, but letting them spider makes it so you can tell them the status of each.

shadowdaddy



 
Msg#: 4557758 posted 12:55 pm on Apr 9, 2013 (gmt 0)

So I've put this very question to one of my agencies - is a robots.txt level "disallow" sufficient as a method of removing links from the link profile? Their response has been very different than

Because when you block their access they don't know if anything has been removed or not, because they can't access it.


Agency has suggested that if the content of a page is not cached thanks to a robots.txt disallow, the content essentially doesn't exist anymore and therefore the links don't either. But the more I read the less convinced I am by this argument.

TheOptimizationIdiot



 
Msg#: 4557758 posted 3:44 pm on Apr 9, 2013 (gmt 0)

...therefore the links don't either.

If the links don't exist, why does Google state in their support they may use the anchor text from the links to index a URL for a site when they can't index the page due to a robots.txt block? (If you search for "Google indexing pages blocked by robots.txt" - no quotes - in your favorite search engine you'll likely find many cases where it's reported.)

Emphasis Added
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.

[support.google.com...]

Sometimes they'll even index a URL only when they have a reference to it and no data for it. In my experience robots.txt is not the way to remove a page or content from the web. There's a status code [410] for purposely removed pages and that's what I've used and would use again to remove a page.

Emphasis Added
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.

The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.

[w3.org...]

shadowdaddy



 
Msg#: 4557758 posted 4:33 pm on Apr 9, 2013 (gmt 0)

I've always read that as saying you could robots.txt block a URL on your own site but if there is an external link pointing to it the URL may still be indexed.

I may have been unclear, as I was talking about links from the blocked page, not to it - perhaps I should have said "therefore the links on that blocked page no longer exist either"

(also wary this may be diverting the thread a little, apologies if so...)

TheOptimizationIdiot



 
Msg#: 4557758 posted 4:43 pm on Apr 9, 2013 (gmt 0)

Ah, could be, but the only way to know for sure is to ask Google (and all other search engines) exactly how they deal with it, so if you don't need the pages or links on/to those pages why depend on an external system, that may change at some time in the future, to do what you think they should when you can just tell them you intentionally removed the page(s) and they should remove all related references?

Another way to look at it is:
robots.txt block = "Hey, don't go in here any more." (What Google and each individual engine does with the info they already have is up to them in this case.)

410 Gone = "We removed this, so don't talk about it, don't send people to it, don't use the information from it, don't index it. It's Permanently Removed; Gone."

Which is more accurate for your situation?
That's the one I'd use personally.

fathom

WebmasterWorld Senior Member fathom us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4557758 posted 10:45 pm on Apr 9, 2013 (gmt 0)

Just to reiterate - my site ranked well for a decade with only a couple of dozen pages indexed.


I would take a step back and re-think your assumptions as correlation does not always imply causation.

The vast majority of websites that pre-date Google that have gotten PANDAized or PENGUINized also suggest the same things but of course pre-dating Google is not likely part of Google's algorithm.

Additionally something you did five years ago would have likely impacted you 5 years ago.

Your observations are interesting but you may have locked onto the wrong **** pile.

That said, unblocking Googlebot can be done in as little as 5 business days [support.google.com...]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4557758 posted 9:27 am on Apr 10, 2013 (gmt 0)

Middle of last year, Google suddenly started sending me warning mails via my Webmaster Tools account, telling me about "possible outages" and that "Googlebot can't access the site". I looked into those, and noticed that Google had found...


The two incidents were probably unrelated because many people were getting those messages last year due to hackers running a big botnet attacking various DNS servers around the globe, most likely Google's, which caused "outages" as Google was unable to crawl the site because DNS wasn't being resolved properly. You may want to check that DNS graph, if they still have it, that showed DNS failures to my server which simply didn't exist as my servers and DNS were up 100% otherwise my multi-peer alarm monitoring service also would've complained about service outages which didn't occur.

What people don't realize is if someone on your circu in a hosting company are under attack then Google probably can't access your server until that DDoS is mitigated and I can give you a bunch of other possible reasons as well which are also likely culprits but hDDoS is the most common reason IMO for that message.

If you're merely having 404, 410 or other results to missing pages then it'll be filed under the proper HTTP result code for the crawler and if you aren't giving the right result codes then you may want to consider fixing the software to properly handle such inquiries.


I've had similar accidents and Google seems to resolve the issues pretty fast once you've gotten everything back to normal and eliminated all the problems.

Good luck with that.

TheBigK



 
Msg#: 4557758 posted 12:22 pm on Apr 11, 2013 (gmt 0)

Last year, I got a ton of 'Page Not Found' errors and I ended up spending all 3 months fixing them. Google continued to tell me that those errors won't affect your site's rankings in any way. In my case, it was only the Google Bot discovering those errors (because they were created by JavaScript related issue) and the end users never noticed them.

Now, what I was experiencing was exactly against what Google had advised - Google said errors won't affect your site and I had graphs to show that the rise in error count was directly related to crawl rate and traffic going down.

I'd always take what Google says with a pinch of salt.

roycerus

5+ Year Member



 
Msg#: 4557758 posted 6:12 pm on Apr 15, 2013 (gmt 0)

Something like this happened to us where we got millions of pages indexed by error. I was advised to remove the robots.txt block and redirect those requests [301 permanent] to a 410 Gone.

The pages are still getting removed after 4 months and we are starting to see slight upward trend. I think there has to be a PR update to reshuffle the PR juice so the pages on the domain get their valid share [for us].

yosmc

10+ Year Member



 
Msg#: 4557758 posted 3:48 pm on Apr 29, 2013 (gmt 0)

Just for reference - I'm afraid that none of the proposed remedies brought any success so far.

abk717



 
Msg#: 4557758 posted 12:01 am on Aug 29, 2013 (gmt 0)

It's been a few months since you last wrote. Were you able to get the referenced directory back into the index?

yosmc

10+ Year Member



 
Msg#: 4557758 posted 11:11 am on Aug 29, 2013 (gmt 0)

Couple of weeks ago my site started to reappear in a SERPs, but since last week ago, they've been slammed again.

At this point I believe it's due to some zoo animal anyway (Penguin or Piranha or whatever).

simonlondon



 
Msg#: 4557758 posted 1:40 pm on Aug 29, 2013 (gmt 0)

Agency has suggested that if the content of a page is not cached thanks to a robots.txt disallow, the content essentially doesn't exist anymore and therefore the links don't either.

This depends on whether the page/content is already indexed/cached when you put the disallow line in. Robots.txt is a crawl instruction tool, not an indexation instruction tool. If the page is already indexed, then simply adding that line will not remove it from the index, and the content still exists and the links do as well.

aakk9999

WebmasterWorld Administrator 5+ Year Member



 
Msg#: 4557758 posted 2:25 pm on Aug 29, 2013 (gmt 0)

If the page is already indexed, then simply adding that line will not remove it from the index, and the content still exists and the links do as well.


There are no rules for this. Sometimes this is how it happens, sometimes not. I would guess it depends on other external factors (perhaps on links pointing to page etc).

Otherwise, disallowing the site in robots.txt would have no effect and the site would continue to rank rather than being dropped from index like a stone (a very recent experience).

Further, if the above is the standard behaviour, it would be a heaven for spammers - just create a page, let it be indexed and rank it, then disallow it in robots and put spammy content on it instead and watch it being ranked for the old content.

I think this would be more close to what happens - blocking a page that was previously indexedvia robots.txt may or may not result in this page remaining in index and it may or may not rank equally well after being blocked (or may drop like a stone).

simonlondon



 
Msg#: 4557758 posted 3:32 pm on Sep 2, 2013 (gmt 0)

There are no rules for this. Sometimes this is how it happens, sometimes not. I would guess it depends on other external factors (perhaps on links pointing to page etc).

Otherwise, disallowing the site in robots.txt would have no effect and the site would continue to rank rather than being dropped from index like a stone (a very recent experience).

Further, if the above is the standard behaviour, it would be a heaven for spammers - just create a page, let it be indexed and rank it, then disallow it in robots and put spammy content on it instead and watch it being ranked for the old content.

I think this would be more close to what happens - blocking a page that was previously indexedvia robots.txt may or may not result in this page remaining in index and it may or may not rank equally well after being blocked (or may drop like a stone).


This is very interesting and I have never actually thought about it from this angle. I suppose you are quite right; after disallowed, Google will have no idea what's on the page and essentially the page could even be doing cloaking.

There must be a way for Google to include some kind of metric such as the length of time from the cache/last-crawl date to the current date. As the length increases the page is gradually less able to rank. I don't know because I haven't done any test on this.

But thanks for the comment coz surely I am not going to say the same ever again, with this theory in mind.

abk717



 
Msg#: 4557758 posted 3:57 am on Sep 9, 2013 (gmt 0)

Did you receive any messages from GWT indicating any issues G found or actions they wanted you to take?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved