Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Catastrophic script mistake

         

humandesigner

3:44 am on Apr 27, 2010 (gmt 0)

10+ Year Member



Hello,

A mistake has been inadvertently made in one of my scripts in my CMS and it ended in generating the following piece of code on 70% of my pages :(

<link rel="canonical" href="http://www.example.com/">

... which told Google to index the home page instead of the others.

I noticed the mistake when it was too late already and now 70% of my pages have been unindexed.

Of course, I've fixed the script, resubmitted my sitemap on GWT but I would like to know

1. if there are other nasty repercussions that I should expect
2. if there are things that I could do to speed up the recovery
3. how long will I have to wait until recovery

Also, there is one weird thing I've noticed :

If I do
site:www.example.com
on Google, a lot of outdated pages get listed in the results although I've properly set up 301 redirections in my htaccess about 3 weeks ago.
Why is that?

Thanks for any useful piece of information.

[edited by: tedster at 3:51 am (utc) on Apr 27, 2010]
[edit reason] switch to example.com - it cannot be owned [/edit]

tedster

4:18 am on Apr 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Last things first - even under the best of times it can take Google much longer than three weeks to process 301s and drop the old URLs from their index. And this is not the best of times at Google as they seem to be in the late stages of rolling out their new Caffein infrastructure. lots of processes have been slower than we are used to. As the man on TV said, "patience grasshopper".

And so that bad news now spills over into your canonical URL error. I've heard that canonical tag errors can take longer for Google to fix than 301 indexing does. I haven't been personally involved with any, so I can only report what others have said. Given the current sluggishness at Google, I'd be prepared for a long haul to recovery.

My advice would be to leave it alone - having verified that all is correct right now, then don't do anything further that would present Google with a moving target in your URLs.

I'm a bit troubled about Google by reading your report. My understanding was that Google would take the canonical tag as only a recommendation. When the content of the suggested URL is not a very close match, they say they will just ignore it. But it sounds like your experience is quite a bit different than that.

TheMadScientist

6:19 am on Apr 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah, tedster's last paragraph...
So much for it being a suggestion rather than a directive.

Be glad you didn't accidentally literally put it to example.com since it seems it's being taken as more than a siggestion if that's the only thing you did to get the content de-indexed.

Hopefully they'll get you back in quickly.

It's when they do things like this they make it so people just don't want to believe a word they say, because unless you have a completely duplicate site I wouldn't see how a 'suggestion tag' could get it 70% of the way deindexed...

suratmedia

6:38 am on Apr 27, 2010 (gmt 0)

10+ Year Member



It won't hurt too much. If Google finds non-duplicate copy or 404/302/301/410 trap; it applies 1 step recursion.

@70% of my pages have been unindexed
====================================
You are not alone, every site is losing 50%+ URLs from site:example.com query since last 2 weeks and You might be blaming meta canonical for that.

[edited by: tedster at 2:37 pm (utc) on Apr 27, 2010]
[edit reason] switch to example.com - it cannot be owned [/edit]

g1smd

8:10 am on Apr 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Check what your WMT reports have to say. I'll believe those just a little bit more than the site: operator at this time.

Look especially for crawl rates to increase in the next week or two to see that they have accessed the updated version of lots of your pages.

tedster

2:40 pm on Apr 27, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You are not alone, every site is losing 50%+ URLs from site:example.com query


That's a very good point. The loss of indexed pages might not have been CAUSED by the script error. One of those "post hoc ergo propter hoc [skepdic.com]" logical traps.

humandesigner

9:36 pm on Apr 27, 2010 (gmt 0)

10+ Year Member



Thanks everyone for your replies.

I've just logged in to my WMT account and I'm close to having a heart attack. On the sitmap section, the indexed pages column now shows ZERO pages! What the hell is going on?

But at the same time, some page still show if I do allinurl:www.mydomain.com/some/page/
The same goes for site:www.mydomain.com

Maybe what I see on the results page and what I see in WMT are on two different datacenters?

And I still don't understand why outdated pages continue to show in site:www.mydomain.com

Thanks again!

humandesigner

10:10 pm on Apr 27, 2010 (gmt 0)

10+ Year Member



Tedster, my urls (mostly outdated ones) are displayed in site:www.example.com
It's in allinurl:www.example.com/some/page/ that they have disapeared. The ones that have disapeared are the ones that where affected by the canonical script error.
So, I don't think I felt in a logical trap.

Thanks again :)

devil_dog

10:45 pm on Apr 27, 2010 (gmt 0)

10+ Year Member



almost all of my sitemaps in webmaster tools are showing me a Zero. Maybe some sort of glitch?

unrelated: Im also seeing heavy crawl activity at the moment.

artek

12:07 am on Apr 28, 2010 (gmt 0)

10+ Year Member



humandesigner, how about the simple remedy using canonical tags to correct mistake starting from the top:

home page
<link rel="canonical"href="http://www.example.com/">

second level pages
<link rel="canonical"href="http://www.example.com/page-A/">
<link rel="canonical"href="http://www.example.com/page-B/">
<link rel="canonical"href="http://www.example.com/page-C/">

third level pages
<link rel="canonical"href="http://www.example.com/page-A/page-1/">
<link rel="canonical"href="http://www.example.com/page-A/page-2/">
<link rel="canonical"href="http://www.example.com/page-A/page-3/">

... and so on, as deep down into the website structure as you want to correct.

It will guide SEs to your right pages, correct your past canonical mistake(s) and set for SEs the solid roadmap to your website structure for the future.

You can also go to your SE WMTs afterwards and Fetch as Bot exact urls of fixed pages to speed up the correction process.

suratmedia

2:36 am on Apr 28, 2010 (gmt 0)

10+ Year Member



Don't worry, same here, Indexed URLs = 0; while site: shows 2390 URLs

PS: mobile(wap) sitemaps are not affected!

humandesigner

3:32 am on Apr 28, 2010 (gmt 0)

10+ Year Member



Devil_dog and suratmedia : thanks a lot! I was really in shock this morning but I feel better now.

artek : in your suggestion, second level pages and third level pages are obsolete pages or do you mean that each page would link to itself via canonical?

Also, I don't know if it's a pure coincidence but this morning I deciced to remove all the 301 redirects in my htaccess. And now all my pages are finally shown with allinurl when it didn't until then.

I thought the 301 redirects were great for unindexing old pages and redirecting visitors at the same time but it looks like google doesn't like it too much. What do you think?

Thanks again to you all :)

humandesigner

3:44 am on Apr 28, 2010 (gmt 0)

10+ Year Member



wow ... the way google works is giving me headaches ... nothing logical at all ... insane :

I just checked one of my page with allinurl and it said "Your search - ... - did not match any documents."

Then, I take a short piece of text from this same page, search for it on Google and my page is seen in the results.

Is Google out of its shoes or is me who fail to understand something?

tedster

3:48 am on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



301 redirects are great for sending users and bots to the current canonical URL for the content that was requested - or to the nearest equivalent page. But if there is no nearest equivalent - that is, if the requested content is just gone - then let it be a 404.

I've seen some major ranking problems develop after 301 redirecting too many non-equivalent pages to some generic page, like Home or a major directory index page.

tangor

4:09 am on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't like third parties messing with the web, as in google coming up with another "tool" such as canonical tags. Didn't see a reason for it first time around and it appears there's good reason not to jump on the bandwagon if playing their game can destroy a site. Correct what was done by accident or inadvertently then do NOTHING for a few weeks. Give google time to figure out the change is stable. They will, eventually, come back and index... eventually...

Personally, I code for the WEB, and web standards, not G... I keep saying "There are no short cuts" and back in the day that mantra would have been TANSTAAFL.

tedster

7:28 am on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



code for the WEB, and web standards, not G

I agree - and if you actually set up your sites technically perfect to web standards, then most canonical problems can't happen... except for proxy servers, and bogus query strings, and...

In this case, I think the canonical tag is a helpful extension to web and server standards - and it's helpful to all search engines, not just G. But Google should implement it exactly as they've stated.

As I've pondered this thread the past few days, I think that the canonical tagging error was not the cause of the indexed pages number dropping so radically. That's happening to many sites right now that do not use the canonical tag at all. I think it's a temporary bug at Googlge.

TheMadScientist

7:38 am on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I keep saying "There are no short cuts" and back in the day that mantra would have been TANSTAAFL.

OMG IDK #*$! TANSTAAFL is...

But the reason I use (and like) the canonical link reference isn't really to canonicalize the pages on my site, but rather to make sure it's included on the 'lazy scraper' pages just to keep them on their toes, because it's one of the few things we can include on a page that says where the original source is, and I know a 'pro' scraper will probably have things set to remove any references to the original source easily, but I like the little 'security blanket' feeling it gives me knowing if the 'lazy' or 'rookie' scraper doesn't remove it there's less harm done.

1script

4:01 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



but I like the little 'security blanket' feeling it gives me knowing if the 'lazy' or 'rookie' scraper doesn't remove it there's less harm done
I was just going to say that it might have been a false sense of security because they don't consider cross-domain rel=canonical but, upon closer review, it turns out they do as of Dec. 2009 ( reference here [googlewebmastercentral.blogspot.com] ). I am glad I took it upon myself to argue with you, TMS to learn something new in the process :)

There is still one catch though: G* insists the pages have to be similar (although not identical and the degree of similarity is kinda fuzzy) so it has to be a really-really lazy scraper. You'd think that at least replacing your ads with his would make something like a 5-10% change to the HTML code of a page.

I think a <base> tag used in conjunction with rel=canonical should provide a little more security yet but then to make a better use of it you have to deliberately screw your own links and image tags and make them all relative. Not sure if it's worth the trouble.

TheMadScientist

4:20 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ 1script: Actually, it was not too big a switch for me, because I almost always use Server Relative* URLs with a base href for 'added security' when linking. I switched to Canonical** URLs for a while to make sure all links pointed to my site if it was scraped, but I figured it was as easy to find and replace one on every page as it was 15 on every page, and it was so much extra code on some sites I switched back to server relative with a base href, because if someone is going to scrape and change, IMO they're going to scrape and change and if not then the base href is sufficient.

The base href also helps make sure there are no errors in internal linking since I usually run totally extensionless websites, because with the base even if there is an omitted opening / on a URL the browser makes the correct request for the file.

To eliminate confusion for those who may not be used to different link type references:
* Server Relative: /the-path/to/the-file.ext
** Canonical: http://www.example.com/the-path/to/the-file.ext

BTW: Glad you found the reference and didn't make me go find it again, because it's one of those things I 'read and remember' but what page it was on or where I found the link to it would have been work for me to find again. LOL.

1script

4:47 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To eliminate confusion for those who may not be used to different link type references:
* Server Relative: /the-path/to/the-file.ext
** Canonical: http://www.example.com/the-path/to/the-file.ext
The list has to be changed else we are going to confuse ourselves thoroughly by adding the <base href=> tag to the equation. Your Server Relative is more commonly known as Absolute, so here is the list I propose:


* Relative: the-file.ext
** Absolute: /the-path/to/the-file.ext
*** Canonical: http://www.example.com/the-path/to/the-file.ext
**** Hybrid: the-path/to/the-file.ext (no leading forward slash)


So, I was saying it's the Hybrid version that will be needed to make use of <base href> tag in case your page ends up on a scraper site. Then, if a visitor clicks on such hybrid link, his browser will forward to your site.

Now, as a crazy idea: had anyone tried to JS-obfuscate the <base> tag to deal with all but most educated / determined scrapers?

I also always use Absolute (Server Relative) links and the thought of deliberately making them Hybrid, which would not work without the <base href> tag, makes me more than a little uneasy.

TheMadScientist

5:07 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I also always use Absolute (Server Relative) links and the thought of deliberately making them Hybrid, which would not work without the <base href> tag, makes me more than a little uneasy.

I wouldn't worry about changing them to anything but what you have, because a base href will actually work well with anything except a canonical, so personally, I would leave them as they are and think very seriously about implementing your idea of JS obfuscation, because then there is nothing for them to find and replace in the source and unless they test, scratch their head and switch everything to a canonical URL for their site, because the base href will be in place and the links will point to your site... I like the idea personally.

g1smd

10:39 pm on Apr 29, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Definitions:
* Relative: the-file.ext
or
../../the-file.ext
or
the-path/to/the-file.ext
** Server-Relative: /the-path/to/the-file.ext
*** Absolute: http://www.example.com/the-path/to/the-file.ext
(with the 'correct' duplicate also known as 'canonical')