Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Anatomy of a HTML to CMS Upgrade; How Google is responding to the new site...

         

samwest

1:11 pm on Aug 20, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Original site started in 2000, did awesome for 10 years until May Day 2010, since then it's been a slow, bumpy ride down hill. I understand many of the reasons, mobile devices, the proliferation of freebie versions of my material on Ads sites, the "evil" G wanting everyone's traffic, the economy etc, etc.

I've been working on a blog for a few years. put it in a /m directory.
Google hardly noticed it there. Always wanted to upgrade the site to CMS. The html with the blog never indexed. They just didn't like to see the two side by side.

About three weeks ago, I scrapped the 70 pages of old html and moved all those pages to Wordpress. Moved everything to the root and have a permalink program catching all possible 301s. Cleaned up all broken links. Three weeks later and G has indexed about 350 of 3037 pages and posts and 468 of 675 images. Each day they index a few more.
Now it's all under one roof.

Site:mydomain.com returns 2940 results...all appear to be valid URL's

I also ran a rank check on about 100 of my top key phrases. 95% rank on page 1, most in the 1, 2 or 3 position.

So, here I am, looking a Google Anlytics Real Time Live display and there's a big goose egg on the display. ZERO!

Every once in a while I get a visit...maybe two, but never more. I'm on a dedicated server with plenty of horsepower. Site loads quickly and gets a A/B rating on GTmetrics.

If I type my top key phrases into G, I get my site, so where are all the visitors? Am I in some kind of strange sand box?

samwest

5:57 pm on Aug 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@aaakkk - yup, I've tried it and that works. I also did the same with PIWIK, but I usually block my own visits in both.

I think I may have found a redirection solution that is working.

What about redirecting all unknowns to the home page? Is that OK to do?

I am seeing requests for valid pages, but also a lot of garbage requests from stuff that was deleted years ago when this CMS was being used as a blog. I'm sending all those requests to the home page. Would the table of contents or site map be better?

[edited by: samwest at 6:02 pm (utc) on Aug 22, 2014]

netmeg

6:00 pm on Aug 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@aaakkk - yup, I've tried it and that works. I also did the same with PIWIK, but I usually block my own visits in both.


I always have one raw stats profile with NO filters, just in case.

What about redirecting all unknowns to the home page?


I wouldn't. Let them 404.

Planet13

7:34 pm on Aug 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I urge EVERYONE to try out your URL redirects using Internet Explorer 8.

That browser chokes on the slightest imperfections in your code.

I had redirect that fetch as googlebot and the different http header checkers said were fine.

I thought I was good to go. Then when I checked my redirects in IE8 I realized that I was in deep tabouli...

aakk9999

8:41 pm on Aug 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm sending all those requests to the home page.

This is a bad idea and Google does not like it:

Not found errors (404)
https://support.google.com/webmasters/answer/2409439?hl=en [support.google.com]
Returning a code other than 404 or 410 for a non-existent page (or redirecting users to another page, such as the homepage, instead of returning a 404) can be problematic. Such pages are called soft 404s, and can be confusing to both users and search engines.


Ideally, if the page was there and now it is not, the server should return a user friendly "Not found..." page and the status 410. If the page was never there in the first place, return the user friendly "Not found..." page with the status 404.

samwest

10:01 pm on Aug 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@aakk - woah! how can I control the response to thousands of junk queries mixed in with the good ones?

What about sending them to a custom 404 page with suggested alternatives?

UPDATE: I found a solution! 410 for WP works perfectly. This should help with all these ridiculous outdated page requests. Thanks again aakk!

samwest

1:59 am on Aug 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One more update:

I've definitely ID'd my problem.

The CMS has several old directories that it resided in. I never thought these had been indexed, but they were.

If I do a site:mysite.com/blog or site:mysite.com/m (two development directories I used) I find hundreds of identical little pages out there.
I have the CMS now installed in the root, so there's yet ANOTHER copy.
It's not duplication, it's triplication. Wow!

So, the next thing is what to do...
So far I found a nice 410 plugin that let me 410 the entire /blog and /m directories. It's working great! All requests to anything in those ghost directories comes back as 410. Aakk, you're a genius!

Now, do I also go so far as to submit a removal request for those two directories? or do I just cool my heels and wait for the 410's to do their work?

The low traffic now makes a lot more sense. I moved the site into a bad neighborhood and now need to clean house.

JD_Toims

5:06 am on Aug 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



People worry too much about "duplication" imo. We've been told/know Google groups together duplicate URLs and then picks the one it thinks is best for the canonical [to show in the results] for any given query, so personally, I'd just wait for Google to pick up the 410's.



Basically, what happens when duplicates are discovered is Google "groups URLs together" and then assigns all rankings to the one it thinks is best for the specific search conducted, so a site could have www.example.com or m.example.com and if Google realizes [in whatever way] www is "desktop" and m is "mobile" Google switches the results shown, so onsite duplication isn't as big of deal as people make it out to be, because if there are 10 URLs on the same site with the same content, they're grouped together and the algo assigns all weight to one of them and shows it in the results -- It might not be the site owner's preferred URL, but basically, the URLs get grouped, Google assigns the weight for all of them to one as the canonical and shows that one in the corresponding place in the results...

There is a bunch of FUD about duplicates, but the reality is: if there is a national site that "originated the content" and a local site that duplicated it, Google in all it's "infinite wisdom" may show the local site rather than the originator depending on the query and previous queries.

Duplication onsite isn't really subject to change anything, except the URL shown in the results, and offsite duplication isn't really anything any of us can control, except via DMCA notice, because Google seems to care less about where the content originated than whether the algo determines a searcher would rather see the original or a duplication of the original that's presented as local to the query.

It's a *huge* flaw/unfairness, imo, because Google completely disrespects the creator of the content when the algo favors a later "copy" of a page for a "query local duplicate" that's been found over the originating site, but I don't make Google's rules, so there's not much I know of except a DMCA that can be done about it...



TL;DR

Don't worry as much about onsite duplication as those who need to justify their existence/paycheck say to, because duplicate-content URLs all get grouped together and one of those URLs is given all the weight of the incoming ranking signals as the canonical -- Worry more about offsite, query-local duplication, because that's the type that can really wreck you due to Google's algo not really caring where the content originated and showing query-local results rather than content originators in the results.

samwest

10:45 am on Aug 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@JD_Toims - thanks for the excellent explanation of dupes. I feel somewhat better, but also worse because I figured it was these dupes that were affecting my anemic traffic. Either way, great info & thanks!

I guess my biggest concern is the number of 404's generated by this bloat. I was reading about how G handles 404 and 410 requests. The difference is subtle, but the 410 "generates an immediate error" rather than waiting. When I see the word "error", I might assume that there is a total error count and that could work against me. G paranoia again.

netmeg

12:34 pm on Aug 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I frequently run into situations where a dev has allowed the dev version of a site or directory to be indexed while working on it - basically I remove it in GWT and then block it off so it can't be re-indexed in 90 days. Works for me, but it's up to you.

samwest

5:27 pm on Aug 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@netmeg - thanks - I just sent in the removal requests. One was already gone and the other required a few more questions. All good, thanks!

samwest

10:48 am on Aug 24, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



update from big crawl on the 19th...indexed pages went from 373 of 3353 to 1618 of 3353 over night. Now, if any of that could translate into higher traffic, it would be great, otherwise it's just useless fluff. So far traffic has taken only a very minor uptick. Patience!

samwest

12:15 pm on Aug 25, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The site traffic appears to be clamped...I've had the same visitor count +/- 2% over the last 5 days. Crawling has stopped and they are only halfway done.

aakk9999

3:07 pm on Aug 26, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You have to wait. You have added 410 Gone only 4 days ago. Now the patience is required, for at least a few weeks, and also monitoring for any other technical glitches.

samwest

12:03 am on Aug 27, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ak - I'm actually onto another project right now, but keeping and eye on this one. Thanks for your help! and Netmeg too! (and every else for that matter)

samwest

4:20 pm on Sep 7, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just checking in - traffic has been capped off since the last report. Here's the odd part...My listings haven't changed, except into positive territory, I'm running adwords again, I now have 1500 pages indexed rather than the old 150 pages. 10 x the content, but now 1/10th the traffic. Interesting.

Still handling all 301's and 410'ing dupes the were in development directory. Had one big crawl, but everything is sliding sideways or downhill, especially search queries.

aakk9999

9:48 pm on Sep 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am wondering whether the change in information architecture has contributed to your traffic drop.

You say you have 1500 pages and you used to have 150. Your opening post says you had 70 pages plus blog in /m directory, and that Google hardly paid attention to it. By this I presume you mean rarely crawled and mostly not indexed?

The phrase "thrown all under one roof" confuses me a bit. If the blog was in /m folder, it was "under one roof" as far as Google was concerned as it resided in the same domain. The technology you use to produce html returned to Google/browser is irrelevant.

Before your changes - have you created a spreadsheet with URLs from the "old" site that Google indexed and done some analysis?

Why do you think Google did not "pay attention" to /m folder? Was it blocked by robots? Or were the pages in fact indexed, but you could only see them in SERPs if you search for the exact sentence from such page or if you use inurl: operator where you give it narrowed down URL pattern?
This 46 message thread spans 2 pages: 46