Ensure you set up proper representation of your canonical URLs. Do no not do a HTTP request in code to populate this field. This tag should be populated from a static entry in a database or simulates the real URL through code.
If you have www.mywebsite.com/aaa/bbb/ccc as the real page, and www.mywebsite.com/aaa/bbb/ccc-1 pulls the same information on page, the page of www.mywebsite.com/aaa/bbb/ccc-1 canonical's tag should be www.mywebsite.com/aaa/bbb/ccc
Of course, any duplicate variations should be noindexed regardless and should have a developer look into the issue in why your URL is resolving in this context.
From personal experience I have rectified this problem a few times.
Learn HTTP status codes and how and when to use them.
Familiarize yourself with robots.txt and robots meta tags; learn when to use which and where (and avoid mistakes like using both!)
Remember sometimes what you keep OUT of Google is more important than what you put in. Things like search results pages and thin, low content or duplicate content pages that won't bring you quality traffic anyway.
Machine learning, graph analysis, expert systems, natural language processing. How they work and actual implementation. Assume Google is a few years ahead of any Ph.D. level paper you come across.
And what netmeg said. Sharp fella.
Yes netmeg speaks words of wisdom.
Thanks. I'm also not a fella.
Oh boy. I just realized I did not read closely enough. I called netmeg "nutmeg". LOL Sorry mam!
@netmeg Whoops, my apologies.
If you've got a six-word text string that figures prominently in your site's configuration, make sure you spell it right. ;)
Back to the topic at hand:
SITE ARCHITECTURE MATTERS
A year or so back, someone posted about a hacker attack that affected only the mobile version of the site. The hacker seems to have reasoned-- correctly-- that the People In Charge would generally look at their work in the biggest possible format. So if you constrain your misbehavior to the less glamorous versions of a site, you can sneak under the radar for a long time.
Doesn't only apply to security breaches. If something visually disastrous happens at smaller sizes ("Whoops! That table doesn't really work when each cell is only two ems wide, does it?"), the user is not going to stick around and try to make things right. That was your job.
Speaking of hacks, I've found a few websites whose DNS cache was poisoned, allowing the hacker to divert a percentage of their search traffic! It's not as common as someone inserting parasite content/links or an iframe hack - but it really creates a mystery when it does happen. "How can my stats disagree this much?"
The fix? Check your DNS settings and fix the errors you find. There are many good tests available onlne.
Reference thread: DNS Cache Poisoning [webmasterworld.com]
1) Make sure your robots.txt is in plain text format. Having it in UTF-8 will make google not being able to read it and in such case Google acts as robots.txt is not there.
2) Make sure robots.txt returns 200 OK. Returning HTTP 500 on it may result in your site not being crawled and de-indexed.
3) Avoid using parameter "lang" for the language. If you must use parameter, use lng or something else. Omitting & before "lang" parameter makes many browsers and scrappers understanding &lang as left angle bracket <. Even if you correctly use &lang= , scrappers that scrape SERPs scrape it without encoding and then Google picks up duplicate URLs that may look as <=en or similar.
4) Be careful with relative paths - in fact do not use them. Infinite URL space can easily be created with incorrect handling of relative paths, creating thousands and thousands of duplicate pages
|Make sure your robots.txt is in plain text format. Having it in UTF-8 |
Format and file encoding have nothing to do with each other. Odd to see this here. It's a pervasive error on my e-books forum.
So long as none of your filenames or directories use non-ASCII characters, the encoding is immaterial in any cases.
robots.txt can return either 200 or 404 (meaning you haven't got one). Anything else, and the well-behaved robot will go away sulking.
|Avoid using parameter "lang" for the language. |
To be clear: you're talking about URL parameters, right? Not <lang="something"> declarations. I know this one well; it plays havoc with my log-wrangling in exactly the way you describe. Another parameter to avoid is "ni". Can't remember who uses it, or what for-- only that it turns into a mess.
Make sure your robots.txt is in plain text format. Having it in UTF-8
Format and file encoding have nothing to do with each other.
Well, if you open your robots.txt in Textpad and then save it as UTF-8 and upload it to server, google ignores it.
Unfortunately some months ago I had a first hand experience in this - pages that were supposed not to be crawled were crawled.
Only when I saved it in PC ANSI then it started to "work" stopping Google.
|To be clear: you're talking about URL parameters, right? Not <lang="something"> declarations. |
Yes, I was talking about lang= parameter in URL, sorry this was not clear enough!
Did not know about ni parameter (thanks!), but there are also "reg" (sometimes used for region parameter) which turns into Registered Trademark. I am sure there are others!
But "lang" is very common, hence I mentioned it.
:: peering into crystall ball ::
Betcha Textpad added the dreaded BOM, and it's this that played havoc with your robots.txt file. Poke around in the preferences and you should find an option for saving UTF-8 files without the BOM. Once it is gone, there will be no difference in file content.
Always check your redirects in incognito mode as that is what Google will crawl and I have found that sometimes jsession-ids get added in incognito mode along with other issues which should be fixed immediately
I would say the biggest piece is - Don't forget the basics. It's easy to get caught up in catch phrases like content marketing, but every piece of client / company owned information needs precise, technical SEO.
It's so easy to get caught up in a myriad of channels and lose oversight of the basics.
I have made it now a rule to see if a website returns first in Google for its own content, by searching for a string of words (within quotes) from the homepage and a couple of inner pages every month. Many examples of sites suffering because some authority site/s carry their content.
Make sure your efforts can scale. I work on a website with thousands of pages, and for a long time the rule of thumb was to change individual titles and such (which were very spammy-looking before my time), but it has grown to be a very tedious task. So, now we've moved toward a bit of standardization for some elements, like title tags.
It is a lot easier to work with than the per-page basis we were only doing before, though we still can choose to change individual titles if some other keywords makes more sense.
However, this also means you need to know your niche. Is there some money keyword that is generally used by searchers looking for sites in the niche? If so, that can be used to scale some efforts.
And the other part of that is making sure you can actually do it. Invest in a good CMS with the ability, or find a plugin to do it on WordPress, etc.