Can <meta name="origin" content=" be used instead of canonicals?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Can <meta name="origin" content=" be used instead of canonicals?

richinberl

8:43 am on Sep 2, 2019 (gmt 0)

So my CTO is running a duplicate of our www. site in api. I am not 100% clear on the reasons for this - but he is doing it. We apparently can't have a different robots.txt in the www and api subdomain as the files all must be identical.

The whole of api.sitename.com has now been indexed which just should not be.

To mitigate the problem, his solution was to add

<meta name="origin" content="https://www.sitename.com/webpage"> to pages such as api.sitename.com/webpage

I have not heard of Google respecting the origin meta tag, treating it as a faux canonical tag.

Does anyone have experience with this?

Robert Charlton

4:16 pm on Sep 2, 2019 (gmt 0)

I've never heard of the origin meta tag either.

The whole of api.sitename.com has now been indexed which just should not be.

I've encountered IT departments that have exact dupes of a site for development, and I've seen a lot of grief created by this kind of situation. If you block the subdomain with a password, there's essentially no way to communicate with the site from the outside,

richinberl

4:29 pm on Sep 2, 2019 (gmt 0)

The problem is.... I can't see any possible way, that it could be impossible, to not do what you are saying.

And yet, my CTO is looking me in the eye... and saying.... thats how it is.

So it really just comes down to the meta orgin tag either working or not. Technically? I guess it should. Its described here [doc.ohreally.nl...] though I don't know the real value of this site.

My issue is... does Google respect it or not?

flatfile

5:01 pm on Sep 2, 2019 (gmt 0)

These are the tags Google understands

index - Allow the page to be indexed.
follow - Follow any links in the page as part of crawling.
noindex - Prevents the page from being indexed.
nofollow - Don't follow links from this page as part of crawling.
nosnippet - Don't show a text snippet or video preview from being shown in the search results. For video, a static image will be shown instead, if possible. Example: <meta name="robots" content="nosnippet">
noarchive - Don't show a Cached link for a page in search results.
unavailable_after:[date] - Lets you specify the exact time and date you want to stop crawling and indexing of this page.
noimageindex - Don't show the page as the referring page for an image in Google Image search results.
none - Equivalent to noindex, nofollow.
all - [Default] Equivalent to "index, follow".

Source [support.google.com ]

lucy24

7:15 pm on Sep 2, 2019 (gmt 0)

We apparently can't have a different robots.txt in the www and api subdomain as the files all must be identical.

Say what now? You not only can, you have to have a different robots.txt for each valid hostname, including subdomains, because they will be separately requested. It is possible that www.example.com/robots.txt and api.example.com/robots.txt both serve the same physical file, whether because of the site's directory structure or some behind-the-scenes rewriting. But it�s nonsense to say they �must be� identical. Is this your CTO�s way of saying �I don�t know how to do it so I�ll tell the boss it can�t be done�?

tangor

3:59 am on Sep 3, 2019 (gmt 0)

If it is not here, it probably isn't valid: [w3schools.com...]

You can robots.txt anywhere ... that does not mean the bots will respect it one way or the other! And I suspect that g has it's own "understanding" of robots.txt and that's the one that needs to be addressed.

not2easy

5:04 am on Sep 3, 2019 (gmt 0)

It almost looks like someone confused some efforts to enable referrer tracking with some other ideas? About 5 years ago when more sites were starting to 301 to https, Moz.com's blog suggested using

<meta name="referrer" content="origin">

to enable tracking "https: --> http: " traffic source/referrers. That is the only place I know of where "origin" was ever discussed for meta tags. It was not an effective method because it only included the 'origin' domain, no page or full URLs. It was not related to canonicals in any way. I vaguely recalled it and had to look it up: [moz.com...]

It is not a Google thing, they are clear on what they read and use.

tangor

7:03 am on Sep 3, 2019 (gmt 0)

It is not a Google thing, they are clear on what they read and use.

Exactly!

OP ... might provide a link to this discussion to the powers that be? :)

Or not!

dennisjensen

7:34 am on Sep 3, 2019 (gmt 0)

I looked at the source provided, the origin tag in this context deals with the origin of intellectual work. "... Indicate sources that were used to create an original work; a list of sources which is readable to the end user (e.g. footnotes or a seperate page) should be used for that ...". The example used is that of the ISBN number for a book. I don't see any way in which G would respect (or even know how to deal with) this in the situation described.

Alternative suggestion to the otherwise exellent suggestions already provided: x-robots-tag no-index for the API-version.

My guess is that you're an in-house, like myself?

richinberl

7:44 am on Sep 3, 2019 (gmt 0)

Ok. I can't stress enough what I am dealing with, in terms of locked down codebase, frameworks, and derision at suggesting things not of the idea of the CTO. Its really not worth my time to tell this guy... there is no way something is not possible... if If I well know... it is.

a. The robots.txt is pulled from a database field. The database is the same for the main domain and subdomain. It could still be made conditional on the subdomain it was being pulled into.

b. Google says its list of meta tags they can read is not exhaustive. I guess he is relying on this.

c. Yes, I also thought at first it was a confusing with/#*$!isation of the referrer meta tag, but it seems somewhat legit as a meta tag... I just don't see Google valuing it.

The problem is. I have no way to prove it won't work and a difficult time also showing harm.... that is... till the harm is done.

Of course the solution is "follow best practices, take no chances do what Google says" ... but all you guys are living in rational town, I am in crazy world.

tangor

4:36 am on Sep 4, 2019 (gmt 0)

@richinberl ... as an employee you do as directed, but MEANWHILE, document these directions to CYA when the fit hits the shan!

McMohan

7:29 am on Sep 5, 2019 (gmt 0)

As far the Google index is concerned. you may just want to validate it on GSC and have the domain removed from the index using URL removal tool. That will relieve you for 3 months. "Just a double protection" will be the justification to your CTO.

richinberl

8:02 am on Sep 5, 2019 (gmt 0)

Yes indeed, McMohan, I am already all over that solution, even prepared to do it permanently

McMohan

8:42 am on Sep 5, 2019 (gmt 0)

Unless you disallow on robots.txt or use noindex tag, it can't be permanent. You may have to do keep using removal tool every time it reappears in the index.

richinberl

8:46 am on Sep 5, 2019 (gmt 0)

what i meant. I every 3 months. remove the urls. I do this as a permanent task in my calendar of stupid things i do/

tangor

10:31 pm on Sep 8, 2019 (gmt 0)

chuckles ... for all the tech noise that AI can read your mind ... they can't read commonsense "disavow this krap 4ever and don't bother me!"

Pretty sure there's a human engineered routine that prevents the machine from actually getting that directive correct.

</sarcasm>