homepage Welcome to WebmasterWorld Guest from 54.225.1.70
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
via this intermediate figment of the imagination
lucy24




msg:4399906
 11:14 pm on Dec 20, 2011 (gmt 0)

I've been poring over posts that talk about the "via this intermediate link" line in gwt, trying to make sense of something that showed up in my own list recently. So far I haven't found any better explanation that "gwt has gone bonkers".

It's going to be a little tricky to give the necessary information without giving the wrong information, so bear with me.

Deep in the bowels of my site I've got a group of pages that are best described as an unauthorized mirror. My own personal Wayback Machine. The directory isn't included in the original site's robots.txt, but each individual page has the same "noindex, nofollow" meta tag. So the real pages are listed in neither google nor the (real) Wayback Machine.*

Recently a new batch of "pages that link to your site" showed up in gwt linking to a specific one of these mirrored pages. Close study suggests they are all the same page, and the owner needs to spend some time with the URL parameters section of gwt-- but I won't complain, because one of them was a "printer friendly" version that didn't require login. It's a plausible link-- except that all of them are listed as "via this intermediate link".

This is where I start getting suspicious about google's compos mentisness, because the "intermediate link" is, you guessed it, a page on the original mirrored site.

This site, too, has also showed up on gwt. A cluster of different pages, all ostensibly linking to my page, again "via this intermediate link". Of course they don't do anything of the sort; what they do link to is their own original version of the page. Which g### doesn't know about. (I checked with site: search.)

Everyone follow that?

According to google, a page on www.example.org links to a page on www.example.gov which in turn redirects to me.

Oh, yes, those not-really-linking pages. I don't have mirrors of them. (They're boring.) So my links lead to the real thing. Like all external links in this group of pages, they are flagged as "nofollow". The explanation is hidden somewhere in this detail, but trying to figure it out is giving me a headache :(


* This may be premature. Looking up my own site turns up nothing more recent than February, though they've been crawling regularly and have archived versions dating back to 2007. I know there is a wide range of opinions on the Wayback Machine. I, personally, like the idea. "Hm, is that really what I thought **** *** ** *** ****** meant back in August 2010?"

 

aristotle




msg:4400169
 4:00 pm on Dec 21, 2011 (gmt 0)

I tried to follow everything but am still confused. For one thing, I'm not sure how the real Wayback Machine figures into it. Maybe it would help if you could give basic overview of the relationships between the different sites and directories.

rlange




msg:4400264
 8:29 pm on Dec 21, 2011 (gmt 0)

I'm not sure if I'm following that, either. Let me see...

  • You administer Site A.
  • Within Site A, you've mirrored pages from Site B. The original pages on Site B are noindex'ed and nofollow'ed, but otherwise publicly accessible.
  • Google Webmaster Tools began showing new links from Site C to one of these mirrored pages on Site A.
  • Google Webmaster Tools is reporting that these links from Site C are "via this intermediate link" and that "intermediate link" is actually from Site B.
  • Google Webmaster Tools is also reporting links from Site B to Site A.

Is that about it?

If so, that does seem to indicate that Site C is actually linking to a page on Site B and Site B is redirecting to the mirrored pages on Site A (your site). At least that's what my limited experience with "via this intermediate link" suggests.

This would be easy enough to test by visiting the "intermediate link" on Site B and seeing if it redirects you to Site A.

However, I think there may be an assumption in the second point that, just because the pages are noindex'ed and nofollow'ed, Google is completely unaware of those pages. That's probably not the case.

Warning: Full-blown fantasizing ahead. My starting assumptions:

  • Google is aware of the contents of pages, even if they're noindex'ed.
  • Site C is actually linking to the page on Site B.

I wonder if Google is doing some sort of automatic canonicalization. That is, I wonder if Google is detecting the duplication and assuming that your mirrored versions of the pages are the original. Because of this, Google Webmaster Tools sees the links on Site C to Site B and reports them as links to Site A via Site B, even if that's not reality.

That's a lot of assumptions on my part so don't take it as truth. Hopefully, if I've correctly laid out the scenario you're describing, someone more knowledgeable can offer a better explanation.

--
Ryan

Donna




msg:4400273
 9:35 pm on Dec 21, 2011 (gmt 0)

To me it looks like an old data set being brought back. At least do you see a lot of pages removed a while back re-showing as missing ?

lucy24




msg:4400305
 10:57 pm on Dec 21, 2011 (gmt 0)

rlange, yes, you've got it. Memo to self: try to write in a way that is intelligible to more than 1 out of 3 educated readers. Site C links to site B, and site B most emphatically does not redirect to me :)

I wonder if Google is detecting the duplication and assuming that your mirrored versions of the pages are the original. Because of this, Google Webmaster Tools sees the links on Site C to Site B and reports them as links to Site A via Site B, even if that's not reality.

That was my line of thought too. To a human visitor it is absolutely, unambiguously clear that I'm duplicating someone else's pages. To some human visitors it may look like a vicious parody, but a quick visit to the "real" site will put that idea to rest.

Final detail. The "real" pages could have disappeared any time after last May, because they're targeted towards a specific event. As of a few seconds ago, they're still present, with two changes:

Their human-followable link from the parent site is gone. That happened quite a while ago.

If you request the subdomain name alone, you now get a 403. But if you request any one page by its full name including query string, it's still there. This is a very recent change and I'm wondering if that's what led to google's terminal confusion? I'm also wondering how the ### you do that-- but I don't speak IIS, so I'd better not think about it too much.

Oh, by the way. I threw in the Wayback Machine as an analogy. At some time in the future it may be important to know that this group of pages once existed. And thanks to the Wayback Machine's huge time delay, there's no way of knowing whether they have a record of the pages. If they obey "noindex", they don't.

lucy24




msg:4411327
 11:38 pm on Jan 26, 2012 (gmt 0)

Follow-up to the above:

It took a while, but gwt has now got things sorted out-- with no help from me-- and is no longer crediting me with nonexistent links. Whew.

Incidentally, that "unauthorized mirror" has recently been visited by users in at least two different offices of the relevant government. But I have yet to get an irate e-mail ordering me to take it down right this second Or Else. ::snrk::

I also found the (real) Wayback Machine's fine print. They really do run on a two-year delay, give or take.

n00b1




msg:4468671
 10:30 am on Jun 23, 2012 (gmt 0)

Sorry to bring up this old thread but I am seeing a similar thing on WMT. To cut a long story short I have moved from the www to non-www version of my website. I did this because of a large number of 'inorganic' links pointing to my website that, despite my best efforts, I just couldn't get removed.

I have seen a number of links listed on WMT with 'via this intermediate link'. The thing is, the www version of my site (where the links point) is a dead end as it simply doesn't exist any more. I don't have any preferred domain set in WMT and there is no redirection between the pages/sites.

I don't know if these links are actually counting as 'votes' for the non-www version of the site or whether they are being listed (perhaps temporarily) so that I can see they exist and request that they are pointed to the new location if I feel it would be valuable. I can only see what I would deem to be 'good' and 'organic' links actually being listed in this fashion at this point but I don't know if the picture is complete.

tedster




msg:4468677
 11:39 am on Jun 23, 2012 (gmt 0)

I have moved from the www to non-www version of my website

Does this mean that every request for a "www" URL gets a 404 or 410 status response - in the http header, not just on screen?

n00b1




msg:4468678
 12:04 pm on Jun 23, 2012 (gmt 0)

The DNS server has been configured to disable the 'www' URL. If you type in an address with 'www' or click on a link containing this you get redirected through to the 'not found' response as set up by your ISP. So no header is returned technically.

I did actually want to return a 404 as if nothing else this has proved frustrating for users. The thing is I couldn't work out how to do this (any ideas?) Current rankings suggest that the 'www' links aren't counting towards the non-www and all of my pages seem to rank differently to how they did but I don't want Google to start combining these.

g1smd




msg:4468679
 12:21 pm on Jun 23, 2012 (gmt 0)

If you type in an address with 'www' or click on a link containing this you get redirected through to the 'not found' response as set up by your ISP.

Is this "DNS Error" or "Request Timeout"?

Either way you're losing traffic.

Returning 404 is easy. Point the DNS to some place on your server where there are no files. Requests for stuff that is 'not found' will return 404 status and the 'not found' error message.

n00b1




msg:4468682
 12:41 pm on Jun 23, 2012 (gmt 0)

It's a DNS error as the 'www' DNS records don't exist. I will see if I can sort out a 404 as that was always the preferred option - as you said, I am losing visitors! I am sure I did try to redirect the www to a page that doesn't exist on the site but am fairly sure that didn't work for some reason.

n00b1




msg:4468685
 1:06 pm on Jun 23, 2012 (gmt 0)

Right. I have done as you suggested but the primary header response is now a 301 redirect (to the page that itself returns a 404). How exactly will Google treat this? Does the initial response (301) matter or is it the final destination (being 404) that matters?

levo




msg:4468687
 1:21 pm on Jun 23, 2012 (gmt 0)

Does the initial response (301) matter or is it the final destination (being 404) that matters?


I've recently battled with this problem as well, and couldn't find any reference or answer. If it is a problem, only solution that I can think of is to redirect via a PHP script that checks the destinations header and redirects if it exists.

n00b1




msg:4468691
 1:42 pm on Jun 23, 2012 (gmt 0)

Hmm... This brings me back to square one then. Any more insights into this?

g1smd




msg:4468692
 1:47 pm on Jun 23, 2012 (gmt 0)

Where are you redirecting to?

Resolve the DNS to a folder that has no files.

Define an ErrorDocument for the 404 status.

Upload the file you want to show when the status is 404.

You're done.

n00b1




msg:4468693
 1:59 pm on Jun 23, 2012 (gmt 0)

Having put back up the 'www' DNS server entries I was trying to redirect using HTACCESS. This was a 301 redirect to a location that doesn't exist. The problem is that the first step of this is a 301 redirect which the server reads first...

I seem to have cleaned this up so that it returns a 404 (still using HTACCESS). I used this code (without the space before 'mydomain' in the second instance):

RewriteEngine on
rewritecond %{http_host} ^www.mydomain.com [nc]
rewriterule ^(.*)$ http:// mydomain.com/nonexistant [l,r=404,nc]

And that seems to push the user over to a location that doesn't exist whilst returning a 404 right from the start. Interesting.

Edit: That seems to bypass my custom 404 page. Great. And I can't resolve to any folder using DNS as it only lets me redirect to an entire domain or subdomain. I am also using Wordpress if that makes any difference.

n00b1




msg:4468707
 3:35 pm on Jun 23, 2012 (gmt 0)

Could I do anything with subdomains? Like setting the 'www.domain.com' DNS to resolve to a subdomain (like 'no.domain.com')? I tried doing this with an empty folder and only the 404 error document there as you suggested but it goes to a LiteSpeed server 'Index of /' page with that as the only listing if you visit 'no.domain.com'.

g1smd




msg:4468712
 3:56 pm on Jun 23, 2012 (gmt 0)

Why do you think you need to 'redirect'? A redirect tells the browser to ask for a different URL. You don't want that. You want the URL that was requested to directly return a 404 status code in the HTTP header.

If you point the DNS at an empty folder then any request for
www.example.com/<anything> will return a 404 response automatically.


You could also return '410 Gone' using:
RewriteCond %{HTTP_HOST} ^www\.
RewriteRule .* - [G]

n00b1




msg:4468719
 4:12 pm on Jun 23, 2012 (gmt 0)

Like I said I can't point the DNS to a folder, only the domain itself or a subdomain.

OK so the subdomain could be an empty folder, but the problem is I need a 404 to be returned when just 'www.domain.com' is enterered on its own as well. That is where the redirects came into play.

I quite like the idea of returning a 410 Gone, although I can't really help out my visitors with that.

lucy24




msg:4468783
 10:11 pm on Jun 23, 2012 (gmt 0)

I need a 404 to be returned when just 'www.domain.com' is enterered on its own as well.

The .* in g1's example will take care of that. It means "the request is for something or nothing". Yes, there is a difference between a null request and no request ;)

If someone types or clicks "www.example.com" their own browser will append a trailing slash, so it reaches the server as "www.example.com/". (Only with bare domain names! Other trailing-slash redirects happen on site.)

All RewriteRules ignore "www.example.com/" so what's left over is, by definition, .* Since you are serving up a [G] you don't need the capturing parentheses. You only need the RewriteCond to check for {HTTP_HOST}

If these requests are only coming in for pages or directories-- that is, nobody asks for pictures etc with the wrong domain name-- you can constrain the rule a little further by writing it so it only applies to trailing / or .html.

Caution! If you do this, you need to write a second Condition, either making an exception for "my410.html" or limiting the rule to {THE_REQUEST}, because you will otherwise get into an infinite loop. Something analogous happened to me not long ago with a custom 403 page.

Is there really such a thing as [R=404]?

If you decide to go the 410 route, make sure you have a nice 410 page-- or simply use your custom 404 page for both. The built-in 410 page is scary.

n00b1




msg:4468831
 7:39 am on Jun 24, 2012 (gmt 0)

Thanks Lucy. This is why I liked the 410 idea (it applies to everything I need it to). The problem is I don't know how to serve up a custom error page. I have created a 410.shtml through my server's 'custom error page' utility but all codes from htaccess go to the default and scary Litespeed one. I will see if I can do something manually and not use the utility.

g1smd




msg:4468839
 8:00 am on Jun 24, 2012 (gmt 0)

In htaccess:

ErrorDocument 410 /errors/error410.php
n00b1




msg:4468847
 9:27 am on Jun 24, 2012 (gmt 0)

Thanks for replying again.

I did try that but it doesn't work. I know how to specify a custom 410 page in htaccess but for some reason forcing a 410 through HTACCESS (as per the code you gave above) redirects to the default page regardless. I'm stumped.

lucy24




msg:4468859
 11:33 am on Jun 24, 2012 (gmt 0)

Hm. Does the code work for your other custom error pages?

Some hosts have an alternative system that doesn't require htaccess at all. There's a list of specific names, like Missing for 404 and Forbidden for 403, that the server is told to look for. If the list includes something for 410, you might try that. (OK, so it probably doesn't include it; I just checked and my own host only has four built-ins.)

levo




msg:4468867
 12:30 pm on Jun 24, 2012 (gmt 0)



ErrorDocument 410 /error_pages/410.html

...
RewriteCond %{REQUEST_URI} !^/error_pages/410.html$
RewriteRule ^/ - [G,L]



You have to exclude the error page.

g1smd




msg:4468933
 4:09 pm on Jun 24, 2012 (gmt 0)

The rule pattern
^/ won't match any requests in htaccess context.

Escape the literal period in the other RegEx pattern.

[G] implies [L] so [L] is not needed.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved