Forum Moderators: phranque

Message Too Old, No Replies

How do I block a referral redirect from another domain

getting PR stolen by an archiving site

         

RickW

6:09 pm on Nov 1, 2004 (gmt 0)

10+ Year Member



There is a personal search engine /archiving site that is using a redirect to one of my pages. At first it just knocked me down a few results in Google, but now I'm not showing up in the results and my page rank has been zapped - I'm getting penalized by Google because I'm considered the duplicate page.

Google for <my keywords>, and you'll find the a result on the 2nd page <with my title>. But it's not my page, instead it's <the archive site's forwarding page>.

I tried using htaccess to block by domain, but then I realized that will only work if <the archive domain> was the visitor's host which won't be the case. So then I was looking at some of the HTTP_REFERER code examples, but they were all about blocking hotlinking.

I decided that I don't want to block all referals from <the archive site>. I was digging around and found that there is another page that is linking to me the conventional way... so I just want to block referal from <the forewarding page of the archive site>. That way I can regain my PR and page listing, while maintaining a link from them since they have a high PR.

Any easy way I can accomplish this?

[edited by: jdMorgan at 7:06 pm (utc) on Nov. 1, 2004]
[edit reason] Removed specifics per Terms of Service. [/edit]

jdMorgan

7:27 pm on Nov 1, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



RickW,

Welcome to WebmasterWorld!

Unfortunately, there is no easy solution to this problem. Your problem is not the referrals from the archiving site, it's that Googlebot follows that redirect from the forwarding page on the archive site. But Googlebot does not provide a referrer when it fetches your page, so HTTP_REFERER will be blank.

Therefore, if you block by user-agent=Googlebot only, then your page will show up in their search results as a URL listing only, and it will be the forwarding page's URL. If you block referrals from that page, then only legitimate visitors who provide a referrer and who come through that forwarding page will be blocked, but again, this won't affect Googlebot, because it doesn't include a referer with its requests.

If you block blank referrers, you will have a lot of trouble with legitimate users who access the 'net from behind corporate or ISP caching proxies, or who use the "privacy" settings --sometimes the default settings-- of products such as Norton Internet security.

Various things have been tried, and the only things that seem to work so far are:

  • Ask the archive site to delete the forwarding link.
  • Ask the archive site to use a more sophisticated exit-tracking method, such as Google's JavaScript tracking image technique.
  • Failing that, submit a DMCA copyright violation report to Google to have the wrong-URL-page removed from Google's index.

    All these options have been thoroughly discussed recently in this thread [webmasterworld.com] about redirection- and meta-refresh=based page hijacking in the WebmasterWorld Google forum.

    The only protection we currently have is that this URL-replacement problem only occurs when the redirecting page has higher PR that the "victim" page.

    Hopefully, Google will fix this problem, but it's not an easy one to fix. In order to reverse this damage without affecting the functionality of many Web sites, Google will have to specifically recognize archive and directory sites and change the way that it treats meta-refresh and 302 redirects from that class of site.

    Jim

  • RickW

    8:43 pm on Nov 1, 2004 (gmt 0)

    10+ Year Member



    So just to verify what you are saying, there is nothing I can put in the htaccess file to block google from reaching my page if it follows a redirect from another domain? My site logs show the site page that is redirecting to me, so shouldn't there be a command in htaccess to effectively block them regardless if it's googlebot or not?

    jdMorgan

    9:11 pm on Nov 1, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    > My site logs show the site page that is redirecting to me, so shouldn't there be a command in htaccess to effectively block them regardless if it's googlebot or not?

    Your site logs will not show a Googlebot access with a referrer, any referrer. So you could block the visitors (people) coming from that page (if their browser provides a referrer), but you can't block Googlebot because it won't provide a referrer. And the Googlebot access is what is causing the problem.

    That is why this is such a problem, and why the thread I cited above went on for five pages. You can probably save yourself some time by reviewing it, to see all the things that have been tried already. Google will have to fix this problem unless all Webmasters, good and bad, stop using 302 redirects and meta-refreshes on exit pages.

    Jim

    RickW

    12:26 am on Nov 2, 2004 (gmt 0)

    10+ Year Member



    > and why the thread I cited above went on for five pages. You can probably save yourself some time by reviewing it, to see all the things that have been tried already. Google will have to fix this problem unless all Webmasters, good and bad, stop using 302 redirects and meta-refreshes on exit pages.

    I began to skim through that thread, but was overwhelmed quickly once people started linking away from the main thread to other threads.

    I guess I will have to pop some popcorn and break out a six-pack and spend a night reading through it. :)

    Google should just set googlebot to leave referer information for the logs...

    jdMorgan

    12:34 am on Nov 2, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    > Google should just set googlebot to leave referer information for the logs...

    That only works if there is one and only one link to your page. Otherwise, they'd have to re-fetch your page every time they found a link to it in order to "give you a chance" to reject each and every incoming link referrer...

    That's why spiders don't do this. They work from a database that may contain dozens to tens of thousands of link referrers to your one page. How would they know which one you won't like without trying all of them? :(

    Jim

    RickW

    3:51 am on Nov 2, 2004 (gmt 0)

    10+ Year Member



    Okay, I found a solution that Google could implement. They already setup this proprietary meta tag:

    META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"

    Now they can creat a new one:

    META NAME="PERMANENT" CONTENT="domain.com"

    All of the search engines can adopt this new tag. It lets Google know that if the URL domain doesn't match up with this meta tag, then it is either temporary or invalid and should be dropped from their results.

    RickW

    1:00 pm on Nov 5, 2004 (gmt 0)

    10+ Year Member



    I would like to point out my final solution.

    First I tried denying the offending site by domain, but then realized that will only work if the visitor is using that domain as a host. I then tried blocking by referral, but as explained above, that may not help with googlebot. Here is what my .htaccess file looks like now:

    <Limit GET PUT POST>
    order allow,deny
    allow from all
    deny from .offendersite.net
    </Limit>

    RewriteEngine on
    RewriteCond %{HTTP_REFERER} ^http://(www\.)?offendersite.net/.*$
    RewriteCond %{REQUEST_URI} ^/.*$
    RewriteRule ^.* - [F]

    Redirect /article.php/my-article-id [google.com...]

    I decided heck with it. I temporarily removed my page that was being hijacked, and redirected it to Google. If googlebot reads that the same way it did for the offender site, then it's not going to be happy with the target and hopefully penalize the site.

    Fighting fire with fire...

    jdMorgan

    2:47 pm on Nov 5, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    msg#4: ...but you can't block Googlebot because it won't provide a referrer.

    Your code won't work as intended for Googlebot, because HTTP_REFERER will be blank. Therefore, the RewriteCond will fail, googlebot won't get redirected, and it won't see the google.com URL. If you truly removed your pages, then all Googlebot will get is a 404-Not Found response.

    If you block blank referrers (and you could), then you will block many visitors who access the Web through their ISP's or corporation's caching proxy, including virtually all AOL users. This may lead to lost business and a lot of time spent on "customer support." During the time that you leave your block in place, loyal repeat visitors may give up and delete your site from their "Bookmarks" and/or "Favorites" list.

    Many sites are suffering from this same problem, and I am very sympathetic to your predicament, but Google is going to have to use sophisticated means to thwart this hijacking. When done intentionally, it is exploiting a basic weakness in the HTTP 302 redirect response mechanism; the lack of a check for "authority" to request reassignment of content to a new URL. The authors of HTTP simply never imagined that a 302 might be used dishonestly. Your proposal in msg#5 addresses this weakness directly, and would be a good solution if it did not require millions (or billions) of Web pages to be updated in order to gain "protection." Google will prefer (and likely seek) a solution that does not require old pages --some very valuable-- to be updated.

    Jim

    jdMorgan

    3:08 pm on Nov 5, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    After a second cup of coffee and a re-read of your code, I realize that I missed the Redirect function at the end. That technique was discussed in the previous thread, but was reported to have failed.

    I'll leave my post as-is, in case the Googlers at the 'plex might see the last paragraph, which stands on its own despite my error in interpreting your code.

    Rather that removing your page, you might consider trying another approach: Drop the redirect and add a <meta name="robots" value="noindex,follow"> to that page instead. It has been my experience that Googlebot will drop a "noindex" page from their index, but that they will re-fetch it periodically, to see if the robots directive has been changed. Therefore, it's more likely that no long-term damage will be done to the page's "reputation." In the meantime, the content of that page will remain available to your legitimate visitors.

    Jim

    RickW

    3:12 pm on Nov 5, 2004 (gmt 0)

    10+ Year Member



    My site is still relatively new and I am not making any money off it, so I'd rather cut off the toe before I lose the foot.

    I'm going to put the page back up but with a new ID, so it will be recrawled and put back in it's proper place in the results.

    Since the original page is no longer available, it doesn't matter that I'm messing it up for other visitors that came across the link... and I'm only blocking blank referers who are finding my page through the offenders link.

    RickW

    3:26 pm on Nov 5, 2004 (gmt 0)

    10+ Year Member



    After a second cup of coffee and a re-read of your code, I realize that I missed the Redirect function at the end. That technique was discussed in the previous thread, but was reported to have failed.

    I'll leave my post as-is, in case the Googlers at the 'plex might see the last paragraph, which stands on its own despite my error in interpreting your code.

    Rather that removing your page, you might consider trying another approach: Drop the redirect and add a <meta name="robots" value="noindex,follow"> to that page instead. It has been my experience that Googlebot will drop a "noindex" page from their index, but that they will re-fetch it periodically, to see if the robots directive has been changed. Therefore, it's more likely that no long-term damage will be done to the page's "reputation." In the meantime, the content of that page will remain available to your legitimate visitors.

    I knew I was missing something. Be right back, got to get some coffee too...

    The noindex is an interesting idea. Exactly how does the caching of the pages have an impact on search results and PR anyway?