Forum Moderators: phranque
Google for <my keywords>, and you'll find the a result on the 2nd page <with my title>. But it's not my page, instead it's <the archive site's forwarding page>.
I tried using htaccess to block by domain, but then I realized that will only work if <the archive domain> was the visitor's host which won't be the case. So then I was looking at some of the HTTP_REFERER code examples, but they were all about blocking hotlinking.
I decided that I don't want to block all referals from <the archive site>. I was digging around and found that there is another page that is linking to me the conventional way... so I just want to block referal from <the forewarding page of the archive site>. That way I can regain my PR and page listing, while maintaining a link from them since they have a high PR.
Any easy way I can accomplish this?
[edited by: jdMorgan at 7:06 pm (utc) on Nov. 1, 2004]
[edit reason] Removed specifics per Terms of Service. [/edit]
Welcome to WebmasterWorld!
Unfortunately, there is no easy solution to this problem. Your problem is not the referrals from the archiving site, it's that Googlebot follows that redirect from the forwarding page on the archive site. But Googlebot does not provide a referrer when it fetches your page, so HTTP_REFERER will be blank.
Therefore, if you block by user-agent=Googlebot only, then your page will show up in their search results as a URL listing only, and it will be the forwarding page's URL. If you block referrals from that page, then only legitimate visitors who provide a referrer and who come through that forwarding page will be blocked, but again, this won't affect Googlebot, because it doesn't include a referer with its requests.
If you block blank referrers, you will have a lot of trouble with legitimate users who access the 'net from behind corporate or ISP caching proxies, or who use the "privacy" settings --sometimes the default settings-- of products such as Norton Internet security.
Various things have been tried, and the only things that seem to work so far are:
All these options have been thoroughly discussed recently in this thread [webmasterworld.com] about redirection- and meta-refresh=based page hijacking in the WebmasterWorld Google forum.
The only protection we currently have is that this URL-replacement problem only occurs when the redirecting page has higher PR that the "victim" page.
Hopefully, Google will fix this problem, but it's not an easy one to fix. In order to reverse this damage without affecting the functionality of many Web sites, Google will have to specifically recognize archive and directory sites and change the way that it treats meta-refresh and 302 redirects from that class of site.
Jim
Your site logs will not show a Googlebot access with a referrer, any referrer. So you could block the visitors (people) coming from that page (if their browser provides a referrer), but you can't block Googlebot because it won't provide a referrer. And the Googlebot access is what is causing the problem.
That is why this is such a problem, and why the thread I cited above went on for five pages. You can probably save yourself some time by reviewing it, to see all the things that have been tried already. Google will have to fix this problem unless all Webmasters, good and bad, stop using 302 redirects and meta-refreshes on exit pages.
Jim
I began to skim through that thread, but was overwhelmed quickly once people started linking away from the main thread to other threads.
I guess I will have to pop some popcorn and break out a six-pack and spend a night reading through it. :)
Google should just set googlebot to leave referer information for the logs...
That only works if there is one and only one link to your page. Otherwise, they'd have to re-fetch your page every time they found a link to it in order to "give you a chance" to reject each and every incoming link referrer...
That's why spiders don't do this. They work from a database that may contain dozens to tens of thousands of link referrers to your one page. How would they know which one you won't like without trying all of them? :(
Jim
META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"
Now they can creat a new one:
META NAME="PERMANENT" CONTENT="domain.com"
All of the search engines can adopt this new tag. It lets Google know that if the URL domain doesn't match up with this meta tag, then it is either temporary or invalid and should be dropped from their results.
First I tried denying the offending site by domain, but then realized that will only work if the visitor is using that domain as a host. I then tried blocking by referral, but as explained above, that may not help with googlebot. Here is what my .htaccess file looks like now:
<Limit GET PUT POST>
order allow,deny
allow from all
deny from .offendersite.net
</Limit>
RewriteEngine on
RewriteCond %{HTTP_REFERER} ^http://(www\.)?offendersite.net/.*$
RewriteCond %{REQUEST_URI} ^/.*$
RewriteRule ^.* - [F]
Redirect /article.php/my-article-id [google.com...]
I decided heck with it. I temporarily removed my page that was being hijacked, and redirected it to Google. If googlebot reads that the same way it did for the offender site, then it's not going to be happy with the target and hopefully penalize the site.
Fighting fire with fire...
msg#4: ...but you can't block Googlebot because it won't provide a referrer.
If you block blank referrers (and you could), then you will block many visitors who access the Web through their ISP's or corporation's caching proxy, including virtually all AOL users. This may lead to lost business and a lot of time spent on "customer support." During the time that you leave your block in place, loyal repeat visitors may give up and delete your site from their "Bookmarks" and/or "Favorites" list.
Many sites are suffering from this same problem, and I am very sympathetic to your predicament, but Google is going to have to use sophisticated means to thwart this hijacking. When done intentionally, it is exploiting a basic weakness in the HTTP 302 redirect response mechanism; the lack of a check for "authority" to request reassignment of content to a new URL. The authors of HTTP simply never imagined that a 302 might be used dishonestly. Your proposal in msg#5 addresses this weakness directly, and would be a good solution if it did not require millions (or billions) of Web pages to be updated in order to gain "protection." Google will prefer (and likely seek) a solution that does not require old pages --some very valuable-- to be updated.
Jim
I'll leave my post as-is, in case the Googlers at the 'plex might see the last paragraph, which stands on its own despite my error in interpreting your code.
Rather that removing your page, you might consider trying another approach: Drop the redirect and add a <meta name="robots" value="noindex,follow"> to that page instead. It has been my experience that Googlebot will drop a "noindex" page from their index, but that they will re-fetch it periodically, to see if the robots directive has been changed. Therefore, it's more likely that no long-term damage will be done to the page's "reputation." In the meantime, the content of that page will remain available to your legitimate visitors.
Jim
I'm going to put the page back up but with a new ID, so it will be recrawled and put back in it's proper place in the results.
Since the original page is no longer available, it doesn't matter that I'm messing it up for other visitors that came across the link... and I'm only blocking blank referers who are finding my page through the offenders link.
After a second cup of coffee and a re-read of your code, I realize that I missed the Redirect function at the end. That technique was discussed in the previous thread, but was reported to have failed.I'll leave my post as-is, in case the Googlers at the 'plex might see the last paragraph, which stands on its own despite my error in interpreting your code.
Rather that removing your page, you might consider trying another approach: Drop the redirect and add a <meta name="robots" value="noindex,follow"> to that page instead. It has been my experience that Googlebot will drop a "noindex" page from their index, but that they will re-fetch it periodically, to see if the robots directive has been changed. Therefore, it's more likely that no long-term damage will be done to the page's "reputation." In the meantime, the content of that page will remain available to your legitimate visitors.
I knew I was missing something. Be right back, got to get some coffee too...
The noindex is an interesting idea. Exactly how does the caching of the pages have an impact on search results and PR anyway?