Joeduck, you should ask for removal of these links from Google.
You will probably have to re-enter them in your robots.txt (*), but it will be easier if they have a generic componenet, as; when you request removal from Google with the URL-console, there's a limit to the size of your robots.txt file.
Also, you might have to enter them one-by-one in the url-console, which will take some time.
(*) That is, if you can't make them return html with the meta tag <meta value="robots" content="noindex">
Thanks Claus - this makes sense though I'm worried these are a symptom rather than the problem itself? The links all appear to be from our cgi and cf directories that are referenced to send people to our major affiliates. Hopefully can use wildcards in the Google process but I have not checked yet. We'd excluded these directories when the problem started, now we allow them.
I would be interested in seeing it
Email me the info so I can take a look at it.
I think if I can look at a few cases I maybe able to see if there is a way to find out if there is anything we can do to either stop it or identify it.
Yeah, they are symptoms, but the problem is not on your end, it's on Google's. In this case they see an URL (your redirect URL) and assume that this equals a document, even though it does not.
Once an URL is indexed it will not be removed by putting it in "robots.txt" - this will only keep the spider from revisiting the URL. In order to get it removed you must specifically request removal.
If you've got your redirect script in some folder, like, say:
... then you can just put "/redir/" (or "/redir/redir.php") in your "robots.txt", you don't need to put in every single redirect.
(i have removed such URLs from my own sites a few times, so i know the process)
Excellent and thanks for any advice. Claus - do you think these bogus "link pages" replace other legitimate pages?
They can do, I suspect, if Google sees that they are duplicates of something else (which may well happen with the screwy way that they treat some sorts of redirects these days).
I'm with g1smd here: If SE's are allowed to follow those links you might be "hijacking" some of the target pages before you know it - i've got all mine robots.txt'ed for the same reason
(and the additional reason being that i like to have control over what is indexed - i especially don't want "internal" things or "errors" to be indexed. All i want in the index is my real pages and nothing more - one URI per page. For that reason i do remove all kinds of different stuff that should not be there whenever i see it. I like to keep things clean, as this helps me avoid "surprises" of many different kinds.)
Claus this could be a real problem for people running adsense because your not allowed to exclude googlebot from any part of your site.
If you run a robotstxt file you have to allow google full access in the first line.
What about the rel="nofollow" (or was it rel="noindex") attribute that Google "invented" just a few months ago...
Can you use that? Would it work?
Reid why are you saying that? We've had several excluded directories and have run adsense for some time. To Google's credit (but our frustration) our adsense reps have been nice talking about this but unable to help with our problems because they are very separated from search side of things.
RE: Nofollow - we've been discussing that and I favor placing them at most of our outbound links.
Non-riot thinking (points we all agree on?):
1. Search Engines cannot function properly if one webpage can influence the serps of another webpage.
2. Protocols must exist, and redirects serve a valid purpose.
3. It is commendable that Search Engines want to avoid duplicate content.
As I read all the discussion, everyone is barking at points 1 and 2, because of point 3.
If it wasn't for point 3 (duplicate content), who cares about redirects from other webpages, as long as your own page is still in serps on it's own merit.
If bark we must, then I suggest we bark at Google for their method of eliminating duplicate content (define fair/unfair or show certain duplicates).
Some suggest the current G choice is based on PRank, arguing that the redirect badguys have a higher Prank and focus on webpages with lower Prank so that they get chosen in the duplicate content dilemma, and not the other guy. Maybe so.
The problem of criteria for eliminating duplicate content has always existed, maybe it is aggravated by the inventiveness of badguys using redirects, maybe protocols need to be changed (point2) to deal with that, but duplicate content is close to "similar" content, and before you know it we are talking about how serps ranking choices are made.
A level playing field is impossible....except maybe for napoleon.
We can only watch as Google copes.
|Reid why are you saying that? We've had several excluded directories and have run adsense for some time. To Google's credit (but our frustration) our adsense reps have been nice talking about this but unable to help with our problems because they are very separated from search side of things. |
I should not have said that. last month I read on adsense guidelines 'do not use a robots.txt file. and then buried in the optimization tips there is a line
If you have a robots.txt file, remove the file or add the following two lines to the top of the file:
This change will allow our bot to crawl the content of your site, so that we may provide you with the most relevant Google ads.
I wonder if media-partners-google* includes 'googlebot'?
|Claus this could be a real problem for people running adsense because your not allowed to exclude googlebot from any part of your site. |
If you run a robotstxt file you have to allow google full access in the first line.
I have started a new thread on this topic in the AdSense forum, so let's continue with regard to that issue over there:
How much should Google be allowed to spider WRT AdSense? [webmasterworld.com]
|If it wasn't for point 3 (duplicate content), who cares about redirects from other webpages, as long as your own page is still in serps on it's own merit. |
I think the problem is much deeper than that.
if your home page is the temporary location of the hijacking page then the hijacking page takes the home page's place in the SERP's
This could also happen aside from duplicate content issues. ie googlebot see's that it has indexed the page AND the temporary location of the page. Aside from duplicate content it say's 'I have indexed the same page twice' so it removes the temp page (your home page) and leaves the hijack page which points at the temp.
Having followed this thread from the start, I took action.
I found 302 redirects to 3 of my pages, low traffic 3rd level stuff,
but the jackers rated above me for test phrases of mine.
Here's what I did: I temporarily renamed files at the host forcing 404 errors.
I used the Google emergency removal tool and selected "remove all".
I changed the filenames back quickly before they might get spidered by anyone.
The next day, I checked on my requests, and all three reqs were "DENIED"!
I have no idea why denied, but it seems to have done some good anyhow!
All 3 pages fell to the bottom dead end of the listings when
I ask for site:mysite.net Dead last, all three, plus a 4rth one I forgot about.
AND, all four jacked URLs are now marked 'supplemental result'.
None of my pages are so marked.
None of the jacks rate above me for my original
test snippets any more.
Comments anyone? Do I have this under control at least? - Larry
larryhatch - I had the same thing. When i removed one of these URL's I got a 'successful' but the url is not gone, it only fell into the 'ommitted pages' in site:search. didn't change the sandbox situation though but that may be another issue.
On another site I had a similar url from the same place appearing in site: this was before the URL removal tool became known to me.
I contacted the offending directory and asked them to remove my site. They removed it without a response. My site then came out of the sandbox and has been rising rapidly in google traffic. On this one I can't remove the url from google though because the go.cgi file is not returning a header for that perticular id#.
On the previous site they are refusing to remove the site from their directory, they must have been hit hard by the 302 scare. Thay have banned my ip too.
On a separate thread, there is discussion of Google looking into
files that they ordinarily dont .. java stuff and the like.
I wonder if there's a chance they are looking for dodgy 302 redirects.
I can dream, can't I? - Larry
Personally I think this is the reason behind this whole strange activity going on at the plex since mid March. Eveyone is reporting rollbacks and 'old database' rolling rolling databases.
I seriously think they are running tests and doing rollbacks changing algos trying to fix the 302 thing.
I bet they know exactly what the problem is (but they'll never tell) but it is not so easy to get rid of.
I've seen things like this a thousand times in the semiconductor
industry. Somebody screws up royally due to an oversight.
They will never admit it, but quietly, they make sure the problem gets fixed.
If that's the case, I don't need mea-culpas .. just some reassurance
that the problem is indeed being addressed.
I can't help but feel that my "denied request" did have some beneficial effects.
One odd thing. Unlike so many pond scum scrapers, the particular one
I dealt with actually had good taste! He only scraped the very best
sites in my arcane field (UFOs) avoiding all the loonies and
amateurish junk. If he put all his black-hat efforts into some honest
research, he could actually be a positive influence in the field!
As it is, he has the most authoritative people in the field ready
to cut his head off; those who can figure out what he's doing that is.
Very ironic. - Larry
We also have not seen a real update sence alegra.
We also can see its a big problem because we have NEVER seen so many pages in the serps as suplemental results
[edited by: zeus at 1:08 pm (utc) on April 9, 2005]
Did anybody remove a hijacked page with the Google emergency removal tool more than 90 days ago?
After those threads
I wonder whether Google will visit the pages again after 90 days (or maybe more) and the problem will start all over again.
Are all the non-dmoz.org URLs, found in a site:www.dmoz.org search, an example of the problem that Google says doesn't exist?
g1smd : Exactly!
All these seem to be flagged as "Supplemental" which, i think, means that they will probably not show up in a regular search. So, DMOZ might not have problems becaue of these - it's when they turn up in regular searches that they can cause problems.
Keeeeripes! Thanks for the tip.
I Googled up site:www.dmoz.org .. what a menagerie!
This is scraper central.
If nothing else comes to the attention of G, this should. - Larry
[edited by: ciml at 9:39 am (utc) on April 11, 2005]
[edit reason] No specifics please. [/edit]
Ooops! I should have read Claus's post first.
Yup, all 'supplemental'.
That brings up another question. Can I assume that
supplemental results are penalized in some way? -Larry
The supplemental results rarely show in a search and the flip side is that the non supplemental duplicated page shows a lot further down in the serps.
A page might not be a Supplemental Result for all search queries that it is returned for.
I'm not 100% clear on the effect of a "supplemental" stamp, i have to admit that. Thinking about it, i do see these in results for regular queries sometimes (which is probably also what they're there for)
The snippet for a Supplemental Result is never updated. It comes from an ancient archive deep in the Googleplex. It can easily represent content last seen on the page 3 or 4 years ago.
For a different search query the same page might be returned in the results, but might be a normal result and with a more up to date snippet.
At no time is there a rule to say that the words in the snippet can still be found in the cached page or on the real live site.
Something surprising. [Thanks to the fellow who stickied me this tip]
site:www.dmoz.org brings up NOTHING AT ALL any more.
site:dmoz.org (no www.) yields 11.2 million pages, all real dmoz URLs
as far as I looked (several pages worth) and not a scraper in sight!
Did this thread embarrass somebody, or is it just coincidence?
I don't understand why the www should make any difference. -Larry
I can confirm that the "jacker" urls within a site view are no longer showing.
I had a sticky from a fellow member who I was working with, he went looking for the 302's I stickyed to him that were showing up as being part of his site.
I also confirmed that the leaches attached to one of our sites also no longer show up in a site: search.
And a certain Drudge no longer has any attached to his site. In fact I looked at 15 sites that I knew about having leaches and they were all gone.
Now is the problem fixed?
I don't know it could just be hidden