Great, so in in addition to all the threads complaining that Googlebot is ignoring robots.txt, it's also ignoring meta tags? Are any pages with the noarchive meta tag showing up as supplimental results with caches?
Looks like the only sure way to stop google indexing a page is to serve Googlebot a 403 error for that page.
I just found a benefit to the noindex meta tag being ignored by Google.
I found a scraper directory with noindex tags on it's "details" page which copies descriptions/titles/keyword 6 times on this page and in the alt tags for transparent images and they list multiple keywords in the URL also (which means they can easily outrank any site in their directory) and they do it to every site in the directory.
These scrapers will show up in the "intitle:yourdomain.com" search in Google even though the noindex tag is in place.
This site is from a foreign country and host their own site so contacting the host is useless. I reported the site to Google AdSense and Google Spam and also have been writing to all the web design sites listed in the directory asking them to do the same.
If I was a clever chap and had an office in Mountain View CA, I might get the site maps people working on implementing some tools to identify technical problems that could be causing "crawling and indexing" issues. I'd then give some nice instructions that were NOT written by programming but someone who could channel Denzel (explain this to me like I'm a six year old) while doing it.
Some of the more cynical of us might notice that we had to sign up for a solution to a problem they created in the first place, but since they are they web, when Google ain't happy, ain't nobody happy.
|My original observation is that pages containing the <meta name="robots" content="noindex"> tag (for the last three years) are now showing in SERPs with a full title and snippet, and have a link to a cached page from 2005 July, and that the cached page clearly shows the meta noindex tag too. |
What with Google using a "crawl caching proxy," shared by various bots, including the Adwords bot and Googlebot, and the Adwords bot ignoring robots.txt, I've been anticipating that problems might happen.
While Google has talked publicly about the Adwords bot ignoring traditional robots.txt prohibitions, it hasn't mentioned anything about the robots meta tag. For a number of reasons, it would be nice to have an official word on how the Adwords bot regards this tag. I'm also wondering whether the Adwords bot ever goes beyond specified landing pages.
Additionally, with regard to g1smd's situation, and that of others whose <meta name="robots" content="noindex"> tagged pages are being indexed, I'm wondering if those affected might simultaneously be using robots.txt to disallow bots to these pages.
I'm also wondering whether anyone who'd disallowed the Adwords bot specifically has seen their blocked pages in the index.
|I'd then give some nice instructions that were NOT written by programming but someone who could channel Denzel (explain this to me like I'm a six year old) while doing it. |
<offtopic> graywolf - I've long felt that this should apply to all software documentation and interfaces... even pitched it to some companies... but the idea hasn't caught on.</offtopic>
noarchive example [184.108.40.206]
Don't know how long that link will last.
Has anyone with the Google or Yahoo toolbar been looking at your noindex pages?
One thing I've noticed is that with the toolbar it tries to index those pages, my site is half designed with a robots.txt on it with no links and it has been indexed by Google... it has had the robots.txt since the day I bought the domain
|>>>That would be MS rather than G. <<< |
MS thinks they are all things pc.
You are wrong. MS does not dominate only the PC sphere. They also dominate the web/html. They have always enforced their own "web standards" or "extensions of standards" via their Internet Explorer dominance without any prior public discussion.
From [google.com ]:
|Google automatically takes a "snapshot" of each page it crawls and archives it. |
Users can access the cached version by choosing the "Cached" link on the search results page.
Note: this tag only removes the "Cached" link for the page. Google will continue to index the page and display a snippet.
Metatags only prevent displaying "Cached" links or content on SERPs (as long as the display flag checking program runs perfectly).
I can confirm that scrapper spiders are running in massive numbers against Googles cached pages ripping the content was not suppose to be rippable...
If google is ignoring noindex tag, we still make it ignore pages through a robots.txt
Correct me if i am mistaken?
I hope those "secret" pictures don't appear in the index...
The noindex sites that are now indexed and cached by Google, that I have found, have the meta robots noindex tag on every page of the site and do NOT have a robots.txt file at all.
Traditionally, a robots.txt disallow instruction for a URL simply led to the page appearing as a URL-only entry in Google's index if they ever saw a link to the disallowed page. The meta noindex tag ensured that nothing appeared for the page at all. That is now completely broken.
For Yahoo, when they find the meta noindex tag, they do include the page as a URL-only entry, but often they also try to build an entry for the page, by using the anchor text of one of the links pointing to the page (from an external site) as the SERPs entry title!
I also have been finding those sites do not have their robots.txt file set up right. Different versions:
Where are you Google? This is too big of an issue to be quiet about.
Anyhow, while we have your attention, I would like to suggest a robot.txt protocol that utilizes server headers. For example, the following example could be used to prevent Googlebot from indexing any jpg file on an Apache server.
<Files ~ "\.jpg$">
Header append Robots "noindex"
This way, we can block specific files types on the server side and keep the main robots.txt file compliant for bots that choke on a lot of the proprietary directives or special characters that some bots use. Besides, it's a much more flexible solution than a standard robots.txt file or meta tag is and it would save bandwidth for both parties.
Wouldn't something like this do the trick?
Some bots choke on * in disallow directives. It's not supposed to be used in that fashion.
Indeed. But you referred to Google, so I replied with that in mind.
But, as you say, it won't work with Yahoo or MSN probably.
Well you get Google on board first and the other major search engines will follow.
yeah google is indexing all my popup "enlarge image" pages? when I told the robot not too.
Also when I checked google sitemap it said there was an error trying to crawl fusjkmgvhxkcx.html?
That is a non existent page!
The disallow notation with a * in the URL is for Googlebot only.
The bot asking for URLs that do not exist, is trying to see what your real "404 error page" really looks like. Many brain dead webmasters serve a 302 redirect for "page not found" and their bot gets confused by it. They now test for that condition.
I have it set so a directory is ignored via robots.txt recently started to get 404's for links within this directory (since been deleted), its still listed as disalow in robots.txt, doing a quick search shows 100's of suplemtal results all within this directory, cached from 10 June 2005
After playing with the robots.txt analysis tool on google sitemaps stats page i came across the following
in robots.txt the following line
now if I use the "Test URLs against your robots.txt file" with http://www.site.com/dirabc comes up with "Blocked by line 63: Disallow: /dirabc/"
But if I try it with http://site.com/dirabc then I get "Allowed Not in domain"
Guess its time to go play with .htacess to get the http://site.com domains to goto www.site.com, still I dont think that the above should be happening
[edited by: lawman at 11:04 pm (utc) on June 15, 2006]
If you Disallow: /dirabc/ then you have only disallowed http://site.com/dirabc/ and http://www.site.com/dirabc/ and their files, but you have NOT disallowed http://site.com/dirabc and http://www.site.com/dirabc which if accessed can still return a valid DirectoryIndex.
Disallow, does not disallow folders, or files, but URLs that start with the partial-URL mentioned in the exclusion. The disallow is on a per-domain basis, with the robots.txt file being found in the same root folder of the domain as the files that it can exclude.
I recently set up a bunch of pages that are duplicate of content on my site but I wanted to have a separate page for each widget to help my visitors locate what they are looking for once they are on my site.
I put the tag:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
on each page thinking that would be enough to stop google indexing it and penalising me for duplicate content.
Today I notice my site is disapearing intermitantly from the google listings for its most popular phrase. However, when I check out its rankings at the Big Daddy Dance Tool site I see that my site is still ranking number 3 for each datacentre.
Would it be possible that google is now penalising me for duplicate content? I don't have a robots.txt file, should I have one of these, will it help? Or should I just remove the duplicate content all together. When I say duplicate, its not really, but its similar. Kind of like a summary.
Is anyone else noticing anything like this? Or is this intermitant dropping in and out business just normal and related to regular index movements.
my site is more older than year but just one page was crawled that is home page and that was indexed in top SERPs, for few key word it was on top rank but yesterday my all pages got crawled and now i am no where in SERPs can any body suggest what is that i mean what must be reason behind that
I have also seen a website which is indexed by google however site owner was using noindex tag.
googlebot is not following that tag nowadays.
I dont see it at my end
Looks like the bad cache is disappearing, 220.127.116.11 is clean, any relationship to the bad data push?
Aside to those concerned about unauthorized caching: I don't know if Google is cleaning up its act re ignoring NOINDEX, but Alexa and MSN are making a mess of it now --
Amazon-owned Alexa shows MSN-crawled SERPs WITH code-forbidden Caches
NOARCHIVE and robots.txt instructions meaningless. Three strikes?
A month ago, Google thought it had 19,000 of my pages. That's waaaay more than makes sense. None of my pages had <robots> tags then but I added <robots follow noindex noarchive> to my biggest category of pages about May 23. Google stopped replacing those pages but continued showing results along with a CACHED copy from just before I made the change.
Over time, G* forgot about the month-old CACHED copies and started using copies from a year or more ago and called them SUPPLEMENTAL. Those obsolete pages seem to be (seem to be) going away a few at a time. Google now thinks it has under 9,000 of my pages. That's still too many but it keeps getting better.
My guess (just a guess) is that the 19,000 counted multiple archived copies of my pages from different dates. Now each crawl of a page with a <robots follow noindex noarchive> tag seems to kill off a few offending obsolete copies.
[NOTE: I am forcing one specific server to remove that as a variable.]
In the two hours since I started the note above, the number of my pages Google reported knowing dropped from "about 8640" to "about 297" which is about what it should be. Now if it will just STAY that way...
G* still delivers some annoying SUPPLEMENTAL RESULTS for some searches but that does seem to be getting better.
| This 70 message thread spans 3 pages: < < 70 ( 1  3 ) > > |