| 12:51 am on Nov 14, 2004 (gmt 0)|
vrtlw - Thanks!
I sure wish a mod would move these posts to a new thread that is on topic so it could be participated in more. The subject has gotten way off course of this threads original topic.... of course it was my fault for it doing so ;)
I'm sure all the mods and admins are busy at the conference... hehe
| 1:59 am on Nov 14, 2004 (gmt 0)|
This is one of those 'deafening silence' threads. All one sided, and with good reason. Who would defend this nonsense. One can only assume/hope that some meetings have already taken place at the Plex trying to figure out how to backtrack on this one. Especially if this was PublicRelatons/ego driven, as it would seem.
| 2:03 am on Nov 14, 2004 (gmt 0)|
they even have cgi files listed that I haven't had on the site since LAST CHRISTMAS.
| 2:41 am on Nov 14, 2004 (gmt 0)|
|they even have cgi files listed that I haven't had on the site since LAST CHRISTMAS. |
That matches up nicely with
From this thread
|Google has been checking at a high rate older 404 pages, from nearly a year ago. Interesting the rate it is going at and also that the age of the database they are using is pre Oct 2003. |
| 7:18 am on Nov 14, 2004 (gmt 0)|
Back to the original topic
"Is there an update going on?"
I'm currently seeing the biggest movement I've seen in a while with my sites. A sandboxed site that has been in the the high teens consistently for an uncompetitice search has moved up to six. I am also seeing some movement with my mature sites. Havne't noticed anything across the board yet, though.
| 7:41 am on Nov 14, 2004 (gmt 0)|
I want to chime in here with a few points, and a warning:
First, my original comments cited by Robert_Charlton early in the robots.txt part of this thread mention Google's "additional results." Bear in mind that my original comments were posted quite a long time ago.
Second, just to clarify/reiterate: If you Disallow a page in robots.txt, then your <meta name="robots" content="noindex"> tag can have no effect on a working "reputable" robot, because such a robot won't fetch the page. If it does not fetch the page, it can't parse the meta-tag.
Third, the Standard for Robot Exclusion [robotstxt.org] says that robots.txt is intended to control robots' fetching of pages. It does not mention search engines or search engine listings at all.
Google was proud of including URL-only listings for pages which they found through links but could not spider. Googleguy's comment was something to the effect that they were finding more of the "Deep Web" -- and the example he cited clearly showed that he meant indexing links to obscure-but-useful pages, and not that they intended to reveal people's cgi-bin files.
While I agree with some who have posted here that this URL-listing is a lousy idea, I disagree that it is malicious, or simply an attempt to grow the index. As currently implemented, it is simply a mistake, and one with some rather bad consequences. I'd rather discuss those consequences in a thoughtful manner than clutter the thread with name-calling and accusations; That is not how to win a debate.
And now the warning: Since I first posted the information about AJ/Teoma joining Google in including link-based URL-only-listings, Yahoo (Slurp) has joined the club and has posted (and may still be posting) these linked-to-but-non-spiderable pages. But rather than a plain URL, Y! uses the link text from the link they found as the title of the listing.
If we are to continue down this road, it is time for Google, AJ, and Y! to review what they are doing, and then triple the size of their "Webmaster Help" pages to include all the details of how Webmasters can keep potentially-troublesome URLs out of their indexes.
| 8:35 am on Nov 14, 2004 (gmt 0)|
I don't normally get involved with the daily debate in SE forums. Yes, I skim thru the 'big ones' daily looking for a single clear concept from an authority on the subject. (here's the pay-off line) ....
But I was so impressed, nay, illuminated by the previous posting by JD; (Is there an update going on?
), that I was impelled to write and express my appreciation of your clear and lucid clarification on the situation.
Congrats' - You moved me to make my 1st forum posting in some years, on any subject.
| 1:24 pm on Nov 14, 2004 (gmt 0)|
Ok, I will adhere to the rules - I only wish these were all in one thread ;)
I can only say I searched both Teoma, Yahoo, and new MSN and do not see the same problem at all?
Do you see the problem at Yahoo/Teoma/MSN preview where they are listing files from areas blocked by robots.txt?
|While I agree with some who have posted here that this URL-listing is a lousy idea, I disagree that it is malicious, or simply an attempt to grow the index. |
Could someone/Googleguy then please explain why or at least some logical reason they are fetching urls to files in which the site owners have exclusively went out of the way to keep them from being found by including that directory/folder in their robots.txt?
It makes absolutely no sense to me, but maybe I am missing something and people do want their shopping cart orders including credit card info, setups and config files to scripts/programs, personal info, and other files to be neatly inventoried by Google? Yes, in many cases those files should NOT be kept on the web, even by .htaccess lockdown - but in my opinion that does not excuse Google for what they are doing.
I apologize if I come off in the wrong way but the privacy issues alone bothers me.
So what is the reason for Google taking an inventory of files blocked by robots.txt?
1.They can't control their bot or it has a major bug?
2.They feel they will be able to expose/weed out some tactics they don't like from a website?
3.Without these useless files their index would dwindle. The only reason I say this is on average on many of the sites that I know and/or have control of Google is listing as many as 3-times the pages as the sites have or want viewable to the public. This is not meant to be a jab, but what would happen if they took out all these useless files that have been blocked via robots.txt. They have almost a million files/entrys from dmoz alone in their index that are blocked via robots.txt - lets face it their index would dwindle very fast...
| 1:29 pm on Nov 14, 2004 (gmt 0)|
|I see some pretty significant changes in the SERPs across a number of terms. Also note that the cache is showing pages I updated within the past week. |
I have noted more than the usual shifting of pages in and out of the SERPs for the past week or so, and of course we have seen threads about updated backlinks (FWIW).
Anyone else seeing changes?
This thread is now at 22 pages and very little is actually on-topic. Is anyone seeing an update?
| 2:04 pm on Nov 14, 2004 (gmt 0)|
Is anyone seeing an update?
No, only daily flux.
| 4:39 pm on Nov 14, 2004 (gmt 0)|
I think this is the main reason.
|They feel they will be able to expose/weed out some tactics they don't like from a website? |
Based on some testing in progress, I suspect the 302 redirect/meta redirect page jacking problem is being addressed. Perhaps uncloaking cloaked pages is another goal.
I've been seeing fresh tags on pages daily since 9/25 for SERPS that typically only show fresh tags every 2-3 days, and it continues today. I'm convinced we're seeing (and complaining about ;) ) a "work in progress," not a finished update.
| 7:45 pm on Nov 14, 2004 (gmt 0)|
|Based on some testing in progress, I suspect the 302 redirect/meta redirect page jacking problem is being addressed. |
Can I ask why you suspect this?
| 7:45 pm on Nov 14, 2004 (gmt 0)|
"but maybe I am missing something"
I confess I don't understand what you are upset about. These URLs are clutter, no doubt about that part, but everything displayed is on a publicly accessible page. A human could see any of these URLs by just going to the page. Google indexs what humans can see, unless told not to. If you don't want URLs listed that humans can see (and in fact could see in the cache of the page) then you need to go back one more level in your robots.txt disallow, until you get to the point where the URLs are all not a problem to be seen. Disallow the page with the URls, not merely the destination pages.
| 9:05 pm on Nov 14, 2004 (gmt 0)|
no, you couldn't see the urls in many cases by going to the directory/folder.
Take for example the following:
Do a search on google for allinurl:/cgi-bin/ site:searchengineworld.com and start on about page 6. there is no reason in the world those should be in there - thats a privacy issue to me. If you know some of the popular programs, scripts, carts etc you can find everything from orders with credit card info, system setups etc.
Just because something exists does not give Google or anyone else the right to show it when someone has intentionally blocked it. Jeez, I know whats under the clothes of a woman walking down the street, doesn't give me the right to view whats there or show it to others ;)
| 9:20 pm on Nov 14, 2004 (gmt 0)|
I think you are missing the point.
Pages, all be it just the url, should not appear on any serps page if disallowed.
The page the link is on can appear in serps, but listing the actual url as a seperate listing is contrary to the disallow instruction.
The webmaster wants the page which he is linking to (and 'disallowed' from the index) only to be available from specific pages, not google serps. For whatever reason, that is a fair request which is being ignored. Anybody else may list the page, but I suspect that would be equally annoying.
Humans can find the page via a site, and that is the point... the webmaster wants the page not to be listed by spiders but only found by humans following a controlled path.
Just because the url is 'public' by being on other pages does not mean the url should have its own listing. Disallow means disallow, not 'Oh, it says disallow but I will list it anyway because a human could find it '
| 9:22 pm on Nov 14, 2004 (gmt 0)|
|Jeez, I know whats under the clothes of a woman walking down the street, doesn't give me the right to view whats there or show it to others |
I can't quite agree here - it is all dependant on how much was on display already - if said woman was not as good at concealing their own assets then I am sure there would be a few comments as she made her way down the street.
Security of a website (or private parts) is the sole responsibility of the webmaster (or dresser) - no excuses.
| 9:36 pm on Nov 14, 2004 (gmt 0)|
|Security of a website (or private parts) is the sole responsibility of the webmaster (or dresser) - no excuses. |
One of the tools available for privacy/security is "robots.txt". I'm sorry, but I can't help but laugh and its not personal. It's just that if Google ran a script to get inside password protected areas... there are people that would still blame it on the webmaster in defense of google.
I have one question for those who don't seem to understand:
If I find a security hole or private information on your site or anyone elses, even though I was going through areas of the site I was forbidden from going to - is it ok to post the sites private, confidential, or security risks publicly on my website?
That is what you have to ask yourself. Is it ethical? Is it legal?
| 9:45 pm on Nov 14, 2004 (gmt 0)|
"The page the link is on can appear in serps, but listing the actual url as a seperate listing is contrary to the disallow instruction."
No it isn't, and I suspect that is the problem here. Suppose site.com has a complete robots.txt ban. Google will not crawl site.com
But if siteb.com links to site.com, site.com will show as a URL only listing due to it being publicly viewable from siteb.com which has no robots.txt prohibition against indexing the information on its page.
Both allinurl:/cgi-bin/ site:searchengineworld.com and allinurl:/cgi-bin/ site:dmoz.org are good examples. All the information listed there is accessible to people, and thus everything is information Google can list. If something there leads to a some webmaster stupidity, then that has nothing to do with anything. Googlebot functions as a person. If you don't want a person, with no password permissions, to see stuff, don't have it in the html of pages people/Googlebot can see.
"even though I was going through areas of the site I was forbidden from going to"
The point you don't seem to understand is the information listed is viewable from places where you have invited Googlebot... or at least not dis-invited.
You can't come in my house unless I invite you. But you can know its address because I put it in the phone book.
((I'm not saying this is the best way to do things, only that Google is behaving as it has in the past, and that it conforms to the rules. I'd personally prefer they had different policies regarding this stuff though.))
| 9:53 pm on Nov 14, 2004 (gmt 0)|
steveb you have not answered my question:
If I find a security hole or private information on your site or anyone else's, even though I was going through areas of the site I was forbidden from going to - is it ok to post the sites private, confidential, or security risks publicly on my website?
That is what you have to ask yourself. Is it ethical? Is it legal?
If you believe this is ethical and legal for me to do so I will be happy to take a look at any sites I can verify you own and post links to any private, confidential, or security risks I find publicly. Do we have a deal?
| 10:00 pm on Nov 14, 2004 (gmt 0)|
|You can't come in my house unless I invite you. But you can know its address because I put it in the phone book. |
You may invite someone into your house, but I don't think you want them going through your bedroom drawers.
| 10:48 pm on Nov 14, 2004 (gmt 0)|
>You can't come in my house unless I invite you. But you can know its address because I put it in the phone book
Thats a good analogy, so lets run with it....
Google indexes the phonebook.com/index and there is a listing:
"If you go to Steveb's house he will hit you with a saucepan.
(link>Steveb's home address(/link) (This links to a page with the address and is 'disallowed')
Someone searches for Steveb's address on google and sees the disallowed page in the serps.... they go to Steveb's house unaware of the danger.
Thats why Google should respect 'disallowed' and not have a seperate listing.
| 1:16 am on Nov 15, 2004 (gmt 0)|
MHes, I don't believe it's true that steveb hits people with saucepans.
| 1:31 am on Nov 15, 2004 (gmt 0)|
I just bought a Louisville Slugger for that.
The Contractor, you are not asking a question that is possible. "even though I was going through areas of the site I was forbidden from going to". Google does not do this. The actual question is if it is ethical to show private information a webmaster has made public. The answer is yes. Google is not indexing forbidden pages. If an allowed page links to a forbidden page, that information is available. Gooogle is doing nothing wrong in that. Again, I personally think there is a better way, which is to not show URLs of pages *never* indexed, but the choice Google makes is a logical and ethical one. They aren't doing anything wrong, but they could do something better.
| 1:36 am on Nov 15, 2004 (gmt 0)|
"You may invite someone into your house, but I don't think you want them going through your bedroom drawers."
Which again highlights how I think you are misunderstanding what they are doing. If you invite someone into your house, they can see that you *have* bedroom drawers. That's it. They do not go through the drawers.
If you didn't want people to know you have drawers, don't let them in the house. If you didn't want people to know your adress by looking in the phonebook, then don't list it there. Your protection is one step back all the time.
| 3:26 am on Nov 15, 2004 (gmt 0)|
Again, I personally think there is a better way, which is to not show URLs of pages *never* indexed, but the choice Google makes is a logical and ethical one.
Ethics aside for a minute, I'm just wondering with what reasoning do you think that this is a 'logical' thing for Google to do?
I think if Spock were here he would disagree ;-)
| 3:34 am on Nov 15, 2004 (gmt 0)|
|If you didn't want people to know you have drawers, don't let them in the house. If you didn't want people to know your adress by looking in the phonebook, then don't list it there. |
From the security point of view, I agree. From the privacy point of view, the house/drawers analogy to page/links doesn't feel right to me. I don't think you'd expect the phonebook to publish a map of the inside of your house.
|If you don't want a person, with no password permissions, to see stuff, don't have it in the html of pages people/Googlebot can see. |
There's no question that some of the material that's now exposed should be, at the least, password protected. But, I don't see the point of Google indexing urls to pages with noindex tags that reside in directories that are blocked.
Going way back to why I'd explored a similar situation in the first place...
...we had material completely open to the public, but we didn't want either the material or the the links to it to show up in the index.
I can imagine that people have lots of reasons for keeping non-secure info accessible to the public but keeping the location out of the index. Is the current indexing situation going to make this impossible?
| 4:12 am on Nov 15, 2004 (gmt 0)|
PS... As to the question of whether there's an update going on, I can't really say, except that Friday I noticed some unusual movement in a site "sandboxed" because of domain name change.
It had, over the past four months, very slowly worked its way up to the fourth page. On Thursday or Friday, I noticed that it had dropped to about the sixth page. Today it jumped up to the second page and then dropped down to page three, an unusual amount of movement.
One spasm does not an update make. Other movement I'm seeing, among established sites I watch, appears relatively gradual and normal.
| 4:18 am on Nov 15, 2004 (gmt 0)|
to highlight the size of the mistake/glitch view Msg#:102
|cache on 1 of our pages reads |
"as retrieved on 31 Dec 1969 23:59:59 GMT."
this really shows me the magnitude of the glitch, it will be fixed. Well it has to be fixed Otherwise Google will die. i'm seeing more and more cached dates on my pages with the same.
Patients people these cgi files will dissappear, maybe Google should go back to the monthly dance at least then the work behind the scenes wouldn'd be live for us to see.
And to get back on course no real SERP's change just the normal day to day flux.
| 4:18 am on Nov 15, 2004 (gmt 0)|
>> I don't see the point of Google indexing urls <<
But. They are NOT indexing it. They are not showing a title or a description for it. They haven't retrieved the contents of the file.
| 4:46 am on Nov 15, 2004 (gmt 0)|
Then, pray tell, where did they retrieve it from to render the text on your screen? The URL is indexed, i.e. it is a part of their 9 billion pages. The URL is not crawled and there is no data, fancy or otherwise, to associate with the url.
|But. They are NOT indexing it. |
This brings up an interesting question. Could we force a robots.txt protected page to show up for an unusual query with enough anchor text? If not, does that say something about the way these pages are indexed. Or does it say something about the way Google returns all pages. An interesting experiment would be to take two pages, one protected by robots.txt and the other with no title and no content . Then point links from the same page with the same anchor at each page and see the result.
| 5:35 am on Nov 15, 2004 (gmt 0)|
"where did they retrieve it from to render the text on your screen"
From a page that they index. I think some folks are not understanding how these URLs exist.
I put this URL here, once this page is crawled, that URL will show for a site:dmoz.org search. This obviously has no connection to what the dmoz.org robots.txt says since the URL is on webmasterworld. As it happens, that URL will never get indexed, so it won't get a full listing, but Google is doing nothing wrong with including that URL among all the URLs that would show for a site:dmoz.org search because it has seen it on a page that is indexed.
Some people may not like logic, but for a site search I'd like to see pages Google knows exist, rather than URLs Google has seen in html. Pages that actually exist make up a "site".
| This 260 message thread spans 9 pages: < < 260 ( 1 2 3 4 5 6 7  9 ) > > |