homepage Welcome to WebmasterWorld Guest from 54.205.59.78
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 260 message thread spans 9 pages: < < 260 ( 1 2 3 4 5 6 [7] 8 9 > >     
Is there an update going on?
willybfriendly




msg:127678
 11:51 pm on Oct 30, 2004 (gmt 0)

I see some pretty significant changes in the SERPs across a number of terms. Also note that the cache is showing pages I updated within the past week.

I have noted more than the usual shifting of pages in and out of the SERPs for the past week or so, and of course we have seen threads about updated backlinks (FWIW).

Anyone else seeing changes?

WBF

 

jnmconsulting




msg:127858
 7:36 pm on Nov 12, 2004 (gmt 0)

Not only is is wrong but it is dangerous, for all of us, to have those types of directories inventoried in the index. A good search engine manipulator can acctually use google or others to find those directories and specific operating systems to perform malicious acts.
send data to the cgi interface, find administration files and userlogs etc...

I view this as a larger problem than anything else.

there are hackers guides out there that show how to do this and the search parameters to use with google.

The Contractor




msg:127859
 7:50 pm on Nov 12, 2004 (gmt 0)

They don't "go" there. They just list links they see on pages they are allowed to go to.

Uhmmm... so are you telling me that I could make a bunch of links to pages like cgi-bin/googleshouldntbehere.cgi and if the files didn't exist it would still list the files/names? If not, then it would have to go there to see if the file existed - right? and it shouldn't if it's blocked...right?

The Contractor




msg:127860
 7:58 pm on Nov 12, 2004 (gmt 0)

A good search engine manipulator can acctually use google or others to find those directories and specific operating systems to perform malicious acts.

I just had someone on the phone and we were doing searches like that, amazing what people leave open :(

My whole point is Google never did this in the past 3 years on the site - does it really have to include useless files to grow the index. States the site has over 24,000 pages indexed and it really should be about 1/3 of that.

steveb




msg:127861
 8:37 pm on Nov 12, 2004 (gmt 0)

If Google reads links on a page it can index, it will make URL only listings for those links if it can't (or doesn't) actually get to the pages.

The Contractor




msg:127862
 8:55 pm on Nov 12, 2004 (gmt 0)

Well I've been around a "little" while and have never experienced this before until this new "larger" index. So, in theory I could link to a lot of pages/sites that don't exist and Google would read those links as text....... interesting, wonder if Google thought this through at all? Let alone as someone mentioned the security reasons of caching file names that are blocked.... a pretty stupid move I'd say... I found dozens of sites I could take down in a second just doing some quick searches earlier.

g1smd




msg:127863
 9:01 pm on Nov 12, 2004 (gmt 0)

I have a page that has had <meta name="robots" content="noindex, follow"> on it for over 2 years and the URL does appear as a URL-only listing in some Google results, and has always done so.

I always assumed that Google would keep a list of every link it had ever seen, as well as the status for that link (in the internal database). What surprises me is that those URLs can turn up in public results.

scoreman




msg:127864
 10:03 pm on Nov 12, 2004 (gmt 0)

I noticed someone mentioned that the index was at 23 mil? Eh, take a look now

"Google's index nearly doubles to more than 8 billion pages."

mark1615




msg:127865
 10:16 pm on Nov 12, 2004 (gmt 0)

Same here. They are totally ignoring the robots command. This is VERY dangerous. They have opened up who knows how many sites to hackers. As someone said earlier, with minimal knowledge and the info G is now showing a malicious hacker could take down any number of sites. Thanks G. And they're worried about SEO? Please. Try and get priorities in order - try not exposing everyone to destruction just so you can say you have indexed a bazillion pages - which is misleading anyway. The hyprocrisy is truly staggering.

instinct




msg:127866
 10:24 pm on Nov 12, 2004 (gmt 0)

I have a very bad feeling on this one. Google will get spanked for this, as they should.

It sounds like it was a way to artificially increase the 'indexed pages' count - which was an obvious PR move in order to counter msn's 'big' announcement.

Is G the one using shady tactics to compete with Microsoft?

Say it ain't so Google!

Philosopher




msg:127867
 10:35 pm on Nov 12, 2004 (gmt 0)

The URL only listings are really nothing new. They have been around for a year or two at least and if memory serves me GG has commented on them as well. As has been said, they are simply the URLs that G has found and not spidered either because they haven't gotten around to it, or they are not allowed to.

I know one of my sites uses redirects to point to specific affiliate pages. This setup meant I had thousands of separate links running through the redirect script. I disallowed the script in my robots file but whenever I would check in G to see how many pages of the site were indexed, I would find all the redirect URLs listed as URL only. This has been this way for well over a year.

I would guess the only interesting thing is that these URL only listings are being seen more now. Is this because they are coming up more in natural searches, or because people are looking at what pages are indexed a bit more closely because of the size jump?

I would guess the latter.

coconutz




msg:127868
 10:59 pm on Nov 12, 2004 (gmt 0)

>>It sounds like it was a way to artificially increase the 'indexed pages' count - which was an obvious PR move in order to counter msn's 'big' announcement.

No different than the same stunt they pulled on February 17, 2004 (announcing their index was now 6 billion "items"). [google.com...]

Just one day prior to Yahoo unveiling their own search engine and dumping Google. [docs.yahoo.com...]

WebFusion




msg:127869
 11:52 pm on Nov 12, 2004 (gmt 0)

Yep, Cheap PR stunt (I stil wonder why the press hasn't stopped their love-fest and done some digging on these obvious PR moves).

I've sure everyone remembers the one-upmanship that Google and the (then) alltheweb (fast) were doing a few years back when it came to the "size" of their respective indexes. Every time one of them would announce "our index has grown to XXX number of pages", at which point the other woudl come back and one-up them.

Is it any coincidence that google's "new" bot was spidering so absolutely furiously over the last 4-6 weeks or so? In my opinion, the word at the 'Plex probably went somethign like this:

GooglePRDude: "Msn is about to start their PR push. What can we do right now to counter it? I Know!....let's just let our bot index every single thing it can find, as fast as possible, and we'll tell the world that we have the biggest index so we can steal some of their thunder."

GoogleEngineer: "Sure, we can do that, but it will bring in even more spam. We're having trouble controlling it now, if we bring in millions more spammy pages, we're going to see an even bigger relevance hit than we're already getting."

GooglePRDude: "So? That's what Adwords is for! Besides, we've got our stock options now, so who cares if we're not the MOST relevant, as long as we keep the strongest BRAND!"

europeforvisitors




msg:127870
 2:42 am on Nov 13, 2004 (gmt 0)

Nice fantasy, but PR people don't have that kind of influence in real life.

Powdork




msg:127871
 5:47 am on Nov 13, 2004 (gmt 0)

While checking to see how many pages I had indexed (today) I also decided to do a site:www.google.com [google.com] search. Quite a few of the listings are not from Google.

Robert Charlton




msg:127872
 6:48 am on Nov 13, 2004 (gmt 0)

googleshouldntbehere.cgi

I experienced what I think you're discussing and posted the question on this thread back in 2003...

Problem with Googlebot and robots.txt?
Google indexing links to blocked urls even though it's not following them
[webmasterworld.com...]

Response from GoogleGuy to my post...
If we have evidence that a page is good, we can return that reference even though we haven't crawled the page.

Part of my response to GoogleGuy...
...I'd suggest that less aggressive indexing here would be helpful. I can't imagine why Google would want to return a link to a blocked page.

Google obviously didn't follow my advice.

Jim Morgan's post (referenced by his link) about what's behind this problem has gotten moved. Here's his reference (originally posted about something else) with the new url and part of his comments....

Question about simple robots.txt file
[webmasterworld.com...]

If Google finds a robots.txt Disallow for a page, it will remove the page's title and description from its search results. It will also no longer match search terms to the words on that page. So, the page essentially disappears from the Google search results pages. However, if Google finds a link to that page, it will still show that page in results when someone clicks on "More results from <this domain>".
I went around and around with this, trying to find a way to tell them "don't mention my contact forms pages at all, please", and here's what I ended up with:
For Google, don't Disallow the page in robots.txt, but place a <meta name="robots" content="noindex"> tag in the head section of the page itself.

You'll also need to do this for Ask Jeeves/Teoma as well; their handling of robots.txt is the same as Google's.
All the others seem to interpret a robots.txt Disallow as "don't mention this page at all." (

The results I saw way back were in regular serps, not in "More results..."

Adding <meta name="robots" content="noindex"> to the page (in addition to using robots.txt) got rid of the problem. But, again, I don't think Google should be doing this, particularly if the primary motivation is to enhance page count. Who needs it?

Powdork




msg:127873
 7:02 am on Nov 13, 2004 (gmt 0)

If you use it in conjunction with robots.txt won't the robots.txt keep Googlebot from ever crawling the page and thus keep it from finding the noindex, nofollow tag in the head.

MHes




msg:127874
 9:02 am on Nov 13, 2004 (gmt 0)

Google employee.... "People on the forum are moaning"
2nd google employee.... "Good, we must be getting rid of more spam then."

:)

The Contractor




msg:127875
 12:46 pm on Nov 13, 2004 (gmt 0)

Adding <meta name="robots" content="noindex"> to the page (in addition to using robots.txt) got rid of the problem. But, again, I don't think Google should be doing this, particularly if the primary motivation is to enhance page count. Who needs it?

Bob, you know which site I am talking about - check for yourself and you will see that the meta tag is and always has been <meta name="robots" content="noindex,nofollow"> so that must not even work anymore. Whats funny is it has obeyed that (robots.txt and meta) in a couple other sites I have worked on and has not listed their pages. Funny because the robots.txt files are basically an exact duplicate of my own.

Google employee.... "People on the forum are moaning"
2nd google employee.... "Good, we must be getting rid of more spam then."

Uhm... this is the opposite. These are program files on sites that have no business being in the index. They are useless being in there. Think of it as the same as this search allinurl:dmoz.org/editors/ site:dmoz.org

What possible purpose does it have to have 114,000 pages that are meant to be hidden from the public showing up in Google? Why put this kind of trash in their index when there are sites that have been waiting for many months to "really" be included. Completely ridiculous... wait till someone with the company name and deep pockets approaches them for exposing sensitive information they did not want made public.

MHes




msg:127876
 1:59 pm on Nov 13, 2004 (gmt 0)

The Contractor - Of course you are correct. In another thread Msndude said 'we honour robots.txt' which may have been a bit of a dig at google.

I think adding these pages to the index is a mistake. They cannot run the risk of law suits coming at them thick and fast. However, whoever values a presence in Google is probably not going to complain, unless they are very short sighted.

The best way for them to reinflate their index is to reinclude all the previously banned sites and shove them at the bottom of the serps.

zeus




msg:127877
 2:45 pm on Nov 13, 2004 (gmt 0)

I notised today that im not down 70% in incomming visits but more today, I also notised that many more pages in the site: search is doman only, it could be the reason. Google is realy messing up the serps these days.

Another thing my link count even changes, sometimes 460 other day 461, that I have never seen before that is always stable untill new back link counts.

Powdork




msg:127878
 4:11 pm on Nov 13, 2004 (gmt 0)

Bob, you know which site I am talking about - check for yourself and you will see that the meta tag is and always has been <meta name="robots" content="noindex,nofollow"> so that must not even work anymore.
See message #196
The Contractor




msg:127879
 4:29 pm on Nov 13, 2004 (gmt 0)

Powdork, I agree with you ;)
I notice all the Google cheerleaders are absent from discussing this "problem"....seems no-one including Google can defend this action. I'm not trying to be as rough on Google as it may sound, but they should face this "problem" and fix it. I have found some pretty sensitive information along with many security holes in websites.

Of course if you wanted to capitalize on this it would be very easy to send these websites, companies, government sites, and education facilities a polite email or phone call stating that you have discovered a security hole on their site and you specialize in online security... "We would be willing to fix it for $xx..." ;)

quotations




msg:127880
 5:01 pm on Nov 13, 2004 (gmt 0)

Of course if you wanted to capitalize on this it would be very easy to send these websites, companies, government sites, and education facilities a polite email or phone call stating that you have discovered a security hole on their site and you specialize in online security... "We would be willing to fix it for $xx..." ;)

I believe that if you do a search you will find that several people who are sitting in prison sent very polite emails and phone calls just like that.

It is usually interpreted as a threat and very quietly prosecuted as blackmail.

You might want to try the search someplace other than google.com unless you really want to find directories of "people sitting in prison" and "people sitting in prison with Via8r!" and "people sitting in prison with h@ir los$" ;-)

The Contractor




msg:127881
 5:07 pm on Nov 13, 2004 (gmt 0)

uhmm... no, I wasn't talking of blackmail... Google is putting it out there for the world to see, it's not like you used a script to break into or hack their network/site or webserver.. Myself, I would appreciate someone bringing it to my attention that I have publicly viewable security or sensitive documents/info showing.

markus007




msg:127882
 5:40 pm on Nov 13, 2004 (gmt 0)

Myself, I would appreciate someone bringing it to my attention that I have publicly viewable security or sensitive documents/info showing.

At the end of the day you are responsible for your own security, if googlebot can access those pages then so could any competitor if they wanted to snoop, with or without the pages listed in google.

The Contractor




msg:127883
 5:50 pm on Nov 13, 2004 (gmt 0)

At the end of the day you are responsible for your own security, if googlebot can access those pages then so could any competitor if they wanted to snoop, with or without the pages listed in google.

So if I find a security hole on your site then I guess it's the only ethical thing to do is post it for the public to see...right? I think you are missing the point here. Google should not list pages/files of sites that it is blocked from that folder/directory "period". There is information even on a very closely related site of webmasterworld.com that I don't believe the users of the tool would like everyone to know what sites they are running through it - not a big deal, but to me it's a privacy issue in this case. Yes, the cgi-bin is blocked and in my opinion those pages should not be listed in Google.

markus007




msg:127884
 7:16 pm on Nov 13, 2004 (gmt 0)

I understand what you are saying, but even if google didn't list the page you still have the security hole, why not just fix the problem?

Robert Charlton




msg:127885
 7:18 pm on Nov 13, 2004 (gmt 0)

Bob, you know which site I am talking about - check for yourself and you will see that the meta tag is and always has been <meta name="robots" content="noindex,nofollow"> so that must not even work anymore.

Yes, I see that the meta tag is there on all the pages in googleshouldntbehere.cgi, but the urls do indeed show up in the index. You couldn't really password-protect the cgi directory without blocking access to the pages and thus impede functionality of the site. Fortunately, there's nothing critical in these pages... it's just messy for Google to have them indexed... but I gather you have seen sensitive info on other sites.

If you use it in conjunction with robots.txt won't the robots.txt keep Googlebot from ever crawling the page and thus keep it from finding the noindex, nofollow tag in the head.

Good question. More from Jim Morgan in his msg #12 of
[webmasterworld.com...] --

...you may ask, "Well then, what good is robots.txt, if these search engines treat Disallows this way? Why not just use the robots metatag and forget robots.txt?"

The answer is that using robots.txt saves bandwidth. If a page is Disallowed in robots.txt, Google and AJ/T will list the page URL (with no title or description) if they find a link to it, but they will not download the page. On the other hand, in order to see the on-page robots metatag, a search engine *must* download the page. So using a robots.txt Disallow for those engines which treat it as "don't mention it" can save you a lot of bandwidth if the pages are large or spidered often because the site has high PR or link popularity. As a result, I have many pages which are disallowed for all engines except Google and AJ/T, and also are tagged with a meta name="robots" content="noindex,nofollow" specifically for Google and AJ/T.

meta name="robots" content="noindex" isn't working in The Contractor's case. Also, I note that g1smd posted this in msg #186 above...

I have a page that has had <meta name="robots" content="noindex, follow"> on it for over 2 years and the URL does appear as a URL-only listing in some Google results, and has always done so.

I always assumed that Google would keep a list of every link it had ever seen, as well as the status for that link (in the internal database). What surprises me is that those URLs can turn up in public results.

I hope that this is just a bug in what's got to be a huge project for Google. If it's intentional, once again, I've got to say I think that Google ought to rethink why they're displaying these links publicly.

The Contractor




msg:127886
 9:12 pm on Nov 13, 2004 (gmt 0)

I understand what you are saying, but even if google didn't list the page you still have the security hole, why not just fix the problem?

I don't have a security problem and I really don't care that Google has 3-times the files for a site than are readable by the public.... I'm simply stating that Google should not list files that are blocked by robots.txt.... why is that hard to understand? Many of the files shown in Google could not be found otherwise or at least not at all easily. Try to get an index or file list from someone's cgi-bin and see what happens…

By doing this Google has opened security and privacy issues. I would think that as a bunch of geek programmers they would know the meaning of bloat. Right know they are stating they have increased their index and patting themselves on the back... when all they have done is bloated their index with worthless files that are not supposed to be viewed in public and have also exposed security and privacy issues.

Let's put it this way... if the new MSN search would have these files in their index, under the conditions that Google does (by not fully obeying robots.txt), people would be bashing the heck out of them... You would see it on every major online news site about their disregard for privacy and security… since it's Google, the cheerleaders stay quiet or try to defend it by blaming the site owners for having these files on their site/server…simply amazing.

Google will probably not fix this because it would end up with the smallest index of any of the major search engines… pretty sad when the day comes that they have to try the bigger is better routine… they started out as quality and have resorted to bloating their index with useless and/or private files.

This topic should have been in it's own thread…

vrtlw




msg:127887
 10:02 pm on Nov 13, 2004 (gmt 0)

This was first observed in March [webmasterworld.com] but the thread wasn't really picked up on.

added:- That is a great reference from JDMorgan - nice find.

The Contractor




msg:127888
 12:51 am on Nov 14, 2004 (gmt 0)

vrtlw - Thanks!

I sure wish a mod would move these posts to a new thread that is on topic so it could be participated in more. The subject has gotten way off course of this threads original topic.... of course it was my fault for it doing so ;)

I'm sure all the mods and admins are busy at the conference... hehe

This 260 message thread spans 9 pages: < < 260 ( 1 2 3 4 5 6 [7] 8 9 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved