Weird bot behavior

Forum Moderators: open

Message Too Old, No Replies

Weird bot behavior

ukgimp

3:49 pm on Jul 16, 2003 (gmt 0)

Article [securitynewsportal.com]

Requests for:
[error] [client 64.68.82.31] File does not exist: /public_html/maillist/maillist_signin.asp

If this is Google, have they turned their technology loose on the web to create SPAM mailing list to sell in a quest for even greater profits?

paranoid or what, but this is what is out there.

Brett_Tabke

4:04 pm on Jul 16, 2003 (gmt 0)

We get several submissions like that a week. It reminds me of the other recent bad interpretation by the Register [webmasterworld.com] of events.

What happened is simple:
- someone tried those urls while using the google toolbar
- the toolbar sent the url back to google,
- google tossed the url in the spider inbox
- attempted to spider the pages
- end of story.

eg: this was reported on webmasterworld a year ago.
[webmasterworld.com...]

GoogleGuy

4:30 pm on Jul 16, 2003 (gmt 0)

No, the toolbar wouldn't cause this. Either someone submitted that url and we tried to crawl it, or possibly someone linked to that url. I'll ask someone to check it out to see if they can find out which it is. When you crawl 3B+ pages, you end up trying to crawl a few pages that don't exist because of bad links/submissions on the web. No need to delve into conspiracy theories (hacked Google, Google getting into spam email, etc.). :)

bcolflesh

4:37 pm on Jul 16, 2003 (gmt 0)

No need to delve into conspiracy theories...

Aha! - you can't fool us - now we are onto your little game!

dmorison

4:44 pm on Jul 16, 2003 (gmt 0)

So I run a check on the IP at 64.68.82.31 and you can imagine my surprise

Obviously an amateur log watcher then.

Brett_Tabke

4:46 pm on Jul 16, 2003 (gmt 0)

So GG, you are saying that Gbot *isn't* following toolbar urls and that the behavior that had been seen - as fairly common knowledge - is no longer the case.

eg: [webmasterworld.com...]

Most of us have been using the toolbar to "submit" new pages for over a year. Works perfectly. Simply visit the url with the toolbar and along comes googlebot. (I'm talking hundreds of cases of this where the pages are not linked anywhere).

Pricey

5:02 pm on Jul 16, 2003 (gmt 0)

Allot of assumption in that article.

There are any number of reasons why that happened.

Brett: If the url didn't exist, then the toolbar could not have sent it in the first place. I think thats what GG is saying.

olias

5:58 pm on Jul 16, 2003 (gmt 0)

Simply visit the url with the toolbar and along comes googlebot

The thing is I have dozens of pages that are not linked that i am not concerned enough to put security on that Googlebot has not visited. In fact I have yet to have one of my hidden pages crawled just because I viewed it with the toolbar.

I think they are only following Webmasterworld admins and mods to make them paranoid. ;)

Fearless

6:34 pm on Jul 16, 2003 (gmt 0)

Brett,

The thread that you cite from last year is a great "must read." G-Guy has said the same thing since then.

Most of us have been using the toolbar to "submit" new pages for over a year. Works perfectly. Simply visit the url with the toolbar and along comes googlebot. (I'm talking hundreds of cases of this where the pages are not linked anywhere).

I've seen the same thing myself and tried to say so in a thread. G-guy sort of denied it. A forum member challenged me and I couldn't understand why because I (basically) found the same thing that Lisa did. I didn't know who to believe. Now I do: Lisa.

But lately I notice Googleguy using a lot of terms like "should," "ought to," "eventually" and "maybe." I understand and grasp the constraints that he operates under, but sometimes I don't see the point of chiming in just to say "don't worry, be happy" (or words to that effect.) If he cannot (in fact) reveal the Google policy at work, why say anything at all? Sometimes I feel that stuff is going on that Googleguy disagrees with, but can't say so out loud.

I have a properly constructed, non commcercial site with adequate backlinks that is five months old and ONE page is listed in Google. For about 56 hours we had "fresh" tags and a current "cache." We still have no PR and no "Links" and now we're back to a cached image circa the first week in June.

"Fresh deep bot?' Maybe if you're fighting for the top listing of selling Viagra online....

GoogleGuy

8:20 pm on Jul 16, 2003 (gmt 0)

Brett, I'll stand by what I said in message #13 of that thread (and later posts):

I'll say it again. :) I don't think our privacy policy prevents Google from doing this, because we are allowed to use anonymous user data to improve our search, but installing the toolbar didn't make googlebot crawl your page. See [google.com...] for some of the typical ways that urls leak. Other ways include people guessing urls, network/DNS setups, etc.

My teacher called this "post hoc" logical fallacies ("It rained after I washed my car, so washing my car must cause it to rain!"). One of the reasons I'm here is to dispel myths; if people still want to believe myths after I've dispelled them, that's their business. ;)

Fearless, just to be extra clear: we don't do this. My personal opinion is that our privacy policy would let us. But as of right now, if someone tells you that the toolbar caused their page to be crawled, they're mistaken. Hope that's definitive enough for ya? ;)

P.S. While we're on common myths, advertising on Google doesn't cause sites to show up in the index either.

Jakpot

8:33 pm on Jul 16, 2003 (gmt 0)

Sometimes I don't believe what I read and most of what I hear and see.

GoogleGuy

8:46 pm on Jul 16, 2003 (gmt 0)

Kinda weird that they chose the headline "Was Google hacked or is Google getting into the SPAM business". Sort of a "have you stopped kicking your dog yet?" headline. I suppose that "Google follows link or url submission to page that doesn't exist" doesn't have quite the same ring to it. :)

loanuniverse

8:56 pm on Jul 16, 2003 (gmt 0)

....doesn't have quite the same ring to it. :)

Well, it got you to reply :)

Dog bites man! doesn't get as much interest as Man bites Dog!

Powdork

9:01 pm on Jul 16, 2003 (gmt 0)

Dog bites man! doesn't get as much interest as Man bites Dog!

But at least in that one story, one of the choices would apply. Google was neither hacked nor are they getting into the spam business. Headlines/titles should grab your attention, yes. But they should also be descriptive and true.

Fearless

9:28 pm on Jul 16, 2003 (gmt 0)

Googleguy,

OK. I get it.

that's definitive enough for ya?

I was taught about "post hoc, ergo propter hoc" a long time ago. AND I see a lot of that confusion going on all the time in this forum. Temporality does not equal causality.

However, on my "hobby" site, I accidentally (I think) replicated Lisa's experiment and got the same result.

Many people have made posts to this forum to the effect of how their new site has been crawled and indexed in "48 hours."

Which certainly is not consistent with my experience of late. In one thread you went so far as to say "it's sort of a curve" or something to that effect. In other threads you've referred to sites linking to mine AND to "people finding your site." (You've used that phrase more than once if my feeble memory serves me right.) To me, that implies some measure of traffic.

How do you measure that?

And if you can be "definitive"do backlinks via jumpmenus count in your evaluation of a new site?

How about backlinks via php pages? How about other scripts like asp? cfm?

I'm not talking about links within my site being crawled. I mean with a new site will this type of link help us over "the curve?"

pixel_juice

9:49 pm on Jul 16, 2003 (gmt 0)

On the original topic, the article in question was quite ridiculous. Another tabloid 'scare the uninformed' piece designed to drum up some extra page views with an attention grabbing title. The author must have seen a strange request from googlebot and punched the air for joy.

Do the hackers or Google have a magic key that gives their crawler wide open access to this /maillist/maillist_signin.asp program?

A magic key? I thought this was from a 'security news portal'?

However if I was Google (or GoogleGuy) I would certainly be concerned about the amount of negative publicity they've been attracting recently. The white knight of web search is becoming decidedly grey...

Powdork

10:42 pm on Jul 16, 2003 (gmt 0)

And if you can be "definitive"do backlinks via jumpmenus count in your evaluation of a new site?
How about backlinks via php pages? How about other scripts like asp? cfm?

jumpmenu? depends
php? yes
asp? yes
cfm? yes

a side note: This thread title was changed (for obvious reasons), but it was changed to the same exact title as another current thread.

nanocet

10:52 pm on Jul 16, 2003 (gmt 0)

>a side note: This thread title was changed (for obvious reasons), but it was changed to the same exact title as another current thread.

Well, actually no.
One is spelled "behavior" while the other is spelled "behaviour"

Google wouldn't treat those the same.. ;)

Brett_Tabke

11:09 pm on Jul 16, 2003 (gmt 0)

Well, there is a 1-to-1 correlation some how between visiting unindexed new pages with the toolbar and GoogleBot showing up. It has happened to too many people to be an accident. Some how they are linked - either directly from gbot putting the url in the spiders inbox, or via referral leaking - but I know for a fact, they are linked.

GoogleGuy

11:59 pm on Jul 16, 2003 (gmt 0)

Hi Fearless, the reason why it's important for people to find your site is that if it's good, they'll link to your pages and spread your reputation out on the net. Filetype (php/cfm/asp) doesn't matter--we follow links from all those pages. Define "jumpmenu" and I'll be happy to say, but my rule of thumb for crawlers is to stick to static links and stay away from JavaScript-y/esoteric links. :)

Referral leaking is a common way that we find leaks to unindexed pages, but that has nothing to do with the toolbar. I still differ with you Brett, but feel free to mail me some examples (deep pages--none of this root page stuff ;).

GoogleGuy smiles and maintains his position.

kpaul

1:00 am on Jul 17, 2003 (gmt 0)

Strive for PR as if PR didn't exist.

I should make a t-shirt like that, but not too many people would 'get it' ;)

I understand people coming at this from a 'business' perspective when their livelihood depends on the free traffic they can muster from the SEs, but that isn't an excuse (imho) for all the whining and accusations and conspiracy theories that get bantered about.

Not all 'business' needs to be so brutal and cut-throat. Sure, some will say it's needed in a capitalistic society, but I don't think so.

Apply the integrity and 'good content' theory to your whole business perspective. It pays off, I believe.

I have to admit I didn't even read the whole article.

Anyway, that's my 2 shekels.

Jakpot

1:44 am on Jul 17, 2003 (gmt 0)

they'll link to your pages and spread your reputation out on the net.

Spread your reputation?

SEO practioner

2:51 am on Jul 17, 2003 (gmt 0)

Interesting topic in deed.

ogletree

3:41 am on Jul 17, 2003 (gmt 0)

Is there any chance Google might start spidering sites with more than 2 equal signs? I have great links from some sites but I get no credit becasue there are 4 equal signs.

g1smd

6:28 am on Jul 17, 2003 (gmt 0)

>> Is there any chance Google might start spidering sites with more than 2 equal signs? I have great links from some sites but I get no credit becasue there are 4 equal signs. <<

You mean four variables? Google will do one or two, maybe three maybe not, but not four or more. Use mod-rewrite to make the URL into something that looks like folders instead. If any if the variables contain id= then Google will ignore the link as having a session ID.

dmorison

7:25 am on Jul 17, 2003 (gmt 0)

If any if the variables contain id= then Google will ignore the link as having a session ID.

You sure? That's a bit amateurish for Googlebot - after all, 9/10 newbie database driven pages pass "id" as a parameter referring to the ID of a row in a database,

viewproduct.asp?id=1234

"sessid" or anything with "ses" in it I can understand (or perhaps id= where the value is over a certain number of characters and contains letters aswell as numbers), but simply not crawling on "id" alone would be a bit restrictive I would have thought.

Powdork

7:32 am on Jul 17, 2003 (gmt 0)

The backlink listed number 1 of 50 for one of my sites.

www.somedomain.com/category_list.cfm?CatID=45

One of the quickest ways to find what files Google will index backlinks from is to check pages' backlinks. They index plenty o stuff. I tend to see a much greater variety of filetypes with the different types of searches (link: site: etc) rather than in regular serps.

bcolflesh

1:14 pm on Jul 17, 2003 (gmt 0)

If any if the variables contain id= then Google will ignore the link as having a session ID.

This might have been true in the past, but I currently have 6000+ pages with ID= spidered by Google.

Regards,
Brent

ogletree

1:42 pm on Jul 17, 2003 (gmt 0)

I was asking for a solution to current problems I just thought GG might have a comment on future plans. I have no control over other sites that point to me. I know what the current situation is. I have never seen a url with more than 2 equal signs. I think it is in Googles best interest to adapt to the web not the web to Google. Actually very few sites in the grand scheme of things even know what helps on Google or even care.

Fearless

3:23 pm on Jul 17, 2003 (gmt 0)

Ogletree,

That's what I was trying to say. I went back through my older established sites and checked "link" and didn't find any script or jumpmenu generated backlinks. And yet, that's clearly the future of the web.

In my case, from a "reality" perspective- those sites are the most signifigant ones that link to our organizational sites. Way more significant that some plain HTML links that Google does appear to pick up.

If Google is looking at backlinks from script generated pages in their asessment of the new site, (As G-guy seems to imply) they aren't showing those backlinks in the "links" search function, (in the sites that I checked.) (And I double checked a few known ones to make certain they were PR 4 or higher.)

This 42 message thread spans 2 pages: 42