Is Google crawling IRC?

Forum Moderators: open

Message Too Old, No Replies

Is Google crawling IRC?

Is Google logging IRC channels?

punta

10:57 am on Nov 5, 2003 (gmt 0)

This site has reports of some people seeing Google machines on their IRC channels [manero.org...]
This is interesting. Perhaps Google are planning to log Internet Relay Chats so people can search them, like they can withe USENET threads.

It's quite an interesting development. What do people think to this?

Brett_Tabke

3:42 pm on Nov 11, 2003 (gmt 0)

Very interesting. I see there is both a Register article on it this morning. Unfortunately the reg article is done again by the notorious google hater Orlowski.

rfgdxm1

3:56 pm on Nov 11, 2003 (gmt 0)

>This is interesting. Perhaps Google are planning to log Internet Relay Chats so people can search them, like they can withe USENET threads.

I doubt it, because of the privacy implications. For those unaware of Usenet (I'm a heavy Usenet rat), it is sort of a decentralized message board system. Posting something to Usenet is analogous to tacking flyers to walls all over town. There is no expectation of privacy; in fact the idea is for display of the post to the world. IRC is different.

My best guess: Google is scraping Usenet for URLs. Sites may be mentioned on Usenet long before Googlebot can find them. If Google is trying for the biggest, most complete index, it makes sense to look for URLs in as many places as possible. Most people who are on IRC never post to Usenet. Different subculture.

bakedjake

4:00 pm on Nov 11, 2003 (gmt 0)

IRC is different.

How so? If they're crawling public networks/public channels, I disagree with you. There should be no expectation of privacy.

Now, I don't really understand the point of Google crawling IRC. Meaning, why would they want to index people's conversation on IRC?

punta

4:03 pm on Nov 11, 2003 (gmt 0)

My best guess: Google is scraping Usenet for URLs. Sites may be mentioned on Usenet long before Googlebot can find them. If Google is trying for the biggest, most complete index, it makes sense to look for URLs in as many places as possible.

I don't believe they are trying to obtain the biggest index. Google don't index 'island' pages. If a page hasn't got any links to it and it's submitted to google via add URL, it doesn't stay in the index for long (if it gets there at all). Surely indexing these pages would increase the size of the index.

Solution1

4:14 pm on Nov 11, 2003 (gmt 0)

URL's coming from Usenet or IRC are just as much recommendations for webpages by people as links from the WWW. I think it definitely makes sense for Google to count these recommendations in their PageRank algorithm as they do with WWW-links. The value of them would be comparable to those from blogs or message boards.

punta

4:18 pm on Nov 11, 2003 (gmt 0)

URL's coming from Usenet or IRC are just as much recommendations for webpages by people as links from the WWW

It's too easy to abuse. People would be flocking to IRC channels to namedrop their own web sites.

bcolflesh

4:20 pm on Nov 11, 2003 (gmt 0)

People would be flocking to IRC channels to namedrop their own web sites.

The bots are primed and ready as we speak!

ukgimp

4:21 pm on Nov 11, 2003 (gmt 0)

Why the hell would you want to search IRC?

G need to concentrate on getting spam out of their index before they bugger around with something that could be abused easily.

rfgdxm1

4:28 pm on Nov 11, 2003 (gmt 0)

>Google don't index 'island' pages. If a page hasn't got any links to it and it's submitted to google via add URL, it doesn't stay in the index for long (if it gets there at all). Surely indexing these pages would increase the size of the index.

They have been increasingly. Search here about "Supplemental Results."

bakedjake

4:29 pm on Nov 11, 2003 (gmt 0)

Why the hell would you want to search IRC?

Law enforcement.

But why is Google indexing it?

rfgdxm1

4:30 pm on Nov 11, 2003 (gmt 0)

>Why the hell would you want to search IRC?

>G need to concentrate on getting spam out of their index before they bugger around with something that could be abused easily.

What if Google assigns pages found only on IRC PR0? This would mean IRC wouldn't be useful to spam. However, Google could find more content that would match searches.

punta

4:39 pm on Nov 11, 2003 (gmt 0)

Why the hell would you want to search IRC?
G need to concentrate on getting spam out of their index before they bugger around with something that could be abused easily.

Maybe that's exactly what they are doing. Spammers often run bots that sit on IRC channels spamming people with website URLs whenever they join the channel.

Perhaps googlebotIRC is collecting these spam URLs with the intent of removing them from the google index.

bakedjake

4:44 pm on Nov 11, 2003 (gmt 0)

Perhaps googlebotIRC is collecting these spam URLs with the intent of removing them from the google index.

I would love that. Really, I would. But they're certainly not doing that; it would be way too easy to delist your competitors.

ukgimp

4:45 pm on Nov 11, 2003 (gmt 0)

Perhaps googlebotIRC is collecting these spam URLs with the intent of removing them from the google index.

Now all a bad person does is spam the irc with theirt competitors.

BakedJ > Was what i thought re the law and it is menioned on slashdot

anxvariety

7:25 am on Nov 14, 2003 (gmt 0)

I bet it's to find people that go on IRC and tell people to click on their AdSense ads.

Sanenet

6:53 pm on Nov 14, 2003 (gmt 0)

On that link, in the google reply:

The unusual activity you are observing is part
of an experiment aimed at improving Google's search quality. Please be
assured that this behavior is only temporary.

Probably just some programmer mucking around. Don't they encourage their techies to experiment as much as possible?

Brett_Tabke

2:23 pm on Nov 15, 2003 (gmt 0)

I think it was a promotion stunt.

Look on irc today and you'll find every variation of "googlebot" as nicknames available.

HughMungus

2:57 pm on Nov 15, 2003 (gmt 0)

Sounds to me like Google is looking for other sources of "relevant" links. It's the same reason (many think) that Google bought Blogger. The thinking is that people with an individual presence on the web in the form of a blog or on chat are posting/sending links that are way more relevant than a link just sitting on a page somewhere, especially if it's a link that was explicity requested.

e.g.,

Chatter 1: "Do you know of a website that sells concert tickets?"

Chatter 2: "Yes, go to http://widgets.com/."

widgets.com becomes more relevant to "concert tickets" because it's being referred to directly from one person to another.

It's certainly interesting. Too bad spam ruins stuff like this for everyone.

Craig_F

3:08 pm on Nov 15, 2003 (gmt 0)

A stunt/test sounds most likely to me at this point. However, I wouldn't be surpised if they are checking all that data for usefulness. There's a lot that can be learned and done with that information. Privacy is a problem but there are many ways around that.

ciml

8:04 pm on Nov 15, 2003 (gmt 0)

If Eddie's asking for LINK and not LIST, the presumably it is crawling servers, rather than channels or content.

I've no idea if channel names and MOTDs are enough for a useful IRC channel search, but I guess the way to find out is to try.

pleeker

10:05 pm on Nov 15, 2003 (gmt 0)

I'm not entirely familiar with everything going on via IRC, and my gut reaction (posted in a similar thread a couple weeks ago), is that most of what happens in IRC would have about as much value as guestbooks on the www.

But on second thought ... surely there are some areas of the IRC world that do have valuable content and information. So Google, in its attempt to organize the world's information, wants to find those areas and index them.

The concern we're talking about here is the abuse of links posted in IRC chats. But if G can analyze links on the www and determine which ones are more valuable, more relevant, less spammy, etc., surely they can do the same on an IRC channel ... can't they? Why are some assuming G can't extend their technology to that type of content?

JasonHamilton

1:32 am on Nov 16, 2003 (gmt 0)

The googlebot has been seen joining channels, but not staying for long.

It's anyone's guess if they plan on joining channels to log discussions. At this point there isn't enough information to form that kind of conclusion.

Without meaning to sound arrogant, I think I know about as much about IRC as one can know. I've run IRC servers for AT&T for many years on Undernet, I'm an operator on servers on EFnet. I've run several smaller IRC networks, written IRC server code, numerous IRC robots (Ranging from game bots to talking AI robots), IRC network services, and developed several IRC related websites.

See my profile to see one of my current IRC projects - it already searches IRC in ways no other IRC search currently does. Resolved are problems with dealing with unstable smaller networks that merge constantly (two separate networks that merge will end up with duplicate channel listings), network splits, and networks just shutting down. My site doesn't list just the larger channels, it lists ALL publicly viewable channels, even those with 1 user, while at the same time weeding out channels that are only temporary. There are no arbitrary limits on minimum network size since we can deal with the network instabilities that is so inherent with the smaller startup networks.

I also cache the server motds, though it's unlikely that google would find that information useful to Joe user.

Anyway, back to the subject of Google on IRC... if google wants channels, I've already got that done, better than anyone else. If they want user discussions, one must keep in mind that it's been tried before, and even with several million in funding (reportedly 10 million), ChatScan failed miserably at it. It's not enough to know of IRC, you must understand its userbase. Some people are drawing parallels with IRC and news groups. However, you must keep in mind that news groups are posts sent to servers, then distributed across the internet to all the news servers who want it. It's understood that your post will be seen by anyone who wants it. IRC discussions are meant only for the users within the channel. Users on IRC tend to be a little more technical in nature than those elsewhere on the net. I know this is a generalization, but IRC use requires a little more knowledge than entering a url on your web browser, or typing out an email. At any rate, privacy is an important aspect to all of this. Let me give a few examples: Mrs. Henderson (the name is made up) is in a support channel for battered women. She's speaking about her husband to the support group. Imagine her horror if she then found out her discussions were published on the web for all to read? How about a depression channel like #asd? I can come up with many more examples of why channel logging isn't going to fly, but I think you can get the point.

I do have some ideas for useful things Google could do with IRC information.. I've already implemented another website that extracts graphic URLs from channel topics across all the IRC networks, allowing users to browser though about 300-400 pages worth of images. It's like an image gallery, but it's constantly changing (about 3 megs of channel topics are added per minute) and can contain images to virtually anything. And that's just a starting point.

JasonHamilton

1:37 am on Nov 16, 2003 (gmt 0)

Oh, I said a lot in my previous post, but I forgot to post what I had originally wanted to say.

There are technical reasons against logging chat on IRC.

1) It's not designed to be logged on that kind of scale. Only users within a channel are set to receive the messages said in a channel.
2) Even if google had a hub on each network, messages to a channel only pass through the hub if a recipient is on a server on the other side of the hub.
3) Joining thousands of channels is not possible - most IRC networks limit users to 10 channels at a time. Thousands of robots on the major networks is not likely to be allowed, no matter how nicely google asks.
4) Google linking pseudo-servers to each network and launching pseudo-users into the channels isn't likely to happen either - there are reasons why services like ChanServ do not sit inside channels. The net.burst tends to be several megs, eating up bandwidth and resources as the networks rush to sync up with each other.

Ultimately, I just don't see users wanting their discussions published in a public forum, especially when it's for the monetary benefit of a company. HOWEVER, that isn't to say that IRC itself has no commercial value or no use to google. I just don't see that channel logging *specifically* as having any place.

ALbino

6:19 am on Nov 16, 2003 (gmt 0)

Why this discussion is even taking place is mind boggling. There is no practical use for logging IRC traffic. The majority of it is "u r hot... do j00 wanna cyber w/ me?!?!". Anything interesting worth logging is either incredibly technical and intricate or illegal. Even if they did log it and make it searchable you would have to sort through literally tons of unuseful information to find the stuff you wanted. Within every worthwhile conversation is 20 "u r hot" conversations. With that said, if they did want to log it, all they would have to do is be hubbed and all IRC traffic would go through them. The irony is that the networks with the technical and interesting users you would actually want to log would never allow Google to do that. So really this whole thread after the first few points is moot :)

AL.

JasonHamilton

3:46 pm on Nov 16, 2003 (gmt 0)

ALbino wrote:
<<With that said, if they did want to log it, all they would have to do is be hubbed and all IRC traffic would go through them>>

ALbino, you must have missed my post. I clearly explained that it's not possible to properly log discussions via a hub. All messages do NOT pass through a hub. IRC messages are only routed on a need to know basis. Unless there is a target user on the OTHER side of a hub, the messages will never be seen by the hub. So:

UserA (On ServerA) talking to UserB (On ServerA) would never be seen by any hub or any server other than ServerA. If there is a UserC on ServerC AND the logging hub sits between the path of ServerA and ServerC, then and only then would the message be logged. Basically the logging would be inconsistent, and completely dependant on what users are in what channels in order to see any of the discussion.

mary

3:58 pm on Nov 16, 2003 (gmt 0)

The majority of it is "u r hot... do j00 wanna cyber w/ me?!?!". Anything interesting worth logging is either incredibly technical and intricate or illegal.

Well, if these figures are correct:
There are 1,279,887 people in 638,042 chatrooms right now!

It would seem that there are a lot of people out there saying "u r hot".

I assume those one million people aren't sitting online all day, and this being Sunday morning the numbers are likely low. Those are very impressive numbers by anyone's standards.

JasonHamilton

4:45 pm on Nov 16, 2003 (gmt 0)

I don't claim to have accurate stats for each and every IRC network, but I do have stats from when I ran the largest server on Undernet for 4 years. Given a server that has 9,000 - 10,000 concurrent users, it would avg around 120,000 user connections per day. From that, you can extrapolate connects/hour, and get some kind of estimate on what a concurrent population of 1,200,000 users means in terms of daily connections. It's a little difficult to estimate actual unique IRC users globally due to the fact that users reconnect all the time, and there are bots, clones, etc.

Suffice to say, the IRC userbase is not diminutive in any way, nor can it be argued that it isn't statistically probable to have any useful information.

ciml

6:08 pm on Nov 16, 2003 (gmt 0)

My gut reaction is that the privacy issue comes down to IRC channel settings and public awareness. Even so, people will often think of a public #myfriendsname channel as being private, compared with 'publishing' a Web page or Usenet post. If a 'mega brand' search engine like Google was to log and provide access to IRC traffic, PR could be very tricky.

Thanks for sharing your experience Jason, your technical reasons against IRC logging are pretty clear.

So assuming that this is not the purpose, we can maybe guess some kind of channel directory or language experiment.

punta

10:38 am on Nov 17, 2003 (gmt 0)

I also cache the server motds, though it's unlikely that google would find that information useful to Joe user.

Hmm, maybe not MOTD, but what about channel topics? Often channel operators on IRC put useful/interesting links in IRC topics. My theory...

Google gets a listing of all channels and topics on an IRC network. When it sees a URL in the topic of a channel, it enters the channel. Once it's in, it gets a list of all users.

It then performs analysis on this information to determine the importance of the link. If a channel has just one person in it, then it could easily be spam so it won't have much importance assigned to it. However, if that channel has close to a hundred users, then the link in the topic would be more important.

Some people are drawing parallels with IRC and news groups. However, you must keep in mind that news groups are posts sent to servers, then distributed across the internet to all the news servers who want it.

IRC messages are messages sent to a server and then distributed to all the other servers in that network

It's understood that your post will be seen by anyone who wants it. IRC discussions are meant only for the users within the channel.

It's understood that all IRC messages, even private messages are sent across an open network and can be intercepted by nosey network administrators

Users on IRC tend to be a little more technical in nature than those elsewhere on the net.

Surely you mean arrogant, not technical?

I know this is a generalization, but IRC use requires a little more knowledge than entering a url on your web browser, or typing out an email.

USENET used to be just as hard (if not harder), before Google broke the clique and made it easy to use.

At any rate, privacy is an important aspect to all of this. Let me give a few examples: Mrs. Henderson (the name is made up) is in a support channel for battered women. She's speaking about her husband to the support group. Imagine her horror if she then found out her discussions were published on the web for all to read?

Now you're just showing your ignorance. There's plenty of support groups on USENET, and it's easier to be anonymous on USENET than it is on IRC.

This 56 message thread spans 2 pages: 56