Forum Moderators: open
It's quite an interesting development. What do people think to this?
I doubt it, because of the privacy implications. For those unaware of Usenet (I'm a heavy Usenet rat), it is sort of a decentralized message board system. Posting something to Usenet is analogous to tacking flyers to walls all over town. There is no expectation of privacy; in fact the idea is for display of the post to the world. IRC is different.
My best guess: Google is scraping Usenet for URLs. Sites may be mentioned on Usenet long before Googlebot can find them. If Google is trying for the biggest, most complete index, it makes sense to look for URLs in as many places as possible. Most people who are on IRC never post to Usenet. Different subculture.
My best guess: Google is scraping Usenet for URLs. Sites may be mentioned on Usenet long before Googlebot can find them. If Google is trying for the biggest, most complete index, it makes sense to look for URLs in as many places as possible.
I don't believe they are trying to obtain the biggest index. Google don't index 'island' pages. If a page hasn't got any links to it and it's submitted to google via add URL, it doesn't stay in the index for long (if it gets there at all). Surely indexing these pages would increase the size of the index.
They have been increasingly. Search here about "Supplemental Results."
What if Google assigns pages found only on IRC PR0? This would mean IRC wouldn't be useful to spam. However, Google could find more content that would match searches.
Why the hell would you want to search IRC?
G need to concentrate on getting spam out of their index before they bugger around with something that could be abused easily.
Maybe that's exactly what they are doing. Spammers often run bots that sit on IRC channels spamming people with website URLs whenever they join the channel.
Perhaps googlebotIRC is collecting these spam URLs with the intent of removing them from the google index.
The unusual activity you are observing is part
of an experiment aimed at improving Google's search quality. Please be
assured that this behavior is only temporary.
Probably just some programmer mucking around. Don't they encourage their techies to experiment as much as possible?
e.g.,
Chatter 1: "Do you know of a website that sells concert tickets?"
Chatter 2: "Yes, go to http://widgets.com/."
widgets.com becomes more relevant to "concert tickets" because it's being referred to directly from one person to another.
It's certainly interesting. Too bad spam ruins stuff like this for everyone.
But on second thought ... surely there are some areas of the IRC world that do have valuable content and information. So Google, in its attempt to organize the world's information, wants to find those areas and index them.
The concern we're talking about here is the abuse of links posted in IRC chats. But if G can analyze links on the www and determine which ones are more valuable, more relevant, less spammy, etc., surely they can do the same on an IRC channel ... can't they? Why are some assuming G can't extend their technology to that type of content?
It's anyone's guess if they plan on joining channels to log discussions. At this point there isn't enough information to form that kind of conclusion.
Without meaning to sound arrogant, I think I know about as much about IRC as one can know. I've run IRC servers for AT&T for many years on Undernet, I'm an operator on servers on EFnet. I've run several smaller IRC networks, written IRC server code, numerous IRC robots (Ranging from game bots to talking AI robots), IRC network services, and developed several IRC related websites.
See my profile to see one of my current IRC projects - it already searches IRC in ways no other IRC search currently does. Resolved are problems with dealing with unstable smaller networks that merge constantly (two separate networks that merge will end up with duplicate channel listings), network splits, and networks just shutting down. My site doesn't list just the larger channels, it lists ALL publicly viewable channels, even those with 1 user, while at the same time weeding out channels that are only temporary. There are no arbitrary limits on minimum network size since we can deal with the network instabilities that is so inherent with the smaller startup networks.
I also cache the server motds, though it's unlikely that google would find that information useful to Joe user.
Anyway, back to the subject of Google on IRC... if google wants channels, I've already got that done, better than anyone else. If they want user discussions, one must keep in mind that it's been tried before, and even with several million in funding (reportedly 10 million), ChatScan failed miserably at it. It's not enough to know of IRC, you must understand its userbase. Some people are drawing parallels with IRC and news groups. However, you must keep in mind that news groups are posts sent to servers, then distributed across the internet to all the news servers who want it. It's understood that your post will be seen by anyone who wants it. IRC discussions are meant only for the users within the channel. Users on IRC tend to be a little more technical in nature than those elsewhere on the net. I know this is a generalization, but IRC use requires a little more knowledge than entering a url on your web browser, or typing out an email. At any rate, privacy is an important aspect to all of this. Let me give a few examples: Mrs. Henderson (the name is made up) is in a support channel for battered women. She's speaking about her husband to the support group. Imagine her horror if she then found out her discussions were published on the web for all to read? How about a depression channel like #asd? I can come up with many more examples of why channel logging isn't going to fly, but I think you can get the point.
I do have some ideas for useful things Google could do with IRC information.. I've already implemented another website that extracts graphic URLs from channel topics across all the IRC networks, allowing users to browser though about 300-400 pages worth of images. It's like an image gallery, but it's constantly changing (about 3 megs of channel topics are added per minute) and can contain images to virtually anything. And that's just a starting point.
There are technical reasons against logging chat on IRC.
1) It's not designed to be logged on that kind of scale. Only users within a channel are set to receive the messages said in a channel.
2) Even if google had a hub on each network, messages to a channel only pass through the hub if a recipient is on a server on the other side of the hub.
3) Joining thousands of channels is not possible - most IRC networks limit users to 10 channels at a time. Thousands of robots on the major networks is not likely to be allowed, no matter how nicely google asks.
4) Google linking pseudo-servers to each network and launching pseudo-users into the channels isn't likely to happen either - there are reasons why services like ChanServ do not sit inside channels. The net.burst tends to be several megs, eating up bandwidth and resources as the networks rush to sync up with each other.
Ultimately, I just don't see users wanting their discussions published in a public forum, especially when it's for the monetary benefit of a company. HOWEVER, that isn't to say that IRC itself has no commercial value or no use to google. I just don't see that channel logging *specifically* as having any place.
AL.
ALbino, you must have missed my post. I clearly explained that it's not possible to properly log discussions via a hub. All messages do NOT pass through a hub. IRC messages are only routed on a need to know basis. Unless there is a target user on the OTHER side of a hub, the messages will never be seen by the hub. So:
UserA (On ServerA) talking to UserB (On ServerA) would never be seen by any hub or any server other than ServerA. If there is a UserC on ServerC AND the logging hub sits between the path of ServerA and ServerC, then and only then would the message be logged. Basically the logging would be inconsistent, and completely dependant on what users are in what channels in order to see any of the discussion.
The majority of it is "u r hot... do j00 wanna cyber w/ me?!?!". Anything interesting worth logging is either incredibly technical and intricate or illegal.
Well, if these figures are correct:
There are 1,279,887 people in 638,042 chatrooms right now!
It would seem that there are a lot of people out there saying "u r hot".
I assume those one million people aren't sitting online all day, and this being Sunday morning the numbers are likely low. Those are very impressive numbers by anyone's standards.
Suffice to say, the IRC userbase is not diminutive in any way, nor can it be argued that it isn't statistically probable to have any useful information.
Thanks for sharing your experience Jason, your technical reasons against IRC logging are pretty clear.
So assuming that this is not the purpose, we can maybe guess some kind of channel directory or language experiment.
I also cache the server motds, though it's unlikely that google would find that information useful to Joe user.
Hmm, maybe not MOTD, but what about channel topics? Often channel operators on IRC put useful/interesting links in IRC topics. My theory...
Google gets a listing of all channels and topics on an IRC network. When it sees a URL in the topic of a channel, it enters the channel. Once it's in, it gets a list of all users.
It then performs analysis on this information to determine the importance of the link. If a channel has just one person in it, then it could easily be spam so it won't have much importance assigned to it. However, if that channel has close to a hundred users, then the link in the topic would be more important.
Some people are drawing parallels with IRC and news groups. However, you must keep in mind that news groups are posts sent to servers, then distributed across the internet to all the news servers who want it.
IRC messages are messages sent to a server and then distributed to all the other servers in that network
It's understood that your post will be seen by anyone who wants it. IRC discussions are meant only for the users within the channel.
It's understood that all IRC messages, even private messages are sent across an open network and can be intercepted by nosey network administrators
Users on IRC tend to be a little more technical in nature than those elsewhere on the net.
Surely you mean arrogant, not technical?
I know this is a generalization, but IRC use requires a little more knowledge than entering a url on your web browser, or typing out an email.
USENET used to be just as hard (if not harder), before Google broke the clique and made it easy to use.
At any rate, privacy is an important aspect to all of this. Let me give a few examples: Mrs. Henderson (the name is made up) is in a support channel for battered women. She's speaking about her husband to the support group. Imagine her horror if she then found out her discussions were published on the web for all to read?
Now you're just showing your ignorance. There's plenty of support groups on USENET, and it's easier to be anonymous on USENET than it is on IRC.