homepage Welcome to WebmasterWorld Guest from 54.242.18.232
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Looksmart using spider with no user-agent for MSN?
A Looksmart spider came that had an IE user-agent...
WileE




msg:396634
 12:25 am on Oct 17, 2003 (gmt 0)

One of our sites started getting referrals from MSN with session-IDs attached to the URL (i.e. [xyz.com...]

(I changed the actual ID for publication here)

Well, that seemed odd, because we filter requests by user-agent (Googlbot, Slurp, bot, etc) and don't generate session-IDs for spiders. So I went through the log to find the first time that this SessionID had shown up, as this would be the time it was generated. To my surprise, I found:

sv-fw.looksmart.com - - [08/Oct/2003:13:53:28 -0400] "GET /index.php?session_id=2106002191ad2fee7a94178dbb33deac HTTP/1.1" 200 31238 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; YComp 5.0.0.0)"

So our script had generated a sessionID for Looksmart because it didn't have a user-agent that looked like a spider. It looks like a user.

1. Has anyone else noticed this?

2. I suppose this could be to avoid cloaking, but there are legitimate reasons to be detecting spiders (such as mine).

3. Is the only alternative to try and filter by IP addresses of known spiders... which will be quite a headache to keep up with...?

(in case anyone is curious, I found that T312461 is an IE security update)

 

stechert




msg:396635
 1:23 am on Oct 17, 2003 (gmt 0)

I think this is a real person looking at your web page and then entering it into the directory (wrongly) with the session ID still attached. I don't know if that session ID was the exact one or if it had been edited, but searches for that string on search.msn.com and looksmart.com don't resolve, so I assume the problem's been fixed (probably through editorial review).

Cheers,
Andre

WileE




msg:396636
 4:41 am on Oct 17, 2003 (gmt 0)

As I mentioned, I changed the actual session id before posting here on WebmasterWorld... although it makes little difference. If I search for the actual session ID string, it returns nothing. If I search for some actual keywords, the resulting link DOES still include the session ID.

Are they really finding sites by person and inserting into the directory? Seems terribly inefficient.

closed




msg:396637
 4:54 am on Oct 17, 2003 (gmt 0)

Are they really finding sites by person and inserting into the directory? Seems terribly inefficient.

Yes they are, and yes it may seem inefficient.

A page of mine was added into the directory that way. I didn't submit the location or anything. A LookSmart editor added it. No complaints here.

stechert




msg:396638
 7:18 am on Oct 17, 2003 (gmt 0)

Looksmart has both an algorithmic search platfrom (Wisenut) and a search tuned by people (Zeal/Directory). The directory is where folks would have been adding links by hand.

Cheers,
Andre

WileE




msg:396639
 9:37 pm on Oct 17, 2003 (gmt 0)

The plot thickens.

Today, inktomi spidered that same session_id.

Could inktomi have followed a result from the msn results page?

also, I'm not all to happy about engines putting a session_id in their SERP pages...

closed




msg:396640
 3:31 am on Oct 18, 2003 (gmt 0)

I guess the thing to do is to find out where the source of the link is.

Put the title of the page you're looking for in MSN Search. What heading do you get in the results? Do you get "SPONSORED SITES", "WEB DIRECTORY SITES", or "WEB PAGES"?

WileE




msg:396641
 4:03 am on Oct 18, 2003 (gmt 0)

Web Directory Sites.

jdMorgan




msg:396642
 5:20 am on Oct 18, 2003 (gmt 0)

WileE,

I'd do three things: First, contact Looksmart directory and request that the link be corrected. Second, temporarily patch your scipt to recognize and ignore that particular session ID, so you can assign a new one to real visitors that follow one of these messed-up links. Third, temporarily patch your script to check the requestor's IP address against Looksmart's IP block, and don't assign sessions for those requests. You could check the sv-fw.looksmart.com hostname instead, but that is slower and may not work all the time.

You can remove the first patch when they fix the link, and the second patch when you are sure this was a human editor error, and not some new project robot.

Really messy problem... This is a good example of why cloaking is so tough!

Best,
Jim

closed




msg:396643
 5:33 am on Oct 18, 2003 (gmt 0)

I'm guessing it's in the Zeal directory. All you need to do is edit the page's profile. To do that, you'll either have to be a Zeal Contributor or ask someone else who is a Contributor or higher to edit it for you.

I'd recommend becoming a Zeal Contributor. It's not really that hard. All you have to do is register, then pass the Member Quiz. If you do it, then you get the points for the edit, plus you get some experience with the Zeal/LookSmart way of doing things.

I could do the edit for you, but I think it would be better in the long run for you to be a member so that if the same thing happens again, you'll already know what to do.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved