javascript hrefs / cookies and spiders

Forum Moderators: open

Message Too Old, No Replies

javascript hrefs / cookies and spiders

Can Google's spiders can crawl javascript hrefs?

Musicroom

10:22 am on Sep 3, 2004 (gmt 0)

We have a site which requires users to select a country before they enter. We then cookie the user so that we can serve them the appropriate content. However this necessary 'splash' page will not allow a spider through. We could put small static links to our US and UK sites at the bottom of the splash page, but then we would have to cookie the spider so that it does not return to the splash page.

Google's official guidelines seem to be behind their actual capabilities - I saw an article yesterday saying they could now crawl Flash content. So any idea if Google's spiders can crawl javascript hrefs? Assuming Google doesn't have a valid cookie set, I can't think of how to allow them into the site. We intercept anyone without a valid cookie on every page.

Google's guidelines don't give us a lot of info on *how* to actually allow search bots to crawl. Do bots access with a different protocol that we can identify?

Lord Majestic

10:48 am on Sep 3, 2004 (gmt 0)

It is safe to assume that bots won't understand JavaScript apart from (and I am speculating here) those obvious functions that take full URL as a parameter that can be recognised and extracted.

I'd suggest to create unique URLs for each of the countries and link (say) flags on front page to these pages. Naturally do not require cookie to be set in order for bot to access these country-specific URLs.

oshatz

11:30 am on Sep 3, 2004 (gmt 0)

From the Googlebot information page (http://www.google.com/bot.html):

What kinds of links does Googlebot follow?

Googlebot follows HREF links and SRC links.

end quote.

I use JavaScript redirects, and google doesn't seem to follow them. I did some tests where I placed a new page, with only JavaScript pointing to it, and at the same time added another page with a regular link. Only the regular one got spidered.

Oren

Musicroom

12:24 pm on Sep 3, 2004 (gmt 0)

Thanks for the advice so far...

I'm now checking for the user agent string set to the user agent of the google robot. This is from:

[robotstxt.org...]

If the user agent is determined to be a valid robot (in this case, only google), then no redirect occurs. It goes straight into the site. I tested this using the browser from:

[kmeleon.sourceforge.net...]

To make it as primitive as possible, I went into the preference and turned off Java, Javascript and did not accept cookies. I set your user agent string directly in this browser, in the preferences. Set to:

Googlebot/2.X (+http://www.googlebot.com/bot.html)

I could then browse the site as a robot would see it.

Any further experiences on this front?

Lord Majestic

12:33 pm on Sep 3, 2004 (gmt 0)

If the user agent is determined to be a valid robot (in this case, only google), then no redirect occurs.

This sounds like cloacking to me. One would assume that a search engine would do a test run over all or suspected in cloacking domains by requesting pages with typical browsing useragent. This might be innocent, but in the age of automated processes it might be treated with extreme prejudice. I'd rethink this approach.