Forum Moderators: phranque

Message Too Old, No Replies

Need to find out where Gbot is hanging

What is the best way to emulate a spider?

         

stuntdubl

1:20 pm on Mar 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wasn't sure where else to post this, so I hope you guys can help...

I am working on a new site that was created in OsC. It uses session ID's, but it is only available via a javascript dropdown to start a session.

The page is a low PR5, and has been around for a while, and has probably at least 10000 pages (guessing here), but only about 150 are indexed.

What are the best ways for me to go about experimenting to find out why the site isn't getting more fully indexed?

I have made some changes (like adding the sitemap to every page) that I am hoping will help, but suggestions would definitely be appreciated.

domokun

3:12 pm on Mar 18, 2004 (gmt 0)

10+ Year Member



well, first up id check your pages in the lynx browser to see what the spider might make of your content.
then run your pages through the sim-spider located here at webmaster worlds. there's plenty of others to choice from just google for 'sim-spider'.

have a looksee at your log files and see if a spider is trying to access some of your links but is finding only error pages.
also, thats sounds like an awful lot of pages. i hope there aren't links to all them on one page! google limits itself to about 100 links per page, anymore then that and it balks.

what you may also want to consider is duplicate content. make sure the pages that are indexed dont have a huge amount of similar content to those that are not.

pageoneresults

3:39 pm on Mar 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It uses session ID's

I think that might be the very first issue you need to address. Google does not like session IDs. I've seen it index some sites with ids in the string but I personally feel that is the biggest road block for any spider.

You may want to consider rewriting the URIs and remove all parameters from the string...

www.example.com/category/product/12/

...or something to that effect.

There are plenty of programs out there that will spider your site. Download one and run the spider. See what it finds. In most cases that is what the SEs will find.

Once you've done the rewrite, you will need to devise a way to direct the spider to those new URIs. If the database is capable of generating 10,000 pages, then you have some work to do. I usually like to provide index pages for each main category. Those index pages contain the rewritten URIs. Once a bot gets in there and starts traversing the rewritten URIs, tis a beautiful sight! ;)

Note: It will most likely take a few crawls from Google before it gets most of what you have available.

P.S. There are a few OsC specific topics floating about discussing these issues. I'd search the board and review those before jumping into this.