Forum Moderators: open

Message Too Old, No Replies

More dynamic page crawling problems

some content taken, some left alone

         

John_Caius

12:48 am on Nov 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We have pages of the form [domain.com ] that are in the Google index. They contain links to pages of the form [domain.com ] and Google doesn't spider the links. Any ideas why? The first type of page has been in the index for at least nine months.

Slade

1:24 am on Nov 28, 2002 (gmt 0)

10+ Year Member



What is "u" in "ID=u"?

Is it random, sessionid, constant?

Is googlebot spidering the "second type" links?

John_Caius

1:33 am on Nov 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The part of the site that Google has spidered is an A-Z sitemap of 27,000 pages of informational content. ID=u refers to the 'U' section of the A-Z list and start=1 refers to 'start at the first item beginning with u and give the next 20'. So the parameters are constant for the page in question. About 350 pages of the sitemap are in the index.

Each link in the sitemap is of the second type and Google has never spidered any of them. We've put a static page sitemap to some of our most popular pages linked from our homepage - Google found the sitemap but again doesn't follow the links.

Slade

1:53 am on Nov 28, 2002 (gmt 0)

10+ Year Member



This is only a guess, and I have no proof, only a memory of someone throwing it out as an idea.

On one page, say that static sitemap, change "ID" to something else, like "product" for a few of the links.

Googlebot should be spidering again all out in the next two weeks, so you should be able to identify if it worked fairly quickly.

Try running a few pages as is through the sim-spider here [searchengineworld.com...] just to see if anythng flaky pops up.

Slade

3:33 am on Nov 28, 2002 (gmt 0)

10+ Year Member



...meanwhile... John stickied me the URL...

I looked at the sim-spider of that page and it doesn't include all those links. (Well, duh, you knew that already, you're not getting spidered!)

I took a peek at the code and noticed that the dynamicly generated sections looked like:


<a href=[red][b]'[/b][/red]http://yadda.com[red][b]'[/b][/red]>Yadda</a>

When I think they should look like:

<a href=[red][b]"[/b][/red]http://yadda.com[red][b]"[/b][/red]>Yadda</a>

I copied the page, replace all'ed the single quotes and ran it through simspider again and it did, then show all your links.

I wouldn't think it should matter, but apparently it does. If it weren't that a great deal of your site is non-indexed because of it, I'd say you found a nice glitch...

GilbertZ

8:03 am on Nov 28, 2002 (gmt 0)



Actually it was my stickymail in case John is confused. Nice catch there. I wonder if this is an issue with the sim spider only or the google spider?

P.S. when I've looked at vbulletin sites, their content does not get indexed very often...

John_Caius

11:33 am on Nov 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Many thanks for such prompt help and assistance! :) We'll try changing this and let you know how we get on.

John_Caius

11:05 pm on Nov 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We looked at our code for the A-Z type pages (the first type) and we could only find the " characters around the URL in the hyperlink, no ' characters. Slade - could you stickymail me with the particular URL you tried? Thanks very much.

We have a static sitemap of our most popular pages that's been in the Google index for a few months too, in an effort to solve this problem. The page.cfm?ID=12345678 pages have again not been spidered, although we've made some specific static copies and linked them from here - the static copies have been spidered. So it definitely seems to be something to do with the second type of dynamic URL.

As explained previously, each page of content has a dynamic URL but the dynamic URL doesn't change for that page, there's no sessionID or anything. So there shouldn't be a problem with the spider getting into infinite loops.

Would the Apache mod_rewrite system help in our case by getting rid of the question marks? How do you go about doing that - I don't know anything about server-side programming but we have technicians who do.

We're a health information database, not a database of products, and we have no banner advertising or popups. We just want people to have access to our content from the search engines because it's been ten years in development and it's a really comprehensive resource. Any further help or suggestions gratefully received...

GilbertZ

9:45 am on Nov 29, 2002 (gmt 0)



John, I think he confused you and me because I sent him a stickymail and had that problem and fixed it...Hopefully within the next update or two the content will be indexed :)

Have you tried the sim spider? It's awesome and helped me figure out that a page that wasn't being followed didn't have a good status code so I was able to fix it!

John_Caius

9:59 am on Nov 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh I see! :) We've tried the sim spider and it definitely sees the links. But they haven't been followed in the last six to eight updates, I don't see why they're likely to get followed in the next one or two without us making some changes in some way.

Does anyone have experience with using mod_rewrite with this kind of URL?

kleinguru

10:32 am on Nov 29, 2002 (gmt 0)


hi John,

you should try something like this:

RewriteEngine On

RewriteRule ^/list.cfm/(.*)/(.*)$ http://www.domain.com/list.cfm?ID=$1&start=$2 [P]
RewriteRule ^/page.cfm/(.*)$ http://www.domain.com/page.cfm?ID=$1 [P]

these rules will rewrite http://www.domain.com/list.cfm/u/1 to http://www.domain.com/list.cfm?ID=u&start=1 and http://www.domain.com/page.cfm/-12345 to http://www.domain.com/page.cfm?ID=-123245

GilbertZ

10:40 am on Nov 29, 2002 (gmt 0)



John,

Are you getting a 200 success code?

John_Caius

11:07 am on Nov 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Don't know - what's a 200 success code? I'll ask the programmers!

John_Caius

11:11 am on Nov 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks kleinguru :) - we'll try that and see whether it works.

John_Caius

11:40 am on Nov 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Site is now available in my profile.

John_Caius

11:44 am on Nov 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I put the sitemap (bottom of our front page) into the simspider and it came up with all 200 codes, no unspiderable links. Google can follow the static ones but not the ones with a? in.

John_Caius

12:03 pm on Nov 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hm, just found that our site runs on a Cold Fusion server system, not Apache. Is there an equivalent function to the Apache mod_rewrite in Cold Fusion? As you can see, I'm not an expert in this aspect of our site...

Slade

12:14 am on Dec 3, 2002 (gmt 0)

10+ Year Member



My suggestion is building these into your application.cfm. You will get the pure unadulterated URLstring and can rewrite it (internally) there to be what the rest of your pages expect.