Forum Moderators: open

Message Too Old, No Replies

is a CMS system bad for indexing?

         

marwal

2:17 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



I'be build a cms system (content management system) where all pages are stored in a database and retrieved with an url like this:
getDoc.php?pid=1137@content=2121@meta=1119

Can google follow these links?

/M. Wallin

ikbenhet1

2:22 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



I think they can, but they don't do it Max 2 variables in the url, and prefferably avoid 'id' in the url.

<added>that's an @ not a &, i'll have to check on that

ikbenhet1

4:21 pm on Mar 15, 2003 (gmt 0)

10+ Year Member




I seems not. I did some searches, @ is the same as 'space' . I could not find any urls indexed with @ in them.(nor with 3 times '=' in 1 variable.)

plasma

7:57 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



--------------------------8<-------------------------------------------
Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site. Supporting this feature saves you bandwidth and overhead.

Allow search bots to crawl your sites without session ID's or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.

If you decide to use dynamic pages (i.e., the URL contains a '?' character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them small.
---------------------------8<------------------------------------------

taken from [google.com...]

ikbenhet1

8:02 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



Don't know exacly what you trie to say but: Taken from webmasterworld, [webmasterworld.com...] (message #7)

Note,he uses also @, not only &, i missed that too the first time.

plasma

8:32 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



> Note,he uses also @, not only &, i missed that too the first time.

I don't even see a reason for using '?' either ;)

I recommend url_hashing.
replace each (dynamic) url by a hash and use mod_rewrite to re-translate the incoming requests.

2. Implement if-modified-since. Google likes it, and browser-caches too ;)

I'm also a CMS developer, and every single sub-page of our customers gets indexed by google :)
Before I took the above steps things were different :¦

A nice side-effect is that ppl don't fiddle with the query-strings.
You can even disallow any query-string which didn't come originally from the cms.

hih

plasma

9:38 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



Another step you should take:

[ircache.net...]

There enter your url.
It will tell you how cachable it is.

AFAICT cachable (dynamic-)sites have a much higher possibility to be indexed by spiders.

cu

ASleep

11:06 pm on Mar 15, 2003 (gmt 0)



> AFAICT cachable (dynamic-)sites have a much higher possibility to be indexed by spiders.

I tend to disagree. Most sites I deal with do not want to be cached due to the dynamic nature of the content and use things like must-revalidate, etc. to stay fresh.

I've never seen this impact the sites ability to be spidered. The URLs may have an impact (albeit not as much as years ago) but I doubt the cache-ness has any impact.

hitchhiker

11:28 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



plasma:

Do you hash the entire url, or just the querystring

ie:
www.mydomain.com/index.aspx?2439876345fhs324
or
www.mydomain.com/2439876345fhs324.aspx

which could also be made possible?
I think i'm having a similar problem, if it's ok could you give more detail, ie what happened b4, and how long did it take to change?

tnx

Sorry, i meant to say 'how long did it take google to see the change and spider properly'

[edited by: hitchhiker at 11:29 pm (utc) on Mar. 15, 2003]

plasma

11:29 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



>I've never seen this impact the sites ability to be spidered. The URLs may have an impact (albeit not as much as years ago) but I doubt the cache-ness has any impact.

Ok, then maybe it's not the caching that the spiders like, it's the presence of freshnes-validators.
Caching doesnt hurt, though :)
Of course you should fine-tune to let not cache pages that MUST always be fresh.

plasma

1:13 pm on Mar 16, 2003 (gmt 0)

10+ Year Member



hitchhiker

I would recommend the hash as subdirectory, so that the template's filename can still be seen, e.g. you want (dynamic-)images to be spidered too.

[foo.bar...]
[foo.bar...]

actually I have another subdirectory 1 level above:

[foo.bar...]
[foo.bar...]

because I didn't want the hashes to be at toplevel, but now when I think about it, it seems unnecessary.

BTW: wasn't there an issue with google PR and subdirectories?

Maybe this is even better, (although it is too ugly in my eyes)

[foo.bar...]
[foo.bar...]

Hopefully you have direct control over your apache.
You can put mod_rewrite code in .htaccess files, but max performance is only achieved using the conf files.

cu

andreasfriedrich

3:31 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>BTW: wasn't there an issue with google PR and subdirectories?

Absolutely not despite all the talk about it.

Andreas

hitchhiker

4:54 pm on Mar 16, 2003 (gmt 0)

10+ Year Member



thanks plasma,
how long did it take google to realise? ie deepcrawl -> serp time.

plasma

10:32 pm on Mar 16, 2003 (gmt 0)

10+ Year Member



I don't know, if the problem was really related to this.
We changed many things besides the technical ones.

We also mailed google and asked if there were penaltys on our site because it could think that this is cloaking, they told us that there were no penalties and that it could be related to the EverFlux.

After 3 Weeks or so, FreshCrawl picked up the pages and everything is OK now.