Forum Moderators: open

Message Too Old, No Replies

Dynamic Pages Not Crawled

dynamic pages to static pages

         

gesbos1

5:10 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



Hi guys,

I'm new to this forum and I have been given a task that I need help with. Our site is composed of pretty much all dynamic web pages, and it is a rather large site. Consequently, my boss tells me that google spiders and robots are not traversing the site. My first question is, how can I tell if googlebot is accessing my site and how deep it is crawling?

My second question is how do I make my dynamic pages look static to googlebot? I've done some reading on this site and see where several webmasters use mod_rewrite to do this. However, I get confused about how to properly use this tool. All links are currently dynamic, so does this mean we have to change all links to be friendly and then have mod_rewrite rewrite them to the dynamic pages? Or, can we change the dynamic pages to static which would be easier?

HELP

AthlonInside

5:38 pm on Feb 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To find googlebot activities, look at your raw log files.

When using mod_rewrite, your pages are still dynamic, just the URLs are more pretty and do not look 'dynamic'.

For example, instead of having a lot of pages that look like this

http://www.example.com/viewarticle?page=1568

you can have

http://www.example.com/viewarticle/1568/

for the same page.

[edited by: tedster at 9:23 pm (utc) on Feb. 17, 2004]
[edit reason] switch to example.com [/edit]

pcgamez

5:50 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



Can you give ant links to good tutorials for mod_rewrite?

Abdelrhman Fahmy

6:02 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



i hope this will Help :

[razertech.com...]
[engelschall.com...]

also dont forget to search for it at webmasterworld,it's a lot of threads here about this topic

XtendScott

6:24 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



FYI,

Mod_rewrite is for Apache servers.
ISAPI_rewrite is for IIS servers.

thevoodoo

6:36 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



Gesbos,

I just wanted to add that if you are going to employ URL Rewriting which is the use of MOD_Rewrite or ISAPI_Rewrite, you will need to keep a few things in mind.

If your website is big enough, then you need to think of a long term solution, rather than a regular URL mapping solution. You will need to write or have somebody develope a URL Rewriting Engine for you. This will be a rather intelligent engine which processes URLs based on criteria that you set. For example you can tell the engine to process only urls ending in "HTM" and not "HTML". Or you can have it just process the capitalized version of pages. In this manner, you will be able to keep a namespace free, for static uses, such as promotions or contact pages.

You can always take the easier route, by just replacing all the question marks "?" and ampersands "&" with forward slashes "/". But this will still result into very deep directory structures which can be as annoying for SE's as dynamic URLs with long query strings.

As a final word, I made a huge change on one of my old sites from full-on dynamic URLS to root directory HTML pages, but dropped quite a bit in rankings and am now seeing long dynamic URLs rank pretty weel. So when you decide to go either route, I woudl apreciate it if you update me on the results, to see if your rankings improve or drop like mine.

TheVoodo

gesbos1

6:45 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



thanks guys.

I do have Apache HTTP server. Which log file has info about googlebot?

Answer me this: Let's say right now most links on my home page are in this format:

[parkseed.com...]

Can someone give me an example that would rewrite the above http statement to:
[parkseed.com...]

or something like that. Something more crawlable. So a simple mod_rewrite rule that changes? to / and = to - and & to /

Would this work?

ppg

9:07 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



Hi gesbos1,

I notice the word 'servlet' in your URL - you're using server-side java right?

I'm not 100% sure, but I don't think you can use mod_rewrite with server-side java since the requests for dynamic pages are served by the java application server (whichever one you're using) rather than apache, which will probably be serving your static pages only. That was the case the last time I checked.

All is not lost if this is the case though, the way round it would be to pass all requests for dynamic URLs through a controler servlet which can translate the static-looking URLs in your links to dynamic ones by matching patterns you define, before passing on the request to the correct resource.

Please note I havn't done this myself, but I've seen it used to good effect.

you can see visits from googlebot in your normal apache log files, just look for 'googlebot' in the user agent part of the log entry.

hth - I'd look into it whether or not you _can_ use mod_rewrite before you spend a lot of time learning about it

gesbos1

9:27 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



Thanks ppg.

Yes, we use servlets. Has anyone else heard about mod_rewrite and java servlets? Can we use mod_rewrite with servlets?

thevoodoo

10:03 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



Gesbos,

1. The Apache mod_rewrite in the 1.3 version will let you map URLs to URLs and even filenames to filenames but it only provides URL-to-filename mapping in the real backend. This means that any NON-URL-to-filename rewrites are first translated to a URL-to-filename mapping, the necessary mappings are applied and they are translated back to a reverse URL-to-filename mapping. In Apache 2.0 the two missing direct mappings must have been added however I have not checked. You must see if that would resolve any Java issues.

2. URL Rewriting with Apache is done in two different Phases:

  • Server Level with the "httpd.conf" file or your Apache config file.[/li]
  • Directory Level with ".htaccess" files[/li]

The second one is always more burden that the first one, so do the first one if possible.

Now for the example you requested. You need to determine your website's linking and content generation structure. Let's say in the example URL you provided:

If all of the pages that you want to map are displayed through the file "StoreCatalogDisplay", then you may be able to map like this:

RewriteRule ^(.+)\/(.+)\/(.+)\-(.+)\.html$ servlet/StoreCatalogDisplay\?catalogId=$2&storeId=$1&langId=-$4&mainPage=$3

In a perfect world, this should map

/10101/10066/page1-1.html

to

servlet/StoreCatalogDisplay?catalogId=10066&storeId=10101&langId=-1&mainPage=page1

You must definitely know how to use regular expressions, and you must know it well to be able to do good URL Rewriting. However briefly here is what the line does.

It first tells the rewrite engine that this is a Rule to be obeyed in rewriting anything that has two subfolders "2 X /" and file with at least one dash "-" with an "html" extension.

Then it tell the engine to grab whatever is between the beginning "^" of the URL and the first forward slash "/", this will be the first folder. Then it grabs anything between the first forward slash and the second one, this will be the second folder. Finally it grabs anything from the two sides of the dash in the file name ending with ".html". It stores them in the order retrieved as $1,$2,$3 & $4. In the next steps it just fills in the blanks and Tada.

I am sorry if sound vague, if you need more help post here so that everyone can read and I will try to respond.

TheVoodoo

ppg

10:23 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



gesbos1, what servlet container does your site use?

gesbos1

10:24 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



thevoodoo,

When you say:

"In a perfect world, this should map

/10101/10066/page1-1.html

to

servlet/StoreCatalogDisplay?catalogId=10066&storeId=10101&langId=-1&mainPage=page1"

Does this mean that the link on the page has to be /10101/10066/page1-1.html and then mod_rewrite rewrites it to servlet/StoreCatalogDisplay?catalogId=10066&storeId=10101&langId=-1&mainPage=page1,
or, will this mapping be the other way around.

The way our site is set up right now, we have hundreds, if not thousands of links on the site that are dynamic links. So I need to know if we would have to make all of these links the static pretty links and then have mod_rewrite rewrite them all to their dynamic counter parts, or can we leave the structure as is with dynamic links (which would be the preferred way) and have mod_rewrite make them appear static?

I really don't want to have to change all the dynamic links that are already there to static links. Am I making sense?

thevoodoo

10:36 pm on Feb 16, 2004 (gmt 0)

10+ Year Member



Gesbos,

Good point! my previous post got so long that I forgot to explain that.

What I mean is that you can enter the prettier URL in the brwser and internaly get the page which would have been created by the uglier one "No Offense" ;-)

However, here are the two approaches.

  1. If you want to use the URL rewriting to only get your pages indexed.[\li]

  2. If you want to create a solid structure to be indexed by the SE's and provide a permanent solution[\li]

If it is the first one, then you can leave your current dynamic URLs. But you have to add the static ones somewhere on the site, maybe a multipage sitemap so thatthey get indexed. But this way you will have double the number of pages indexed with duplicate contents, half dynamic urls and half static ones. This can nto be good in the long run.

If you take the second choice. Then the fact that your website is dynamic comes ot your help. Since your website in generated on the fly with a dynamic engine, you will be able to change the dyanmic links on the pages to static ones much more easily. All you need to do is, reverse engineer your RewritieRules.

In this case, when the servlet is genearting a link for the Store "10101", Catalog "10066", called "page1" and in the language number "1", instead of filling out this blanks:

catalogId=$2&storeId=$1&langId=-$4&mainPage=$3

It will fill out these blanks:

/$1/$2/$3-$4.html

Does it make sense?!

TheVoodoo

gesbos1

1:03 am on Feb 17, 2004 (gmt 0)

10+ Year Member



thevoodoo,

So, you are saying that we simply need to change the way the servlet's write out the links and then have mod_rewrite convert these to the dynamic version when they are clicked?

Am I getting this right?

Thanks for all your help voodoo!

gesbos1

1:35 am on Feb 17, 2004 (gmt 0)

10+ Year Member



oh yeah,

Once again, which log do I analyze to see if googlebot is traversing my site and how far it is getting? My AccessLog.log files are over 400 mb! Is there some tool I can use that will tell me what I want to know?

Thanks

thevoodoo

5:37 pm on Feb 17, 2004 (gmt 0)

10+ Year Member



Gesbos,

You got it exactly Right!
And as far as log files. I am pretty sure your server already comes with at least one log file analyzer. Are you using any server/hosting management programs like, PSA, CPANEL/WHM, Ensim etc.?

They all come pre-equiped with at least one of the following:

Webalizer [mrunix.net]
Awstats [awstats.sourceforge.net]
Analog [analog.cx]

TheVoodoo

dhaliwal

4:42 am on Feb 18, 2004 (gmt 0)

10+ Year Member



Well we also have our website updated and cached by googlebot everyday,
but the further links to some pages are not indexded, i don't know why?

i think if google prevents some website from being listed, all the pages are prevented? or even the part of that website may be prevented from being listed? due to something called as seaqrch engine spamming.

regards,
dhaliwal

gesbos1

7:47 pm on Feb 18, 2004 (gmt 0)

10+ Year Member



Remember in my discussion how I told you guys that we use websphere and .jsp pages? I keep getting an exception error from the websphere level everytime I try to click on a link that I have made "pretty" saying the page is not found even though I have my mod_rewrite rule set up. Any ideas? Of course the page is not found, that's what the rewrite rule is for. Is this an application server thing?

This is the sample I'm using:

RewriteRule ^(.+)\/(.+)\/(.+)\-(.+)\.html$ servlet/StoreCatalogDisplay\?catalogId=$2&storeId=$1&langId=-$4&mainPage=$3

In a perfect world, this should map

/10101/10066/page1-1.html

to

servlet/StoreCatalogDisplay?catalogId=10066&storeId=10101&langId=-1&mainPage=page1