How do SE bots treat dynamic pages?

Forum Moderators: open

Message Too Old, No Replies

How do SE bots treat dynamic pages?

does the bot read the asp file of the html generated by the asp file?

brickwall

3:13 pm on Nov 30, 2004 (gmt 0)

Hello guys. Newbie question here.

My site consists mostly of dynamically created pages. I use a combination of SSI, database pulls, reading from textfiles using FileSystemObject, and Response.Write to build my html pages using asp. My question is: when G or other SE crawl sites like mine, do the bots read the content of the asp file directly, or, do the bots behave like a browser and goes to the server to "request" the html file generated by the asp file, and base the indexing decision on the generated html?

If the bots read the asp file directly, then I have no idea how I can optimize my site for SEs without turning my back on the convenience my scripts allows for.

If the bots "request" the output html, then I guess, I have prepared for this and the generated html is already SE friendly.

BTW, my querystrings are all very simple and short as in widget.com/showwidget.asp?id=1

Please educate me on this oh great forum members.

uncle_bob

3:17 pm on Nov 30, 2004 (gmt 0)

The Bots have no choice but to receive the output html, there should be no way for them (or anyone else) to access the source code of your asp pages

brickwall

3:29 pm on Nov 30, 2004 (gmt 0)

Thanks uncle_bob for the reassurance. I have also thought of this but I am bothered by what G always tells us regarding the bots being "dumb" so that it doesn't recognize long and complicated querystrings. I thought, if bots behave like a browser and "just" requests the server for the output html, then why should the querystring matter at all? Isn't it the server's job to worry about that?

Iam not an expert in anything technical so I guess Iam just missing something here.

Xoc

3:44 am on Dec 1, 2004 (gmt 0)

There issues that the bots can run into with dynamically generated pages. Here are three similar, but slightly different examples:

1) First is that some of the query string may be something that doesn't lead to a unique page. Example

example.com?a=1&userid=123
example.com?a=1&userid=456
example.com?a=1&userid=789
etc.

2) Second, let's suppose that I have a web page titled "Numbers under 1 billion":

example.com?a=1
example.com?a=2
...
example.com?a=999999999

3) Third, let's suppose that I have a dynamically generated page that generates hyperlinks like this:

<a href="example.com?randomnumber=562951413>randomly generated querystring</a>

where any random number is dynamically handled and also creates hyperlinks like that.

The first example has essentially an infinite number of pages from different user ids. In each case, the content is the same. The spider has to deal with that.

The second example has pages that are almost identical, generated from the querystring.

The third example creates a spider trap, where the spider can index the site from now to eternity, since any querystring will generate another page.

In all three cases this creates a burden on the search engine trying to provide relevant results. They cannot afford to spend time indexing millions of pages that are identical. Nor can they store the results of millions of pages that are almost identical. Finally, they cannot index the same site forever.

Any one of these on your site will likely kill your ranking. The more complicated the querystring, the more likely the spider will run into one of these situations, so the engines don't bother indexing complicated querystrings.

brickwall

6:51 am on Dec 1, 2004 (gmt 0)

Hello Xoc, thanks for that detailed explanation. I really appreciate it.

Are you essentially saying that the bot is trying to "guesswork" the range of values contained in querystrings?

In your second example, widget.com?a=certainvalue, does not the bot just take the certainvalue indicated in the referring page? Why must it "guess" all possible values?

I have my site setup like this

level 1 - main page (myindexpage.asp)
level 2 - subject pages (mysubject.asp) contains links to individual article pages whose URL are HARDCODED with individual querystring id values as in showarticle.asp?id=1, showarticle.asp?id=100 and so on
level 3 - article pages (showarticle.asp?id=value) all dynamically created, template-based, whose actual content varies significantly

Do you think the bot will have a hard time dealing with my querystring values in level 2 even if they are right there on the page hardcoded?

If the answer to this question is YES, then I still don't get it?

beauzero

4:41 pm on Dec 3, 2004 (gmt 0)

Its been my experience with G for your level 2 question that you can put a "hidden" index.htm page through a link like "site index" at the bottom of all of your pages. This will help. I have had to mess with formatting quite a bit though.
This is what I have found (please please if you have advice let me know).

1. font size matters on the link to the "index" page.
2. Descriptions in the links on the index page MATTER. ie don't use ISBN: 0000000001 as the link. Use the title of the book or in your case the title of the article.
3. Make SURE that you put similar text from 2 in the page header title to what the link referenced, exactly the same has been best.
4. Make sure that keywords from the link are also found in the page referenced.

I have gotten much better breakdown through G and Inktomi this way.

If you know of any other ways of doing this please let me know.

tomda

4:57 pm on Dec 3, 2004 (gmt 0)

1) First is that some of the query string may be something that doesn't lead to a unique page. Example
example.com?a=1&userid=123
example.com?a=1&userid=456
example.com?a=1&userid=789
etc.
The first example has essentially an infinite number of pages from different user ids. In each case, the content is the same. The spider has to deal with that.

Not, if you modify you robots tag in the metatags depending of your variable
if ($userid=="0") {$robotsmetatag="all";} else {$robotsmetatag="none";}
echo "<meta name='robots' content='".$robotsmetatag."'>

This way SE will only browse one page (over million) and you are safe regarding content duplicate.