Long URLs being indexed

Forum Moderators: open

Message Too Old, No Replies

Long URLs being indexed

at least they are for Amazon

dvduval

4:57 pm on Nov 10, 2002 (gmt 0)

I see that Google is now indexing URLS over 100 characters long for Amazon.com with results even wrapping to the next line.

Questions:
How come Amazon gets all the breaks?
What makes their products different than mine?
When is Google going to start indexing Miva carts like they index Amazon's cart?

Chris_R

5:01 pm on Nov 10, 2002 (gmt 0)

Amazon's PR is most likely higher than yours.

Do you use any & or? in your urls?

This will decrease your chance of being spidered.

NFFC

5:02 pm on Nov 10, 2002 (gmt 0)

>How come Amazon gets all the breaks?
What makes their products different than mine?

Good PageRank will always help in getting "problem" URL's indexed, you have to make G! really want at the content.

This does no harm either [google.com...]

dvduval

5:04 pm on Nov 10, 2002 (gmt 0)

QVC is another good example of long URLs being indexed, while HSN seems to have none.

I want my long URLs to be indexed.

Can we clarify the rules for indexing long URLs so that I can be sure and abide by them?

dvduval

5:08 pm on Nov 10, 2002 (gmt 0)

The PageRank of the pages with long URLs is often 3 or less. I have direct links to Miva pages from PR7 pages that are not being indexed.

dvduval

5:11 pm on Nov 10, 2002 (gmt 0)

The QVC pages have as many &s in their URLs as I do.

andreasfriedrich

5:26 pm on Nov 10, 2002 (gmt 0)

I don�t think that Google has had any problems with long URLs in the last two years as long as they look static. One of my sites has URLs up to 200 characters. Some of these pages have a PR of only 2 and they have been in the index for two years now.

Andreas

dvduval

5:33 pm on Nov 10, 2002 (gmt 0)

Are you saying that Google rewards people who attempt to mislead the bot into thinking it's seeing a static page? Please clarify.
Note: I'm not trying to be a smart a** here. I just want to know more what the criteria is for indexing.

andreasfriedrich

6:26 pm on Nov 10, 2002 (gmt 0)

Traditionally SE refrained from indexing for fear of adding whole databases or other SE`s SERPs to their index or causing unwanted side effects on the server. Another reason might have been that in those days content worth indexing was contained in static pages. Since this has changed Google started to index resources that are indentified by URIs containing query strings (dynamic URIs).

There is, however, no standard or RFC or anything else that requires dynamic content to be identified by dynamic URIs only. RFC2616 states explicitly that [i]f the Request-URI refers to a data-producing process, it is the produced data which shall be returned. So your statement of mislead[ing] the bot into thinking it's seeing a static page would only make sense when the bot had a reasonable expectation that it gets only static pages when it requests a resource identified by a static URL. Since this is not the case the bot is not given false or misleading information, thus it is not mislead.

Think of the HTT protocol as an interface between your web server and the outside world just as the door to your house is your interface to the world. What happens behind closed doors is nobody else�s business. (I do realize that this notion of privacy is not without problems. But since it is applied almost univocally by the US courts it seems to be a recognized albeit problematic idea. I believe that it will do here.) Whether your pages are created dynamically by some script or a large human staff or trained rats should not concern the user requesting that resource. All that should matter to them is that they get what they might reasonably expect: A valid resource of the type specified in the Content-type header.

So there is nothing dodgy about using static URLs for dynamic content. And since SEs tend to index static URLs more easily why not give them what they want?

Andreas

dvduval

9:52 pm on Nov 10, 2002 (gmt 0)

Thanks Andreas,

I didn't completely follow the part about the RFC. Are you saying that there is a better way to control the HTTP headers that are passed to the spider? If so, where might I look for information regarding intructions for modifying theses headers?

andreasfriedrich

11:05 pm on Nov 10, 2002 (gmt 0)

All I was saying is that even RFC 2616 suggests that a static URI can point to either a static or dynamically created resource. This is not about headers.

However, if you are serving dynamically created content from a static URL where the created content may depend on variables not encoded in the URI you would need to make sure that those resources are not cached.

See http header for "dont cache"? [webmasterworld.com]
[webmaster.info.aol.com...]

GoogleGuy suggested that you implement handling for conditional requests using the if-modified-since [webmasterworld.com] header field.

See this post [webmasterworld.com] for an implementation in PHP.

Let me stress this again: Just because your pages are build dynamically does not require you to use dynamic URIs. A URI should identify a resource: [aaron.tld...] does that just as good if not better than [aaron.tld...]

So there is nothing dodgy about using static URLs for dynamic content. And since SEs tend to index static URLs more easily why not give them what they want?

Andreas

jasonh

2:16 am on Dec 23, 2002 (gmt 0)

The problem with Google and Miva pages is the store code.
Having? and & is the not the problem per se. Search for Google and Miva Merchant for more info.