Handling of '?' in URLs in Google

Forum Moderators: open

Message Too Old, No Replies

Handling of '?' in URLs in Google

huppy99

4:56 am on Jan 7, 2002 (gmt 0)

Hi.

We have a number of URLs such as this :
domain.com/cgi-bin/browse.cgi?itcMenu=nav&category=12

How does Google handle '?' characters in URLs?
We tend to rank very well, but our online shop section could do better,a nd I suspect it might be because of this...

Cheers.

Brett_Tabke

9:40 am on Jan 7, 2002 (gmt 0)

They will index it, but it is very rare that they rank well. It's best if you can move some of the content to standard extension files (such as html,htm, or even shtml).

TopRankingSEO

10:42 am on Jan 7, 2002 (gmt 0)

Brett, could you please tell me why url with ? won't get high rank as per your thought?

BTW, please follow the following link and see #3 listing.

[google.com...]

agerhart

1:35 pm on Jan 7, 2002 (gmt 0)

The keyword search that you linked to is not a very competitive one, with less than 200,000 pages being returned. When you have a search that has over 2,000,000 pages being returned it is more rare to see URLs with a ? in them.

TopRankingSEO

5:18 pm on Jan 7, 2002 (gmt 0)

When you have a search that has over 2,000,000 pages being returned it is more rare to see URLs with a ? in them.

Yes, it's possible. But it doesn't prove that URLs with a ? won't rank well.

Please, feel free to give your comments on #2 listing: [google.com...]

agerhart

5:23 pm on Jan 7, 2002 (gmt 0)

You are correct that it is not an impossibility, but it is not the norm.

Brett_Tabke

5:29 pm on Jan 7, 2002 (gmt 0)

Page rank is still the gatekeeper. I think (know) that the page you spoke of above, would rank even higher if it didn't have the ? in the url.

Beyond

12:48 am on Jan 13, 2002 (gmt 0)

I believe there is a limit to how variables Google likes in a ? url. I see it indexing most shorter ones, but it seems to avoid ones that have more than four or five variables. There might be a character limit perhaps? With ? urls the shorter the better it seems.

bobriggs

1:03 am on Jan 13, 2002 (gmt 0)

Wow. There was a post that I remembered from a couple of months back about google and dynamic sites.

Sorry I can't give you the link because it must be a post, not a get. But a search on:
WebmasterWorld site search [searchengineworld.com] with the terms 'google dynamic' returns 193 results.

concensus seemed to be that you're correct Beyond. For a while, google would pick up the first level of ? pages, but would not count them as links to other pages (if there were other links). And you could also see that the number of & (ie parameters) were also lower in highly indexed sites. All of this changes month to month...Google has picked up more dynamic sites, but so far I don't think anyone can recognize a definite pattern of how deep it will go, or which ones count as linking pages, etc.

Macphisto

8:39 pm on Jan 16, 2002 (gmt 0)

Since i changed most dynamic URLs to work with / instead ? and & google crawled nearly all of them and they ranked well in the results before i got the PR drop last December.
The little hack is use to change the URLs works well with nearly all php stuff i use on my sites.

Matt

Brett_Tabke

6:52 pm on Jan 22, 2002 (gmt 0)

>I believe there is a limit to how variables

I've been watching for this Beyond. I still think the file extension itself is most important. I have some page.html?anything=you-want that are ranking infinitely higher than page.cgi?anything=you-want.

Beyond

7:40 pm on Jan 22, 2002 (gmt 0)

Re: .cgi vs .html, that's interesting. I can understand why Google would want to rank .html higher than say .pdf or .doc files, but I don't see why they would do this with .cgi since about 99% of the time .cgi returns html formated output. Are you taking everything into account with this example? Is it an apples to apples comparision?

Maybe our buddy Googleguy would like to comment on file extensions and save us the time of making a bunch of test files up to fully verify this.

william_dw

7:51 pm on Jan 22, 2002 (gmt 0)

IMHO,
google would be likely to give some kind of a penalty to dynamic pages, even just for the simple reason that i could code a dynamic page which displayed 20 links, all links point back onto itself but with a different querystring, hey presto, googlebot runs around in circles.

I'm sure google have a very elegant way of handling this, but they cant just index the page once(ie: exclude the querystring when checking if the page has already been spidered), because then almost all the dynamic content would be lost,
so i figure google combines a per-site limit of pages with a penalty for every dynamic page, perhaps increasing every time google follows another dynamic link from a dynamic page.

Just a guess + my $0.02,
William

hutchins13

6:44 am on Jan 23, 2002 (gmt 0)

It's been my experience that Google does not follow any links from a dynamic page.

Also, I have had no luck getting CGI pages indexed (ex. shop/index.cgi?ID=&task=item&ItemID=IT14), but I know other people have.

Brett_Tabke

8:50 am on Jan 25, 2002 (gmt 0)

Hmmm. designing a test.

[webmasterworld.com...]

huppy99

10:22 am on Jan 25, 2002 (gmt 0)

Guys,

Thanks for the thoughts so far. We are thinking of changing the URLs for our shop... It deserves to get more traffic!

As far as Altavista goes, everytime it hits our servers, we hit very high load... we have maybe 20,000 pages of content, and thinking about it, if it indexed the dynamic shop pages, it might kill us completely...

But that is an infrastructure matter...

wardbekker

1:02 pm on Jan 25, 2002 (gmt 0)

I'm using URL rewriting on a website. Because of limitations of asp 3.0 i use a redirect in my 404.asp page to the correct URL.

Does Google catalog these links? And which one? The friendly "www.foo.com/article/32646.html"
or the "www.foo.com/article.asp?id=32646" URL.

And the most important thing : Can i get
penalized for the redirect thingie?

joshie76

1:56 pm on Jan 25, 2002 (gmt 0)

>>redirect in my 404.asp page

I'm not 100% sure but I think this results in Google been returned with a 404 for any request they make on the "www.foo.com/article/32646.html" page. This probably isn't too good and I wouldn't be surprised if Google didn't index such responses at all.

Let me know if I'm wrong as this would be a great way for us ASP 3.0 guys to deal with this issue (short of writing our own APIs).

Josh

wardbekker

2:21 pm on Jan 25, 2002 (gmt 0)

If you include a Response.Status "200 OK" it should give an OK header. I even managed a great workaround for the missing server.transfer problem. Here it comes;

Function HTTPGet(strURL) 'As String
Dim strReturn ' As String
Dim objHTTP ' As MSXML.XMLHTTPRequest
If Len(strURL) Then
Set objHTTP = Server.CreateObject("Microsoft.XMLHTTP")

objHTTP.open"GET", strURL, False
objHTTP.send 'Get it.
strReturn = objHTTP.responseText
End If

HTTPGet = strReturn
End Function

Dim theCorrectURL

** here you put some parsing of the missing url that is returned to 404.asp page. Output is the URL of the asp page you want to load (for ex. article.asp?id=62346&foo=73443 **

theCorrectURL = [foo.com...]

response.write HTTPGet(theCorrectURL)

*TADAAA*

Any other activeX control that reads file via HTTP will do ;-)

Brett_Tabke

2:25 pm on Jan 25, 2002 (gmt 0)

The code isn't important, it is the sequence of headers that is. What is being fed to the spider?

(and btw, welcome to the forums Ward).

wardbekker

2:31 pm on Jan 25, 2002 (gmt 0)

Good question...i will record the header info for you and post it here. Hang on ;-)

rpking

2:33 pm on Jan 25, 2002 (gmt 0)

I can confirm that so long as a 200 OK header is passed to googlebot, this system does work. I've had 1000's of pages indexed via a very simliar system using php.

wardbekker

3:24 pm on Jan 25, 2002 (gmt 0)

rpking :

Nice!

Brett_Tabke :

HTTP/1.1 200 OK
Server: Microsoft-IIS/4.0
Date: Fri, 25 Jan 2002 15:18:50 GMT
Content-Type: text/html
Set-Cookie: ASPSESSIONIDQGQGGHQK=LFMELLFDIHDMMKKFFFANGGOB; path=/
Cache-control: private

I hope that this it the info you asked for. I used [rexswain.com...]

wardbekker

3:48 pm on Jan 25, 2002 (gmt 0)

Because the HTTPXML component is rather slow i would recomment using some sort of 'cloaking' for performance improvements;

Dim useragent, found, searchspiders, spider

searchspiders = Array("Googlebot", "ArchitextSpider", "Scooter", "Ultraseek", "InfoSeek", "Lycos_Spider_(T-Rex)", "Gulliver", "FAST-WebCrawler")

found = false

useragent = request.serverVariables("HTTP_USER_AGENT")

For each spider IN searchspiders
If inStr(useragent, spider) Then
found = true
Exit For
End If
Next

IF found Then
response.write HTTPGet("http://www.foo.com/" & newQueryStr)
Else
response.redirect("http://www.foo.com/" & newQueryStr)
End If
response.end

john316

7:24 pm on Jan 25, 2002 (gmt 0)

Does anyone konow what would be better?

html?terms=wordone&wordtwo
html?terms=wordone_wordtwo
html?terms=wordone wordtwo (and let the address go %20)

Is there any penalty for the &

Thanks

Brett_Tabke

9:19 pm on Jan 25, 2002 (gmt 0)

The 200 looks fine Ward. I thought you were redirecting it 1 time to get to the final url (not something you want).

John, I'd stay away from anything with spaces and go with the _ underline. The underline is unique in that it can't be contained in a domain name.

rogerd

10:27 pm on Jul 20, 2002 (gmt 0)

First, thanks to wardbekker for some nifty code! I've incorporated it into a test 404 handler for converting apparent directories and .htm files into ASP queries. So far so good.

One question, though. For actual errors, i.e., requests for nonexistent pages that don't parse into a query, I need to return an error page AND a 404 result code to the spider. If I just branch to a Response.Write(HttpGet(pagenotfound)) I still return a 200 to the server.

I think I need to include the line
Response.Status = "404 Object Not Found"
but where does this go? I've tried to put it in the error page just before I write the "not found" page, but I'm still returning a 200...

rogerd

11:00 pm on Jul 20, 2002 (gmt 0)

Hmmm, one other weird issue. My error page was working fine in test mode, until I installed the page as the actual error page. It seems like it doesn't execute. If I call the page directly, e.g., error.htm (server is set to process .htm as .asp), it parses the URL properly. If you request an nonexistent page, like bogus.asp, it doesn't seem to run... I know it's calling the right page, because if I stick a plain HTML error page in it works fine. Any reason why the server wouldn't execute the ASP code if it's redirecting to the error page?

bcc1234

3:08 pm on Jul 21, 2002 (gmt 0)

What if I have a really long file name that appears to be static, is that bad ?
Something like:
/DisplayPage_q_site_d_refnum_e_6006_a_page_d_refnum_e_30103_a_cat_d_refnum_e_6951.htm

That's the actual file from one of my sites, should I try to make them shorter ?

Thanks.

rogerd

6:57 pm on Jul 21, 2002 (gmt 0)

bcc, that's a pretty ugly url. I would certainly try to find a way to shorten them. That looks like the output of one of those ISAPI-redirect-type programs. Beyond search engine effects, just trying to e-mail someone that URL could be problematic. (And forget about typing it!)

On my dual question above, I've narrowed things down a bit. I've managed to make the 404 code work for truly bogus pages, and I've narrowed the non-function in error referral mode to a server issue (I think). I'll repost in the script forum if I'm still stuck, thanks.

This 33 message thread spans 2 pages: 33