Forum Moderators: open
1) The advice about rewriting urls with ? in them is sound. Only use ? on pages you don't want spiders to mess around in. And only use rewriting for pages you DO want the spiders to see. WebmasterWorld should use it, for example!
2) Hide all access to dynamic content you don't want spidered behind POST-based forms. The classic example is shopping carts. You don't want googlebot or anyone else browsing your entire site and buying everything!
3) Avoid what I call "Stupid ASP tricks". The classic one is a page redirecting to itself to set a cookie. Or using javascript for essential site functions. These confuse spiders.
4) A huge mistake that a lot of sites make is setting an individual context as soon as they get a visitor (ie: redirecting to a URL with ?id=whatever). This is a variant of the cookie stupid ASP trick. Sites should only establish an individual context for a user when they absolutely need to (eg: when the user add the first item to their shopping cart).
Every part of your site that you want spiders to see ought to be viewable with a minimally configured browser. No java, no scripting, no DHTML.
Best
R
Kapow, data is deprecated. I don't think Google has as much of a problem with php or any dynamic url as they used too. (just keep them low key without too many parameters)
trebor, I can see your point. If it were just search engines I was concerned about, I'd agree with you 100%. However, rouge spiders are the worst. The ? random urls with browser cache busting random strings has the effect of detouring the rogues. It's also why only about 50% of the site is accessible by nonauthorized spiders (ya, I cloak out of self preservation).
[domain.com...]
or
[domain.com...]
Would google not give me a 1 pagerank decrease for each directory deep is if I let it find the "default" index.html program?
Where links are known google calculates the PR based in a more complex way. (based on the number of links and the PR of the pages on which they appear) - its possible to have sub-directories with much higher PR's than the homepage.
Brett: Yep, and the side effect is that your site search works better and search engines index your pages appropriately. All because you've taken the time to get the information before returning the headers. There are so many forums where the title is "powered by myforum" for every post - in fact I'm about to customise a forum to get around these limitations very soon.
rpking: Put a post in the serverside or website technologies forums and we'll see if we can get to the bottom of the mod_rewrite prob.
My concern (and that of my server admin) is that mod-rewrite is a security risk since it requires FollowSymlinks?
Are there ways of overcoming security risks for mod-rewrite, this method seems to be the most elegant way of serving php with queries?
In our case it provided a doorway for hackers to fiddle with the website and database.
If the error page is defined like [anotherdomain.com,...] the browser shows the new address.
I tried to follow up your query on FollowSymlinks being a security risk but was unable to find anything newer than 1998? I would hope that any holes would have been fixed by now 4 years!
If it is still a risk then I'd be interested in knowing more.
DenRomano and Ahmad: The discussion here provides some additional ways of providing mod_rewrite style functionality without mod_rewrite - [webmasterworld.com...]
One of them is the custom 404 script - I used to use that method ;)
I am designing a website, which has pure dynamic content. The site has an online input interface for editors to edit content with 0% html knowledge and each page can change from one second to the next. Thus I had to build in a way of ensuring that the user sees the newest version of the page possible and not a cached copy from his/her hard disk. I.e. integrated a datetimestamp into each link to ensure that no page would be cached. In some cases i even added ?timestamp to img tags to ensure the right picture would show. I have based my server side scripting on ASP. yes. ASP. don't go cursing me. it has done the trick for me so far.
But this also means that my URLs are absolutely spider unfriendly.
Now has anybody actually come across this problem. each page must not be cached on the users machine under no circumstance. must be reloaded from the server. but want to be spider friendly as well?
And if so, found an answer except for putting hidden links into a div style=display:none tag to allow spiders to crawl without timestamps.
I eliminate browser and server caching with this code at the top of my asp pages.
<%
Response.Buffer = True
'Eliminate browser and proxy caching
Response.Expires = -1
Response.Expiresabsolute = Now() - 2
Response.AddHeader "pragma","no-cache"
Response.AddHeader "cache-control","private"
Response.CacheControl = "no-cache"
%>
this way you won't have to use strange query string links.
there is also an excellent asp forum with some really knowledgable people. All nt related stuff is discussed including ASP 3.0 and .Net
[webmasterworld.com...]
tried the response.expires one before the expires absolute one too, but did not quite work on all browsers or platforms. did not try your combination of buffering, expires , expiresabsolute, cache control and header modification. I can see that you have had many sleepless nights over this problem too.
Should this work, I will credit you in every page I will use this combination. promise. Because you have just saved me about 50 sleepless nights. Wish I would have figured this out earlier, as by now I must have added time stamps to about 500 links in my website.
Case in point: my site (virginiaartists.net) started out as all html, evolved to a few php pages, and recently went to all php. That's right, every stinkin page. (Why? I like include and mysql.)
As each step of the switch from html to php extensions went online, I noticed the number of indexed pages dropping. At first I thought they didn't like something else about my site (who knows?). Now that all the pages have been converted, I have dropped from 120+ indexed pages to 1. Just the home page. That's it.
Now I'm going to try the AddType trick to parse html as php, and change all the extensions, again! Betcha anything the pages get indexed around the 20th...
BTW: Does anyone make a tool that will evaluate a site for Google niceness?