| This 44 message thread spans 2 pages: 44 (  2 ) > > || |
|Dynamic = Static|
Its all Google's problem
Thinking about dynamic pages like php, asp and cgi pages most people know that some SE's has serious troubles with indexing this kind of webpages. But why?
In the early days we had MS-DOS, the first Windows versions where build on top of this operating system.. just to give the user the idea their are not using the 'difficult' commands you needed for DOS.
Just a little spin off...
Now we have Apache (and IIS but sssst!) as OS, it operates and serve your Website. Most basic version of a website is just send out .HTML pages.. no script, nothing at al.
Install some script engine in your Apache configuration like PHP and you can build dynamic pages... with some knowledge you could serve the .PHP files as .HTML (far as i know). With this kind of setup , SE just don't see its a PHP file right?
If you go use parameters in your URL's then you can get serious troubles with the actual SE's. Thats why most SE's advise to not use a Session Identifier in your URL because every user get's a different one..
Some webmasters use the parameter called 'id=', thats a dangerous one, SE's again, think that the id= parameter was used for the Session Id.. Thats why you can better use another parameter for identifing different pages like articles.
Think about this, you got 'articles/reader.php?a=83' , IMO thats a page and a unique URL. Be happy if it is indexed in the SERPS but; what if a user calles Joe Doe links to that URL/page with an extra parameter like; 'articles/reader.php?a=83&trick=joohoo
To be honest if you do something on my pages, it just serve the?a=83 article.. same page BUT different URL! SE's do see this as Duplicate content. Thats why you need to code your .PHP,.ASP etc pages so that if a user use parameters that does'nt has effect the script return a 404.
You could also do things with the robots.txt file, i have some pages who use extra parameters for sorting features... but's that's only ment for human and SE could see duplicate content on this pages, so i did add the following in the robots.txt: /articles/index.php?s=
Google understands this, they will crawl all the pages in /articles/ an notices that they can index /articles/index.php?paging=1,2,3 etc But Not /articles/index.php?paging=1?s=DESC
I just notice that different SE still not understand the meaning of a robots.txt file.. But their becoming awake, they putting man hours on working to a better robots.txt system in their engines.
In my opinion its good that SE crawl all the pages that are Disallowd by robots.txt... but if disallowed, crawling is ok, indexing is forbidden! I like the reports in Google sitemaps where you can see the pages included dynamic ones witch are disallowed. So dont think hey, they are overriding the rules.
This was what i would like to share with you, it's really a English exam for me, hope you understand a little.
Nothing like a bit of google-bashing to make you feel better, huh? ;)
Dynamic URLs in and of themselves cause NO problems whatsoever (as long as the URL has three or less parameters).
The problems mainly come from "duplicate content" issues caused by poor design.
Duplicate content is caused by haveing same content at multiple URLs. This happens when the same content is available at:
- URLs with the same parameters, but the parameters are in a different order.
- parameters are slightly different (for example, one has print-friendly on and the other does not)
- non-www and www versions of the URL (get the 301 redirect in place).
- several related domains serving "200 OK" content (that redirect will be useful again).
- http and https allowing full access, etc.
Make sure, too, that every page of the site has a unique title tag and a unique meta description. Failure to do so can easily catch you out.
Additionally, do not serve bots with session IDs or with anything that could look like a session ID. That will always cause you a lot of problems.
Finally, make sure that every page of the site links back to http://www.domain.com/ in exactly that format. Do NOT link to /index.html ever.
|Thats why you need to code your .PHP,.ASP etc pages so that if a user use parameters that does'nt has effect the script return a 404. |
Do you realize what a incredible undertaking that would be?
Even google doesn't protect itself from that.
try it yourself.
If that's the case then even static pages would have to protect themselves from having someone simply link to them like
The simplest thing to do is to have google sitemaps ONLY spider the dynamic urls that you give them. If you get dupe content it is because you did it to yourself and not someone being malicious.
|Finally, make sure that every page of the site links back to [domain.com...] in exactly that format. Do NOT link to /index.html ever. |
Do you refer to dynamic sites or is it a general rule? Why is it so important?
If you link back to /index.html or to /index.php from every page of your site, then you are sending all your internal PageRank back to that URL.
Trouble is, Google wants to list you as www.domain.com/ and that URL has no PR being passed to it from within your site. It only has PR being fed to it from external links.
Oh, and www.domain.com/index.html is duplicate content of www.domain.com/ and that is another thing you want to be avoiding.
So link back only to the canonical form: http://www.domain.com/ and that will fix the problem.
I do agree it's all Google's problem... :c)
But "screw your competitor idea" is kind of cool...
Check this out - let's say your competitor has a url like this:
Now, one can put a bunch of links to the pages like this:
The default behaviour of php (and many other languages) is to display extactly the same page for all the links above (it ignores the unused vars)...
As Bewenched said, it would be close to impossible for individual webmaster to resolve this, especially if the website is complex...
It would almost seem that Google will have to figure out which variables are more important! The ones that make the page change compared to the ones that don't...
I got a script-page that react on the parameter '?version=print' or something, just besure to serve that kind of pages with a '<meta name="robots" content="NOINDEX,NOFOLLOW">' between its <HEAD></HEAD>
Just for example, i had script-pages that react on sorting parameters.. example '?alfabeth=d' returns everything beginning with the letter D. I included a 'rel="nofollow"' parameter in the <a href... i thought i was save, but no! I found url's with the?alfabeth param in the SERPS...
i just added Disallow /articles/?alfabeth to my Robots.txt , intresting for me is to see Google its own robots.txt file... [google.com...]
espessialy this one:
:) solves everything!
Noting agains Google btw.. its about all SE's,
The issue about linking to [domain.com...] or [domain.com...] is True but thats Google problem ;)
Besure if you send out links to your homepage you'll point to your Root without giving the filename eg: [domain.com...]
I thought with Google.com/analytics you could give-up witch page is your 'root-page' ie. index.php or index.html , so that's indicates that google is awair of this problem. Still other sites could send links out to /index.php or index.html , never saw this but could happen.
|The simplest thing to do is to have google sitemaps ONLY spider the dynamic urls that you give them. If you get dupe content it is because you did it to yourself and not someone being malicious. |
Is it correct that you could do this by creating a robots.txt with this:
, it is more beautifull to do some trick in your script than using the external robots.txt (it gives a lot of information to competitors)
Some new updates today for sure...
dynamic and static. Here comes Gooo for the Fall.
Also while we are on the whole passed variables. What about those that send us where our traffic is comming from .. say if you're running an ad campaign.
example.. your page really is"
say you're running a google campain and you want to track the success
Now say you're running analytics to track this stuff and google sees this other url.. (and we know they are using this data) So should these be seen as two urls? No .. it shouldn't, but it could.
Should the value of the url be based on the content.. absolutely.. so your passing
and also passing from somewhere else
They obviously aren't going to be displaying the same content... or at least shouldn't .. if they do .. then supplimental the page etc..
Surely we aren't all going to have to face doing mod rewrites or using some sort of url rewriting for the sake of the search engines. .. gawd I hate that stuff .. looks so spammy and if you are a developer it's a pain in the rear to find that file that has an error or needs tweaking.
So even if you are using a rewrite and your product changes or a product category changes then the url changes and how are you going to do proper 301's? Can you imagine! I'd be writing those darn things all day long.
Hasn't google been saying all along "write the site for the user .. not the search engine"
Maybe google sitemaps should have an option of "Only spider what I give you" That would save all of us trouble and save them server resources.
Enough of a rant .. I've got to go write a 301 redirect since google now spidered my default pages (it got this from analytics there are no links to these) Joy!
Yeah, I hope Google is reading this, cause that's what I want as a webmaster.
I have a dynamic page:
When I pass a Varible $x to this page it displays different content:
I want both of these pages to be listed in the search engine - cause they are different.
But sometimes I have multiple links to the same page:
In this case I do not want both pages to be listed since the actual page hasn't changed...
So, basically, google will have to figure out what is the ROOT page. It has to figure out when different variables are passed to that page - which ones change it and which ones don't. I've created a site for my visitors - I expect google to figure it out...
P.S. If they figure out the above; a little problem, there are sites (like most of mine) that have "totally dynamic" pages - everything changes every time it loads... not sure how they would deal with that...
No. You figure it out. You're the designer.
You present the bots with the footprint that you want to be indexed.
You tell search engines what you want to be indexed, and what they should ignore.
As the designer/SEO, that's what the client is paying you to do.
I've designed my pages for my visitors, who don't care how many virables there are in the URL. I didn't expect Search Engines to hit my pages from different angles like that and then blaming ME for duplicate content.
Considering the complexity of some pages I don't think adding/removing "no-index" tags on the page is a good solution. Also the robotos.txt isn't good idea, since I have hundreds of types of pages - the rules will be insane.
I know one place where I've put all the pages that I do want listed - and none that I don't -- SITEMAPS. If I could tell google to list only the pages that are there and no other for that domain, it would save everybody a lot of trouble. Not only that, it'll cut spidering down to 1/10th of what it is now!
I have just fixed a site that had 50 000 content pages but was exposing over 750 000 individual URLs to search engines.
Now it has 50 000 URLs listed in Google. It can be done. It takes some work, but that work is very much worth doing.
Yes, I agree. It is worth it... For me it's not even a choice - I must do it...
I was mostly complaining about ever increasing work-load... I'll be yet another thing I would have to do on my site to make it work better on the search engines - something that they can't figure out themselves... (and something that Google asked me not to do..)
|didn't expect Search Engines to hit my pages from different angles like that and then blaming ME for duplicate content. |
They aren't "blaming" you, they are simply cleaning up their index. It isn't "blame", it isn't a "penalty", it is simply a webmaster that is trying to serve good content to its user.
Go ahead and design your site for the user. Just remember that if you want to be listed, one of your users is Googlebot.
|have just fixed a site that had 50 000 content pages but was exposing over 750 000 individual URLs to search engines. |
Now it has 50 000 URLs listed in Google. It can be done. It takes some work, but that work is very much worth doing.
Yes it can .. I agree...and I had to do it myself over the last few months.
Massive amounts of 301's to make sure there wasn't dupe content
Making sure no ssl spidering can be done
Optimizing code as much as possible
Making sure titles are unique (difficult with a big dynamic site)
Putting in pop out of frame scripts
Blocking scraper ips
Reporting index spam
Analyzing my server logs to mkae absolutely sure of no dupes or missed 301's
In fact I did so much coding lately I wore a smooth spot on my keyboard space bar and broke two mice :O
Now maybe I can get to the good stuff of adding more content and adding more stuff for my customers .. and take about a 2 week nap.
Wow, your hands must be tired.....
We have a php based site but we used the zen cart template. Part of their template has some php coding to prevent bots from indexing duplicate content. Part of the coding is listed below.
Basically since its a template, it serves up a no index meta tag when a bot goes to the above listed template url. If you are very good at programming php, you can insert simple code in a few places and not have to 301 a bunch of pages.
<edit reason - fix sidescroll>
[edited by: tedster at 5:25 pm (utc) on Sep. 8, 2006]
Noone forces a webmaster to use "dynamic" URLs with get-variables. It is quite close to a contradictio in adjecto if you in the same breath expect SEs to index the presumed "static" content of these "dynamic" pages. Look at Webmasterworld. What a "dynamic" site, constantly changing/adding content, but I only see static urls, pure html.
If anything goes wrong, it is just a consequence of your coding.
[edited by: tedster at 10:17 pm (utc) on Sep. 8, 2006]
Yes, you are correct that dynamic site could be made to look static - like WebmasterWorld. But it's a fearly simple site with only a couple of functions - on posting, editing ads.
Think about the site like Myspace where you have many many different type of pages: view profile, view picture, comment picture, add friend, send message, foward to firend, add to group... It's hard to make this kind of site without the get-virables in the URL.
So "No-one forces a webmaster to use dynamic URL with get-vars" is not right - the design & complexity of the website forces me to use those.
So technically speaking WebmasterWorld is in violation of this Google webmaster guidelines:
If making a dynamic page to look like a static page isn't the trick to improve the search engine rankings, then I must be totally clueless about this whole SEO thing...
|Avoid tricks intended to improve search engine rankings. |
Forum scripts are relatively simple to make search engine friendly. Keep all the "log in", "start thread", "reply to post", "send PM", etc, URLs out of the index using noindex meta tags.
Likewise exclude any duplicate content URLs for all threads from the index (such as print friendly, and threaded views, etc).
You can control the footprint of what is indexed. Use it to your advantage. And, no, it is not cheating or devious manipulation. It is proper site design.
Here's some clues to get started: [webmasterworld.com...]
>>>>>I know one place where I've put all the pages that I do want listed - and none that I don't -- SITEMAPS. If I could tell google to list only the pages that are there and no other for that domain, it would save everybody a lot of trouble. Not only that, it'll cut spidering down to 1/10th of what it is now! <<<<<
I wholeheartedly agree! And what the he** is wrong with google that it won't pay attention to a specially made site map created at their request and their specifications? NO other se has the problems google has. Good Grief! The need for a 301 would almost be non existent if not for google.
Oliver, here is a little experiment I did:
For this thread:
Will all bring up this very same thread - but URL's are different.
If one puts a bunch of links like these and google comes up and swallows them - then downloads all the pages. The "super smart algo of theirs" determines it's all the same page - then SCCCHAZAM - WebmasterWorld is "kicked in balls" for duplicate content. In your words is because the webmaster of the WebmasterWorld didn't do their job right... now, are you sure?
Yes, they will show up for a few weeks, and then the internal links to the "real" canonical URL of those duplicates will trump those "stray" URLs after a few weeks. They will be marked unimportant by Google, and fade away.
Some months ago, I began a thread with the following post, but it received zero replies...
|It seems to me that there has long been a need for search engines to be able to determine which url parameters are function-related and which are content-related. I do not believe there is any way in which this is currently possible but I have a simple suggestion. |
Currently, search engines ignore everything in the url beyond the #. That's ok, but if search engines ignored everything beyond a null parameter, all function-related parameters could be positioned to be ignored too.
This would not have any sort of immediate impact, but over time, it could have a dramatic effect if widely adopted.
Of course, there may be a better way of identifying function-related parameters.
If everything is ok, all the server output is HTML
A webmaster could hide the dynamic part and use a Rewrite mod to show up as 'normal' URL's.
But what is normal?
Yep, this issue been around for a while. I was complaning about it for a few months too. See, the thing is that people don't respond to what they think is not effecting them. This, however, effects everybody... just need to put it in the way they most people will "translate" it and "hypotherically apply it" to their sites...
I would love a definate answer from google on this issue... I mean, I don't mind doing it either way they "prefer", but I need know: In this particular situation, how would they "prefer"...
Hey, g1smd. the fact that
is unefficient. Is it possible NOT to have that...?
|they will show up for a few weeks |
But I don't want to loose them! I want them to count, just like others. Yeah, I know the first thought: do the 301, silly... but it's highly unefficient too... any suggestions?
|They will be marked unimportant |
|Think about the site like Myspace where you have many many different type of pages: view profile, view picture, comment picture, add friend, send message, foward to firend, add to group... It's hard to make this kind of site without the get-virables in the URL. |
I must admit I did not waste mytime at myspace yet, but I doubt that. For instance the syntax for "view picture" is <a href = "path/imageresource">view picture</a>. No need for any get variable. On the contrary, I don't actually want the serps be spammed by the results of any "forward to friend"-action.
This is truely a bad counterexample: I know that you may configure your webserver in such a way that it parses script-languages inside htm-pages. The w3c-specifications for html, however, don't say anything about processing get-variables. (Or did I miss anything? Why the hell, then, did I learn php?) So no need for an SE to assume these as different URLs.
I'd even go further: To decide carefully between a) actions to be processed by a cgi-script and b) pure URIs worth being indexed by a search-engine, very much helps a programmer to get clear in his mind about what he is actually doing. Which - as we all know - is not easy at all.
I read on many occasions that google recommends URLs not to contain more than three get-variables, if you want it to be indexed. I'd speculate that this exactly has to do with the overall topic of this thread: given a maximum of three such parameters you are left with a combinatorial matrix of 12 possible occurances of these as a get-appendix of the URL of the same cgi-script. If four varaibles were allowed, you'd get 60 and so on, with the normal tremendous growth rate of such curves.
I believe, that this limitation has to do with the fact, that google is willing to check these 12 in order to NOT apply a duplicate content filter due to glitches in parameter order performed by a webmaster. But that google is not willing to check ALL possible combinations of URLs with any high number of such variables.
If adding arbitrary get-variable-sequences to links was a means for a competitor to tank your site, this technique would have been applied decades ago.
Nevertheless it would be helpful, if someone from inside google could clear this issue once and for all.
I haven't a clue what Google does or doesn't actually do,
I can only relate what I've seen happen, more than one person has had page ranking problems when a page was linked to using (so called by some tracking variables) query strings.
I think that what g1smd says would be correct given time, however a lot can happen in a short period of time. Ask MikeNoLastname about it.
I do a 301 to the page with the query string removed, some folks send 404's, some don't even know it happens and get whatever result the server provides (from what I've seen) normally a 200 and the page as though there was no query string.
In any event Google really only wants one copy of the content, so I'd suggest that to be safe only serve the data from one url striping the query string via a rewriterule and a 301 or return a hey this ain't here in response to the requests.
I've been writing code for a long time and I trust exactly zero systems to "get it", if you get my drift.
The case in point is a perfect example of a possible problem (note the use of weasel words), you can make up any variable name to use in a query string, this makes the possible number of variations unlimited.
Google would be daft to attempt to contain such a mess.
Now I know all of the problems this can result in for folks building certain kinds of sites. But if they can build them thay also can take the time to tame them.
Yes, but Google will know what URLs are seen in links that point to your site from within your own site, and what URLs are seen only in links that point to your site only from other sites, and what URLs are seen in links both from within your own site and from other sites. From that it can deduce a certain amount of "trust" in the site itself, and resolve duplicate URLs in favour of those that are internal links within the site itself. I tested this a year ago and saw it happen then.
| This 44 message thread spans 2 pages: 44 (  2 ) > > |