Forum Moderators: Robert Charlton & goodroi
In the early days we had MS-DOS, the first Windows versions where build on top of this operating system.. just to give the user the idea their are not using the 'difficult' commands you needed for DOS.
Just a little spin off...
Now we have Apache (and IIS but sssst!) as OS, it operates and serve your Website. Most basic version of a website is just send out .HTML pages.. no script, nothing at al.
Install some script engine in your Apache configuration like PHP and you can build dynamic pages... with some knowledge you could serve the .PHP files as .HTML (far as i know). With this kind of setup , SE just don't see its a PHP file right?
If you go use parameters in your URL's then you can get serious troubles with the actual SE's. Thats why most SE's advise to not use a Session Identifier in your URL because every user get's a different one..
Some webmasters use the parameter called 'id=', thats a dangerous one, SE's again, think that the id= parameter was used for the Session Id.. Thats why you can better use another parameter for identifing different pages like articles.
Think about this, you got 'articles/reader.php?a=83' , IMO thats a page and a unique URL. Be happy if it is indexed in the SERPS but; what if a user calles Joe Doe links to that URL/page with an extra parameter like; 'articles/reader.php?a=83&trick=joohoo
To be honest if you do something on my pages, it just serve the?a=83 article.. same page BUT different URL! SE's do see this as Duplicate content. Thats why you need to code your .PHP,.ASP etc pages so that if a user use parameters that does'nt has effect the script return a 404.
You could also do things with the robots.txt file, i have some pages who use extra parameters for sorting features... but's that's only ment for human and SE could see duplicate content on this pages, so i did add the following in the robots.txt: /articles/index.php?s=
Google understands this, they will crawl all the pages in /articles/ an notices that they can index /articles/index.php?paging=1,2,3 etc But Not /articles/index.php?paging=1?s=DESC
I just notice that different SE still not understand the meaning of a robots.txt file.. But their becoming awake, they putting man hours on working to a better robots.txt system in their engines.
In my opinion its good that SE crawl all the pages that are Disallowd by robots.txt... but if disallowed, crawling is ok, indexing is forbidden! I like the reports in Google sitemaps where you can see the pages included dynamic ones witch are disallowed. So dont think hey, they are overriding the rules.
This was what i would like to share with you, it's really a English exam for me, hope you understand a little.
In which case no one should _ever_ have any problems with?something=x attached to pages. Not even a little wobble.
If Google is doing a validity check of that nature they would need to have the entire site spidered, do they always spider a site fully in a timely manner to detect the "bum" IBLs of this nature? What about the fact that the pages may actually exist outside of what the site allows Google to access through the links on the site?
Isn't it nice that the web is so large and always changing.
So you have a choice, prevention or suffer any possible cold and wait it out.
Thus the saga of the Wacky Wobble World continues.
[edited by: theBear at 1:50 am (utc) on Sep. 11, 2006]
There are several threads currently running on WebmasterWorld dealing with this issue.
[webmasterworld.com...]
and
[webmasterworld.com...]
[edited by: theBear at 2:16 am (utc) on Sep. 11, 2006]
Anyway, i was very busy with the Robots.txt and the <Meta noindex> options; lemmy explain
i got some pages with 25 links to published articles, the url = /articles/index.php
i give users the opportunity to sort the list on different colums.. this option generates URL parameters ie: /articles/index.php?sort=date
i didn't want this URL's to be index by Google bot because of 2 reasons..
1: the possibility for duplicate content..
2: IMO the SERPS need logical URL's... better for the searcher/user
I have some Adsense on this pages and now you see the problem.. i Disallowed * by robots.txt to this?sort=date URL's.. and guess...
AdSense has some problems with that!
lot of you know this but to respond on this topic; becarefull with the <Meta noindex> in your pages that has Adsense.. i fixed this with:
# Robots.txt
User-agent: *
Disallow: /articles/?s=
Disallow: /articles/index.php?s=
User-agent: Mediapartners-Google
Allow: /articles/?s=
Allow: /articles/index.php?s=
Language frequently makes understanding difficult.
One of the sites I work on has many, many pages produced by a highly customised cms, we have far more of the site off limits to indexing than we allow the search engines to index, but not to viewing by the adsense bot.
It is very easy for us to design user frendly pages that would lead to generation of massive amounts of duplicate content.
The example of sorting would be one, you could make most of those options available as form actions rather than on a url as query string variables. Someone will corect me if I'm wrong but I don't think the SEs currently do forms. If they do I've got a bit of work to do.
We try to keep as much out of the robots.txt file as possible. We already have more than enough places to make mistakes. Mistakes, what we make mistakes, surely you are jesting ;).
If you have ever owned a site in the past, say 4 years ago where the URL’s looked like.
www.somesite.com/produc.cfm?p=1$c=2&blah=whatever, look at your old access_log’s, you will find that Google had no problem what so ever indexing these pages. Still they have no problem indexing these pages. Of course sessions id’s never helped.
I personally believe that their filters down grade sites with such URL’s based upon the user experience. Their index looks a lot cleaner with normal URL’s.
Actually when you get right down to it the various search engines don't even have to show the url.
I don't even think that Google's filters downgrade pages just because they have an attached query string.
Like g1smd says it is the other things that get you, like causing duplicate content, and tons of pages that show the same title and meta information. Think about all of the search routines that pump out a page after page titled keyword1 keyword2 etc... results for mydomain.com.
Sessionid is a synomyn for index spam, don't give the bot more names for the same stuff, while the bot doesn't care the rest of the system will have more room for real content.
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
Obviously, it'll be silly if the webmaster himself would link to his own pages in this manner. But what if some bad person from the outside would put 1000's of links to the site like this... Is there a possibility google would follow them, download the pages, and detect duplicates?
Hi, r3nz0, good example with robots.txt! But would you do if you need to have some of the pages listed with? but some not for example:
http://example.com/index.php?x=1 - listed
but
http://example.com/index.php?z=2 - not listed
http://example.com/index.php?x=1&y=2 - not listed
ashear, indeed, you right - I personally, didn't have a problem with search engines listing my dynamic pages (even the ones that have 3+ variables) - the problem we are discussing is unseen - we think that google and others might be downloading un-needed dynamic pages then, might be detecting the duplicate content, then dropping those pages - I doubt, they would actually admit it. But we at least trying to get the idea out of them, on how they deal with that issue.
It would.
But most forum, cart, CMS, and other types of dynamic sites DO have these flaws built into them - by the bucketload.
Check out what I wrote about vBulletin, as just one example, just a few months ago.
[webmasterworld.com...]
They do get seen. They do get indexed. They fade away after a few weeks if nowhere on the site links to the same URL.
They are seen as being duplicate content.
I have tested what happens, several times, over the last 18 months or more.
phpBB will if not tamed spam the index.
There are all kinds of ways to cause duplicate content problems for a site.
Default server setups _can_ leave multiple paths to a site, when those paths get discovered a single link to the site can if the internal linking does not specify a full path for all of the links on its pages cause both the site to be duplicated and for the site itself to actually tell google that all of those duplicate pages are for real.
There have been multiple 1000+ message threads related to that situation. These huge threads have occured on and off for several years.
www/non-www issues, sites being replicated on an ip address as well as www/non-www and with some folks finding the site also appearing on mail.domain and on parked sites.
Also look up Googlewash.
[edited by: theBear at 1:57 am (utc) on Sep. 12, 2006]