Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Dynamic = Static

Its all Google's problem

         

r3nz0

10:35 am on Sep 6, 2006 (gmt 0)

10+ Year Member



Thinking about dynamic pages like php, asp and cgi pages most people know that some SE's has serious troubles with indexing this kind of webpages. But why?

In the early days we had MS-DOS, the first Windows versions where build on top of this operating system.. just to give the user the idea their are not using the 'difficult' commands you needed for DOS.

Just a little spin off...

Now we have Apache (and IIS but sssst!) as OS, it operates and serve your Website. Most basic version of a website is just send out .HTML pages.. no script, nothing at al.

Install some script engine in your Apache configuration like PHP and you can build dynamic pages... with some knowledge you could serve the .PHP files as .HTML (far as i know). With this kind of setup , SE just don't see its a PHP file right?

If you go use parameters in your URL's then you can get serious troubles with the actual SE's. Thats why most SE's advise to not use a Session Identifier in your URL because every user get's a different one..

Some webmasters use the parameter called 'id=', thats a dangerous one, SE's again, think that the id= parameter was used for the Session Id.. Thats why you can better use another parameter for identifing different pages like articles.

Think about this, you got 'articles/reader.php?a=83' , IMO thats a page and a unique URL. Be happy if it is indexed in the SERPS but; what if a user calles Joe Doe links to that URL/page with an extra parameter like; 'articles/reader.php?a=83&trick=joohoo

To be honest if you do something on my pages, it just serve the?a=83 article.. same page BUT different URL! SE's do see this as Duplicate content. Thats why you need to code your .PHP,.ASP etc pages so that if a user use parameters that does'nt has effect the script return a 404.

You could also do things with the robots.txt file, i have some pages who use extra parameters for sorting features... but's that's only ment for human and SE could see duplicate content on this pages, so i did add the following in the robots.txt: /articles/index.php?s=

Google understands this, they will crawl all the pages in /articles/ an notices that they can index /articles/index.php?paging=1,2,3 etc But Not /articles/index.php?paging=1?s=DESC

I just notice that different SE still not understand the meaning of a robots.txt file.. But their becoming awake, they putting man hours on working to a better robots.txt system in their engines.

In my opinion its good that SE crawl all the pages that are Disallowd by robots.txt... but if disallowed, crawling is ok, indexing is forbidden! I like the reports in Google sitemaps where you can see the pages included dynamic ones witch are disallowed. So dont think hey, they are overriding the rules.

This was what i would like to share with you, it's really a English exam for me, hope you understand a little.

theBear

1:27 am on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



g1smd,

In which case no one should _ever_ have any problems with?something=x attached to pages. Not even a little wobble.

If Google is doing a validity check of that nature they would need to have the entire site spidered, do they always spider a site fully in a timely manner to detect the "bum" IBLs of this nature? What about the fact that the pages may actually exist outside of what the site allows Google to access through the links on the site?

Isn't it nice that the web is so large and always changing.

g1smd

1:32 am on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, that's why I clarified my observation that the "fake" external-only URLs were quickly indexed, but later delisted in favour of the other alternative URL for that content. After a few weeks only the one on any internally generated links still appeared in the SERPs. The "internal" URLs did not get dropped.

theBear

1:49 am on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So we have what will be a self correcting short term situation, one that can also be prevented from even occuring by the IBL target site.

So you have a choice, prevention or suffer any possible cold and wait it out.

Thus the saga of the Wacky Wobble World continues.

[edited by: theBear at 1:50 am (utc) on Sep. 11, 2006]

theBear

2:13 am on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Now if the site itself points to pages in this manner then they have in effect told Google that they are all valid pages and will be left in the index and depending on what is actually in each page cause or not cause a "duplicate" content issue for themselves. Maybe even trip a too many pages in a short time flag flip.

There are several threads currently running on WebmasterWorld dealing with this issue.

[webmasterworld.com...]

and

[webmasterworld.com...]

[edited by: theBear at 2:16 am (utc) on Sep. 11, 2006]

r3nz0

9:56 am on Sep 11, 2006 (gmt 0)

10+ Year Member



Ok me Dutch has to read very carefully to understand you guys :)

Anyway, i was very busy with the Robots.txt and the <Meta noindex> options; lemmy explain

i got some pages with 25 links to published articles, the url = /articles/index.php

i give users the opportunity to sort the list on different colums.. this option generates URL parameters ie: /articles/index.php?sort=date

i didn't want this URL's to be index by Google bot because of 2 reasons..

1: the possibility for duplicate content..
2: IMO the SERPS need logical URL's... better for the searcher/user

I have some Adsense on this pages and now you see the problem.. i Disallowed * by robots.txt to this?sort=date URL's.. and guess...
AdSense has some problems with that!

lot of you know this but to respond on this topic; becarefull with the <Meta noindex> in your pages that has Adsense.. i fixed this with:

# Robots.txt

User-agent: *
Disallow: /articles/?s=
Disallow: /articles/index.php?s=

User-agent: Mediapartners-Google
Allow: /articles/?s=
Allow: /articles/index.php?s=

theBear

1:55 pm on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



r3nz0,

Language frequently makes understanding difficult.

One of the sites I work on has many, many pages produced by a highly customised cms, we have far more of the site off limits to indexing than we allow the search engines to index, but not to viewing by the adsense bot.

It is very easy for us to design user frendly pages that would lead to generation of massive amounts of duplicate content.

The example of sorting would be one, you could make most of those options available as form actions rather than on a url as query string variables. Someone will corect me if I'm wrong but I don't think the SEs currently do forms. If they do I've got a bit of work to do.

We try to keep as much out of the robots.txt file as possible. We already have more than enough places to make mistakes. Mistakes, what we make mistakes, surely you are jesting ;).

g1smd

6:45 pm on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would use the meta robots noindex tag on all of the alternative sort options.

ashear

7:06 pm on Sep 11, 2006 (gmt 0)

10+ Year Member



I find it odd that the SEO community feels that Google has a “Problem Indexing Dynamic URL’s”.

If you have ever owned a site in the past, say 4 years ago where the URL’s looked like.

www.somesite.com/produc.cfm?p=1$c=2&blah=whatever, look at your old access_log’s, you will find that Google had no problem what so ever indexing these pages. Still they have no problem indexing these pages. Of course sessions id’s never helped.

I personally believe that their filters down grade sites with such URL’s based upon the user experience. Their index looks a lot cleaner with normal URL’s.

g1smd

7:17 pm on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They do not naturally downgrade such URLs.

The problem is that mainly due to poor site design most such sites are serving duplicate content in some way or another.

That is the real problem, not the dynamic URLs themselves.

theBear

8:46 pm on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"I personally believe that their filters down grade sites with such URL’s based upon the user experience. Their index looks a lot cleaner with normal URL’s."

Actually when you get right down to it the various search engines don't even have to show the url.

I don't even think that Google's filters downgrade pages just because they have an attached query string.

Like g1smd says it is the other things that get you, like causing duplicate content, and tons of pages that show the same title and meta information. Think about all of the search routines that pump out a page after page titled keyword1 keyword2 etc... results for mydomain.com.

Sessionid is a synomyn for index spam, don't give the bot more names for the same stuff, while the bot doesn't care the rest of the system will have more room for real content.

goubarev

10:02 pm on Sep 11, 2006 (gmt 0)

10+ Year Member



Ok, Oliver, here is the better counter-counter example :c)

[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

Obviously, it'll be silly if the webmaster himself would link to his own pages in this manner. But what if some bad person from the outside would put 1000's of links to the site like this... Is there a possibility google would follow them, download the pages, and detect duplicates?

Hi, r3nz0, good example with robots.txt! But would you do if you need to have some of the pages listed with? but some not for example:
http://example.com/index.php?x=1 - listed
but
http://example.com/index.php?z=2 - not listed
http://example.com/index.php?x=1&y=2 - not listed

ashear, indeed, you right - I personally, didn't have a problem with search engines listing my dynamic pages (even the ones that have 3+ variables) - the problem we are discussing is unseen - we think that google and others might be downloading un-needed dynamic pages then, might be detecting the duplicate content, then dropping those pages - I doubt, they would actually admit it. But we at least trying to get the idea out of them, on how they deal with that issue.

g1smd

10:15 pm on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>> Obviously, it'll be silly if the webmaster himself would link to his own pages in this manner. <<

It would.

But most forum, cart, CMS, and other types of dynamic sites DO have these flaws built into them - by the bucketload.

Check out what I wrote about vBulletin, as just one example, just a few months ago.

[webmasterworld.com...]

g1smd

10:18 pm on Sep 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>> The problem we are discussing is unseen - we think that google and others might be downloading un-needed dynamic pages then, might be detecting the duplicate content, then dropping those pages - I doubt, they would actually admit it. <<

They do get seen. They do get indexed. They fade away after a few weeks if nowhere on the site links to the same URL.

They are seen as being duplicate content.

I have tested what happens, several times, over the last 18 months or more.

theBear

1:56 am on Sep 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To add another example in the forum software arena.

phpBB will if not tamed spam the index.

There are all kinds of ways to cause duplicate content problems for a site.

Default server setups _can_ leave multiple paths to a site, when those paths get discovered a single link to the site can if the internal linking does not specify a full path for all of the links on its pages cause both the site to be duplicated and for the site itself to actually tell google that all of those duplicate pages are for real.

There have been multiple 1000+ message threads related to that situation. These huge threads have occured on and off for several years.

www/non-www issues, sites being replicated on an ip address as well as www/non-www and with some folks finding the site also appearing on mail.domain and on parked sites.

Also look up Googlewash.

[edited by: theBear at 1:57 am (utc) on Sep. 12, 2006]

This 44 message thread spans 2 pages: 44