Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Best way to remove dupe pages with sid session IDs?

         

markovald

3:51 am on Sep 9, 2015 (gmt 0)

10+ Year Member



I have a custom cms that generate duplicate content with some files...

example:
homepage.php?214318&sid=r2342323242342
homepage.php?184343&sid=s12940343436
homepage.php/-searchsdfsdfs

I talk about dozen of dinamically generated files...so i can block this file with robots but it will be indexed...i want to remove pages and to block it.

Do you know a wildcard or system to block dozens of duplicates files with a similar name (example: homepage.php?XXX)

Thank you for your assistance!

Robert Charlton

10:57 am on Sep 9, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



marvovaid, welcome again. I've taken the liberty of changing your title from "What is the best way to remove pages?" to add a reference to the "sid" string, which refers to session IDs. This I think will attract users to provide better answers.

I mention it to encourage members to bring you up to speed on what these are and perhaps how you can avoid generating them, as well as how you can clean them up.

If your CMS is a known brand, rather than home grown, it might be helpful to identify it. I'm assuming that this is an ecommerce site.

not2easy

12:08 pm on Sep 9, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



In your GWT Search Console you can add URL Parameters like "&sid=" to control the crawling. You can expect to see the number of pages for your site decline but that is because these dynamic variations of the same content have been artificially inflating the numbers. I have had to add URL Parameters to prevent a ton of 404's when Google would list "/SEARCH" URLs that only exist when that search is executed.

There are other actions you can take but that depends on working within your CMS. Because SID URLs relate to individual visitors you don't want to rewrite the URL or add a canonical tag.

markovald

12:18 pm on Sep 9, 2015 (gmt 0)

10+ Year Member



Thank you, Robert Charlton! I appreciate that very much.

It's a custom CMS, we sell a service but it's something like an e-commerce.

Thank you all for your assistance.

So if i have this duplicate page:
namepage.php?214318&sid=r2342323242342
(namepage.php also is a "useless" duplicate page)
If i cancel the indexing of namepage.php
And i block the "sid" parameter

I will resolve all the duplication about this page?

I've readed that in some cases it's best to block some url than use noindex because the crawler will check every time the noindex pages, if you block the pages you can save some "crawl budget"...this is true? How can i know when is better to use noindex and when the best thing is cancel the page from indexing?

markovald

11:29 pm on Sep 12, 2015 (gmt 0)

10+ Year Member



Please, i need your feedback to improve my website. :-)

Is very important to me! :-)

Thank you!

not2easy

11:53 pm on Sep 12, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The sid URL changes with each session, you don't want those "pages" indexed because the namepage.php is the page to show in the search results, not namepage with someone's session id attached. Don't no-index the page itself, just add the "&sid=" parameter and set it to crawl No URLs. This will avoid duplicate pages with different URLs and prevent the useless crawling.

Robert Charlton

8:20 pm on Sep 13, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



This is an area outside of what I generally handle, but it's background you should have.

You need to avoid displaying session IDs in the first place on public pages. Beyond creating dupe content, your CMS may be set up in a way that's not secure, as publicly available session urls could be copied by others. Degree of importance here depends of course on what information you're storing.

A frequently employed approach is to use a combination of cookies, which are unique to a user's browser, for general tracking and as an "exchange mechanism" for session IDs, and to limit session IDs in urls to areas where users should need to log in.

Log-in measures and "authentication" routines are necessary to keep private information private. Various authentication mechanisms are used to do this, and I'm really not qualified to explain how to set these up.

The following reference, though, might provide some general perspective and helpful background on the security implications. As noted below in a section I'll quote, problems generally occur in custom CMS systems, for the reaons described...

Top 10 2013-A2-Broken Authentication and Session Management
Open Web Application Security Project
[owasp.org...]

Developers frequently build custom authentication and session management schemes, but building these correctly is hard. As a result, these custom schemes frequently have flaws in areas such as logout, password management, timeouts, remember me, secret question, account update, etc. Finding such flaws can sometimes be difficult, as each implementation is unique.

So, part of what you need to consider is what kind of info you have been storing in your sessions.