Forum Moderators: Robert Charlton & goodroi
I have somewhere around 500-600 legitimate pages. Google reports about 600, but when I start browsing the pagination it drops to about 450, so those are the only URLs I can inspect. A session ID doesn't appear in any of those 450, but what about the other 150?
Yahoo Site Explorer reports about 1200 pages of which about half are duplicated with a session ID. Those duplicated URLs must be very old because I'm sure the site is properly suppressing session IDs at this point.
How would you handle this?
If so, I'd say you have no concern at all. There are no "hidden" urls affecting you; it's just an artifact of the complex way Google needs to use in collecting the page count information from data which is sharded in many different ways across their proprietary database format.
My site suppresses all session IDs from appearing in the URL if a robot user agent is detected, but I'm concerned that some may have slipped through anyway.
If detecting based on UA and not IP, I would think you are subject to a few issues.
Yahoo Site Explorer reports about 1200 pages of which about half are duplicated with a session ID.
I'd be concerned about those.
Those duplicated URLs must be very old because I'm sure the site is properly suppressing session IDs at this point.
Was there a point where things were not working properly? Take a sampling of those URIs with the indexed IDs and see if you can find references to them.
If detecting based on UA and not IP, I would think you are subject to a few issues.
So you think a high-profile robot like Yahoo or Google will sometimes spider a page without their usual UA?
Was there a point where things were not working properly? Take a sampling of those URIs with the indexed IDs and see if you can find references to them.
I've been in business for years and I wasn't always filtering session IDs. Do you mean I should see if Yahoo is currently trying to access those URIs?
So you think a high-profile robot like Yahoo or Google will sometimes spider a page without their usual UA?
No, not their "normal" bots. I tend to think they have "other bots" that perform "other functions" and some may not be easily identified, even by IP.
Do you mean I should see if Yahoo is currently trying to access those URIs?
I'd want to find out why they are, yes. I'd be performing advanced searches to see where and if there are any hard-coded references to those session IDs. Maybe not from your own site, but from someone else's?
I would investigate the list of indexed URLs and copy down the session-ID-string of all those indexed session IDs. If the number of indexed strings is fairly small (perhaps a few dozen, or so) I would set up a 301 redirect for any requests for URLs with those session IDs included.
Rather then detect bots, and suppress session IDs for them, I would suppress session IDs for all users that have not logged in to the site. Bots can't log in, therefore they never see session IDs. This doesn't rely on any user-agent, IP, or referrer data and is much more reliable.
If the number of indexed strings is fairly small (perhaps a few dozen, or so) I would set up a 301 redirect for any requests for URLs with those session IDs included.
Actually, none of those hundreds of session IDs should match. Which of these would you do:
1. 301 them all
2. set up Yahoo's "Dynamic URL Rewriting"
3. set up robots.txt to filter the session IDs
Rather then detect bots, and suppress session IDs for them, I would suppress session IDs for all users that have not logged in to the site.
The point of including the session ID in the URL for non-robots is to prevent a new session from being created upon each access from a user who doesn't accept cookies. Otherwise, that user can't use the site and the server has to work hard creating all those sessions. This is a really tricky balancing act that I've struggled with quite a bit.
edit:
Is there a possibility Google has some session ID URLs in its index like Yahoo does, even though they do no show up with site:www.example.com?
[edited by: Tonearm at 5:34 pm (utc) on Aug. 27, 2007]
(... where "session" is whatever you use as the parameter name).
I recommend using the same detection method you are currently using.
IOW: If you use IP, use the REMOTE_ADDR code and if you are using user agent, use the HTTP_USER_AGENT code.
RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^IP\.NUM\.NUM\.NUM$
RewriteCond %{THE_REQUEST} \?
RewriteRule (.*) /$1? [R=301,L]
OR (for user agent based)
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^robotstring$
RewriteCond %{THE_REQUEST} \?
RewriteRule (.*) /$1? [R=301,L]
If you need QUERY_STRINGS for some URLs requested by bots, you can add your session variable to THE_REQUEST:
If 'sessionID' is the first variable:
RewriteCond %{THE_REQUEST} \?sessionID=
OR
If 'sessionID' is not the first variable, match var followed by = followed by one or more characters not an &, followed by an &, followed by sessionID:
RewriteCond %{THE_REQUEST} \?var=[^&]+&sessionID=
See the Apache Forum [webmasterworld.com] for more details on setting this up correctly.
Justin
Isn't that what I did above?
I'm not sure because we are examplifying. I'd take the full URI string, everything, put it between quotes and perform the searches. What you will typically find are references to those strings in pages where they may not be linked. Or, they were linked at one point and are now no longer linked. I've seen all sorts of stuff going on when doing "quoted searches" with URI strings.
"www.example.com/?session="
That example above should include the "full Session ID, all the way to the end of the string. Not just parts of it but the entire string.
"www.example.com/?session=abc123"
as opposed to just:
"www.example.com/?session="
Is that right? Are you saying I should take the URLs Yahoo has which include a session ID and see if Google has them too? Then I can use that information to track down a link? Is that the idea?
Basically we are trying to find where and if there are any references to those URIs with the Session IDs. One sure way to do this is searching for those strings in quotes.
Its very difficult to determine what your particular issues are as you state Yahoo! reports 1200 links and half contain a Session ID. They had to come from somewhere. And if they are still there and have "fresh dates", they are still being served. Now, where the heck are they being served from? A scraper site somewhere?
[edited by: pageoneresults at 6:43 pm (utc) on Aug. 28, 2007]
Google definitely doesn't have a session ID URL in a site: search or a link: search. I checked those manually too. Yahoo sure is loaded up with them though. I guess they must be from old crawls or "UA-disguised" crawls.
DUDE!, don't do a site: or link: search. Do the quoted string searches as I've specified above. The other two advanced searches are not going to give you what we are looking for which are references to those URI strings whether they are linked or not.
The link: command is not going to show you all of the links out there that Google knows about, it has been that way for years. Yahoo!'s Site Explorer will give you a larger number of results when performing those advanced searches and so will Live. Google throttled that back years ago.
If you perform a quoted string search in Google and cannot find references to those then you're probably okay there. Now, where is Yahoo! getting them from? Check the cache dates.
What does this search [google.com...] give you?
(... where "session" is whatever you use as the parameter name).
Are you saying I should do a quoted Google search for the 600 or so URIs Yahoo has which include a session ID? Where do I find the Yahoo cache date? The "Yahoo" portion of the cache display is obscured because I use CSS absolute positioning for source-ordered content. I can't find any such date by viewing the source code of the cache either.
g1smd,
I still getting nothing after de-examplifying that search, so I guess that's a good sign.
Are you saying I should do a quoted Google search for the 600 or so URIs Yahoo has which include a session ID?
No, just a sampling. I'd take about 10 of them, perform the quoted string searches and see if and where the references to those URIs are. One of the first things you'll see when doing those "quoted string searches" are most likely bolded references of the string sitting on a page somewhere. In many instances, its a scraper or some other bottom-feeding creature.
I've even seen instances, mind you they were very rare, of where someone hard coded a Session ID by mistake. A user of a WYSIWYG program did a cut and paste from a browser session into one of their geocities pages. It happens.
Go to Copyscape and see if any of your content is sitting out there in the "muck".
So you think a high-profile robot like Yahoo or Google will sometimes spider a page without their usual UA?
If you do not find those Session IDs in Google or Live, or any other search engine besides Yahoo!, then that would lead me to believe that there was/is a technical error with your UA implementation and serving Session IDs to Slurp.