| 10:10 pm on Apr 18, 2012 (gmt 0)|
You mean how it ended up in the Google's index? Several scenarios, if the bot saw a link like <a href="www.example.com/scripts/siteUtil.js?somenumber"> it got it, if there is some broken html somewhere, if the bot saw the string thought it was URL etc.
Since you can see the code should be easy to find out which file uses it.
The js code looks like it returns a string of a start and end year. Maybe it tries to create a copyright date range of the document to display it somewhere?
| 10:18 pm on Apr 18, 2012 (gmt 0)|
|You mean how it ended up in the Google's index? |
Yes. That is one thing that I am trying to figure out.
Could this be related to Adsense?
| 10:45 pm on Apr 18, 2012 (gmt 0)|
Remember the old rule about child-proofing?
If a child can see it, the child can reach it.
If a child can reach it, the child can touch it.
If a child can touch it, the child can hold it.
If a child can hold it, the child can break it (or put it in its mouth, or use it to destroy your home, or ... et cetera, depending on what "it" is).
Change a few words and you've got Today's Google and URLs.
| 11:57 pm on Apr 18, 2012 (gmt 0)|
If you want to avoid such URLs appearing in search results, I would recommend you look at the (sometimes complex!) subjects of canonicalisation and robots exclusion.
| 1:30 pm on Aug 17, 2012 (gmt 0)|
|The js code looks like it returns a string of a start and end year. Maybe it tries to create a copyright date range of the document to display it somewhere? |
Can you tell me where the js code might be trying to display the copyright date range? Is it somewhere on the site?
Or is it Google that is trying to do this and display it somewhere in the SERP?
|You mean how it ended up in the Google's index? |
Also, can urls such as www.example.com/scripts/siteUtil.js?somenumber
showing up in Google's index have an impact on rankings?
|If you want to avoid such URLs appearing in search results, I would recommend you look at the (sometimes complex!) subjects of canonicalisation and robots exclusion. |
I don't have access to the root host file so I don't think that I can do anything with robots exclusion. Is there anything else that I can do?
Canonicalisation, I have heard that this is something that is used to set the preferred domain (www.example.tld instead of example.tld), but is this something that I can use for this situation as well? And would I need access to the root host file to do this?
| 5:25 pm on Aug 17, 2012 (gmt 0)|
If you can't create a new file at the root - then you are at a big disadvantage. However, if you can upload a new text file so that it has the address http://www.example.com/robots.txt then you are set. Anyone responsible for a website should have this kind of access. If you don't then I think you should ask for it!
Canonical issues come up any time a server file can be accessed with more than one URL - and any difference in the URL at all makes a second URL. That can be spelling, capitalization, order of variables, extra characters - on and on. The "canonical" URL is the exact character string that you intended to be indexed.
There's a thread on many of those possibilities in our Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page. See Canonical URL Issues - including some new ones [webmasterworld.com].
| 3:02 am on Aug 19, 2012 (gmt 0)|
Thanks for the response and the recommendation.
One thing that I wanted to ask is if I am able to access my robots.text file, what would I write in it so that these urls don't appear in Google's index?
Also, I am not sure how to interpret the fact that these urls are appearing in Google's index when I do a site: operator search but they are not appearing in Bing's index?
The function code that I am seeing on these urls appears to be some sort of calendar script. Where on my site are these pages located?
A reason that I feel it is not Adsense is I don't think that a calendar script is related to adsense, but at the same time, I don't know what it could be.
Could the presence of these urls in Google's index affect the site's rankings?
This is an area that I am not very familiar with, and I would really appreciate your thoughts.
[edited by: Robert_Charlton at 5:07 am (utc) on Aug 28, 2012]
[edit reason] fixed (smiley) formatting [/edit]
| 1:43 pm on Aug 19, 2012 (gmt 0)|
It is not related to Adsense. In the robots.txt file, you would want to disallow the /scripts directory, i do not see any reason to let that directory get indexed.
| 3:07 pm on Aug 19, 2012 (gmt 0)|
Could the presence of these URLs in Google's index spread the page rank of the site and impact rankings?
So instead of all of the site's page rank going to the content pages, it is also going to these URLs and having an impact on the terms that the site's content pages rank for.
I am also wondering if these URLs could be related to the website's external stylesheets? The reason I ask is I looked at the site logs and I saw URLs such as:
and somenumber are the same numbers that I am seeing in the URLs in Google's index.
I would appreciate if you can help me to make sense of all this.
I am trying to figure out how it all fits together.
| 11:25 pm on Aug 19, 2012 (gmt 0)|
the question mark (?) in a url is followed by a query string which the server passes to the requested resource.
however these urls are non-canonical for these resources and these requests should be externally redirected to the canonical url (i.e. with no query string appended)
how you do this depends on your server configuration.
|Could the presence of these URLs in Google's index spread the page rank of the site and impact rankings? |
these urls can't pass page rank and you probably aren't linking to them with an anchor tag.
the only search impact they might have on your site is wasted crawl budget.
|In the robots.txt file, you would want to disallow the /scripts directory, i do not see any reason to let that directory get indexed. |
disallowing a url pattern in robots.txt will not prevent a url from being discovered and indexed without being crawled.
| 2:11 am on Aug 20, 2012 (gmt 0)|
Thanks for the explanation.
If I am seeing http://www.example.tld/color_2.css?somenumber for example, should it be redirected to
(1) http://www.example.tld/color_2.css? (2) http://www.example.tld/color_2.css or (3) http://www.example.tld
Which of the above is the canonical URL? If it is something else, I would appreciate if you can tell me.
I don't have access to my root host file. Is there something that I can do to externally redirect the URLs that I am seeing to the canonical url or is access to the root host file necessary to do this? If access to the root host file is necessary and I don't have it, is there another way that I can accomplish this?
I am also wondering, for several years on this site, I did not have URLs such as
appearing in the SERP. What could be making them to appear in the SERP now?
If I have some idea of what is causing this, then maybe I can focus on that area(s).
I really appreciate your help with this.
| 5:43 am on Aug 20, 2012 (gmt 0)|
http://www.example.tld/color_2.css?somenumber should be redirected (with a 301 status code) to http://www.example.tld/color_2.css
|I don't have access to my root host file. |
i'm not sure what that means technically.
do you mean you don't have access to the server configuration file or the document root directory or ...?
what type of server is it?
|What could be making them to appear in the SERP now? |
most likely it was this series of events:
- google discovered the url
- googlebot requested the url from your server
- your server responded to that request with a 200 OK status code
- the url was indexed
| 2:31 am on Aug 21, 2012 (gmt 0)|
|do you mean you don't have access to the server configuration file or the document root directory or ...? |
I don't have access to the document root directory (I believe that is the same as the server configuration file). I can't upload a .htaccess file.
|what type of server is it? |
It is a linux platform.
With the way things are, is there a way for me to externally redirect the non-canonical urls to the canonical url?
I would definitely like to make good use of my crawl budget.
I appreciate your help.
| 4:59 am on Aug 21, 2012 (gmt 0)|
linux is an operating system, not a server.
i'll assume you meant it's an apache server.
|I don't have access to the document root directory (I believe that is the same as the server configuration file). |
it had better not be the same!
i'm not sure there's anything you can do without the intervention of your hosting service.
| 12:05 pm on Aug 21, 2012 (gmt 0)|
Thanks for the information about the server.
|it had better not be the same! |
The template is CSS enabled so I think that this is where the CSS files come from.
Also, can you tell me the difference between a document root directory and the server configuration file?
I apologize for the questions, but I think that this will help me to figure out if I can do something about the urls that I am seeing in the SERP.
| 5:28 pm on Aug 25, 2012 (gmt 0)|
I observed something, but I am not sure what it means. I would appreciate if you guys can give me your opinion.
I did a site:operator search recently and did not see the urls such as www.example.com/scripts/siteUtil.js?somenumber in the SERP.
The next day, when I performed a site:operator search, I saw those urls in the SERP.
A couple of days later, I was looking at the cache versions of the site's pages and I saw that they were around the same time as when I did the site:operator search and did not see the urls such as www.example.com/scripts/siteUtil.js?somenumber in the SERP.
Does this mean something? Can you help to interpret this?
[edited by: incrediBILL at 4:24 am (utc) on Aug 28, 2012]
[edit reason] fixed formatting [/edit]
| 7:48 am on Aug 27, 2012 (gmt 0)|
Go to the very last page of the site: listings and click on the "show omitted results" link. Google sometimes hides the duplicates, sometimes not.
Why are css and js files in the serach results anyway?
| 9:31 am on Aug 27, 2012 (gmt 0)|
|Why are css and js files in the search results anyway? |
If you have a small enough site-- in filecount, not traffic-- you can get the entire site to show up in Unwanted Smiley Search simply by searching for something like the letter "e" and constraining it to your domain.
| 1:31 pm on Aug 27, 2012 (gmt 0)|
|Go to the very last page of the site: listings and click on the "show omitted results" link. Google sometimes hides the duplicates, sometimes not. |
When I do this, I see urls such as www.example.com/scripts/siteUtil.js?somenumber
somenumber is different for all the urls but if I click on the link, I think that I am seeing what I have in my opening post for all of them.
|Why are css and js files in the serach results anyway? |
That is something that I am trying to understand :)
Can you tell me why this might be happening?
Also, where could these files be coming from?
I am not sure if this helps but for several years, when I performed a site:operator search, I did not see this, but now I am.
[edited by: incrediBILL at 4:25 am (utc) on Aug 28, 2012]
[edit reason] fixed formatting [/edit]
| 4:41 pm on Aug 27, 2012 (gmt 0)|
It sounds as if you have very little control over your site content, both from the production side (css or js files that you don't know about, parameters that you can't explain) and the upload side (no access to the physical site directory, if I'm reading right).
It's tricky without naming names, but I for one would understand better if you explained how your CMS works and what your hosting setup is. Or, if that sentence was Hungarian to you: How do you make the website? How do you change it?
| 2:18 am on Aug 28, 2012 (gmt 0)|
The website is made with a WYSIWYG editor. Changes are made by going to the particular page that you want to make a change to and adding text, etc.(I know that you know how WYSIWYG editors work, but I just thought that I would mention it). It is shared hosting.
The WYSIWYG editor has good features and is user-friendly, but I don't have root access. By not having root access, I don't think that I can put a robots.txt or .htaccess file in the site's directory.
On the site, I believe that I can add meta tags to the site's pages.
Is there anything that I can do about the urls appearing in the SERP? Are there some functions and features that I should see if I have in the hosting setup and/or site content that might help me?
| 4:12 am on Aug 28, 2012 (gmt 0)|
do you have any type of web-based control panel access to manage your server?
something like cpanel or plesk?
| 5:20 am on Aug 28, 2012 (gmt 0)|
When you say "don't have root access", do you mean that you can't FTP into the site (or SFTP or SSH or something web-based or, or, or), look at everything that's there, and move/upload/delete files? Is this your own domain or are you piggybacking on someone else's? Or is it one of those package deals where the site gives you the software and it uploads itself?
Oh, wait. Didn't you start out saying it was your own domain? That would tend to exclude WordPress-type things where, as far as I know, they take you by the hand and do everything for you.
This may seem like an awful lot of questions, but it's hard to give concrete advice without knowing the exact physical setup. I know I'd be very unpleasantly surprised if a site:mydomain search turned up anything like
function getCopyrightDate(iStartYear, iRangeSize, separatorString)
var date = new Date();
// if no start year is passed in, then use the current year
if (iStartYear == null || iStartYear == 0 || iStartYear == "")
iStartYear = date.getFullYear();
et cetera because I know very well I haven't done anything like that myself. And the only thing in my domain that I didn't personally make is piwik, which is strictly off limits to robots. ("Strictly" = htaccess 403, no futzing about with robots.txt)
| 3:38 pm on Aug 29, 2012 (gmt 0)|
I am figuring out the answers to the questions that you asked, but I wanted to ask a couple of questions that I think might help to figure out why this is happening. They expand, to some extent, on some of the questions that I asked in earlier posts.
The first thing is that the urls such as www.example.com/scripts/siteUtil.js?somenumber are only appearing in Google's index. They are not appearing in Bing's index. What could be the reason for this? Are there different ways to crawl a site and maybe the search engines are using different methods? I am not sure if this is possible, but if it is, can you mention what these methods could be?
Another thing is that for several years, I did not see urls such as www.example.com/scripts/siteUtil.js?somenumber in Google's index. How did they start becoming discovered after several years? Could there have been some kind of change in crawling method (this probably relates to the question that I asked in the previous paragraph).
One more thing I wanted to mention that I feel might help to figure out what is happening is the urls that I am seeing in Google's index have increased over the past couple of months. When I perform a site:operator search, I have gradually seen more of them. They did not all appear in the index at the same time.
| 4:19 pm on Aug 29, 2012 (gmt 0)|
their respective crawlers discover urls in different places and parse documents in different ways.
and they will obviously go new places and use new methods over time.
as long as those urls aren't referred to by anything on your domain i would focus on providing the proper redirect or error response.
| 5:44 pm on Aug 29, 2012 (gmt 0)|
|as long as those urls aren't referred to by anything on your domain i would focus on providing the proper redirect or error response. |
Does as long as those urls aren't referred to by anything on your domain mean as long as I myself am not creating links to these urls on the site, the way that I would link text from a paragraph on the homepage to an inner page, for example.
| 8:31 pm on Aug 29, 2012 (gmt 0)|
Yes. But don't overlook invisible links. To sum up scattered remarks from earlier in this thread:
So if the robot is exceptionally stupid and/or in an unusually bad mood, you've suddenly got a pile of Duplicate Content for stuff it was never meant to see in the first place.
* Hasty edit here to remove actual name of program ;)
| 2:59 am on Aug 30, 2012 (gmt 0)|
|But that invisible link such as <a href = "http://www.example.com/trackername/trackername.php">* is visible to any passing robot, because they're reading the raw html, not the surface of the page. |
Is the raw html what you see in the source code? Or is it something else, something that a robot can see when crawling a page but isn't visible in the source code?
Also, can these invisible links be something that a robot did not pick up for years and then started to pick up? Or do you think that there might have been some type of change made to the hosting settings and/or template of the site that caused these urls to get indexed?
| 5:40 am on Aug 30, 2012 (gmt 0)|
Raw HTML = source code. Yes. (Edit: Well, sort of. If you've got php or even simple Includes, "raw hmtl" can mean different things depending on context. But basically, search engines see what you see if you're online and select View Source or equivalent.)
| This 35 message thread spans 2 pages: 35 (  2 ) > > |