When is this URl only listing thing going to end!?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

When is this URl only listing thing going to end!?

internetheaven

10:23 pm on Feb 13, 2010 (gmt 0)

My user's admin areas are blocked in the robots.txt file. Of course, several of the user pages are linked to from the website and the users enter their own personal login page manually ... obviously using a Google Toolbar too because Googlebot keeps trying to reach them.

Now, there are more than 500 URl only listings under my site:example.com search on Google. All admin area URls that are blocked in robots.txt

It's the same story for several of my other sites and URl only listings (mostly affiliate links) keep popping up in my regular search results on Google too.

How hard is it for Google engineers to enter a little bit of code that says "robots.txt blocked = don't list it in the search results".

No honestly: you programmers out there, after more than a decade would you have worked out a way to sort it out?

TheMadScientist

12:27 am on Feb 14, 2010 (gmt 0)

How about unblocking the URLs in robots.txt and just using a bit of UserAgent detection to serve a custom 403 Forbiden page with a noindex robots meta tag on it to GoogleBot for the URLs you don't want in the index?

They'll disappear after the next spidering and won't get spidered anywhere near as much... My experience anyway.

How hard is it for Google engineers to enter a little bit of code that says "robots.txt blocked = don't list it in the search results".

That's not what blocking in the robots.txt means... It means don't access the page to know if it contains the content people are linking to and looking for or not, so they have no way of knowing if it's a content rich page people expect to find when they search or not and if the inbound links indicate it's the correct result they show it. NoIndex means don't show it in the results. Robots.txt means don't access it. They're two totally different things IMO.

steveb

12:45 am on Feb 14, 2010 (gmt 0)

Use nofollow on the links.

internetheaven

7:29 am on Feb 14, 2010 (gmt 0)

Use nofollow on the links.

Already done. Like I said, most of these links don't exist on the site. Google Toolbar must be collecting them when my users type them in or click them from admin emails.

internetheaven

7:33 am on Feb 14, 2010 (gmt 0)

serve a custom 403 Forbiden page

Someone suggested that last time I complained about this. After I made the suggested change, in GWT it showed 200+ URls unreachable/error and my rankings tanked - even vanishing in some places.

After a year and 2 re-inclusion requests I still haven't got my rankings back.

NoIndex means don't show it in the results. Robots.txt means don't access it. They're two totally different things IMO.

Yeah, but I'm not sure I get your point. Are you saying that "don't access it" means "but please index the URl and list it in the results"? Are you one of those guys that thinks "no" means "yes"? ;)

tangor

7:45 am on Feb 14, 2010 (gmt 0)

Cake and eat it, too is dang difficult.

Don't want it to appear, forbid it in the most stringent terms possible (.htaccess, for example... nofollow, nocache elsewise) or live with it. Not that many choices. Worse, if your kiddies provide links beyond your control you can expect visits from all the bots, that's what they do.

Robots.txt does one thing... and it is not the thing you want to do. Which leaves the other thing, and maybe you don't want to do that either...

I'm having a chuckle, old friend, because we are all caught in the same confusing problem. I will say that I deal with it severe, ie.: if I don't want the bots they are forbidden in both robots.txt AND .htaccess. I sleep better at night that way. Just my solution for my particular problem. YMMV

TheMadScientist

8:17 am on Feb 14, 2010 (gmt 0)

Yeah, but I'm not sure I get your point. Are you saying that "don't access it" means "but please index the URl and list it in the results"?

...so they have no way of knowing if it's a content rich page people expect to find when they search or not and if the inbound links indicate it's the correct result they show it.

Robert Charlton

8:20 am on Feb 14, 2010 (gmt 0)

My emphasis added...

My user's admin areas are blocked in the robots.txt file. Of course, several of the user pages are linked to from the website and the users enter their own personal login page manually ... obviously using a Google Toolbar too because Googlebot keeps trying to reach them.

If the references are publicly available, Google is likely to index them. I'd remove robots.txt and put the meta robots noindex tag in the head sections of the pages you don't want referenced

<meta name="robots" content="noindex, nofollow">

Removing the robots.txt allows Googlebot access to the page so it can read the robots meta.

Chances are it's your own links rather than the Google Toolbar that are getting the pages indexed. For more possibilities, see this discussion....

Why is Google indexing my entire web server?
http://www.webmasterworld.com/google/3396393.htm
[webmasterworld.com...]

Note that the link in the thread to the Google FAQ page which discusses the "secret server" is again broken, but the quoted section of the Google FAQ gives you the basic info.

Also note that it's been reported that Bing apparently doesn't obey the meta robots noindex tag, and it's likely that password protection is the best solution.

wilderness

1:14 pm on Feb 14, 2010 (gmt 0)

Also note that it's been reported that Bing apparently doesn't obey the meta robots noindex tag, and it's likely that password protection is the best solution.

MSN/Bing (like google) has their own specific and required meta tag requirements, however should you desire to include those requirements with each page you desire, they will conform.

rainborick

4:25 pm on Feb 14, 2010 (gmt 0)

I was about to post the same advice that Robert Charlton suggested, but it occurred to me that there might be an advantage in this case to omit the "nofollow" instruction initially. I was just thinking that in this situation, you want Googlebot to see those "noindex" <meta> tags ASAP, and by allowing the links to be crawled, you might well speed up the process. I'd also suggest submitting a few of the critical URLs through the AddURL page, since these pages are probably low-priority for crawling within your site as it stands. If you're on an Apache-based server, you can use the

<FilesMatch "(appropriate regexp)">
Header set X-Robots-Tag "noindex"
</FilesMatch>

instruction to make these changes globally with an .htaccess file in your admin directory, both before and after they're de-indexed (assuming there are no existing robots <meta> tags on these pages). Then, once they're out you can add the "nofollow".

Robert Charlton

3:53 am on Feb 15, 2010 (gmt 0)

...but it occurred to me that there might be an advantage in this case to omit the "nofollow" instruction initially. I was just thinking that in this situation, you want Googlebot to see those "noindex" <meta> tags ASAP, and by allowing the links to be crawled, you might well speed up the process....

rainborick - Makes sense. You definitely want to encourage Googlebot to crawl those pages and see the meta robots "noindex" tags.

I suggested the "nofollow" content attribute only because I assume that these admin pages aren't going to be linking to anything that you'd be wanting to accrue PageRank, but of course that's later, not now.