homepage Welcome to WebmasterWorld Guest from 54.237.249.10
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Can't stop googlebot crawling specific forum
persons




msg:4622189
 12:33 am on Nov 9, 2013 (gmt 0)

im using invison power board latest version

i've tried everything in the books to get googlebot to stop indexing and crawling a part of my forum but it won't, this is causing dupe content penalty to my site. Google is picking up on every new thread i make on the forum which is what i don't want. I want users to be able to see the content so i can't set it to guests only. I need to block it from googlebot only, My setup is like this



Wordpress site / Forum Here



www.mysite.com/forums/



i have mod-rewrite so a topic would look like this



[mysite.com...]



my robots is like this

Disallow: /forums/forum/17-cool-section

Disallow: /forum/17-cool-section

Disallow: /forums/index.php?/forum/17-cool-section

Disallow: /forums/forum/17-cool-section/

Disallow: /forum/17-cool-section/

Disallow: /forums/index.php?/forum/17-cool-section/



i've tried all the combinations and it still doesn't work and unchecked the xml sitemap for these specific forums on ipb but google still manages to index any new topics i create in that forum.



Any help would be appreciated, thanks btw im running the latest ipb

 

lucy24




msg:4622196
 2:14 am on Nov 9, 2013 (gmt 0)

indexing and crawling

There's your first problem. You have two mutually exclusive choices:
(a) block page in robots.txt, which will prevent search engines from knowing the page content, but does not keep the page out of the index if, for example, someone links to it using some unusual phrase that could come up in searches
(b) allow robot to crawl, but say "noindex" either in the page itself (meta robots) or as part of the response header

There is currently no way to say "Neither crawl nor index" short of hitting the robot with a 403-- which seems extreme. Though it can be a useful backup if you edit your robots.txt and accidentally remove the block on one subdirectory. Ask me how I know.

Is google currently both crawling and indexing? Your logs will tell you if it's crawling. A site: search in google will tell you if the pages are indexed.

my robots is like this

I hope you don't mean literally like that. Are the blank lines an artifact of whatever cut-and-paste you used to create the post? If not, you're in trouble, because blank lines have semantic meaning in robots.txt

phranque




msg:4622201
 3:02 am on Nov 9, 2013 (gmt 0)

welcome to WebmasterWorld, persons!


it's often helpful to use "fetch as googlebot" in GWT to see how googlebot sees your content.

as lucy24 mentioned, you must allow crawling so googlebot sees your noindex signal or 403 Forbidden response.

how are your forum urls indexed in the SERP when you do a site:example.com/forum search?
if you are excluding googlebot from crawling you will see:
A description for this result is not available because of this site's robots.txt learn more [support.google.com].

persons




msg:4622203
 5:28 am on Nov 9, 2013 (gmt 0)

thanks for the reply guys

im trying to block the forums from google indexing certain forums on my site and i do see the robots listed as shown below

"A description for this result is not available because of this site's robots.txt learn more"


but what it does is block out the forum but every new post i make within that forum google index it.

example if i were paste a url like this

"http://site.com/forums/forum/20-my-thread/"

i would see the actual listing on google. I'd perfer google no crawl/index it at all as it's wrecking havoc on my serps, because i post news on wordpress and copy it onto my forums.

im not sure if i could be a mod re-write issue or something but no matter what robot i put google seems to be indexing the posts.

persons




msg:4622204
 5:30 am on Nov 9, 2013 (gmt 0)

^lucy its not blank lines just an example =)

lucy24




msg:4622206
 6:26 am on Nov 9, 2013 (gmt 0)

i do see the robots listed as shown below
"A description for this result is not available because of this site's robots.txt learn more"
<snip>
im not sure if i could be a mod re-write issue or something but no matter what robot i put google seems to be indexing the posts.

You can EITHER block crawling OR you can block indexing. You cannot do both. (Do not be unhappy. It took me at least a year to wrap my brain around this concept.)

If it is most important to block indexing, you have to permit crawling. Give each page a meta that says
<meta name = "robots" content = "noindex">
(If I misspelled something there, phranque will come along and fix it.)
If you are already using some kind of CMS-- which it sure seems as if you are-- there is probably some very simple change you can make so this happens automatically everywhere. A mouse click here, a plugin there. Don't look at me.

If you can't get the pages to behave as desired, Option B is to put a small htaccess file in the directory where all your forums live. You can't have <Directory> sections in htaccess, so you have to create a separate htaccess file and put it in the appropriate directory. It would say

:: shuffling papers ::

Header set X-Robots-Tag "noindex"


Just the one line.

i would see the actual listing on google

Do you mean an index entry for the new URL, or the content of the page?

From first post:
this is causing dupe content penalty to my site

If the googlebot can't crawl, how does google know there is duplicate content? Do you factually know that you're being penalized, or are you just getting a bad feeling?

I'm not sure that "penalty" and "duplicate content" really belong in the same sentence anyway. That is, I don't think the algorithm says "OK, this identical content occurs in three different URLs on the site, so we'll drop each one 50 spots from where it would otherwise appear."

netmeg or someone like her would know, but I don't think she hangs out in this subforum.

persons




msg:4622211
 7:14 am on Nov 9, 2013 (gmt 0)

hi lucy thanks for the reply, i've figured out a solution with an old plugin that blocks bot so problem solves *yay* i knew i was gettin dupe penalty because i was testing it out for weeks, on setting it to members only and another opening for all. My rankings dropped outta google and came back when i done this. All and all i think my solution for now is to use this plugin i found. Thank you again and phranque for providing your feedbacks =)

phranque




msg:4622212
 7:19 am on Nov 9, 2013 (gmt 0)

an old plugin that blocks bot

technically speaking, what does the plugin do?

persons




msg:4622227
 10:27 am on Nov 9, 2013 (gmt 0)

this is the bot [community.invisionpower.com...]

what it does is set most robot crawlers as a usergroup allowing me to make a seperate usergroup for bots which gives me permission to restrict bot access to the specific forums

martinibuster




msg:4622297
 4:44 am on Nov 10, 2013 (gmt 0)

what it does is set most robot crawlers as a usergroup


When I started reading this thread I had it in mind to suggest you use the bot group to solve your problem. Glad to hear you discovered it! It's a great feature that's also a part of phpBB3. You can then use PHP to keep bots from doing things like crawling certain links. You can do one of those IF/ELSE expressions that determines if the visitor is a bot then, for example, with the ability to screen out bots you can choose to not show them X-link or the REPORT-A-Post button.

Just be careful not to use it to do so-called "clever tricks" like trying to "shape" PageRank.

The best use is to stop the bots from dynamically generating duplicates or hundreds of thousands of empty pages, that kind of thing.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved