homepage Welcome to WebmasterWorld Guest from 54.198.148.191
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Blocked Url's by robots.txt in google webmaster tool
how to fix it?
Evan_Rachel




msg:4621571
 2:50 pm on Nov 6, 2013 (gmt 0)

Iam having problem in google webmaster tool > Crawl> blocked Url's. Blocked url's (58) how to solve it?

My robots.txt file

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Sitemap: http://www.example.com/sitemap.xml.gz

any one?

Thanks

 

lucy24




msg:4621680
 10:15 pm on Nov 6, 2013 (gmt 0)

What's the problem?

Some of the things wmt tells you are not errors. It's just information. The same thing goes for reports of 404/410 errors. "Yes, thank you, Google, I know. I did it on purpose."

If those are the actual names of your blocked directories, I might wonder how google even knows they exist, let alone that there are 58 pages involved. Are there publicly accessible pages in there, with links from the rest of the site?

So long as search engines are crawling all the parts of your site that you want them to crawl, I don't see the problem.

Evan_Rachel




msg:4621754
 7:23 am on Nov 7, 2013 (gmt 0)

robots txt blocked my 58 url's showing in webmaster tool, how to get rid off?

yes publicly accessible pages.

lucy24




msg:4621768
 9:48 am on Nov 7, 2013 (gmt 0)

I don't understand what you want to get rid of. Is robots.txt blocking pages you didn't want blocked? Edit the file, and changes should kick in within 24 hours.

Or do you not want gwt to mention that it has found 58 separate links to pages it isn't allowed to crawl? You can't do anything about that. It's simply giving you information; that's its job. And if someone can figure out how to change the gwt settings so it doesn't always show the "blocked by robots.txt" row by default, I'd like to hear about it, because I've never managed to change it permanently. This area isn't like the "crawl errors" tab where you can click "fixed" and the report goes away. It's permanent.

In some cases you might find it more appropriate to remove a robots.txt block and replace it with a meta noindex header. But if you're talking about wp admin files, frankly the whole directory is none of google's business. (Or, of course, any other law-abiding robot.)

Someone who speaks WordPress may want to weigh in here and explain what kinds of files customarily live in the two named directories. /includes/ sounds like the kind of thing that no visitor-- whether human or robotic-- would even know about. Unless, urk, there's been an error so your raw code is visible to any passing search engine. And /wp-admin/ is one of those directories that people are generally advised to rename, just to thwart the malign robots that keep looking for it. (I've never used WP in my life but still get bombarded by robots asking for the 20 most common variants of the name.)

Evan_Rachel




msg:4622508
 12:36 pm on Nov 11, 2013 (gmt 0)

Thanks for your reply

Yes exactly robots.txt blocking pages i didn't want to blocked, and didn't gave him a permission to blocked. google automatically blocked those url's and total url's comes up 58.

i also include simple robots.txt file but can't over come those 58 blocked url's.

aakk9999




msg:4622509
 12:43 pm on Nov 11, 2013 (gmt 0)

Yes exactly robots.txt blocking pages i didn't want blocked

You control robots.txt - Google just follows the directive in there.

Basically, what your robots.txt says to googlebot (and other bots) is:

Do not crawl any URL that has /wp-admin/ in URL
Do not crawl any URL that has /wp-includes/ in URL

Therefore, Google did not automatically block these URLs. Google followed the instructions given in robots.txt which you control.

To be honest, I am not sure why would you want to allow Google to crawl these pages anyway, especially /wp-admin/.

As Lucy said, when Google Webmaster Tools reports that there are nn URLs blocked by robots.txt, this is for your information only. This may not mean it is a mistake.

lucy24




msg:4622515
 1:04 pm on Nov 11, 2013 (gmt 0)

i also include simple robots.txt file but can't over come those 58 blocked url's.

Do you mean that you changed robots.txt and the pages are still listed as blocked? googlebot may go up to 24 hours without picking up robots.txt, and then it may take still longer for it to try crawling the pages. It depends on how anxious it has been to see them ;)

You might try "Fetch as googlebot" on some selected page in the previously excluded group. Sometimes this prompts it to re-check robots.txt right away.

didn't gave him a permission to blocked

There may be a language problem here. Can you say the same thing again in different words?

Evan_Rachel




msg:4623587
 6:54 pm on Nov 15, 2013 (gmt 0)

Do you mean that you changed robots.txt and the pages are still listed as blocked?


Yes pages are still listed! 1 week ago i change the robots.txt file but still cannot fix the issue!

Evan_Rachel




msg:4623589
 7:05 pm on Nov 15, 2013 (gmt 0)

@ aakk9999

To be honest, I am not sure why would you want to allow Google to crawl these pages anyway, especially /wp-admin/.




iam using simple robots.txt file, but failed


User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Sitemap: http://www.example.com/sitemap.xml.gz

i upload the above file still facing same issue,and i also del the robots.txt file but same issue!

lucy24




msg:4623625
 9:26 pm on Nov 15, 2013 (gmt 0)

#1 Are there publicly accessible pages located inside /wp-admin/ and/or /wp-includes/ ? Can't you relocate them? By default, /wp-admin/ is supposed to be for pages that only you use. I don't know about /wp-includes/ but in a directory with that kind of name I'd expect to find, well, include files. These have no independent existence as pages and nobody should even know their names. robots.txt won't prevent an include from being included, though a carelessly worded 403 or 410 might do it. Ask how I know.

#2 Is google avoiding pages located in other directories? This is the part you test via "fetch as googlebot". And then we need to figure out why it thinks it isn't allowed to crawl.

#3 Is google crawling any pages anywhere on your site?

not2easy




msg:4623660
 12:22 am on Nov 16, 2013 (gmt 0)

iam using simple robots.txt file, but failed


User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Sitemap: http://www.example.com/sitemap.xml.gz


It looks like the addition of a Disallow:
line might address the issue if that's all there is in the file. There is nothing allowed, only disallowed.

User-agent: *
Disallow:
Disallow: /wp-admin/
Disallow: /wp-includes/

Sitemap: http://www.example.com/sitemap.xml.gz


lets Googlebot know that only those two commonly disallowed directories are disallowed.

aakk9999




msg:4623662
 12:34 am on Nov 16, 2013 (gmt 0)

It looks like the addition of a Disallow:
line might address the issue if that's all there is in the file. There is nothing allowed, only disallowed.

@not2easy, it does not need to have Disallow:
By default, everything is allowed unless listed in robots.txt

@Evan_Rachel
Once you changed robots.txt, how long did you wait before checking SERPs?

Do you have Google Webmaster Tools account? If you do, you could go to Robots section, where you can see what is the latest robots.txt that Google has obtained - it is shown in there with whatever rules Google has read from it. Note that once you change robots.txt, it can take at least a day for Google to read it, and then another day for Google to show what has read in Google Webmaster Tools.

Once you verified in Webmaster Tools that Google has a robots.txt you want, then you can use boxes below to test if your URL is blocked by entering the full URL in the box below and click on the button to test.

If a result is that the URL is blocked, then you will need to wait some time for Google to drop these URLs from index - this may take few days or few weeks.

Or if you want URL allowed - this could also take some time - and if these are not important pages (hence not crawled often) the wait may be even months.

It could also be possible that Google does not drop blocked URLs from index and instead still display your page in Search Results, but with the meta description "A description for this result is not available because of this site's robots.txt"

But to return to your opening question - the blocked URLs reported in Webmaster tools are for your information only. These are not errors. These would only be errors if you want Google to index these pages. But Wordpress pages in these folders should not need indexing. So you can safely ignore this message in Webmaster Tools.

Evan_Rachel




msg:4623737
 12:59 pm on Nov 16, 2013 (gmt 0)

Thanks for your replies.

it can happen when we switch to another sever and add a site in addon domains it will change the root directory of a site for example previously my site root directory :

/home/mysitename/public_html.
i have host my site in one hosting and one domain.

current site root directory

/home/main site/public_html/mysitename

now i have hosted 2 site in single hosting

main site is previously hosted and mysitename is hosted afterwards.



i think this is the reason because my xml.sitemap plugin give same error. when i generate the xml sitemap manually!

aakk9999




msg:4623753
 1:19 pm on Nov 16, 2013 (gmt 0)

Reading your latest post, I am not sure what is the problem exactly?

How your directories are organised on the server is irrelevant, what is important is what you see when you request robots.txt or sitemap.xml which should be in your domain root of each of your domains.

What do you have when you request the following for either of your domains?

www.example.com/robots.txt
www.example.com/sitemap.xml

And what is the actual problem you are trying to solve?

[edited by: aakk9999 at 1:20 pm (utc) on Nov 16, 2013]

Evan_Rachel




msg:4623754
 1:19 pm on Nov 16, 2013 (gmt 0)

i also include this commands in robots.txt file

User-agent: *
Allow: /

# Google Image
#User-agent: Googlebot-Image
#Allow: /*
Allow: /wp-content/uploads/

# Google AdSense
User-agent: Mediapartners-Google*
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /*

User-agent: Googlebot
Disallow: /*?
Disallow: /*newwindow=true
Disallow: /*dur=124
Disallow: /*dur=0
Disallow: /*replytocom
Disallow: /*refresh=1

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Allow: /z/
Allow: /about/
Allow: /wp-content/
Allow: /tag/
Allow: /category/
Allow: /manual/*
Allow: /docs/*
Allow: /*.html$
Allow: /*.php$
Allow: /*.js$
Allow: /*.inc$
Allow: /*.css$
Allow: /*.gz$
Allow: /*.cgi$
Allow: /*.wmv$
Allow: /*.cgi$
Allow: /*.xhtml$
Allow: /*.php*
Allow: /*.gif$
Allow: /*.jpg$
Allow: /*.png$
Allow: /*.jpeg$

Sitemap: h t t p://mysiteurl/sitemap.xml.gz


it ok when i add these line the every next day google blocked 29 more url's


also using yoast seo plugin in Taxnomies tabs

i tick Meta Robots:noindex, follow
in
Categories, Tags and Meta Format

it's fine?

aakk9999




msg:4623755
 1:36 pm on Nov 16, 2013 (gmt 0)

I think your robots.txt is a bit of a mess.

Googlebot will only follow the following section in your robots.txt:

User-agent: Googlebot
Disallow: /*?
Disallow: /*newwindow=true
Disallow: /*dur=124
Disallow: /*dur=0
Disallow: /*replytocom
Disallow: /*refresh=1


So for Googlebot, URLs that are blocked are:
- anything that has query string parameters (any URL with ?)
- the next 4 lines are all parameters and they should already be blocked by the first line

You have User-agent: * specified twice. It should be joined.

Allow: /
This is not standard. Instead of this, you should have:
Disallow:

(i.e. word Disallow: without anything after it)

This does nothing for Googlebot-Image as the User Agent line is commented out:
# Google Image
#User-agent: Googlebot-Image
#Allow: /*
Allow: /wp-content/uploads/


But if you removed comment next to the User-agent line, everything would be allowed for Googlebot-Image. You do not need to allow explicitly to be allowed, unless you disallow the rest first (but note - Allow directive only works for Google user agents). For example, this will allow content of /wp-content/uploads/ only:

# Google Image
User-agent: Googlebot-Image
Disallow: /
Allow: /wp-content/uploads/

If you want to allow everything to Googlebot-Image, you should only have:

User-agent: Googlebot-Image
Disallow:

Have you read The Web robots pages [robotstxt.org] where it explains how to use robots.txt ?

aakk9999




msg:4623756
 1:54 pm on Nov 16, 2013 (gmt 0)

Here is your robots.txt with duplicates/unnecessary lines removed, which results in the same exclusions you had, and comments telling you what happens:


# Allow Google Image bot on entire site
User-agent: Googlebot-Image
Disallow:

# Allow Google AdSense bot on entire site
User-agent: Mediapartners-Google
Disallow:

# Allow Google Adwords bot on entire site
User-agent: Adsbot-Google
Disallow:

# Allow Googlebot for mobile on entire site
User-agent: Googlebot-Mobile
Disallow:

# Googlebot - Disallow all URLs with query string parameters, allow the rest
User-agent: Googlebot
Disallow: /*?

# Any other bot not listed above - disallow wp-admin and wp-include directories
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Sitemap: http://mysiteurl/sitemap.xml.gz


Evan_Rachel




msg:4623784
 3:05 pm on Nov 16, 2013 (gmt 0)

@ aak99

Reading your latest post, I am not sure what is the problem exactly?

How your directories are organised on the server is irrelevant, what is important is what you see when you request robots.txt or sitemap.xml which should be in your domain root of each of your domains.



when i shifted my site to the another hosting, the home root directory change a little bit, and very first day google block 58 urls'

i was think that it could be issue!

Evan_Rachel




msg:4623785
 3:11 pm on Nov 16, 2013 (gmt 0)


# Allow Google Image bot on entire site
User-agent: Googlebot-Image
Disallow:


above code mean that google allow image bot to crawl website's images. right?



if we don't write above command google will not crawl my website's images?

i want google crawl my website images and index in a google images

aakk9999




msg:4623809
 5:53 pm on Nov 16, 2013 (gmt 0)

Lines starting with # are comments.
Line:
Disallow :
means: disallow nothing, that is, it means allow everything.

So the quoted code above will let Google image bot to crawl all your images.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved