Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot is not indexing my wordpress website (Googlebot error message)

         

tihami

4:22 am on Oct 27, 2014 (gmt 0)



hi
my site is not indexing through Google webmaster.
got the massage in webmaster.

http://www.example.com/: Googlebot can't access your siteOctober 24, 2014

Over the last 24 hours, Googlebot encountered 24 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%.

my website builtin wordpress and i am using the seo tool Yoast WordPress SEO plugin v1.5.5.3

my current robots.txt is
User-agent: Google
Disallow:
User-agent: *
Disallow: /cgi-bin/
Sitemap: [example.com...]
User-agent: rogerbot
Crawl-delay: 3


regerbot us for any over activity of software or plugin attempt at server

i also check robots.txt in webmaster it is showing allowed but if we checking the sit "fetch as google" not accessing.
i also convert the site http to https

during this issue i also check the quiry
site:example.com inurl:https result showing 190 page
site:example.com inurl:http result showning 80 pages

can anyone help to resolve the Google bot indexing.

thaks

[edited by: aakk9999 at 1:14 pm (utc) on Oct 27, 2014]
[edit reason] Replaced with example.com to avoid auto-linking [/edit]

aakk9999

1:54 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Firstly, your robots.txt does not seem correct. Google's user agent for web search should be "Googlebot" and not "Google".

Secondly, you should leave a blank line between each User Agent directive.

And lastly, the catch-all user agent * should go at the end

Google crawlers
https://support.google.com/webmasters/answer/1061943?hl=en [support.google.com]

Have you tried Google's robots.txt tester?
https://support.google.com/webmasters/answer/6062598 [support.google.com]

This also may help with the correct syntax
About /robots.txt
http://www.robotstxt.org/robotstxt.html [robotstxt.org]

The second question is - when you used "Fetch as googlebot" and you got the error, what does the error say? Is there a link you can click on to tell you whether it is blocked by robots, or cannot access robots.txt or gets 404 or whatever else?

What exactly is the response when you execute "Fetch as googlebot" ?
Can you try to fetch not just your domain root but also your domain robots.txt and see what results you get.

I would also look at server logs to see whether the request reached the server and what is server responding with. I have seen the case where robots.txt responded with HTTP 500 (server error) and Google has dropped all pages out of its index as a result.

lucy24

4:00 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Over the last 24 hours, Googlebot encountered 24 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file

This means that although the robots.txt may have errors-- as detailed above-- it isn't the source of the problem, because Google couldn't get to robots.txt in the first place. So you need to look at your logs and see where all those googlebot requests went. You can also check with an add-on such as Firefox's Live Headers. This may not work in a second-hand request like "Fetch as googlebot" (or any other site that allows UA spoofing), but FF itself lets you send a made-up UA. It probably doesn't have to be exactly correct; just make up something containing the element "Googlebot".

Google's user agent for web search should be "Googlebot" and not "Google".

I once asked about this, and phranque or someone like him pointed to a section in the robots.txt spec that said the "User-Agent" line should be interpreted as broadly as possible. (Does this mean that if you say "User-Agent: e" then any robot whose UA string contains the letter "e" will consider itself bound by this rule? Naah, probably not.)

You said it's a WordPress site, right? And this is a brand-new site? Check your settings to make sure you've turned off any prefs for "Private" or "Development" or the like.

not2easy

4:25 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If your sitemap is at
Sitemap: [example.com...] might your robots.txt be at https://www.example.com/robots.txt rather than http://www.example.com/ as shown in your question? if the site is set to serve all content from https: instead of http: you will want to do an address change in GWT and verify the "new" site so they will look for your files at the correct URL, especially if proper 301 redirects are in place.

anim8tr

5:13 pm on Oct 27, 2014 (gmt 0)

10+ Year Member



You might also want to doublecheck your Wordpress settings (Options -> Privacy) and see if the option to block search bots is enabled. It should be set to allow search bots to crawl your site.

There are similar bot blocking settings in various plugins that do this on a page to page basis as well. Yoast is one of those.

aakk9999

6:17 pm on Oct 27, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This means that although the robots.txt may have errors-- as detailed above-- it isn't the source of the problem, because Google couldn't get to robots.txt in the first place.


This was one of my points. Last year I was investigating a site that suddenly got deindexed. I noticed that a request for (non-existant) robots.txt was returning HTTP 500.

In the process of firstly fixing anything that showed any kind of error, I uploaded robots.txt that only had:

User-agent: *
Disallow:

This was just to ensure HTTP 200 is returned to robots.txt requests. Nothing else was done - and the site got back to index within a week. Go figure!

phranque

5:34 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



according to the original robots exclusion policy...
http://www.robotstxt.org/orig.html [robotstxt.org]:
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.


according to the latest draft standard...
http://www.robotstxt.org/norobots-rfc.txt [robotstxt.org]:
3.2.1 The User-agent line

Name tokens are used to allow robots to identify themselves via a simple product token. Name tokens should be short and to the point. The name token a robot chooses for itself should be sent as part of the HTTP User-agent header, and must be well documented.

These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a User- Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a "*" value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited

The name comparisons are case-insensitive.


i haven't tested this recently but "User Agent: google" should match any user agent string that contains "+http://www.google.com/bot.html", for example.

Google crawlers - Webmaster Tools Help:
http://support.google.com/webmasters/answer/1061943?hl=en [support.google.com]

Clay_More

6:01 am on Oct 28, 2014 (gmt 0)

10+ Year Member



Kind of interesting:

I did not fully or partially understand the O.P.'s issue. Followup pedantic posts made the issue even more clouded in my mind.

Props to anim8tr, Occam's Razor defined.

phranque

7:09 am on Oct 28, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



assuming it's a wordpress/plugin problem, that would be awesome if anim8tr by chance posted the correct solution.

otherwise you have to open up the toolbox and rtfm. =8)

anim8tr

11:13 am on Oct 28, 2014 (gmt 0)

10+ Year Member



phranque, sorry about that. The Wordpress settings are located in Wordpress Admin under Settings -> Reading. On that page there is a checkbox for "Search Engine Visibility".

Checking this box will discourage search engines from indexing the site. A lot of people check this box when developing their site and then they forget to uncheck it when the site goes live.

netmeg

12:28 pm on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All it does is add a disallow all to robots.txt. But yea, people forget it's there. Certain SEO plugins will publish a banner at the top of the admin to remind you.

aakk9999

4:34 pm on Oct 28, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I did not fully or partially understand the O.P.'s issue. Followup pedantic posts made the issue even more clouded in my mind.

I am not sure what is unclear in this OP statement:
Over the last 24 hours, Googlebot encountered 24 errors while attempting to access your robots.txt.

This says that OP's WMT reported that robots.txt is not accessible (for whatever reason). Note that it says "not accessible", it does not say "not found".

I would do the things in order - which is investigate why googlebot cannot read robots.txt. So start with making sure that robots.txt can be read by googlebot. Then move on to other issues - if they still remain there after this one is fixed.

I do not think a plugin can stop robots.txt being accessible. But an error in .htaccess could.



<added>
I would also read these threads:

Google won't crawl if robots.txt returns a 500 error
http://www.webmasterworld.com/google/3561281.htm [webmasterworld.com]

Google drops site after only 10 days of robots.txt returning HTTP 500
url=http://www.webmasterworld.com/google/4580855.htm [webmasterworld.com]

The best way to check if robots.txt is a problem is to use "Fetch as Googlebot" in WMT and fetch the home page and robots.txt file. If you get message "unreachable robots.txt" then this could be the problem even if robots.txt does not exist or never existed on the site - in which case go and check your response codes!

Also note that "Blocked URLs" option in WMT that "tests" the robots.txt is not a good way to test this particular case as it still reports home page as "Allowed".


I am not saying that this is the reason why OP's site is not indexed but the inability of Google to access robots.txt should be investigated first.

@tihami
Do you have access to your server logs? Could you check what is the response to googlebot when it requests robots.txt? Also, could you check if googlebot is requesting any other URLs from your site and what is the response?

When I had that problem I saw that googlebot was only ever requesting robots.txt whilst the request for (non-existing) robots.txt was returning HTTP 500. Once the response was 200OK, Googlebot started to crawl other pages (which were not blocked in robots.txt)

Clay_More

6:59 am on Nov 1, 2014 (gmt 0)

10+ Year Member



Personally, and other people maybe want to do things differently, I run robots.txt as:
http://example.com/robots.txt

Unlikely to run into issues with that.
OP states they are running WordPress along with at least one other plug-in. There are enough variables in the original post to where I believe there is no real possibility of an analytical solution. Apologies if feathers were ruffled.

Brownstownz

10:21 pm on Nov 3, 2014 (gmt 0)

10+ Year Member



If you haven't tried it already, you can also disable all of your plugins to see if one of them is creating a conflict that is having problems.