homepage Welcome to WebmasterWorld Guest from 54.161.185.244
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google AdSense
Forum Library, Charter, Moderators: incrediBILL & jatar k & martinibuster

Google AdSense Forum

    
[Robots.txt] What is the correct User agent for Adsense Crawl
Google adsense crawl error - what is the right user agent
bluemonster



 
Msg#: 4557036 posted 9:15 am on Mar 21, 2013 (gmt 0)

Happen to face Google adsense crawl error, found the fix.

It says, to make adsense to crawl your pages and show targetted ads, add these lines on top of robots.txt.

User-agent: Mediapartners-Google
Disallow:


But my robots.txt file has already had this lines

User-agent: Mediapartners-Google*
Allow: /

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /


Someone explain me what these lines refer to and what is the difference between both.

Which is the right one? Do I need to make any changes or just leave like that?

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4557036 posted 10:37 am on Mar 21, 2013 (gmt 0)

You don't need the Allow lines at all. They're only for identifying permitted areas inside of excluded areas-- and then only for robots like googlebot that recognize the word.

In robots.txt, simpler is better. If you don't say anything about the adbot by name, it will follow the same rules that apply to your generic robot
User-Agent: *

If some areas of your site are blocked to most robots, but you want the adbot to go absolutely everywhere, give it a separate line that says
Disallow:

Just like that. Nothing after the word "disallow".

bluemonster



 
Msg#: 4557036 posted 5:35 pm on Mar 21, 2013 (gmt 0)

Now what should I do?

Can I replace the second set of codes with the first one? or just leave it as it is.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4557036 posted 5:45 pm on Mar 21, 2013 (gmt 0)

do you have any other directives in your robots.txt?

bluemonster



 
Msg#: 4557036 posted 5:53 pm on Mar 21, 2013 (gmt 0)

This is my full robots.txt file -

User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /go/
Disallow: /archives/
disallow: /*?*
Disallow: /wp-*
Disallow: /author
Disallow: /feed/
Disallow: /comments/feed/
Disallow: *?replytocom


User-agent: Mediapartners-Google*
Allow: /

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /

bluemonster



 
Msg#: 4557036 posted 5:54 pm on Mar 21, 2013 (gmt 0)

Now should I just leave it as it is or Should I change the last 4 lines (which is related to adsense)?

netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4557036 posted 7:17 pm on Mar 21, 2013 (gmt 0)

I'd take out those last four lines altogether; you don't need them.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4557036 posted 5:55 am on Mar 22, 2013 (gmt 0)

disallow: /*?*
Disallow: /wp-*
Disallow: *?replytocom


although googlebot and others may allow exceptions, the robots exclusion protocol specifically does not support globbing or wildcarding, so don't expect those *s to work everywhere.


disallow: /*?*


although googlebot and others may allow exceptions, the robots exclusion protocol describes the directive as an upper-cased "Disallow:" and uses that exclusively in examples, so don't expect the lower-cased "disallow:" to work everywhere.


Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
disallow: /*?*
Disallow: /wp-*
Disallow: *?replytocom


assuming the globbing/wildcarding support by googlebot and others, the first three lines quoted above are redundant given the fifth line quoted above and the last line quoted above is redundant given the fourth line quoted above.


User-agent: *
Disallow: /wp-content/

User-agent: Googlebot-Image
Allow: /wp-content/uploads/


"Allow:" is an extension to the robots exclusion protocol supported by google, so you're ok in this specific case.
however, the crawler is going to find the most specific user agent and respect the rules in that group.
therefore if you want any directives from the general rule (User-agent: *) to apply to a more specific rule (e.g. User-agent: Googlebot-Image) you will have to repeat all those rules within the more specific group.
for example if you want Googlebot-Image crawling /wp-content/uploads/ but nothing else in /wp-content/, you will need something more like this:

User-agent: Googlebot-Image
Disallow: /wp-content/
Allow: /wp-content/uploads/



User-agent: Mediapartners-Google*
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /


continuing with the behavior described above about respecting the most specific group, if you really want these 3 user agent strings to exclude nothing (i.e. crawl everything), you might as well follow general protocol and do this:

User-agent: Mediapartners-Google*
Disallow:

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot-Mobile
Disallow:




here's another thing to consider.
if your purpose it to preserve crawl budget and/or reduce bandwidth usage, then excluding crawlers from all those directories and wilcarded paths is the way to go.
however this won't keep those urls out of the index.
if a url is discovered and you have excluded it from being crawled, the url will appear in google SERPS with the following description in the snippet:
A description for this result is not available because of this site's robots.txt learn more.


http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 [support.google.com]:
To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag or x-robots-tag.



finally, for security reasons you should be sure that your server is properly configured for CGI, since you don't want anyone looking directly at scripts in your /cgi-bin/ directory.
robots.txt is for excluding well-behaved bots from crawling resources but it doesn't do anything about authenticating visitors or blocking requests.
you should also keep that in mind for other sensitive areas such as /wp-admin/.
http://example.com/robots.txt is a "honeypot" for malicious probes and may expose your vulnerabilities.

bluemonster



 
Msg#: 4557036 posted 2:53 pm on Mar 25, 2013 (gmt 0)

thank you so much phranque :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google AdSense
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved