Welcome to WebmasterWorld Guest from 54.162.21.214

Forum Moderators: incrediBILL & martinibuster

Message Too Old, No Replies

[Robots.txt] What is the correct User agent for Adsense Crawl

Google adsense crawl error - what is the right user agent

     
9:15 am on Mar 21, 2013 (gmt 0)

New User

joined:Dec 15, 2011
posts:15
votes: 0


Happen to face Google adsense crawl error, found the fix.

It says, to make adsense to crawl your pages and show targetted ads, add these lines on top of robots.txt.

User-agent: Mediapartners-Google
Disallow:


But my robots.txt file has already had this lines

User-agent: Mediapartners-Google*
Allow: /

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /


Someone explain me what these lines refer to and what is the difference between both.

Which is the right one? Do I need to make any changes or just leave like that?
10:37 am on Mar 21, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


You don't need the Allow lines at all. They're only for identifying permitted areas inside of excluded areas-- and then only for robots like googlebot that recognize the word.

In robots.txt, simpler is better. If you don't say anything about the adbot by name, it will follow the same rules that apply to your generic robot
User-Agent: *

If some areas of your site are blocked to most robots, but you want the adbot to go absolutely everywhere, give it a separate line that says
Disallow:

Just like that. Nothing after the word "disallow".
5:35 pm on Mar 21, 2013 (gmt 0)

New User

joined:Dec 15, 2011
posts:15
votes: 0


Now what should I do?

Can I replace the second set of codes with the first one? or just leave it as it is.
5:45 pm on Mar 21, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10553
votes: 13


do you have any other directives in your robots.txt?
5:53 pm on Mar 21, 2013 (gmt 0)

New User

joined:Dec 15, 2011
posts:15
votes: 0


This is my full robots.txt file -

User-agent: *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /go/
Disallow: /archives/
disallow: /*?*
Disallow: /wp-*
Disallow: /author
Disallow: /feed/
Disallow: /comments/feed/
Disallow: *?replytocom


User-agent: Mediapartners-Google*
Allow: /

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /
5:54 pm on Mar 21, 2013 (gmt 0)

New User

joined:Dec 15, 2011
posts:15
votes: 0


Now should I just leave it as it is or Should I change the last 4 lines (which is related to adsense)?
7:17 pm on Mar 21, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12905
votes: 193


I'd take out those last four lines altogether; you don't need them.
5:55 am on Mar 22, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10553
votes: 13


disallow: /*?*
Disallow: /wp-*
Disallow: *?replytocom


although googlebot and others may allow exceptions, the robots exclusion protocol specifically does not support globbing or wildcarding, so don't expect those *s to work everywhere.


disallow: /*?*


although googlebot and others may allow exceptions, the robots exclusion protocol describes the directive as an upper-cased "Disallow:" and uses that exclusively in examples, so don't expect the lower-cased "disallow:" to work everywhere.


Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
disallow: /*?*
Disallow: /wp-*
Disallow: *?replytocom


assuming the globbing/wildcarding support by googlebot and others, the first three lines quoted above are redundant given the fifth line quoted above and the last line quoted above is redundant given the fourth line quoted above.


User-agent: *
Disallow: /wp-content/

User-agent: Googlebot-Image
Allow: /wp-content/uploads/


"Allow:" is an extension to the robots exclusion protocol supported by google, so you're ok in this specific case.
however, the crawler is going to find the most specific user agent and respect the rules in that group.
therefore if you want any directives from the general rule (User-agent: *) to apply to a more specific rule (e.g. User-agent: Googlebot-Image) you will have to repeat all those rules within the more specific group.
for example if you want Googlebot-Image crawling /wp-content/uploads/ but nothing else in /wp-content/, you will need something more like this:

User-agent: Googlebot-Image
Disallow: /wp-content/
Allow: /wp-content/uploads/



User-agent: Mediapartners-Google*
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /


continuing with the behavior described above about respecting the most specific group, if you really want these 3 user agent strings to exclude nothing (i.e. crawl everything), you might as well follow general protocol and do this:

User-agent: Mediapartners-Google*
Disallow:

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot-Mobile
Disallow:




here's another thing to consider.
if your purpose it to preserve crawl budget and/or reduce bandwidth usage, then excluding crawlers from all those directories and wilcarded paths is the way to go.
however this won't keep those urls out of the index.
if a url is discovered and you have excluded it from being crawled, the url will appear in google SERPS with the following description in the snippet:
A description for this result is not available because of this site's robots.txt learn more.


http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 [support.google.com]:
To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag or x-robots-tag.



finally, for security reasons you should be sure that your server is properly configured for CGI, since you don't want anyone looking directly at scripts in your /cgi-bin/ directory.
robots.txt is for excluding well-behaved bots from crawling resources but it doesn't do anything about authenticating visitors or blocking requests.
you should also keep that in mind for other sensitive areas such as /wp-admin/.
http://example.com/robots.txt is a "honeypot" for malicious probes and may expose your vulnerabilities.
2:53 pm on Mar 25, 2013 (gmt 0)

New User

joined:Dec 15, 2011
posts:15
votes: 0


thank you so much phranque :)