Symbols in Robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Symbols in Robots.txt

Meaning of the symbols in robots.txt

elvang

7:39 am on Apr 8, 2022 (gmt 0)

Hi, i want to learn that what is the meaning of that robots.tx? What is it saying? Especially what are those .asp$ . Thank u.
User-agent: *
Disallow: /*.asp$
Disallow: /*.asp?$
Disallow: /*.asp?token=$
Disallow: /arama?*

robzilla

8:27 am on Apr 8, 2022 (gmt 0)

Google, Bing, and other major search engines support a limited form of wildcards for path values. These wildcard characters are:

* designates 0 or more instances of any valid character.
$ designates the end of the URL.

[developers.google.com...]

Disallow: /*.asp$

Will match:
/file.asp
/folder/file.asp

But not: (and thus allowing to be crawled)
/file.asp?key=value
/folder/file.asp?key=value

Disallow: /*.asp?token=$

Will match:
/file.asp?token=

But not: (and thus allowing to be crawled)
/file.asp?token=jSd9mbs3a392s

A strange set of rules, that's for sure.

not2easy

11:45 am on Apr 8, 2022 (gmt 0)

The robots.txt file lets you control access of some robots like Google if your site has files or directories that you do not want them to crawl. The robots.txt file cannot control indexing, only crawling so if they find links to those files somewhere other than the directory you have specified, they will crawl those files. For example, you might have an external link to a page you asked Google not to crawl, but when they find that link on another site, they may follow that link.

The .asp is a file type the same as .html or .php or .jpg are different file types. As robzilla explained, the symbols like * and $ are defined and explained at that Google link he posted. The robots.txt file does not prevent all robots from crawling the files and directories that you list.

Also, if you block crawling of resources that Google needs to use to completely render the page, they cannot render or evaluate the pages so you don't want to block .js or .css files. Really, the best way to learn about robots.txt is to visit that link robzilla posted, it explains all aspects of using robots.txt.

elvang

12:54 pm on Apr 8, 2022 (gmt 0)

As far as i understand they didnt disallow anything?

not2easy

1:05 pm on Apr 8, 2022 (gmt 0)

If you mean your specific example, robots.txt disallowed /file.asp and /folder/file.asp and /file.asp?token= as robzilla explained.

tangor

10:06 am on Apr 9, 2022 (gmt 0)

Bear in mind that only good bots will honor the robots.txt directives. Often such entries can lead to targeting by bad robots.

Just sayin'...

Robert Charlton

3:36 am on Apr 11, 2022 (gmt 0)

As far as i understand they didnt disallow anything?

elvang, I'm not sure what this comment means. Are you asking about robots.txt just for general understanding, or are you applying this information to your robots.txt file on your site?

If you're applying this to your site, what are you trying to accomplish? If it's not working, what are you seeing that you're trying to fix?

Also, as tangor's answer suggested, your robots.txt is a very public document, so if you, say, are wanting to keep certain pages secret, or even out of the index, robots.txt may not be the way to do it.

It may be that you'd be better off using the robots "noindex" meta tag in the pages you want to keep out of Google's index (in which case, for a number of reasons, you don't want block the same page with robots.txt), or it may be that development pages and templates should be kept away from Googlebot by password protection.

It may also be that you've dropped certain pages but are still seeing them in the SERPs. In that case, there may be no problem at all.

I think we can give you a better answer if you're having a problem with your site, if we understand what the problem is... and we can discuss what robots.txt does and doesn't do, and whether using it in certain situations is the correct approach.

Also, if you are asking about your site, thank you for obeying our posting guidelines and for not revealing the name of your domain.

Please do, though, give us some background... indicate how long your site has been online, what type of site it is (ecommerce, information, etc), whether it's WordPress or some other CMS, and how experienced you are with the technical side of the web.

elvang

11:15 am on Apr 11, 2022 (gmt 0)

Thank you tangor and robert charlton.

Actually, the CMS that our company is cooperating has banned all seo tools' bots (screaming frog, ahrefs etc.) In other words, they are just permitting the bots of google, bing, yandex etc. I am trying to find a way to allow screaming frog's and ahrefs's bots. I tried lots of seo tool and all of them wrote me that our CMS rejecting their bots. For that reason, I thought I could allow those seo tools' bots with robot.txt file.

Or do you have any other advice to allow those tool's bots.

Thank you.

not2easy

12:36 pm on Apr 11, 2022 (gmt 0)

It COULD be possible to allow those bots depending on how they are being blocked. If it is by a cloudfare firewall, for example, then you can change settings. If they are blocked by UA (User Agent) then the .conf file or .htaccess file can be edited. You may see clues in the access logs or server's error logs.

lucy24

4:57 pm on Apr 11, 2022 (gmt 0)

For that reason, I thought I could allow those seo tools' bots with robot.txt file.

At this point it becomes important to understand the difference between robots.txt and a configuration file such as htaccess.

robots.txt is, as the extension tells you, a text file. It is purely informational: robots (or nosy humans) can choose to look at it or not, and can choose to honor its directives or not. It is the equivalent of putting up a sign that says Employees Only or No Admittance.

Configuration files are hard-and-fast rules. If something is blocked in a configuration file--by any means, whether user-agent or IP or headers or something more complicated--it is impossible for it to get in. You can�t choose not to obey .htaccess. It is the equivalent of installing a deadbolt.

afaik, there is no internet-access equivalent to a door alarm: Sure, you can open the door, but you will attract immediate and unwelcome attention. Bummer.

phranque

10:03 pm on Apr 11, 2022 (gmt 0)

Actually, the CMS that our company is cooperating has banned all seo tools' bots (screaming frog, ahrefs etc.) In other words, they are just permitting the bots of google, bing, yandex etc. I am trying to find a way to allow screaming frog's and ahrefs's bots. I tried lots of seo tool and all of them wrote me that our CMS rejecting their bots. For that reason, I thought I could allow those seo tools' bots with robot.txt file.

Or do you have any other advice to allow those tool's bots.

you can use robots.txt to Disallow a compliant bot which means it forgoes making the request.
you cannot use robots.txt to allow a request to avoid a CMS ban.

if your CMS is using the User-Agent string in the request to identify seo tool bots, many of these tools (e..g., Screaming Frog) allow you to specify an alternative user agent string so that the request looks like it comes from your preferred visitor such as googlebot, or a typical desktop browser, or an iphone browser, etc.

elvang

7:45 am on Apr 13, 2022 (gmt 0)

Thank you phranque tangor lucy24 not2easy