|Understanding the structure of robots.txt|
I'm wondering how to disallow sub folders, for example i have 2 forums installed on the domain, they are same type, lets say i want to disallow as using the following:
now the folder adm and file download.php are not on the root, they are inside folders for example:
adm path is
and the file download.php path is
using the robots.txt as above will work? or i should fill the full path? offcourse my robots.txt is at the root such as:
i want to keep it at root, and i have same type of forum installed 3 times and i want same folders and files to be disallowed for all the 3 forums.
so using :
will disallow these folders and files , and will understand that they are not on the root automaticlly?
thanx in advance.
robots.txt parsers are very simple: They look at the URL and they look at your robots.txt file. If the prefix specified by the Disallow matches the prefix of the URL that they are checking, then they will not fetch that URL.
prevents them from crawling "example.com/adm/<nothing or anything at all here>.
If that was the only Disallow line in the file, they would fetch /forum/adm/ but they would not fetch /adm/ or /adm/books or adm/books/fiction
Therefore, to Disallow fetching of /forum/adm/ and /forum2/adm/, you'd need
However, Google and a few other major search engines support wild-cards. But if you use wild-cards, you cannot use them in a "User-agent: *" policy record, because this would confuse many other robots which do not support wild-cards. So, to support both advanced and simple robots, you'd have to use something like:
Since using the wild-cards may not make your robots.txt file shorter, it might be best to use the simplest robots.txt structure possible, and simply Disallow each of the /forum/adm/ subdirectories.
Thank you very much
U Provided me with all details, yes what u said i just discovered when i was trying the robots.txt tool in Google webmaster tools. i will work with ur advice and use the simple robots.txt , i appreciate ur help.
about the lenght of the robots.txt , it will affect any thing?
The file size of the robots.txt should not matter. A few years ago some search engines had problems opening up robots.txt files that were larger than 500kb. That issue has been mostly resolved. To be extra safe I would keep the robots.txt file less than 100kb. Do not worry about being limited that is alot of space. Most of my robots.txt do not exceed 20kb.
Now i restricted the files and folders from being crawled, and all working fine, but the problem i started to see these errors in webmaster tools : URL restricted by robots.txt .
I have 17000 Errors!
I searched all sitemaps and removed all the links i restricted in robots.txt from them, but still Google give these errors for search.php and some other files.
so what is the reason ? i thought maybe i shall wait till Google crawl all sitemaps after i changed them, and removed the things i restricted, now all crawled and Google still produce such errors!
Thanx in advance.
A) Wait one or two months. Google sometimes updates GWT reports very slowly, and the dates they show are sometimes not accurate -- The reports are not always generated from the most-recently-crawled data.
- or -
B) Don't worry about it. Google is simply telling you that the pages you Disallowed in robots.txt *are* disallowed and cannot be crawled. So they are telling you that your robots.txt changes worked.
The only thing I'd look into is your search.php Disallows -- Either Disallow search.php completely, or make sure you've got the query string Disallow syntax right in your robots.txt file. It probably *is* right, but it's worth checking. You might want to take steps to prevent *all* search.php URL+query-string variations from being spidered -- Otherwise, it's easy to create an almost infinitely large number of "search" URLs.
Thanx very much for your replay, infact the search.php is a file in my phpbb forum, and im using : Disallow: /forum/search.php &
Logically google must only report an error if im telling to follow a link in the sitemaps, so it must tell i tried to follow the link u ordered me to follow but in the same time ur preventing me through ur robots.txt!
i downloaded all sitemaps since i changed robots file, and i searched with control+F for the word search and not included in any sitemap. that to make sure.
I believe what u said is correct, maybe it just need some time.
Here u can see my robots.txt < sorry, no personal links >
forum , community, how folders all are phpbb 3 forums.
Im dissallowing all not important contecnts since i was facing slow indexing rate, this step fixed the problem, now google regularly index my pages. but i still see these robots errors.
[edited by: tedster at 8:40 pm (utc) on Aug. 30, 2008]
People robots errors increasing
its 30 k now
I have over 3000 "errors" in my GWT report. They've been there for years. It is not s true error. It says "Disallowed by robots.txt" and since *you* disallowed those URLs, you should expect these "errors".
Google Webmaster Tools are not perfect. Google considers it an error when you Disallow *anything* because they want to crawl all of it.
Only "Errors for URLs in Sitemaps" are important in this case.
Stop changing your site/robots.txt/SiteMap.
Wait 3 months after the last change (do something else profitable while waiting) to let Google crawl the site, get new data and update your GWT report, then check again. :)
You can search Google in milliseconds. But crawling, ranking, updating Toolbar PageRank, and updating GWT reports can take months.
Thanx Again Gim
Im not playing with GWT any more since last time u told me to wait, but it used to have 15000 errors, and now 30.000 , so i was wondering if any thing i can do.
I will leave them , wait and see what happenes.