homepage Welcome to WebmasterWorld Guest from 54.161.202.234
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Understanding the structure of robots.txt
subdirectories
alahamdan




msg:3726334
 1:52 pm on Aug 19, 2008 (gmt 0)

Hello People,

I'm wondering how to disallow sub folders, for example i have 2 forums installed on the domain, they are same type, lets say i want to disallow as using the following:

User-agent: *
Disallow: /adm/
Disallow: /download.php

now the folder adm and file download.php are not on the root, they are inside folders for example:

adm path is

domainname.com/forum/adm

and the file download.php path is

domainname.com/forum/subfolder/download.php

using the robots.txt as above will work? or i should fill the full path? offcourse my robots.txt is at the root such as:

domainname.com/robots.txt

i want to keep it at root, and i have same type of forum installed 3 times and i want same folders and files to be disallowed for all the 3 forums.

so using :

User-agent: *
Disallow: /adm/
Disallow: /download.php

will disallow these folders and files , and will understand that they are not on the root automaticlly?

thanx in advance.

 

jdMorgan




msg:3726387
 2:57 pm on Aug 19, 2008 (gmt 0)

robots.txt parsers are very simple: They look at the URL and they look at your robots.txt file. If the prefix specified by the Disallow matches the prefix of the URL that they are checking, then they will not fetch that URL.

So,
Disallow: /adm/
prevents them from crawling "example.com/adm/<nothing or anything at all here>.

If that was the only Disallow line in the file, they would fetch /forum/adm/ but they would not fetch /adm/ or /adm/books or adm/books/fiction

Therefore, to Disallow fetching of /forum/adm/ and /forum2/adm/, you'd need
Disallow: /forum/adm/
Disallow: /forum2/adm/

However, Google and a few other major search engines support wild-cards. But if you use wild-cards, you cannot use them in a "User-agent: *" policy record, because this would confuse many other robots which do not support wild-cards. So, to support both advanced and simple robots, you'd have to use something like:

User-agent: Google
User-agent: Slurp
Disallow: /*/adm/

User-agent: *
Disallow: /forum/adm/
Disallow: /forum2/adm/

Since using the wild-cards may not make your robots.txt file shorter, it might be best to use the simplest robots.txt structure possible, and simply Disallow each of the /forum/adm/ subdirectories.

Jim

alahamdan




msg:3726393
 3:06 pm on Aug 19, 2008 (gmt 0)

Dear Jim

Thank you very much

U Provided me with all details, yes what u said i just discovered when i was trying the robots.txt tool in Google webmaster tools. i will work with ur advice and use the simple robots.txt , i appreciate ur help.

about the lenght of the robots.txt , it will affect any thing?

thanx again

goodroi




msg:3727149
 1:48 pm on Aug 20, 2008 (gmt 0)

The file size of the robots.txt should not matter. A few years ago some search engines had problems opening up robots.txt files that were larger than 500kb. That issue has been mostly resolved. To be extra safe I would keep the robots.txt file less than 100kb. Do not worry about being limited that is alot of space. Most of my robots.txt do not exceed 20kb.

alahamdan




msg:3734993
 3:34 pm on Aug 30, 2008 (gmt 0)

Hello Again,

Now i restricted the files and folders from being crawled, and all working fine, but the problem i started to see these errors in webmaster tools : URL restricted by robots.txt .

I have 17000 Errors!

I searched all sitemaps and removed all the links i restricted in robots.txt from them, but still Google give these errors for search.php and some other files.

so what is the reason ? i thought maybe i shall wait till Google crawl all sitemaps after i changed them, and removed the things i restricted, now all crawled and Google still produce such errors!

Any Idea!?

Thanx in advance.

jdMorgan




msg:3735033
 4:13 pm on Aug 30, 2008 (gmt 0)

A) Wait one or two months. Google sometimes updates GWT reports very slowly, and the dates they show are sometimes not accurate -- The reports are not always generated from the most-recently-crawled data.

- or -

B) Don't worry about it. Google is simply telling you that the pages you Disallowed in robots.txt *are* disallowed and cannot be crawled. So they are telling you that your robots.txt changes worked.

The only thing I'd look into is your search.php Disallows -- Either Disallow search.php completely, or make sure you've got the query string Disallow syntax right in your robots.txt file. It probably *is* right, but it's worth checking. You might want to take steps to prevent *all* search.php URL+query-string variations from being spidered -- Otherwise, it's easy to create an almost infinitely large number of "search" URLs.

Jim

alahamdan




msg:3735103
 7:00 pm on Aug 30, 2008 (gmt 0)

Dear jdMorgan,

Thanx very much for your replay, infact the search.php is a file in my phpbb forum, and im using : Disallow: /forum/search.php &
Disallow: /forum/search.php*

Logically google must only report an error if im telling to follow a link in the sitemaps, so it must tell i tried to follow the link u ordered me to follow but in the same time ur preventing me through ur robots.txt!

i downloaded all sitemaps since i changed robots file, and i searched with control+F for the word search and not included in any sitemap. that to make sure.

I believe what u said is correct, maybe it just need some time.

Here u can see my robots.txt < sorry, no personal links >

forum , community, how folders all are phpbb 3 forums.

Im dissallowing all not important contecnts since i was facing slow indexing rate, this step fixed the problem, now google regularly index my pages. but i still see these robots errors.

thanx again

[edited by: tedster at 8:40 pm (utc) on Aug. 30, 2008]

alahamdan




msg:3738528
 7:29 pm on Sep 4, 2008 (gmt 0)

People robots errors increasing

its 30 k now

any idea?

jdMorgan




msg:3738579
 8:32 pm on Sep 4, 2008 (gmt 0)

I have over 3000 "errors" in my GWT report. They've been there for years. It is not s true error. It says "Disallowed by robots.txt" and since *you* disallowed those URLs, you should expect these "errors".

Google Webmaster Tools are not perfect. Google considers it an error when you Disallow *anything* because they want to crawl all of it.

Only "Errors for URLs in Sitemaps" are important in this case.

Stop changing your site/robots.txt/SiteMap.
Wait 3 months after the last change (do something else profitable while waiting) to let Google crawl the site, get new data and update your GWT report, then check again. :)

You can search Google in milliseconds. But crawling, ranking, updating Toolbar PageRank, and updating GWT reports can take months.

Jim

alahamdan




msg:3738740
 12:00 am on Sep 5, 2008 (gmt 0)

Thanx Again Gim

Im not playing with GWT any more since last time u told me to wait, but it used to have 15000 errors, and now 30.000 , so i was wondering if any thing i can do.

I will leave them , wait and see what happenes.

Thanx again.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved