homepage Welcome to WebmasterWorld Guest from 54.198.148.191
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
how to ban Google from indexing ALL .php files
I'm using rewrite and don't want a dupe penalty
walkman




msg:1526495
 5:19 am on Apr 27, 2005 (gmt 0)

I want to make sure that I read this (see below) right. If I use:
User-agent: Googlebot
Disallow: /*.php$

Google will not index any pages that have .php as an extension, correct? If true, my index is index.php, but all my links are as mydomain.com/. Will I have crawling problems? Any ideas?

Thanks,

From [google.com...]
"To disallow a specific file type,simply modify the Disallow command in your robots.txt file. This works for all of the types of files Googlebot crawls,including HTML, GIFs and .docs. For example, to disallow Microsoft Word files with the ".doc" extension, you would add the following lines to your robots.txt file:

User-agent: Googlebot
Disallow: /*.doc$

 

larryhatch




msg:1526496
 5:24 am on Apr 27, 2005 (gmt 0)

I'd be worried about your index.php not getting crawled.
Are you sure that's what you want? -Larry

walkman




msg:1526497
 5:40 am on Apr 27, 2005 (gmt 0)

"Are you sure that's what you want? "

Not really ;). I want Google to crawl my site. How does google see the main site, as index.ext or as domain.com/? Does anyone know?

thanks for replying Larry,

Brett_Tabke




msg:1526498
 2:34 am on Apr 29, 2005 (gmt 0)

drop the slash and just use *.php

walkman




msg:1526499
 4:30 am on Apr 29, 2005 (gmt 0)

Brett,
thanks for the reply. One concern is still out there:
will my home page get indexed? It's an "invisible" index.php. If you banned Google from indexing .htm pages would WebmasterWorld get indexed when 100% of the links are to the root, not /index.ext

jdMorgan




msg:1526500
 4:39 am on Apr 29, 2005 (gmt 0)

Googlebot works using URLs - It has no visibility into the internal workings of your server.

example.com/ and example.com/index.php are two different URLs. If you disallow *.php, then /index.php is disallowed, but "/" is not disallowed.

If you're worried about it, then check "/" using the WebmasterWorld server headers checker, and make sure you get a 200-OK and not a redirect (301 or 302) to /index.php due to some "misimplementation."

Jim

walkman




msg:1526501
 6:53 am on Apr 29, 2005 (gmt 0)

thank you Jd,
It makes sense, I just wanted to make sure.
will check the headers that way.

kevsh




msg:1526502
 8:44 pm on May 17, 2005 (gmt 0)

Okay, this seems to be along the lines of my issue so I'll post here instead of a new thread.

I have a bunch of incoming links starting with "?" as in:

www.mydomain.com/?=123
www.mydomain.com/?=abc

The query is handled by index.php. Problem is, Google seems to be seeing these as unique pages and not indexing any one of them (possible flagging as duplicate content?)

In any event, I want to block all incoming links beginning with "?" AND any PHP page with a "?" appended, only.

Examples of pages I WANT blocked:

www.mydomain.com/?=123
www.mydomain.com/index.php?id=123
www.mydomain.com/dir/file.php?id=abc

Pages I DO NOT WANT blocked:

www.mydomain.com/index.php
www.mydomain.com/file.html
etc.

I am thinking (hoping) this will work - at least for Google:

User-Agent: googlebot
Disallow: /?
Disallow: /*.php?

If not, any suggestions on how to handle the above scenario?

Reid




msg:1526503
 9:00 pm on May 17, 2005 (gmt 0)

what about this?

user-agent: googlebot
disallow: *.php?

index.php will not be blocked but index.php?.... will be blocked.

I'm not sure about this - just a suggestion

or in the previous case of wanting to disallow site.com?... urls

disallow:? (that's scary looking)

I wouldn't want to chance
disallow: /? because if it misinterprets? then you are disallowing the root.

walkman




msg:1526504
 9:07 pm on May 17, 2005 (gmt 0)

"12. How do I tell Googlebot not to crawl dynamically generated pages on my site?"
[google.com...]

User-agent: Googlebot
Disallow: /*?

ThomasB




msg:1526505
 5:26 pm on May 18, 2005 (gmt 0)

Why not just do a 301 to the / or any directory you specifically exclude?

Never forget that there are other engines out there as well.

Dijkgraaf




msg:1526506
 3:48 am on Jun 2, 2005 (gmt 0)

I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?

Billy Batson




msg:1526507
 6:57 am on Jun 11, 2005 (gmt 0)

Hi,

I'm also trying to avoid a dupe penalty. (Actually, I think I already incurred one and I'm trying to fix my site...)

I want to disallow the printer friendly version of my pages from Google.

Will this work?


User-agent: Googlebot
Disallow: /*,print.htm$

All the printer friendly urls end with ",print.htm"

Thanks!

walkman




msg:1526508
 6:47 pm on Jun 11, 2005 (gmt 0)

Billy Batson,
not sure if it will make a difference, but does the , (comma) have to be there?
Probably only GoogleGuy can give you a definite answer. Not sure how Gbot handles commas, if it does at all.

Billy Batson




msg:1526509
 9:04 pm on Jun 11, 2005 (gmt 0)

Hi Walkman,

No, the comma doesn't have to be there, now that I think about it.

Will implement the comma-less code.

Thanks.

Reid




msg:1526510
 5:12 pm on Jun 15, 2005 (gmt 0)

I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?

robots.txt is based on prefix-matching, meaning it is only looking for text-strings within URL's.

disallow: /*?
means:
/(the root must always be present)
* any text inbetween "/" and "?"
? if "?" appears anywhere within the url then it is disallowed

disallow: /*prnt
/ root
* any text string between "/" and "prnt"
prnt if "prnt" appears anywhere within the URL it is disallowed

non-wildcard must be an exact match
disallow: /prnt

/prnta.html is disallowed
/prnta/ directory is disallowed
/aprnt.html is allowed because it does not match "/prnt"

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved