Welcome to WebmasterWorld Guest from 54.162.157.249

Forum Moderators: goodroi

Message Too Old, No Replies

how to ban Google from indexing ALL .php files

I'm using rewrite and don't want a dupe penalty

     
5:19 am on Apr 27, 2005 (gmt 0)



I want to make sure that I read this (see below) right. If I use:
User-agent: Googlebot
Disallow: /*.php$

Google will not index any pages that have .php as an extension, correct? If true, my index is index.php, but all my links are as mydomain.com/. Will I have crawling problems? Any ideas?

Thanks,

From [google.com...]
"To disallow a specific file type,simply modify the Disallow command in your robots.txt file. This works for all of the types of files Googlebot crawls,including HTML, GIFs and .docs. For example, to disallow Microsoft Word files with the ".doc" extension, you would add the following lines to your robots.txt file:

User-agent: Googlebot
Disallow: /*.doc$

5:24 am on Apr 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd be worried about your index.php not getting crawled.
Are you sure that's what you want? -Larry
5:40 am on Apr 27, 2005 (gmt 0)



"Are you sure that's what you want? "

Not really ;). I want Google to crawl my site. How does google see the main site, as index.ext or as domain.com/? Does anyone know?

thanks for replying Larry,

2:34 am on Apr 29, 2005 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



drop the slash and just use *.php
4:30 am on Apr 29, 2005 (gmt 0)



Brett,
thanks for the reply. One concern is still out there:
will my home page get indexed? It's an "invisible" index.php. If you banned Google from indexing .htm pages would WebmasterWorld get indexed when 100% of the links are to the root, not /index.ext
4:39 am on Apr 29, 2005 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Googlebot works using URLs - It has no visibility into the internal workings of your server.

example.com/ and example.com/index.php are two different URLs. If you disallow *.php, then /index.php is disallowed, but "/" is not disallowed.

If you're worried about it, then check "/" using the WebmasterWorld server headers checker, and make sure you get a 200-OK and not a redirect (301 or 302) to /index.php due to some "misimplementation."

Jim

6:53 am on Apr 29, 2005 (gmt 0)



thank you Jd,
It makes sense, I just wanted to make sure.
will check the headers that way.
8:44 pm on May 17, 2005 (gmt 0)

10+ Year Member



Okay, this seems to be along the lines of my issue so I'll post here instead of a new thread.

I have a bunch of incoming links starting with "?" as in:

www.mydomain.com/?=123
www.mydomain.com/?=abc

The query is handled by index.php. Problem is, Google seems to be seeing these as unique pages and not indexing any one of them (possible flagging as duplicate content?)

In any event, I want to block all incoming links beginning with "?" AND any PHP page with a "?" appended, only.

Examples of pages I WANT blocked:

www.mydomain.com/?=123
www.mydomain.com/index.php?id=123
www.mydomain.com/dir/file.php?id=abc

Pages I DO NOT WANT blocked:

www.mydomain.com/index.php
www.mydomain.com/file.html
etc.

I am thinking (hoping) this will work - at least for Google:

User-Agent: googlebot
Disallow: /?
Disallow: /*.php?

If not, any suggestions on how to handle the above scenario?

9:00 pm on May 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



what about this?

user-agent: googlebot
disallow: *.php?

index.php will not be blocked but index.php?.... will be blocked.

I'm not sure about this - just a suggestion

or in the previous case of wanting to disallow site.com?... urls

disallow:? (that's scary looking)

I wouldn't want to chance
disallow: /? because if it misinterprets? then you are disallowing the root.

9:07 pm on May 17, 2005 (gmt 0)



"12. How do I tell Googlebot not to crawl dynamically generated pages on my site?"
[google.com...]

User-agent: Googlebot
Disallow: /*?

5:26 pm on May 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why not just do a 301 to the / or any directory you specifically exclude?

Never forget that there are other engines out there as well.

3:48 am on Jun 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?
6:57 am on Jun 11, 2005 (gmt 0)

10+ Year Member



Hi,

I'm also trying to avoid a dupe penalty. (Actually, I think I already incurred one and I'm trying to fix my site...)

I want to disallow the printer friendly version of my pages from Google.

Will this work?


User-agent: Googlebot
Disallow: /*,print.htm$

All the printer friendly urls end with ",print.htm"

Thanks!

6:47 pm on Jun 11, 2005 (gmt 0)



Billy Batson,
not sure if it will make a difference, but does the , (comma) have to be there?
Probably only GoogleGuy can give you a definite answer. Not sure how Gbot handles commas, if it does at all.
9:04 pm on Jun 11, 2005 (gmt 0)

10+ Year Member



Hi Walkman,

No, the comma doesn't have to be there, now that I think about it.

Will implement the comma-less code.

Thanks.

5:12 pm on Jun 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?

robots.txt is based on prefix-matching, meaning it is only looking for text-strings within URL's.

disallow: /*?
means:
/(the root must always be present)
* any text inbetween "/" and "?"
? if "?" appears anywhere within the url then it is disallowed

disallow: /*prnt
/ root
* any text string between "/" and "prnt"
prnt if "prnt" appears anywhere within the URL it is disallowed

non-wildcard must be an exact match
disallow: /prnt

/prnta.html is disallowed
/prnta/ directory is disallowed
/aprnt.html is allowed because it does not match "/prnt"

 

Featured Threads

Hot Threads This Week

Hot Threads This Month