Welcome to WebmasterWorld Guest from 54.145.13.215

Forum Moderators: goodroi

Message Too Old, No Replies

how to ban Google from indexing ALL .php files

I'm using rewrite and don't want a dupe penalty

     
5:19 am on Apr 27, 2005 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


I want to make sure that I read this (see below) right. If I use:
User-agent: Googlebot
Disallow: /*.php$

Google will not index any pages that have .php as an extension, correct? If true, my index is index.php, but all my links are as mydomain.com/. Will I have crawling problems? Any ideas?

Thanks,

From [google.com...]
"To disallow a specific file type,simply modify the Disallow command in your robots.txt file. This works for all of the types of files Googlebot crawls,including HTML, GIFs and .docs. For example, to disallow Microsoft Word files with the ".doc" extension, you would add the following lines to your robots.txt file:

User-agent: Googlebot
Disallow: /*.doc$

5:24 am on Apr 27, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 13, 2004
posts:1425
votes: 0


I'd be worried about your index.php not getting crawled.
Are you sure that's what you want? -Larry
5:40 am on Apr 27, 2005 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


"Are you sure that's what you want? "

Not really ;). I want Google to crawl my site. How does google see the main site, as index.ext or as domain.com/? Does anyone know?

thanks for replying Larry,

2:34 am on Apr 29, 2005 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38048
votes: 12


drop the slash and just use *.php
4:30 am on Apr 29, 2005 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


Brett,
thanks for the reply. One concern is still out there:
will my home page get indexed? It's an "invisible" index.php. If you banned Google from indexing .htm pages would WebmasterWorld get indexed when 100% of the links are to the root, not /index.ext
4:39 am on Apr 29, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Googlebot works using URLs - It has no visibility into the internal workings of your server.

example.com/ and example.com/index.php are two different URLs. If you disallow *.php, then /index.php is disallowed, but "/" is not disallowed.

If you're worried about it, then check "/" using the WebmasterWorld server headers checker, and make sure you get a 200-OK and not a redirect (301 or 302) to /index.php due to some "misimplementation."

Jim

6:53 am on Apr 29, 2005 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


thank you Jd,
It makes sense, I just wanted to make sure.
will check the headers that way.
8:44 pm on May 17, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 7, 2005
posts:45
votes: 0


Okay, this seems to be along the lines of my issue so I'll post here instead of a new thread.

I have a bunch of incoming links starting with "?" as in:

www.mydomain.com/?=123
www.mydomain.com/?=abc

The query is handled by index.php. Problem is, Google seems to be seeing these as unique pages and not indexing any one of them (possible flagging as duplicate content?)

In any event, I want to block all incoming links beginning with "?" AND any PHP page with a "?" appended, only.

Examples of pages I WANT blocked:

www.mydomain.com/?=123
www.mydomain.com/index.php?id=123
www.mydomain.com/dir/file.php?id=abc

Pages I DO NOT WANT blocked:

www.mydomain.com/index.php
www.mydomain.com/file.html
etc.

I am thinking (hoping) this will work - at least for Google:

User-Agent: googlebot
Disallow: /?
Disallow: /*.php?

If not, any suggestions on how to handle the above scenario?

9:00 pm on May 17, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 16, 2004
posts:693
votes: 0


what about this?

user-agent: googlebot
disallow: *.php?

index.php will not be blocked but index.php?.... will be blocked.

I'm not sure about this - just a suggestion

or in the previous case of wanting to disallow site.com?... urls

disallow:? (that's scary looking)

I wouldn't want to chance
disallow: /? because if it misinterprets? then you are disallowing the root.

9:07 pm on May 17, 2005 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


"12. How do I tell Googlebot not to crawl dynamically generated pages on my site?"
[google.com...]

User-agent: Googlebot
Disallow: /*?

5:26 pm on May 18, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 5, 2002
posts:1562
votes: 0


Why not just do a 301 to the / or any directory you specifically exclude?

Never forget that there are other engines out there as well.

3:48 am on June 2, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 31, 2005
posts:1108
votes: 0


I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?
6:57 am on June 11, 2005 (gmt 0)

New User

10+ Year Member

joined:May 30, 2005
posts:21
votes: 0


Hi,

I'm also trying to avoid a dupe penalty. (Actually, I think I already incurred one and I'm trying to fix my site...)

I want to disallow the printer friendly version of my pages from Google.

Will this work?


User-agent: Googlebot
Disallow: /*,print.htm$

All the printer friendly urls end with ",print.htm"

Thanks!

6:47 pm on June 11, 2005 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


Billy Batson,
not sure if it will make a difference, but does the , (comma) have to be there?
Probably only GoogleGuy can give you a definite answer. Not sure how Gbot handles commas, if it does at all.
9:04 pm on June 11, 2005 (gmt 0)

New User

10+ Year Member

joined:May 30, 2005
posts:21
votes: 0


Hi Walkman,

No, the comma doesn't have to be there, now that I think about it.

Will implement the comma-less code.

Thanks.

5:12 pm on June 15, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 16, 2004
posts:693
votes: 0


I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?

robots.txt is based on prefix-matching, meaning it is only looking for text-strings within URL's.

disallow: /*?
means:
/(the root must always be present)
* any text inbetween "/" and "?"
? if "?" appears anywhere within the url then it is disallowed

disallow: /*prnt
/ root
* any text string between "/" and "prnt"
prnt if "prnt" appears anywhere within the URL it is disallowed

non-wildcard must be an exact match
disallow: /prnt

/prnta.html is disallowed
/prnta/ directory is disallowed
/aprnt.html is allowed because it does not match "/prnt"