homepage Welcome to WebmasterWorld Guest from 184.73.104.82
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
how to ban Google from indexing ALL .php files
I'm using rewrite and don't want a dupe penalty
walkman



 
Msg#: 622 posted 5:19 am on Apr 27, 2005 (gmt 0)

I want to make sure that I read this (see below) right. If I use:
User-agent: Googlebot
Disallow: /*.php$

Google will not index any pages that have .php as an extension, correct? If true, my index is index.php, but all my links are as mydomain.com/. Will I have crawling problems? Any ideas?

Thanks,

From [google.com...]
"To disallow a specific file type,simply modify the Disallow command in your robots.txt file. This works for all of the types of files Googlebot crawls,including HTML, GIFs and .docs. For example, to disallow Microsoft Word files with the ".doc" extension, you would add the following lines to your robots.txt file:

User-agent: Googlebot
Disallow: /*.doc$

 

larryhatch

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 622 posted 5:24 am on Apr 27, 2005 (gmt 0)

I'd be worried about your index.php not getting crawled.
Are you sure that's what you want? -Larry

walkman



 
Msg#: 622 posted 5:40 am on Apr 27, 2005 (gmt 0)

"Are you sure that's what you want? "

Not really ;). I want Google to crawl my site. How does google see the main site, as index.ext or as domain.com/? Does anyone know?

thanks for replying Larry,

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 622 posted 2:34 am on Apr 29, 2005 (gmt 0)

drop the slash and just use *.php

walkman



 
Msg#: 622 posted 4:30 am on Apr 29, 2005 (gmt 0)

Brett,
thanks for the reply. One concern is still out there:
will my home page get indexed? It's an "invisible" index.php. If you banned Google from indexing .htm pages would WebmasterWorld get indexed when 100% of the links are to the root, not /index.ext

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 622 posted 4:39 am on Apr 29, 2005 (gmt 0)

Googlebot works using URLs - It has no visibility into the internal workings of your server.

example.com/ and example.com/index.php are two different URLs. If you disallow *.php, then /index.php is disallowed, but "/" is not disallowed.

If you're worried about it, then check "/" using the WebmasterWorld server headers checker, and make sure you get a 200-OK and not a redirect (301 or 302) to /index.php due to some "misimplementation."

Jim

walkman



 
Msg#: 622 posted 6:53 am on Apr 29, 2005 (gmt 0)

thank you Jd,
It makes sense, I just wanted to make sure.
will check the headers that way.

kevsh

5+ Year Member



 
Msg#: 622 posted 8:44 pm on May 17, 2005 (gmt 0)

Okay, this seems to be along the lines of my issue so I'll post here instead of a new thread.

I have a bunch of incoming links starting with "?" as in:

www.mydomain.com/?=123
www.mydomain.com/?=abc

The query is handled by index.php. Problem is, Google seems to be seeing these as unique pages and not indexing any one of them (possible flagging as duplicate content?)

In any event, I want to block all incoming links beginning with "?" AND any PHP page with a "?" appended, only.

Examples of pages I WANT blocked:

www.mydomain.com/?=123
www.mydomain.com/index.php?id=123
www.mydomain.com/dir/file.php?id=abc

Pages I DO NOT WANT blocked:

www.mydomain.com/index.php
www.mydomain.com/file.html
etc.

I am thinking (hoping) this will work - at least for Google:

User-Agent: googlebot
Disallow: /?
Disallow: /*.php?

If not, any suggestions on how to handle the above scenario?

Reid

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 622 posted 9:00 pm on May 17, 2005 (gmt 0)

what about this?

user-agent: googlebot
disallow: *.php?

index.php will not be blocked but index.php?.... will be blocked.

I'm not sure about this - just a suggestion

or in the previous case of wanting to disallow site.com?... urls

disallow:? (that's scary looking)

I wouldn't want to chance
disallow: /? because if it misinterprets? then you are disallowing the root.

walkman



 
Msg#: 622 posted 9:07 pm on May 17, 2005 (gmt 0)

"12. How do I tell Googlebot not to crawl dynamically generated pages on my site?"
[google.com...]

User-agent: Googlebot
Disallow: /*?

ThomasB

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 622 posted 5:26 pm on May 18, 2005 (gmt 0)

Why not just do a 301 to the / or any directory you specifically exclude?

Never forget that there are other engines out there as well.

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 622 posted 3:48 am on Jun 2, 2005 (gmt 0)

I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?

Billy Batson

5+ Year Member



 
Msg#: 622 posted 6:57 am on Jun 11, 2005 (gmt 0)

Hi,

I'm also trying to avoid a dupe penalty. (Actually, I think I already incurred one and I'm trying to fix my site...)

I want to disallow the printer friendly version of my pages from Google.

Will this work?


User-agent: Googlebot
Disallow: /*,print.htm$

All the printer friendly urls end with ",print.htm"

Thanks!

walkman



 
Msg#: 622 posted 6:47 pm on Jun 11, 2005 (gmt 0)

Billy Batson,
not sure if it will make a difference, but does the , (comma) have to be there?
Probably only GoogleGuy can give you a definite answer. Not sure how Gbot handles commas, if it does at all.

Billy Batson

5+ Year Member



 
Msg#: 622 posted 9:04 pm on Jun 11, 2005 (gmt 0)

Hi Walkman,

No, the comma doesn't have to be there, now that I think about it.

Will implement the comma-less code.

Thanks.

Reid

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 622 posted 5:12 pm on Jun 15, 2005 (gmt 0)

I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?

robots.txt is based on prefix-matching, meaning it is only looking for text-strings within URL's.

disallow: /*?
means:
/(the root must always be present)
* any text inbetween "/" and "?"
? if "?" appears anywhere within the url then it is disallowed

disallow: /*prnt
/ root
* any text string between "/" and "prnt"
prnt if "prnt" appears anywhere within the URL it is disallowed

non-wildcard must be an exact match
disallow: /prnt

/prnta.html is disallowed
/prnta/ directory is disallowed
/aprnt.html is allowed because it does not match "/prnt"

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved