Welcome to WebmasterWorld Guest from 220.127.116.11
Forum Moderators: goodroi
Google will not index any pages that have .php as an extension, correct? If true, my index is index.php, but all my links are as mydomain.com/. Will I have crawling problems? Any ideas?
"To disallow a specific file type,simply modify the Disallow command in your robots.txt file. This works for all of the types of files Googlebot crawls,including HTML, GIFs and .docs. For example, to disallow Microsoft Word files with the ".doc" extension, you would add the following lines to your robots.txt file:
example.com/ and example.com/index.php are two different URLs. If you disallow *.php, then /index.php is disallowed, but "/" is not disallowed.
If you're worried about it, then check "/" using the WebmasterWorld server headers checker, and make sure you get a 200-OK and not a redirect (301 or 302) to /index.php due to some "misimplementation."
I have a bunch of incoming links starting with "?" as in:
The query is handled by index.php. Problem is, Google seems to be seeing these as unique pages and not indexing any one of them (possible flagging as duplicate content?)
In any event, I want to block all incoming links beginning with "?" AND any PHP page with a "?" appended, only.
Examples of pages I WANT blocked:
Pages I DO NOT WANT blocked:
I am thinking (hoping) this will work - at least for Google:
If not, any suggestions on how to handle the above scenario?
index.php will not be blocked but index.php?.... will be blocked.
I'm not sure about this - just a suggestion
or in the previous case of wanting to disallow site.com?... urls
disallow:? (that's scary looking)
I wouldn't want to chance
disallow: /? because if it misinterprets? then you are disallowing the root.
I'm also trying to avoid a dupe penalty. (Actually, I think I already incurred one and I'm trying to fix my site...)
I want to disallow the printer friendly version of my pages from Google.
Will this work?
All the printer friendly urls end with ",print.htm"
I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc.
I know what their FAQ says, but is it accurate?
robots.txt is based on prefix-matching, meaning it is only looking for text-strings within URL's.
/(the root must always be present)
* any text inbetween "/" and "?"
? if "?" appears anywhere within the url then it is disallowed
* any text string between "/" and "prnt"
prnt if "prnt" appears anywhere within the URL it is disallowed
non-wildcard must be an exact match
/prnta.html is disallowed
/prnta/ directory is disallowed
/aprnt.html is allowed because it does not match "/prnt"