larryhatch

msg:1526496 | 5:24 am on Apr 27, 2005 (gmt 0) |
I'd be worried about your index.php not getting crawled. Are you sure that's what you want? -Larry
|
walkman

msg:1526497 | 5:40 am on Apr 27, 2005 (gmt 0) |
"Are you sure that's what you want? " Not really ;). I want Google to crawl my site. How does google see the main site, as index.ext or as domain.com/? Does anyone know? thanks for replying Larry,
|
Brett_Tabke

msg:1526498 | 2:34 am on Apr 29, 2005 (gmt 0) |
drop the slash and just use *.php
|
walkman

msg:1526499 | 4:30 am on Apr 29, 2005 (gmt 0) |
Brett, thanks for the reply. One concern is still out there: will my home page get indexed? It's an "invisible" index.php. If you banned Google from indexing .htm pages would WebmasterWorld get indexed when 100% of the links are to the root, not /index.ext
|
jdMorgan

msg:1526500 | 4:39 am on Apr 29, 2005 (gmt 0) |
Googlebot works using URLs - It has no visibility into the internal workings of your server. example.com/ and example.com/index.php are two different URLs. If you disallow *.php, then /index.php is disallowed, but "/" is not disallowed. If you're worried about it, then check "/" using the WebmasterWorld server headers checker, and make sure you get a 200-OK and not a redirect (301 or 302) to /index.php due to some "misimplementation." Jim
|
walkman

msg:1526501 | 6:53 am on Apr 29, 2005 (gmt 0) |
thank you Jd, It makes sense, I just wanted to make sure. will check the headers that way.
|
kevsh

msg:1526502 | 8:44 pm on May 17, 2005 (gmt 0) |
Okay, this seems to be along the lines of my issue so I'll post here instead of a new thread. I have a bunch of incoming links starting with "?" as in: www.mydomain.com/?=123 www.mydomain.com/?=abc The query is handled by index.php. Problem is, Google seems to be seeing these as unique pages and not indexing any one of them (possible flagging as duplicate content?) In any event, I want to block all incoming links beginning with "?" AND any PHP page with a "?" appended, only. Examples of pages I WANT blocked: www.mydomain.com/?=123 www.mydomain.com/index.php?id=123 www.mydomain.com/dir/file.php?id=abc Pages I DO NOT WANT blocked: www.mydomain.com/index.php www.mydomain.com/file.html etc. I am thinking (hoping) this will work - at least for Google: User-Agent: googlebot Disallow: /? Disallow: /*.php? If not, any suggestions on how to handle the above scenario?
|
Reid

msg:1526503 | 9:00 pm on May 17, 2005 (gmt 0) |
what about this? user-agent: googlebot disallow: *.php? index.php will not be blocked but index.php?.... will be blocked. I'm not sure about this - just a suggestion or in the previous case of wanting to disallow site.com?... urls disallow:? (that's scary looking) I wouldn't want to chance disallow: /? because if it misinterprets? then you are disallowing the root.
|
walkman

msg:1526504 | 9:07 pm on May 17, 2005 (gmt 0) |
"12. How do I tell Googlebot not to crawl dynamically generated pages on my site?" [google.com...] User-agent: Googlebot Disallow: /*?
|
ThomasB

msg:1526505 | 5:26 pm on May 18, 2005 (gmt 0) |
Why not just do a 301 to the / or any directory you specifically exclude? Never forget that there are other engines out there as well.
|
Dijkgraaf

msg:1526506 | 3:48 am on Jun 2, 2005 (gmt 0) |
I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc. I know what their FAQ says, but is it accurate?
|
Billy Batson

msg:1526507 | 6:57 am on Jun 11, 2005 (gmt 0) |
Hi, I'm also trying to avoid a dupe penalty. (Actually, I think I already incurred one and I'm trying to fix my site...) I want to disallow the printer friendly version of my pages from Google. Will this work? User-agent: Googlebot Disallow: /*,print.htm$
All the printer friendly urls end with ",print.htm" Thanks!
|
walkman

msg:1526508 | 6:47 pm on Jun 11, 2005 (gmt 0) |
Billy Batson, not sure if it will make a difference, but does the , (comma) have to be there? Probably only GoogleGuy can give you a definite answer. Not sure how Gbot handles commas, if it does at all.
|
Billy Batson

msg:1526509 | 9:04 pm on Jun 11, 2005 (gmt 0) |
Hi Walkman, No, the comma doesn't have to be there, now that I think about it. Will implement the comma-less code. Thanks.
|
Reid

msg:1526510 | 5:12 pm on Jun 15, 2005 (gmt 0) |
I wonder if "Disallow: /*?" actually means that they just won't follow URL's with query strings, rather than they won't spider .php, .asp etc. I know what their FAQ says, but is it accurate? |
| robots.txt is based on prefix-matching, meaning it is only looking for text-strings within URL's. disallow: /*? means: /(the root must always be present) * any text inbetween "/" and "?" ? if "?" appears anywhere within the url then it is disallowed disallow: /*prnt / root * any text string between "/" and "prnt" prnt if "prnt" appears anywhere within the URL it is disallowed non-wildcard must be an exact match disallow: /prnt /prnta.html is disallowed /prnta/ directory is disallowed /aprnt.html is allowed because it does not match "/prnt"
|
|