| 2:20 pm on Jan 12, 2003 (gmt 0)|
Two things immediately spring to mind;
- Has /cgi-bin/ been excluded via robots.txt? Potentially if this isn't your site then someone could create a link out script and then block spiders from the cgi-bin directory either for a genuine reason (ie maybe it affects their tracking stats) or in an attempt to weasel their way our of giving a "real" link.
- Are the URLs overly long? Lots of data in a querystring will often deter a search engine from crawling a certain URL - mostly due to a fear of hitting infinitely dynamic pages. That said though there are example of people who say that google did crawl similar links so you might be safe...
| 11:19 am on Jan 14, 2003 (gmt 0)|
No, not all bots will follow the link, but the most important one(s) will (google). Even if the link is parsed off by cgi-bin, it will get read by the shear fact that it is in the url. It doesn't necc have to "click" the link to "follow" the link (unless you encode the link).
| 11:58 am on Jan 14, 2003 (gmt 0)|
|It doesn't necc have to "click" the link to "follow" the link (unless you encode the link). |
David was talking about using the cgi-bin directory for links, specifically using a cgi script to link out to others. Now although the link-thru looks obvious enough it is realistically the same as using an encoded value.
Now unless I'm missing something...
Although a spider may think that "www.example.com/cgi-bin/linkout?http;//test.example.com" would link to "test.example.com", it can *never* be sure without requesting that script because it's dealing with a dynamic server-side script without any idea of what the code behind that script actually does.
| 11:59 am on Jan 14, 2003 (gmt 0)|
Ok, clarification: even if cgi-bin is blocked by robots.txt, Google will read the following link to bar.com
<a href="http://foo.com/cgi-bin?redirect=http://bar.com">Foo and Bar</a>
| 2:01 pm on Jan 14, 2003 (gmt 0)|
|Even if cgi-bin is blocked by robots.txt, Google will read the following link to bar.com |
I agree it will read/note the link and potentially include it in the index but are you saying it will access the redirecting script (which it thinks points to "bar.com") violating robots.txt in the process?