Frank_Rizzo

msg:3536342 | 12:14 pm on Dec 28, 2007 (gmt 0) |
Same here. Every crawl from ASK is 404'd for about two weeks now.
|
jdMorgan

msg:3536823 | 2:58 am on Dec 29, 2007 (gmt 0) |
I use something similar to the following .htaccess code to prevent abuse on Apache servers, but it also seems to put Teoma back on track after one of its errant "%20" requests:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([/0-9a-z._\-]*)[^/0-9a-z._\-](\?[^\ ]*)?\ HTTP/ [NC] RewriteCond %{DOCUMENT_ROOT}/%1 -f [OR] RewriteCond %{DOCUMENT_ROOT}/%1 -d RewriteRule .* http://www.example.com/%1 [R=301,L]
This basically allows the URL-path to contain only the characters 0-9, a-z, A-Z, periods, underscores and hyphens. If any other characters are found in the URL-path, then the URL-path is truncated at that point, and --if the resulting URL resolves to an existing file or a directory-- a 301-Moved Permanently redirect to that truncated URL is invoked. The original query-string attached to the URL (if any) is retained. If you modify the [groups] in the pattern above, make sure that they match exactly -- with the obvious exception of the "^" negation operator in the second group. Jim
|
jdMorgan

msg:3536881 | 7:03 am on Dec 29, 2007 (gmt 0) |
Oops! I introduced some errors in the generalize-copy-and-paste, there. The first RewriteCond should read:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([/0-9a-z._\-]*)[^/0-9a-z._\-\?\ ][^?\ ]*(\?[^\ ]*)?\ HTTP/ [NC]
Also, the most likely cause of this problem is search engine robots picking up on links posted in forums, where the poster or the forum software's auto-link routine has included the trailing space in the link. You'll also see this happening with a trailing period on the requested URL when the person posting the link puts a period at the end of it -- as in, "For more info, see http://example.com/widget.html." However, in this case the period is not hex-encoded, because it is a valid character to include in a URL, unlike a space. Jim
|
Marcia

msg:3536883 | 7:25 am on Dec 29, 2007 (gmt 0) |
To be honest, I'd try to use some kind of contact form to let Ask know, in hope that it'll get forwarded to their crawl people. Sometimes it works. I once had a problem on a site (server-side configuration error on the part of the host) that put Slurp into an endless loop; I wrote Yahoo and actually got a human reply back that was very nice, and they did pass it on. I was afraid of getting banned because of it, but they apparently dealt with it, no problems ensued. If there's a problem, it never hurts to try to communicate.
|
Marcia

msg:3536892 | 7:44 am on Dec 29, 2007 (gmt 0) |
Incidentally, this type of thing also happens with Yahoo, but you'll see /widgets?A=D and /widgets/?N=M (and various other combinations of letters, so many you can't keep up with them) appended to URLs when you do a site: search. And it does appear to have an adverse affect on listings. Jim, is it the same type of solution for that type of thing happening?
|
Frank_Rizzo

msg:3536995 | 12:44 pm on Dec 29, 2007 (gmt 0) |
Surely it is an Ask/Teoma problem rather than the message board as other SE's are not having this problem. Besides, this is happening for every page on the site and not just pages linked to via the message board. I have filled in a feedback form but like the OP no reply. jdmorgan - does that piece of code stop all uses of % translations? I have a database look up which translates spaces such as: /search_users.php?username=J%20Doe&widget=light%20blue would the script fail if I used your code? [edited by: Frank_Rizzo at 12:47 pm (utc) on Dec. 29, 2007]
|
jdMorgan

msg:3537037 | 3:23 pm on Dec 29, 2007 (gmt 0) |
Frank: | The original query-string attached to the URL (if any) is retained. |
| It will only remove %nn tails on the URL-path-part. A query string is not part of a URL, but rather, data attached to a URL to be passed to the resource at that URL. Marcia: You can use a different code snippet to remove spurious query strings - It's been posted elsewhere on WebmasterWorld several times. But, as highlighted by Frank's question above, you do need to be sure that it won't break your site. :) Jim
|
Drreggae

msg:3537097 | 7:22 pm on Dec 29, 2007 (gmt 0) |
Indeed, same here, its crawling the entire *site* like that since few weeks. No way all those 1000+ pages are all on forums. Paranoid as I am, have been thinking someone somehow managed to post a spoofed URL for a sitemap to them.. but that would mean they screw up big time... but if not.. they also screw up big time.. could be nice if ask.com ppl read this forum..
|
vpathak

msg:3539757 | 6:49 pm on Jan 3, 2008 (gmt 0) |
Dear Drreggae We did experience a data error which caused us to crawl badly-formed urls from a small number of sites. We identified the issue and corrected it on Dec 29th. Thanks for flagging and please let us know if you see any further problems. Best regards, Vivek Pathak Infrastructure Product Manager Ask.com
|
Drreggae

msg:3540262 | 11:20 am on Jan 4, 2008 (gmt 0) |
Thanks Vivek! I just came back here to post that indeed that crawler now started to act normal again since, a small week or so. Now we hope to see the same traffic as before back again;-) (sure part is the holidays and end of Q4..:-() regards! Btw, gonna ask this in another subforum, but.. is that crawler: 78.137.163.133 coming from digiweb.ie with "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1" by any chance a crawler from ask.com? I wonder, cause some ask crawlers indeed come from .ie if I am correct, and always use Minefield.. This 78.137.163.133 is all over the web in visible access logs etc, ( ip-78-137-163-133.dedi.digiweb.ie ). Since it does no identify itself, I now block it via .htaccess but surely would be not doing that if I knew if was from ask.com.. [edited by: Drreggae at 11:27 am (utc) on Jan. 4, 2008]
|
incrediBILL

msg:3540723 | 9:56 pm on Jan 4, 2008 (gmt 0) |
It's not Ask but a dedicated server that's either scraping you or hosting a proxy which can be used to hijack your site in the SE's. host 78.137.163.133 ip-78-137-163-133.dedi.digiweb.ie whois -h whois.ripe.net 78.137.163.133 inetnum: 78.137.160.0 - 78.137.163.255 netname: DIGIWEB-HOSTING-NET descr: Digiweb Hosting [3] country: IE
|
incrediBILL

msg:3541721 | 9:07 pm on Jan 6, 2008 (gmt 0) |
While on the topic of ASK, they appear to be making screen shots via the Bloglines IP address so that little RSS feed reading IP is being used for double duty. Also, the same IP has been asking for robots.txt with a blank user agent. Very nice. Not. [edited by: incrediBILL at 9:09 pm (utc) on Jan. 6, 2008]
|
LunaC

msg:3574178 | 8:29 pm on Feb 13, 2008 (gmt 0) |
Careful with that htaccess code, it works well but it may redirect when you don't want (at least it did on my server even after I removed all other rules to test). For example if you have both of these on your server: http://example.com/red/ http://example.com/red-widget/ When you link to http://example.com/red-widget/ it gets 301'd to http://example.com/red/ If you don't have http://example.com/red/ it works as expected.
|
|