Welcome to WebmasterWorld Guest from 54.167.175.107

Forum Moderators: open

entire site "mis"-crawled with appended % 20 codes.

   
12:11 pm on Dec 28, 2007 (gmt 0)

10+ Year Member



Since a couple of weeks, the ask.com crawler,
crawler100.ask.com keeps crawling our entire website with
url-encoded spaces (% 20) appended to each URL, resulting
in our apache server telling it the "File does not exist"..

Since then we gradually see our ask.com traffic fading away.

I emailed them last week, but sofar no reply.

Has anyone else seen this, or have an explanation for it?

12:14 pm on Dec 28, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Same here. Every crawl from ASK is 404'd for about two weeks now.
2:58 am on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I use something similar to the following .htaccess code to prevent abuse on Apache servers, but it also seems to put Teoma back on track after one of its errant "%20" requests:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([/0-9a-z._\-]*)[^/0-9a-z._\-](\?[^\ ]*)?\ HTTP/ [NC]
RewriteCond %{DOCUMENT_ROOT}/%1 -f [OR]
RewriteCond %{DOCUMENT_ROOT}/%1 -d
RewriteRule .* http://www.example.com/%1 [R=301,L]

This basically allows the URL-path to contain only the characters 0-9, a-z, A-Z, periods, underscores and hyphens. If any other characters are found in the URL-path, then the URL-path is truncated at that point, and --if the resulting URL resolves to an existing file or a directory-- a 301-Moved Permanently redirect to that truncated URL is invoked.

The original query-string attached to the URL (if any) is retained.

If you modify the [groups] in the pattern above, make sure that they match exactly -- with the obvious exception of the "^" negation operator in the second group.

Jim

7:03 am on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Oops! I introduced some errors in the generalize-copy-and-paste, there. The first RewriteCond should read:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([/0-9a-z._\-]*)[^/0-9a-z._\-\?\ ][^?\ ]*(\?[^\ ]*)?\ HTTP/ [NC]

Also, the most likely cause of this problem is search engine robots picking up on links posted in forums, where the poster or the forum software's auto-link routine has included the trailing space in the link.

You'll also see this happening with a trailing period on the requested URL when the person posting the link puts a period at the end of it -- as in, "For more info, see http://example.com/widget.html." However, in this case the period is not hex-encoded, because it is a valid character to include in a URL, unlike a space.

Jim

7:25 am on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member



To be honest, I'd try to use some kind of contact form to let Ask know, in hope that it'll get forwarded to their crawl people. Sometimes it works.

I once had a problem on a site (server-side configuration error on the part of the host) that put Slurp into an endless loop; I wrote Yahoo and actually got a human reply back that was very nice, and they did pass it on. I was afraid of getting banned because of it, but they apparently dealt with it, no problems ensued.

If there's a problem, it never hurts to try to communicate.

7:44 am on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Incidentally, this type of thing also happens with Yahoo, but you'll see /widgets?A=D and /widgets/?N=M (and various other combinations of letters, so many you can't keep up with them) appended to URLs when you do a site: search. And it does appear to have an adverse affect on listings.

Jim, is it the same type of solution for that type of thing happening?

12:44 pm on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Surely it is an Ask/Teoma problem rather than the message board as other SE's are not having this problem.

Besides, this is happening for every page on the site and not just pages linked to via the message board.

I have filled in a feedback form but like the OP no reply.

jdmorgan - does that piece of code stop all uses of % translations?

I have a database look up which translates spaces such as:

/search_users.php?username=J%20Doe&widget=light%20blue

would the script fail if I used your code?

[edited by: Frank_Rizzo at 12:47 pm (utc) on Dec. 29, 2007]

3:23 pm on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Frank:
The original query-string attached to the URL (if any) is retained.
It will only remove %nn tails on the URL-path-part. A query string is not part of a URL, but rather, data attached to a URL to be passed to the resource at that URL.

Marcia:
You can use a different code snippet to remove spurious query strings - It's been posted elsewhere on WebmasterWorld several times. But, as highlighted by Frank's question above, you do need to be sure that it won't break your site. :)

Jim

7:22 pm on Dec 29, 2007 (gmt 0)

10+ Year Member



Indeed, same here, its crawling the entire *site* like that
since few weeks. No way all those 1000+ pages are all on forums. Paranoid as I am, have been thinking someone somehow
managed to post a spoofed URL for a sitemap to them.. but that
would mean they screw up big time... but if not.. they also screw up big time.. could be nice if ask.com ppl read this forum..
6:49 pm on Jan 3, 2008 (gmt 0)

5+ Year Member



Dear Drreggae

We did experience a data error which caused us to crawl badly-formed urls from a small number of sites. We identified the issue and corrected it on Dec 29th. Thanks for flagging and please let us know if you see any further problems.

Best regards,

Vivek Pathak
Infrastructure Product Manager
Ask.com

11:20 am on Jan 4, 2008 (gmt 0)

10+ Year Member



Thanks Vivek!
I just came back here to post that indeed that crawler now
started to act normal again since, a small week or so.
Now we hope to see the same traffic as before back again;-)
(sure part is the holidays and end of Q4..:-()

regards!

Btw, gonna ask this in another subforum, but.. is that crawler: 78.137.163.133 coming from digiweb.ie with
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"

by any chance a crawler from ask.com? I wonder, cause some
ask crawlers indeed come from .ie if I am correct, and always
use Minefield..

This 78.137.163.133 is all over the web in visible access logs
etc, ( ip-78-137-163-133.dedi.digiweb.ie ).
Since it does no identify itself, I now block it via .htaccess
but surely would be not doing that if I knew if was from
ask.com..

[edited by: Drreggae at 11:27 am (utc) on Jan. 4, 2008]

9:56 pm on Jan 4, 2008 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




It's not Ask but a dedicated server that's either scraping you or hosting a proxy which can be used to hijack your site in the SE's.

host 78.137.163.133
ip-78-137-163-133.dedi.digiweb.ie

whois -h whois.ripe.net 78.137.163.133
inetnum: 78.137.160.0 - 78.137.163.255
netname: DIGIWEB-HOSTING-NET
descr: Digiweb Hosting [3]
country: IE

9:07 pm on Jan 6, 2008 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



While on the topic of ASK, they appear to be making screen shots via the Bloglines IP address so that little RSS feed reading IP is being used for double duty.

Also, the same IP has been asking for robots.txt with a blank user agent.

Very nice.

Not.

[edited by: incrediBILL at 9:09 pm (utc) on Jan. 6, 2008]

8:29 pm on Feb 13, 2008 (gmt 0)

5+ Year Member



Careful with that htaccess code, it works well but it may redirect when you don't want (at least it did on my server even after I removed all other rules to test).

For example if you have both of these on your server:
http://example.com/red/
http://example.com/red-widget/

When you link to http://example.com/red-widget/ it gets 301'd to http://example.com/red/

If you don't have http://example.com/red/ it works as expected.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month