homepage Welcome to WebmasterWorld Guest from 174.129.130.202
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Ask - Teoma
Forum Library, Charter, Moderator: open

Ask - Teoma Forum

    
entire site "mis"-crawled with appended % 20 codes.
Drreggae




msg:3536341
 12:11 pm on Dec 28, 2007 (gmt 0)

Since a couple of weeks, the ask.com crawler,
crawler100.ask.com keeps crawling our entire website with
url-encoded spaces (% 20) appended to each URL, resulting
in our apache server telling it the "File does not exist"..

Since then we gradually see our ask.com traffic fading away.

I emailed them last week, but sofar no reply.

Has anyone else seen this, or have an explanation for it?

 

Frank_Rizzo




msg:3536342
 12:14 pm on Dec 28, 2007 (gmt 0)

Same here. Every crawl from ASK is 404'd for about two weeks now.

jdMorgan




msg:3536823
 2:58 am on Dec 29, 2007 (gmt 0)

I use something similar to the following .htaccess code to prevent abuse on Apache servers, but it also seems to put Teoma back on track after one of its errant "%20" requests:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([/0-9a-z._\-]*)[^/0-9a-z._\-](\?[^\ ]*)?\ HTTP/ [NC]
RewriteCond %{DOCUMENT_ROOT}/%1 -f [OR]
RewriteCond %{DOCUMENT_ROOT}/%1 -d
RewriteRule .* http://www.example.com/%1 [R=301,L]

This basically allows the URL-path to contain only the characters 0-9, a-z, A-Z, periods, underscores and hyphens. If any other characters are found in the URL-path, then the URL-path is truncated at that point, and --if the resulting URL resolves to an existing file or a directory-- a 301-Moved Permanently redirect to that truncated URL is invoked.

The original query-string attached to the URL (if any) is retained.

If you modify the [groups] in the pattern above, make sure that they match exactly -- with the obvious exception of the "^" negation operator in the second group.

Jim

jdMorgan




msg:3536881
 7:03 am on Dec 29, 2007 (gmt 0)

Oops! I introduced some errors in the generalize-copy-and-paste, there. The first RewriteCond should read:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([/0-9a-z._\-]*)[^/0-9a-z._\-\?\ ][^?\ ]*(\?[^\ ]*)?\ HTTP/ [NC]

Also, the most likely cause of this problem is search engine robots picking up on links posted in forums, where the poster or the forum software's auto-link routine has included the trailing space in the link.

You'll also see this happening with a trailing period on the requested URL when the person posting the link puts a period at the end of it -- as in, "For more info, see http://example.com/widget.html." However, in this case the period is not hex-encoded, because it is a valid character to include in a URL, unlike a space.

Jim

Marcia




msg:3536883
 7:25 am on Dec 29, 2007 (gmt 0)

To be honest, I'd try to use some kind of contact form to let Ask know, in hope that it'll get forwarded to their crawl people. Sometimes it works.

I once had a problem on a site (server-side configuration error on the part of the host) that put Slurp into an endless loop; I wrote Yahoo and actually got a human reply back that was very nice, and they did pass it on. I was afraid of getting banned because of it, but they apparently dealt with it, no problems ensued.

If there's a problem, it never hurts to try to communicate.

Marcia




msg:3536892
 7:44 am on Dec 29, 2007 (gmt 0)

Incidentally, this type of thing also happens with Yahoo, but you'll see /widgets?A=D and /widgets/?N=M (and various other combinations of letters, so many you can't keep up with them) appended to URLs when you do a site: search. And it does appear to have an adverse affect on listings.

Jim, is it the same type of solution for that type of thing happening?

Frank_Rizzo




msg:3536995
 12:44 pm on Dec 29, 2007 (gmt 0)

Surely it is an Ask/Teoma problem rather than the message board as other SE's are not having this problem.

Besides, this is happening for every page on the site and not just pages linked to via the message board.

I have filled in a feedback form but like the OP no reply.

jdmorgan - does that piece of code stop all uses of % translations?

I have a database look up which translates spaces such as:

/search_users.php?username=J%20Doe&widget=light%20blue

would the script fail if I used your code?

[edited by: Frank_Rizzo at 12:47 pm (utc) on Dec. 29, 2007]

jdMorgan




msg:3537037
 3:23 pm on Dec 29, 2007 (gmt 0)

Frank:
The original query-string attached to the URL (if any) is retained.
It will only remove %nn tails on the URL-path-part. A query string is not part of a URL, but rather, data attached to a URL to be passed to the resource at that URL.

Marcia:
You can use a different code snippet to remove spurious query strings - It's been posted elsewhere on WebmasterWorld several times. But, as highlighted by Frank's question above, you do need to be sure that it won't break your site. :)

Jim

Drreggae




msg:3537097
 7:22 pm on Dec 29, 2007 (gmt 0)

Indeed, same here, its crawling the entire *site* like that
since few weeks. No way all those 1000+ pages are all on forums. Paranoid as I am, have been thinking someone somehow
managed to post a spoofed URL for a sitemap to them.. but that
would mean they screw up big time... but if not.. they also screw up big time.. could be nice if ask.com ppl read this forum..

vpathak




msg:3539757
 6:49 pm on Jan 3, 2008 (gmt 0)

Dear Drreggae

We did experience a data error which caused us to crawl badly-formed urls from a small number of sites. We identified the issue and corrected it on Dec 29th. Thanks for flagging and please let us know if you see any further problems.

Best regards,

Vivek Pathak
Infrastructure Product Manager
Ask.com

Drreggae




msg:3540262
 11:20 am on Jan 4, 2008 (gmt 0)

Thanks Vivek!
I just came back here to post that indeed that crawler now
started to act normal again since, a small week or so.
Now we hope to see the same traffic as before back again;-)
(sure part is the holidays and end of Q4..:-()

regards!

Btw, gonna ask this in another subforum, but.. is that crawler: 78.137.163.133 coming from digiweb.ie with
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"

by any chance a crawler from ask.com? I wonder, cause some
ask crawlers indeed come from .ie if I am correct, and always
use Minefield..

This 78.137.163.133 is all over the web in visible access logs
etc, ( ip-78-137-163-133.dedi.digiweb.ie ).
Since it does no identify itself, I now block it via .htaccess
but surely would be not doing that if I knew if was from
ask.com..

[edited by: Drreggae at 11:27 am (utc) on Jan. 4, 2008]

incrediBILL




msg:3540723
 9:56 pm on Jan 4, 2008 (gmt 0)


It's not Ask but a dedicated server that's either scraping you or hosting a proxy which can be used to hijack your site in the SE's.

host 78.137.163.133
ip-78-137-163-133.dedi.digiweb.ie

whois -h whois.ripe.net 78.137.163.133
inetnum: 78.137.160.0 - 78.137.163.255
netname: DIGIWEB-HOSTING-NET
descr: Digiweb Hosting [3]
country: IE

incrediBILL




msg:3541721
 9:07 pm on Jan 6, 2008 (gmt 0)

While on the topic of ASK, they appear to be making screen shots via the Bloglines IP address so that little RSS feed reading IP is being used for double duty.

Also, the same IP has been asking for robots.txt with a blank user agent.

Very nice.

Not.

[edited by: incrediBILL at 9:09 pm (utc) on Jan. 6, 2008]

LunaC




msg:3574178
 8:29 pm on Feb 13, 2008 (gmt 0)

Careful with that htaccess code, it works well but it may redirect when you don't want (at least it did on my server even after I removed all other rules to test).

For example if you have both of these on your server:
http://example.com/red/
http://example.com/red-widget/

When you link to http://example.com/red-widget/ it gets 301'd to http://example.com/red/

If you don't have http://example.com/red/ it works as expected.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Ask - Teoma
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved