Forum Moderators: open

Message Too Old, No Replies

Feedfetcher every 40 minutes for two days

feedfetcher persistently looking for non existent item

         

revrob

8:57 pm on Sep 17, 2011 (gmt 0)

10+ Year Member



Every forty minutes for the past two days I have had this pattern of visits from a combination of Google Feedfetcher and then a visit from a Bell Canada/Sympatico IP address. Both these sources are varying the IP ranges as I block, and I now have the entire Sympatico range blocked and have blocked Feedfetcher by UserAgent in .htaccess so every request is now receiving a 403

My site is w w w.my-domain.org.uk

the GET request is for /%7Emy.domain/rss.xml which was getting a 404 Not found before I started blocking them.
The actual rss file is w w w.my-domain.org.uk/rss.xml (and has been for years)

Abuse reports have not yet curbed this.

It's like a tag team visiting every 40 minutes or so - first the google IP then Sympatico (in the BellCanada IP address range). They are asking for a file but using a malformed domain without the suffix. I can't find the malformed domain in google searches so I have no idea where the search string is coming from - presumably some broken hacking software?

Here's some from earlier today before I blocked the entire Sympatico range. Note the shift in IP and the move from 403 response to 404 not found when they tried a new IP range.

209.85.226.88 - - [17/Sep/2011:03:44:40 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 403 - www.my-domain.org.uk "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=17806222463619821662)" "-"
174.95.145.80 - - [17/Sep/2011:03:44:41 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 404 2319 www.my-domain.org.uk "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)" "-"

209.85.226.88 - - [17/Sep/2011:04:24:41 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 403 - www.my-domain.org.uk "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=17806222463619821662)" "-"
174.95.145.80 - - [17/Sep/2011:04:24:41 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 404 2319 www.my-domain.org.uk "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)" "-"

209.85.228.90 - - [17/Sep/2011:05:04:42 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 403 - www.my-domain.org.uk "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=17806222463619821662)" "-"
174.95.145.80 - - [17/Sep/2011:05:04:42 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 404 2319 www.my-domain.org.uk "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)" "-"

Any advice as to who I can contact at google - they don't publish an abuse email on their WHOIS report - just a technical address at 'arin-contactATgoogle.com' - surprise surprise no reply.

I've had auto-responses from Sympatico abuse email address.

My robots.txt denies all user agents with specific limited access to main directory only for Bing, Slurp, Google and Ask Jeeves.

All constructive suggestions welcome!
Many thanks.

incrediBILL

11:34 pm on Sep 17, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Every 40 minutes is a good thing, the bad path not so good

Feedfetcher hits my site about as fast but Google also indexes the content it finds about as quick so that's the upside if you can fix the path problem

g1smd

11:43 pm on Sep 17, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Feedfetcher fetches two URLs: one requested with plain & in it and another with & inside.

I've tried a 301 redirect to the correct URL, returning 410 for the incorrect URL, and returning a one-post feed with a "stop using this URL, and subscribe to the other feed" message, each for more than a month at a time.

Still the incorrect feed is requested, again and again. Any ideas?

They are asking for a file but using a malformed domain without the suffix:
GET /%7Emy.domain/rss.xml
Err. No they are not.

They are requesting the path and file
/~my.domain/rss.xml
.

You'll probably find they are requesting it via your site's IP address or as a subdomain of your hosts main URL, something like:
10.20.30.40/~my.domain/rss.xml
or
yourhostingcompany.com/~my.domain/rss.xml


At some point in the past, it is very likely that those alternative URLs were exposed on to the web and since Google never forgets a URL it continues to request them.

revrob

6:06 am on Sep 18, 2011 (gmt 0)

10+ Year Member



Thank you for the replies.

They are asking for a file but using a malformed domain without the suffix:
GET /%7Emy.domain/rss.xml
Err. No they are not.

They are requesting the path and file /~my.domain/rss.xml.


My site is my-domain.org.uk/
They are asking for my.domain/

Surely that IS a malformed domain? They've turned the hyphen between my and domain, into a dot.

Secondly - I actually don't want feedfetcher calling every forty minutes and ignoring robots.txt which only allows named bots. I know it can't read robots.txt now because of my useragent ban but if I can work out how to get it to behave I'll let it read robots.txt in the hope it will go away. And I don't want a Canadian domain crawling every forty minutes for my content either. I've never seen anything quite this frequent or persistent - not even Baidu or Yandex.

g1smd

7:21 am on Sep 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My site is my-domain.org.uk/
They are asking for my.domain/

Surely that IS a malformed domain? They've turned the hyphen between my and domain, into a dot.
No. That is not what is happening. The GET request as logged shows only the path and file.

They are requesting the PATH /~my.domain/ and the FILE /rss.xml on an UNSPECIFIED DOMAIN.

You'll probably find they are requesting it via your site's IP address or as a subdomain of your hosts main URL, something like:
10.20.30.40/~my.domain/rss.xml
or
yourhostingcompany.com/~my.domain/rss.xml


You can't directly control this access using a
robots.txt
file loaded in your filespace because that file will appear to be at:
10.20.30.40/~my.domain/robots.txt
or
yourhostingcompany.com/~my.domain/robots.txt

and the
robots.txt
file needs to be in the root of the requested domain to have any effect.

You might try this
.htaccess
code ahead of all your other code to see if redirecting those requests to your own URL space makes any difference (it should do).

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(~|%7E)my\.domain
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]


Be sure to also read: [webmasterworld.com...]

[edited by: g1smd at 7:37 am (utc) on Sep 18, 2011]

lucy24

7:23 am on Sep 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Overlapping g1 because I type slow.

Two different things.

GET /%7Emy.domain/rss.xml

means

GET http://www.example.com/%7Emy.domain/rss.xml

That's what the leading slash means.

If they were asking for a malformed domain, it would never reach you. The user's browser would show one of those "can't identify blahblah" messages, meaning that they've tried every DNS on the planet and nobody has a clue. (Or, if seriously malformed, that they didn't bother to try.)*

%7E is the encoded version of ~ (freestanding tilde). These days you see it most often in quasi-domains, like personal pages within academic sites. Your host probably uses it somewhere in the upper ranges of your path, but not in any public addresses.


* My browser heroically auto-converted it to http://www.~my.domain/ in the address bar, although all I pasted in was %7Emy.domain. But the error message said "www.%7emy.domain could not be found"

revrob

8:49 am on Sep 18, 2011 (gmt 0)

10+ Year Member



sorry I'm still not getting this at all.

Perhaps if I put it another way?

my site is called (pretend) interesting-place.org.uk (hyphenated)

But the request is in the form
GET /%7Einteresting.place/rss.xml

and they were getting the 404 (before I blocked them)

the dot I am talking about is between the two halves of my domain name, and has been replaced by a dot.

interesting-place.org.uk is surely not the same as
interesting.place.org.uk or interesting.place ?

lucy24

11:06 am on Sep 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The part you're not getting is that there is a difference between the URL that a user sees (for present purposes, robots are users) and the path that a computer takes to get to a file. I don't get it either, but that is OK because g1 is going to explain it. Or possibly point to one of 8,000 earlier threads where he explained it, only it didn't sink in :(

Everything in your logs is expressed as

GET /{blahblah}

The part before the slash is your domain name, for example interesting-place.org.uk. You'll notice there is no part before the slash. That's because it is always the same, give or take a leading www.

The part after the slash is the variable part of the address. (If there is also nothing after the slash, they want your index.html or equivalent.) So the visitors are really asking for

http://interesting-place.org.uk/~interesting.place/rss.xml

which, of course, does not exist. So why do they think it does? Because at some time in the past someone got lost backstage and found a real, physical directory called ~interesting.place. You can probably find it yourself if you FTP or equivalent to your site and start clicking the Up or Back button where you would normally go Down or Forward.

I once found a vast list of directories I had no idea existed.* They all have something to do with my site but beyond that I have no idea what they're for. One of them is /home/ and that's where I live. My username, that is, containing assorted logs and domains.


* With subdirectories. ganglia. ghostscript. groff. awk. (You can say that again.) dovecot. lost+found. And apparently several trillion libraries.

revrob

12:12 pm on Sep 18, 2011 (gmt 0)

10+ Year Member



Thanks for trying.

Well I can't see any folder that is called my.domain with a dot in it - and I've just FTP's my web host, and checked from my account root directory all through every folder of the two sites that I have hosted there.

Way way back I used to have an ISP hosted site that had that my.domain as the relevant email account, and part of the url, but that was many years ago - and I don't think I even had an rss.xml file on the site then - and anyway this problem has only just started recently.

I think I will try and focus on Sympatico (part of Bell Canada) and see if I can get them to deal with whatever machine on THEIR network is suddenly prompting these searches that don't work, every 40 minutes.

If I can't get any joy from their abuse desk, perhaps I can redirect the requests from Sympatico to a chunky slow loading file on their own network. Sorry - I know I shouldn't do that...

g1smd

4:08 pm on Sep 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'll try to explain this again, this time using different words.

The folder with the dot in it is likely ABOVE *your* root folder. It cannot be seen from within *your* domain.

Imagine that your host is examplehost.com and they have 100 customers and own one server machine. The host allocates a folder for each site that is stored on their server machine.

Let's say that you have www.example-five.com as your domain. Your site will sit in the hosts server in the folder /sites/example-five/ or somesuch.

They will map requests for your domain to this folder as your document root.

However, if the server is also online as examplehost.com then the folder for your site will also show up as examplehost.com/example-five/ and if the server can be reached by bare IP address then your site will also show up as 10.20.30.40/example-five/

This is what is happening here.

Log files do not show the requested domain name, they show only the path and file that was requested. The fact that part of your domain name appears as the name of a requested folder means that the request is being made to a domain other than yours. Crucially, this other domain name still maps in some way to your folders and files.

This will be, as explained above, the overall domain name of your hosting company, or the IP address of the server.

revrob

4:45 pm on Sep 18, 2011 (gmt 0)

10+ Year Member



my two domains are hosted on my hosting account under a host account domain name that is completely different from both domains and contains no words from either - its in the form disgitsandnumbers.websitehome.co.uk (hosts are 1&1)

What I don't understand is that in your example above, you don't show example-five (with a hyphen in it) turning into example.five (with a dot in it). The various iterations you are quoting all retain the hyphen as "example-five". Yet somehow my hyphenated domain name in "my-domain" plus has been turned into "my.domain" in the request.

I can be reasonably certain that there is no path, hidden or otherwise, that has taken my domain name, split it in two, removed the hyphen, replaced it with a dot, and created a folder called my.domain that is part of the path to the file.

What is puzzling is that searching on "my.domain" throws up lots of hits in google for "my-domain.org.uk" which is my site - but all the entries show the correct domain address of my-domain.org.uk

I think I need to ask my host if they can explain this. Thanks for trying. Meanwhile lets hope Sympatico/Bell Canada can act against the perpetrator.

g1smd

5:06 pm on Sep 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They can call the folder that your site is hosted in anything they want. It doesn't matter what the folder is called as it isn't meant to ever appear in a valid URL.

In the example, I showed the folder as being named /example-five/ but it could equally have been called /example.five/ or /customer32282727722727/ or even /nigel/ and it would make no difference whatsoever to the normal running of your site.

The named folder is not anywhere within your domain. It is the folder that *houses* your domain. As such it can only ever appear as a folder name when another domain that maps to the root of the server (that's maps to the root of the entire server, not just maps to the root of your domain) is requested.

Pfui

5:20 pm on Sep 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Backing upthread a bit...

don't want feedfetcher calling every forty minutes and ignoring robots.txt which only allows named bots. I know it can't read robots.txt now because of my useragent ban

Actually, "Feedfetcher-Google" doesn't read robots.txt because it doesn't ask for it. Yes, it's a bot. No, G doesn't regard it as a bot.

(Aside: I'm not sure how your "useragent ban" works but FWIW, the only file I allow everything at least once is robots.txt. But it's CGI-generated so only certain G, M, and Y hosts using certain UAs get details and everything else gets a simple full Disallow. What happens in my robots.txt stays in my robots.txt:)

revrob

7:26 am on Sep 20, 2011 (gmt 0)

10+ Year Member



The problem has finally stopped, and the last identified IP range involved was Kindsight, in an Ottowa Telecom Ltd. IP block. After notifying Ottowa Telecom - about four hours later I had the last visit

74.125.112.87 - - [19/Sep/2011:23:26:58 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 403 - www.my.domain.org.uk "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=17806222463619821662)" "-"
174.89.128.53 - - [19/Sep/2011:23:26:58 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 403 - www.my-domain.org.uk "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)" "-"

174.89.128.53 is Kindsight well known data mining outfit.

and after that - nothing. Todays log is completely clear, which after many days with 40 minute repeats, is a relief.

I suspect that Kindsight are involved in some DPI manouevre with Canadian ISPs a bit like Vodafone/BlueCoat and TalkTalk/Huawei for intercepting web requests to do "malware protection" checks or behavioural advertising scanning. Looks like they just whitelisted my website as soon as they were identified.

dstiles

8:14 pm on Sep 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



revrob - can I ask how you got kindsight from that? 174.89.128.xx is just broadband, as far as I can tell, running a valid (though slightly out of date) firefox browser..

revrob

9:09 am on Sep 21, 2011 (gmt 0)

10+ Year Member



Sorry - you're right - I quoted the wrong log entry from that days log, and the wrong log address - sorry - there are so many and such a shifting pattern as they moved IP ranges (and even ISPs), it got quite confusing. First Bell Canada/Sympatico then Ottowa Telecom Ltd then Kindsight - all paired with a Findfetcher request. The one I gave above was Bell Canada/Sympatico, not Kindsight.

Incidentally the Technical enquiries address for Kindsight was the same as for Ottowa Telecom Ltd. No abuse reporting option provided. I sent abuse reports to that address anyway as the situation developed.

The IP I should have quoted was:72.1.196.178

and a sample log entry (there were many) showing that was this one:

74.125.112.84 - - [19/Sep/2011:18:07:19 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 403 - www.my-domain.org.uk "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=**********)" "-"
72.1.196.178 - - [19/Sep/2011:18:07:21 +0200] "GET /%7Emy.domain/rss.xml HTTP/1.1" 403 - www.my-domain.org.uk "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)" "-"

and the IP lookup info is here
[iptools.com...]

and identifies it as Kindsight within an Ottowa Telecom Ltd block.
American Registry for Internet Numbers NET72 (NET-72-0-0-0-0) 72.0.0.0 - 72.255.255.255
Telecom Ottawa Limited TOL-IPBLOCK-1 (NET-72-1-192-0-1) 72.1.192.0 - 72.1.223.255
Kindsight TOL-72-1-196-176-190 (NET-72-1-196-176-1) 72.1.196.176 - 72.1.196.191

As soon as I blocked the Kindsight ranges, life got quite interesting with suddenly a large number of searches for many pages of my site going on from the Kindsight IP range (and getting 403) - with the pattern of a human browser user (someone at the company responding to my abuse reports?) doing a google search (from the Kindsight IP address) then I noticed a vist from the Wayback archive (sandwiched between Kindsight visits) and then a visit using the Kindsight IP again to look at the OLD location of the site on my ISP hosted webspace, (which they would have got from the Wayback archive) then the same browser useragent a few seconds later responding to the 403 responses by using another IP in Kansas. And that was all up to yesterday. But despite all that time consuming research activity, no reply to my abuse report emails.

Thanks for pointing out the error.

dstiles

9:22 pm on Sep 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wouldn't hold out much hope for replies to abuse reports. Either they fix them silently, copy the report to the perpetrator (who then knows "who" you are!) or quietly forgets about them - assuming they even get the report in the first place.

Thanks for the IP. I hadn't come across that range before.

You seem to have stirred up vengeance, anyway. :)