Forum Moderators: open

Message Too Old, No Replies

parchmenthill

         

Pfui

1:21 pm on Aug 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



nym.theservergroup.net
Mozilla/4.0 (compatible;++see+http://www.parchmenthill.com/search.htm)

robots.txt? NO

Info page says they will download robots.txt "very shortly"...

Uh-huh.

FWIW:

Publisher or bot-runner? Bot-runner or publisher?

"Our goal is to publish books..."
[parchmenthill.com...]

"The goal [of our project] is to make it very easy for people to find websites..."
[parchmenthill.com...]

wilderness

2:40 pm on Aug 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This backbone has multiple ranges that are all pests.
66.148.64.0 - 66.148.127.255

caribguy

8:37 am on Aug 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Also seen from 76.73.37.nnn

ParchmentHill

11:42 pm on Aug 31, 2009 (gmt 0)

10+ Year Member


Hi,

Sorry about the robots.txt -- it really is in the works. Right now, the program only looks at the root index page for a given domain, and public access won't be allowed until after the robots.txt file is implemented.

It is still a work-in-process, so there are still some other things to be done (e.g. valid PTR record; "theservergroup.net" isn't us). Parchment Hill is the "parent company", so to speak -- the search engine project still needs to be named (I'm sure there's at least one domain name out there that hasn't been registered yet!), at which point it will get its own website.

FWIW, the data that is retrieved is all indexed, and is designed to be used solely for the purpose of searching websites (specifically, finding a domain or website, rather than a webpage).

Any questions/problems, just let me know.
-Scott

jdMorgan

2:11 am on Sep 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Scott, and thanks for stopping in.

What the mob here is saying is basically, "No robots.txt compliance, no fetching allowed." Or to turn around the phrase you used above, "Robot access to our sites won't be allowed until after robots.txt compliance is implemented."

The common factor among many of the forum contributors here is that their sites have been harvested, scraped, nph-proxied, and copied many too many times. Many of us are small operators, and can't afford a "legal team" to keep up with all the DMCAs we'd need to file as a result of all of this malicious activity.

Therefore, the trend showing here is that if it's not a current traffic-and-revenue-producing search engine, and it's not a fully-validated legitimate browser with a verified human behind it, then it's highly suspect.

Many here now shoot first and ask questions later. Whitelisting by user-agent and IP address range is getting more and more common.

So, the longer you put off robots.txt compliance, the worse the damage to your robots' "reputation" will be, and the fewer sites you'll be able to index.

We're not fooling ourselves here, though: You may not notice the small number of sites that are represented by individual Webmasters here and in other similar forums. However, some of the folks here have quite extensive influence -- extending to thousands or perhaps tens of thousands of sites. So that may rate at least a second thought.

Anyway, it's nothing personal, it's either just a business decision based on return-on-investment and risk management, or for some smaller sites with limited bandwidth allocations, a matter of their cost-effective survival. Because of rampant abuse, the Web is no longer truly 'open,' and robots.txt compliance is no longer optional.

Looking forward to that first robots.txt fetch. :)

Jim

ParchmentHill

11:24 am on Sep 1, 2009 (gmt 0)

10+ Year Member



That's certainly not a problem; I know people will block us. I, too, have spent countless hours pouring through log files and blocking bogus traffic (I was one of the first to discover FasterFox's prefetching and Microsoft's evil 'InfoPath', and even had several viruses that would connect to an old website of mine). I've wrestled with the 'But what if it is valid traffic?' question many a time.

If this project does succeed, it will be a big help to a lot of smaller websites (it's designed to help find websites they are trying to find, whether it is that of a local restaurant or a forum for owners of Toyotas). It tackles the search issue from a completely different angle that a normal search engine (e.g. that local restaurant may not even have a single link into it yet).

But my reason for posting isn't just to convince people not to block the traffic (you're welcome to, especially if you have the type of website people may not have troubles finding). The reason for posting is simply to let people know that some projects do have people behind them that are at least trying somewhat to be responsible. :)
-Scott

ParchmentHill

7:55 pm on Sep 14, 2009 (gmt 0)

10+ Year Member



Just a quick follow-up (for the benefit of those following this thread, and others that may be searching for information on the bot).

The bot now fully respects robots.txt files (and no runs have occurred without the robots.txt support since my last post). No public access has been provided to information blocked by robots.txt.

Also, it turns out that the User-Agent: header was missing a name for the bot (oops!), so it now has a name ("ParchBot", as in "ParchBot/1.0"). We're also checking robots.txt for 'parchmenthill.com', since that's all we really had as an identifier previously.
-Scott

Pfui

12:37 am on Sep 15, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thank you for reading and heeding robots.txt! Earlier today:

aquarius.theservergroup.net
Mozilla/5.0 (compatible;+ParchBot/1.0;++http://www.parchmenthill.com/search.htm)

robots.txt? YES

Will you always be crawling from .theservergroup.net?

ParchmentHill

12:57 am on Sep 15, 2009 (gmt 0)

10+ Year Member



The 'theservergroup.net' domain is not ours; apparently, it was from a leftover reverse DNS entry from a previous user of our IP range. That should be changed shortly to search.parchmenthill.com. Until the search engine has its own name, any crawling should be done from IPs with a reverse of *.parchmenthill.com.
-Scott

jdMorgan

5:46 pm on Sep 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks, Scott!

On the basis of robots.txt compliance and a valid user-agent, your 'bot is now Allowed in both my robots.txt files and filter files.

We (Webmasters) appreciate the quick remediation of this problem, and I hope that this thread will serve as an example to others who wish to deploy new previously-unknown 'bots that it's easier to start our right with robots.txt compliance than to have to do "damage control" after reports of any 'strange behaviour' have started to propagate across the Web -- and especially if you have to present yourself in a "tough room" like this one... :)

Jim

Pfui

9:20 pm on Oct 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All is not well with the 'new' ParchBot. Below is current activity on one site where it asks for robots.txt way, waaay too often, and then flat-out ignores it anyway.

search.parchmenthill.com
Mozilla/5.0 (compatible;+ParchBot/1.0;++http://www.parchmenthill.com/search.htm)

09:35:21 /robots.txt
09:35:22 /
09:47:04 /robots.txt
09:47:05 /
09:57:12 /robots.txt
09:57:13 /
10:06:52 /robots.txt
11:26:00 /robots.txt
11:35:25 /robots.txt
11:47:59 /robots.txt
12:05:00 /robots.txt
12:20:01 /robots.txt

Not okay.

ParchmentHill

9:40 pm on Oct 5, 2009 (gmt 0)

10+ Year Member



> All is not well with the 'new' ParchBot.

Sorry about that. We were doing some testing today on a handful of domains, after we noticed two glitches in the robots.txt processing (where it would download pages despite the wishes of the robots.txt file). You'll see from the logs where the glitch that affected your domain was fixed (with the first three hits it wasn't honoring the robots.txt file, but after that it did).

If you have other domains, you should notice that others weren't hit during this time (unless you have a huge number of domains).

That's one of the drawbacks to robots.txt -- it isn't easy to parse, since it isn't completely standardized. And, we found that a lot of people appear to be running software that automatically adds the entire User-Agent: HTTP header, rather than just the name of the user agent.

The good news, though, is that it appears that the parsing is working as intended now.
-Scott

Pfui

12:34 am on Oct 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thank you for your prompt reply, Scott. Your bot has steered clear since this morning's hyperactivity.

Truth be told, I don't know whether to be amazed at or creeped out by the coincidence of one of my West Coast domains finding itself in a "handful" randomly tested by your East Coast company. If the odds were 1 in, oh, 20,000, I'm off to Vegas, baybee.