Forum Moderators: open
robots.txt? NO
Info page says they will download robots.txt "very shortly"...
Uh-huh.
FWIW:
Publisher or bot-runner? Bot-runner or publisher?
"Our goal is to publish books..."
[parchmenthill.com...]
"The goal [of our project] is to make it very easy for people to find websites..."
[parchmenthill.com...]
Sorry about the robots.txt -- it really is in the works. Right now, the program only looks at the root index page for a given domain, and public access won't be allowed until after the robots.txt file is implemented.
It is still a work-in-process, so there are still some other things to be done (e.g. valid PTR record; "theservergroup.net" isn't us). Parchment Hill is the "parent company", so to speak -- the search engine project still needs to be named (I'm sure there's at least one domain name out there that hasn't been registered yet!), at which point it will get its own website.
FWIW, the data that is retrieved is all indexed, and is designed to be used solely for the purpose of searching websites (specifically, finding a domain or website, rather than a webpage).
Any questions/problems, just let me know.
-Scott
What the mob here is saying is basically, "No robots.txt compliance, no fetching allowed." Or to turn around the phrase you used above, "Robot access to our sites won't be allowed until after robots.txt compliance is implemented."
The common factor among many of the forum contributors here is that their sites have been harvested, scraped, nph-proxied, and copied many too many times. Many of us are small operators, and can't afford a "legal team" to keep up with all the DMCAs we'd need to file as a result of all of this malicious activity.
Therefore, the trend showing here is that if it's not a current traffic-and-revenue-producing search engine, and it's not a fully-validated legitimate browser with a verified human behind it, then it's highly suspect.
Many here now shoot first and ask questions later. Whitelisting by user-agent and IP address range is getting more and more common.
So, the longer you put off robots.txt compliance, the worse the damage to your robots' "reputation" will be, and the fewer sites you'll be able to index.
We're not fooling ourselves here, though: You may not notice the small number of sites that are represented by individual Webmasters here and in other similar forums. However, some of the folks here have quite extensive influence -- extending to thousands or perhaps tens of thousands of sites. So that may rate at least a second thought.
Anyway, it's nothing personal, it's either just a business decision based on return-on-investment and risk management, or for some smaller sites with limited bandwidth allocations, a matter of their cost-effective survival. Because of rampant abuse, the Web is no longer truly 'open,' and robots.txt compliance is no longer optional.
Looking forward to that first robots.txt fetch. :)
Jim
If this project does succeed, it will be a big help to a lot of smaller websites (it's designed to help find websites they are trying to find, whether it is that of a local restaurant or a forum for owners of Toyotas). It tackles the search issue from a completely different angle that a normal search engine (e.g. that local restaurant may not even have a single link into it yet).
But my reason for posting isn't just to convince people not to block the traffic (you're welcome to, especially if you have the type of website people may not have troubles finding). The reason for posting is simply to let people know that some projects do have people behind them that are at least trying somewhat to be responsible. :)
-Scott
The bot now fully respects robots.txt files (and no runs have occurred without the robots.txt support since my last post). No public access has been provided to information blocked by robots.txt.
Also, it turns out that the User-Agent: header was missing a name for the bot (oops!), so it now has a name ("ParchBot", as in "ParchBot/1.0"). We're also checking robots.txt for 'parchmenthill.com', since that's all we really had as an identifier previously.
-Scott
On the basis of robots.txt compliance and a valid user-agent, your 'bot is now Allowed in both my robots.txt files and filter files.
We (Webmasters) appreciate the quick remediation of this problem, and I hope that this thread will serve as an example to others who wish to deploy new previously-unknown 'bots that it's easier to start our right with robots.txt compliance than to have to do "damage control" after reports of any 'strange behaviour' have started to propagate across the Web -- and especially if you have to present yourself in a "tough room" like this one... :)
Jim
search.parchmenthill.com
Mozilla/5.0 (compatible;+ParchBot/1.0;++http://www.parchmenthill.com/search.htm)
09:35:21 /robots.txt
09:35:22 /
09:47:04 /robots.txt
09:47:05 /
09:57:12 /robots.txt
09:57:13 /
10:06:52 /robots.txt
11:26:00 /robots.txt
11:35:25 /robots.txt
11:47:59 /robots.txt
12:05:00 /robots.txt
12:20:01 /robots.txt
Not okay.
Sorry about that. We were doing some testing today on a handful of domains, after we noticed two glitches in the robots.txt processing (where it would download pages despite the wishes of the robots.txt file). You'll see from the logs where the glitch that affected your domain was fixed (with the first three hits it wasn't honoring the robots.txt file, but after that it did).
If you have other domains, you should notice that others weren't hit during this time (unless you have a huge number of domains).
That's one of the drawbacks to robots.txt -- it isn't easy to parse, since it isn't completely standardized. And, we found that a lot of people appear to be running software that automatically adds the entire User-Agent: HTTP header, rather than just the name of the user agent.
The good news, though, is that it appears that the parsing is working as intended now.
-Scott
Truth be told, I don't know whether to be amazed at or creeped out by the coincidence of one of my West Coast domains finding itself in a "handful" randomly tested by your East Coast company. If the odds were 1 in, oh, 20,000, I'm off to Vegas, baybee.