pigmej, wykop/0.3

Forum Moderators: open

Message Too Old, No Replies

pigmej, wykop/0.3

Pfui

7:12 am on Aug 6, 2009 (gmt 0)

wklej.to
pigmej, wykop/0.3

robots.txt? NO

A Polish something-or-other (Google Translate the Host name), on a TLD that looks like it's Tonga (.to), but per WHOIS is actually OVH in France.

GaryK

7:46 am on Aug 6, 2009 (gmt 0)

Yep, it's Tonga. I know it didn't read robots.txt, but how well-behaved was it? What kinds of files did it try taking?

Pfui

8:27 am on Aug 6, 2009 (gmt 0)

Just /

pigmej

9:50 pm on Aug 6, 2009 (gmt 0)

Soo...

that crawler was visiting your site because, your site was on [wykop.pl...] ( Polish Digg clone ).

It is fetching the same url as was published on wykop. That's the reason of ignoring robots.txt.

GaryK

1:34 am on Aug 7, 2009 (gmt 0)

Welcome to WebmasterWorld.

It would be nice if you used a properly formatted user agent string so we wouldn't have to make guesses about what your bot is doing.

Something like this would be greatly appreciated, and might help to ensure your bot gets some respect:

wykop/0.3 (http://wykop.pl/bots.html; bots@wykop.pl)

The bots.html page should describe why your bot is hitting our site. And the e-mail address is in case we have questions about your bot.

Thank you. :)

pigmej

10:55 am on Aug 7, 2009 (gmt 0)

This bot isn't official wykop.pl bot...

But no problem I can do something like this ;)

And don't worry, fetched sites will be used only for search engine :)

GaryK

4:04 pm on Aug 7, 2009 (gmt 0)

This bot isn't official wykop.pl bot

Does this mean your bot is scraping Wykop for links and content?

pigmej

4:15 pm on Aug 7, 2009 (gmt 0)

Yep.

For making a search engine for wykop.pl ( because the wykop one is... doesn't works )

GaryK

4:43 pm on Aug 7, 2009 (gmt 0)

wklej.to

A friend suggested you're using this domain name because in Polish it means something like Paste It. Is that right? Just trying to figure out why a Polish search engine would be using a ccTLD from Tonga.

Anyway, I'm just not sure I see the benefit in letting your bot hit my sites. None of them are ever going to be included in a site that's Digg-like.

But if you want other webmasters to seriously consider giving your bot access to their sites I think you need to make the changes I suggested earlier. Namely an "about your bot" page and a contact address.

Best of luck to you.

pigmej

7:37 pm on Aug 7, 2009 (gmt 0)

No...

Search engine is on different address.

"wklej.to" is the biggest site on that server, that's the reason why it's "wklej.to".

search engine will be there: szukaj.tutaj.to/wykop (this means something like: search.here.this ). on that url is only "part" of search and really beta/alpha stage.

I'm indexing "body" from every site added to wykop ( should be only once ). I need full text search engine. That's because I need to test some technology...

ps. wklej.to == paste.it

ps2. UA fixed ;)

Pfui

8:04 pm on Aug 8, 2009 (gmt 0)

@pigmej:

The mere fact someone supposedly puts a link to one of my sites on a site you fetch/scrape (or on one of your sites) does NOT give you permission to scrape my site(s). Rather, permission to engage in such activity is specifically denied in my robots.txt instructions:

User-agent: *
Disallow: /

Bot-running is also specifically denied in my sites' Terms of Use, Copyright Notices and code.

Bottom Line: Your bot fetches/scrapes sites that do not belong to you and ignores robots.txt telling it not to. Regardless of your needs, your bot's conduct is not okay.

Solution? Program your bot to read and heed all sites' robots.txt files. If access is disallowed, it departs, fetching nothing else. Thank you!

pigmej

9:55 pm on Aug 8, 2009 (gmt 0)

Pfui,

Generally you have right... But it's just one hit ( one single URL ), contents of the site are removed, when the contents are parsed by full text search engine. Then the content is removed from the database. That was the reason of making spider ignoring robots.txt... Search engine will be free, without ads ( I'm doing it for free )

I understand your arguments. I can do spider that respects rules in robots.txt

enigma1

10:48 am on Aug 9, 2009 (gmt 0)

It is fetching the same url as was published on wykop. That's the reason of ignoring robots.txt.

Well I am saying about this for some time now. All search engines will pick and access a link posted elsewhere without checking robots.txt.

This can be also done with redirects and google for instance. Just set up 302 or 301 redirect for the spider and watch it walk right into the link without reading robots.txt. I don't see why this crawler will act differently right now.

Spiders should be smarter though, but is matter of efficiency. Ideally not only they should check robots.txt but also when they see an external link they should begin crawling from the domain root. Then if they find from the pages the original link they should access it and index it subject to the tags. Now it won't work too well for major hosts that offer some free automated sites. Examples
example.com/joe
exmaple.com/john
exmaple.com/bill

That's a problem and can be solved by registering the site with an SE, but... a process like this will eliminate lots of security holes that exist right now and reduce the visibility of sensitive information and exploits. RFIs, sessions exposure via the url, private/admin unprotected pages, private statistics that are not locked etc.

But if you check the competition levels, the spiders try to index as many pages they can as fast they can, regardless of content which is bad. This is global and it may change if there is enough pressure to the popular SEs.

jdMorgan

3:46 pm on Aug 9, 2009 (gmt 0)

> All search engines will pick and access a link posted elsewhere without checking robots.txt.

I have never seen Google fetch a page that was Disallowed in robots.txt -- at least not in many, many years. Yahoo and MSN have done so, but only during occasional "buggy" periods when they were using/testing newer versions of their spiders.

For the past year at least, there have been no exceptions in my access-control logic to allow G, Y, or MSN to violate robots.txt, and none of them have done so.

So I disagree that "all search engines will pick and access a link posted elsewhere without checking robots.txt." The only user-agents that I permit to access my site without checking robots.txt are those that belong to directories to which I have voluntarily submitted the page being fetched. In other words, if I asked to be put on their URL-fetch list it's OK that they don't check robots.txt because I have "opted in."

Jim

enigma1

4:59 pm on Aug 9, 2009 (gmt 0)

So I disagree that "all search engines will pick and access a link posted elsewhere without checking robots.txt."

Very easy to verify it. Next time google gets a page of one of your domains, output some 301 or 302 redirect headers of a restricted folder/page and watch it accessing it.

jdMorgan

6:02 pm on Aug 9, 2009 (gmt 0)

enigma1,

I can see how they might fetch a robots.txt-Disallowed URL if the redirect comes from the server of the originally-requested URL and the target is on that same domain, but really, even if this is an error, it is an edge case that is only loosely-related to the subject at hand.

If G, Y, or MSN-Live-Bing ever fetched a robots.txt-Disallowed URL on my site, it would get banned, and I'd see that in my "special" 403 log file. This does not happen, the fact that I don't ever redirect to Disallowed URLs notwithstanding. It does sound like they need to review the precedence of their redirection- and robots.txt-handling routines, though.

The question here is where we as Webmasters draw the line on permissible fetches: In a best-practices (and simpified) view, it is OK for the agent of a search engine to fetch a URL if that URL is not disallowed, and OK for the agent of an opt-in directory (or similar service) to unconditionally fetch a URL that has been submitted to it by the Webmaster who controls that URL. What we're discussing here is to which class an agent that gets its URL-list by public submissions ("diggs") belongs. In my opinion, it belongs in the "search engine" class because the Webmaster did not opt-in, so such agents should fetch and obey robots.txt.

---

pigmej,

By a process of self-selection, a majority of participants in this forum are Webmasters whose sites have been subjected to on-going daily abuse by hundreds of unknown or malicious user-agents. Each of us weighs the cost of allowing unknown and potentially-malicious user-agents to access our content. That cost is the risk that the user-agent is a scraper that will re-publish our pages (usually in order to support paid-advertising or perhaps as bait to attract visitors to a site whose purpose is to download malicious code to the client), plus the cost of the bandwidth and server resources, plus the work to add that user-agent to either a white-list or black-list of approved or denied user-agents. Balanced against that is the potential benefit of getting additional referral traffic if the user-agent is legitimate.

Many of us here get thousands of hits per day from unknown, unwelcome, or malicious user-agents; In some niche-site cases, there may be more of these hits than there are legitimate visitors. This can be a huge waste of server resources, as well as cluttering up the server logs with junk requests.

So although "I just fetch the home page" sounds OK to you, imagine (because it's likely true) that there are hundreds or even thousands of automated agents doing exactly the same thing. You can't then blame Webmasters for asking, "What benefit do I get for serving all of these requests every day?"

This is especially true if the target sites are hosted by companies that enforce bandwidth limits, or put hundreds of sites on the same low-performance shared name-based virtual servers; Handling all of these "unprofitable" requests can interfere with the operation and viability of the site.

Some Webmasters make a choice on the side of simplicity and say, "If it is not a major search engine and it is not a legitimate browser with a human at the keyboard, then it's denied." This may not be fair, but from the Webmaster's point of view it may represent the best return on time spent maintaining access control lists, and on server resources used. So, authors of crawlers like yours should respect that decision -- either on the principle of mutual respect or from a self-preservation standpoint; If you don't respect that decision, then your user-agent, IP address, IP range, hosting company, or even your entire country may be added to a widely-published "Deny-from" list, and you will see the "completeness" of your index suffer as a result. At least one of the participants here has a "reach" of many thousands of Web sites, so this is not as trivial as it may appear to be as one little discussion in "just one little forum."

I recommend fetching and checking robots.txt. And do be sure that your code correctly parses all valid robots.txt constructs as documented in the Standard -- Including the "allow some, deny the rest" construct and multiple-user-agent policy records. (And thanks!)

Jim

enigma1

7:57 pm on Aug 9, 2009 (gmt 0)

I can see how they might fetch a robots.txt-Disallowed URL if the redirect comes from the server of the originally-requested URL and the target is on that same domain, but really, even if this is an error, it is an edge case that is only loosely-related to the subject at hand.

Actually I witnessed this, with google on different domains. In fact, sending it from one page of domain A to domain B inside a not-allowed url that finally ban the bot.

This in turn showed up on the search results few days later where the domain B pages were indexed with a ban message.

Now imagine competitive sites if one figures out the other uses some IP trap or similar mechanism, he sets up an image or page of some sort to force the browser/spider redirects while keeping the accesses on his site. (eg: exposing an tiny image with his pages that points inside the domain B restricted url) It did happen before and it can easily be forced with this approach.

Take another example this post here:
[webmasterworld.com...]
so it lists a disallowed url:
/forum/showthread.php?t=54321
An external site now posts or forces via redirect a url like this:
/forum/showthread.php?t=54321&block_me=now
What will happen is the spider will access the modified link because it will be unable to differentiate even if it reads the robots.txt in advance of the target site. So by injecting a simple parameter to the restricted link you can force the spider to follow.

For these reasons I do not use robots.txt. Once restricted urls are detected, they can be manipulated to serve your adversaries interests.

pigmej

10:05 pm on Aug 9, 2009 (gmt 0)

jdMorgan,

Only ONE hit, no more hits in next days etc. Fixed version already online ( 0.38 )

pigmej, wykop/0.3

Pfui

GaryK

Pfui

pigmej

GaryK

pigmej

GaryK

pigmej

GaryK

pigmej

Pfui

pigmej

enigma1

jdMorgan

enigma1

jdMorgan

enigma1

pigmej

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week