Facebook gets value of robots.txt

Forum Moderators: open

Message Too Old, No Replies

Facebook gets value of robots.txt

Not a Spider Report, just info

tangor

12:22 am on Jul 1, 2010 (gmt 0)

Facebook has updated its robot.txt file so that the site can only be crawled by a short list of search engines, including Google, Microsoft's Bing, China's Baidu, Russia's Yandex, and a few others.

Previously, Facebook's robot.txt allowed anyone to crawl the site, although the company had threatened to sue at least one developer for crawling, before adding new terms of service that barred scraping without the company's written permission. Some — including programmer and blogger Pete Warden, the man who Facebook threatened to sue — had complained that the social networking site was breaking the rules of the interwebs. The site was allowing unfettered crawling, but the company's legal team was not.

[theregister.co.uk...]

If mods allow, this post is offered to show further proof why we who actively support this forum are so interested in robots.txt, spiders, bots and others. One has to wonder why it took Facebook so long to WHITELIST their robots.txt!

mack

12:50 am on Jul 1, 2010 (gmt 0)

In the case of facebook this will probably add up to a real saving of bandwidth and ultimately the cost associated with the bandwidth.

Mack.

tangor

2:07 am on Jul 1, 2010 (gmt 0)

mack nailed it, I think, for most of us. It is the ultimate cost of bandwidth (varies for each of us) and that secondary of scraping content, riding coattails, etc. In the case of Facebook... that number has to be immense.

mack

5:39 am on Jul 1, 2010 (gmt 0)

I'm kinda on the fence when it comes to white listing.

The good: If every site was to use a while list the Internet, as a whole would be faster.

The bad: If all sites where to use white lists it would be virtually impossible for a new-start search engine to get anywhere.

Mack.

Staffa

5:56 am on Jul 1, 2010 (gmt 0)

>> it would be virtually impossible for a new-start search engine to get anywhere.

Not really, even though they are not yet on the list their visit will leave a trace in the logs and all they have to do is identify themselves in their UA and have a clear description of their goals on their site, then we can check them out and decide whether or not we will white list them as well.

tangor

6:13 am on Jul 1, 2010 (gmt 0)

Amen

Status_203

8:39 am on Jul 2, 2010 (gmt 0)

Not really, even though they are not yet on the list their visit will leave a trace in the logs and all they have to do is identify themselves in their UA and have a clear description of their goals on their site, then we can check them out and decide whether or not we will white list them as well.

...and obey robots.txt
...and be absolutely identifiable via round trip DNS.