Forum Moderators: open

Message Too Old, No Replies

Facebook gets value of robots.txt

Not a Spider Report, just info

         

tangor

12:22 am on Jul 1, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Facebook has updated its robot.txt file so that the site can only be crawled by a short list of search engines, including Google, Microsoft's Bing, China's Baidu, Russia's Yandex, and a few others.

Previously, Facebook's robot.txt allowed anyone to crawl the site, although the company had threatened to sue at least one developer for crawling, before adding new terms of service that barred scraping without the company's written permission. Some — including programmer and blogger Pete Warden, the man who Facebook threatened to sue — had complained that the social networking site was breaking the rules of the interwebs. The site was allowing unfettered crawling, but the company's legal team was not.

[theregister.co.uk...]

If mods allow, this post is offered to show further proof why we who actively support this forum are so interested in robots.txt, spiders, bots and others. One has to wonder why it took Facebook so long to WHITELIST their robots.txt!

mack

12:50 am on Jul 1, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



In the case of facebook this will probably add up to a real saving of bandwidth and ultimately the cost associated with the bandwidth.

Mack.

tangor

2:07 am on Jul 1, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



mack nailed it, I think, for most of us. It is the ultimate cost of bandwidth (varies for each of us) and that secondary of scraping content, riding coattails, etc. In the case of Facebook... that number has to be immense.

mack

5:39 am on Jul 1, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'm kinda on the fence when it comes to white listing.

The good: If every site was to use a while list the Internet, as a whole would be faster.

The bad: If all sites where to use white lists it would be virtually impossible for a new-start search engine to get anywhere.

Mack.

Staffa

5:56 am on Jul 1, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> it would be virtually impossible for a new-start search engine to get anywhere.

Not really, even though they are not yet on the list their visit will leave a trace in the logs and all they have to do is identify themselves in their UA and have a clear description of their goals on their site, then we can check them out and decide whether or not we will white list them as well.

tangor

6:13 am on Jul 1, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Amen

Status_203

8:39 am on Jul 2, 2010 (gmt 0)

10+ Year Member



Not really, even though they are not yet on the list their visit will leave a trace in the logs and all they have to do is identify themselves in their UA and have a clear description of their goals on their site, then we can check them out and decide whether or not we will white list them as well.


...and obey robots.txt
...and be absolutely identifiable via round trip DNS.