So far this is the most attended session. Standing room only available only before it start.
What does facebook sysadmins have to support?
- Monthly 700 million minutes of time spent on fb
- 6billion pieces of content updated
- 3 billion photos
- 1 million connect implementations
- 1/2 billion active users
- fb reached a limit on leasing datacenter space
- fb is building their own http://www.facebook.com/prinevilledatacenter
- currently serving out of california and Virginia
Originally facebook was a simple Apache PHP site. When fb started hitting a limit on this, they started compiling PHP into C++ (HipHop for PHP).
FB claims to be the biggest memcache deployment in the world. They server 300 Terbytes of memcached data out of memory.
MySQL improvements contributed back is flashcache.
- News Feed
- PHP – front end
- erlang (chat room)
For Systems, what does fb have to worry about on a daily basis?
- data manaement
- Core operating updates
- Configuration Management
- CFengine for system management
- On Demand
- Web Push – new code gets deployed to fb at least once a day. Its a coordinated push, everyone is aware, notification happens to dev team. Everyone sites on IRC during the push. It is undestood by engineers and the rest of the company
- push software built over on-demand control tools
- code distributed via internal BitTorrent swarm
- php gets compiled, the few hundred MB binary gets rapidly pushed bia bit torrent.
- it takes one minute to push across the entire network
- Backend Deployments – only Engineering and Operations. Engineers write, test and display
- Quickly make performance decisions
- Expose changes to subset of real traffic
- No ‘commit and quit’
- Deeply involved in moving services to production
- Ops ‘embeded’ into engineering teams
- Heavy Change logging – pin pointing code to every push and change
- Ganglia – aggregated metrics
- nested grids & pools
- over 5 million monitored metrics
- facebook inhouse monitoring system
To manage complexity and the number of alarms and systems monitoring the fb team uses aggregation. Initially alarms were managed by email.
Scribe – high performance logging application. Initially used syslog. Also used Hadoop and Hive.
How does it work and gets done?
- clear delineation of dependencies and responsibilities
- Constant Failure
- Servers were the first line of defense, then started focusing on racks
- Now is focused on clusters. Logical delineation based on function (web, db, feed, etc)
- Next stage is datacenters – what to do if a natural disaster strikes?
- Constant Communication – information is shared constantly.
- lots of automated bots, get and set data
- internal news updates
- “Headers” on internal tools
- Change log/feeds
- Small teams