Irving's blog

The trials and tribulations of computering

Tuning the Chef Server for Scale

In Chef’s Customer Engineering team we are frequently asked for advice on tuning the Chef Server for high-scale situations. Below is a summary of what we generally tell customers. Note that these tuning settings are specific to Chef Server 12, which is the recommended version for any customer who cares about the performance of their Chef server.

General Advice

Understand the OSS components that make up the Chef Server

A good way to think about the Chef server is as a collection of microservices components underpinned by OSS software:

  • Nginx (openresty)
  • PostgreSQL
  • Solr
  • RabbitMQ
  • Redis
  • Chef
  • Erlang/OTP
  • Ruby
  • Runit
  • The Linux Kernel
    • LVM
    • Storage subsystem
    • Network stack

It’s important to understand the performance characteristics, monitoring and troubleshooting of these components. Especially Postgres, Solr, RabbitMQ, Runit and Linux systems in general. It’s worth noting that the Chef server core is Open Source, and all if its code can be examined on Github

Because these components are glued together using Chef, it’s highly recommended that you familiarize yourself with the cookbooks that configure the Chef server when you run chef-server-ctl reconfigure

Have good monitoring in place

We don’t provide prescriptive monitoring guidance at this time, but here’s our advice:

  • Use existing Open source software (Sensu, Nagios, etc) to collect metrics and test the health of the OSS components. This should be fairly straightforward to set up.
  • Configure your monitoring systems and load balancers to query the Health status endpoint of erchef (https://mychefserver/_status)
  • Run a graphite server. erchef will send detailed statistics if you set the following in your chef-server.rb file:
    1
    2
    3
    
    folsom_graphite['enabled'] = true
    folsom_graphite['host'] = 'graphite.mycompany.com'
    folsom_graphite['port'] = 2003
    
  • Use Splunk or Logstash to collect and analyze your Chef server logs.
    • You can collect and graph useful performance data by graphing /var/log/opscode/opscode-erchef/requests.log.N, /var/log/opscode/oc_bifrost/requests.log.N and /var/log/opscode/opscode-reporting/requests.log.N
    • Each request line will show (in ms) various performance counter. For example: req_time=20; rdbms_time=2; rdbms_count=3; authz_time=5; authz_count=1;

Think about API requests per second rather than node counts

A very common measurement for the size of Chef servers/clusters is the number of nodes they serve. However, this number is not terribly useful because of other elements that can cause very wide variation. Namely:

  • The interval and splay of Chef client runs
    • 1000 nodes every hour == 500 nodes every 30 minutes
    • Insufficient splay can cause a “stampede condition” on the Chef server. Splay should be equal to the interval in order to get maximum smoothness of request load.
  • The number and complexity of search requests and databag fetches performed during each Chef run
  • The number of cookbooks depended on for each Chef run. More cookbooks adds loading to the depsolver and also to the Bookshelf service which serves cookbooks
  • The size of node data, which we’ve seen range from 32kb to 5MB (the default maximum is 1MB but can be increased). This adds load to the indexing service (opscode-expander) as well as to Solr

Although it’s not perfect, we’ve found that a good rule of thumb for examining active Chef servers is the number of API requests per second aggregated across the entire cluster. We’ve found that clusters which sustained higher than 125 API RPS started to experience occasional errors.

DRBD: Don’t do it

In the field we’ve found that DRBD has a negative impact on performance and availability of Chef server clusters. Specifically:

  • Because DRBD uses synchronous replication, a block is not considered “committed to disk” until it has been confirmed by both nodes in the cluster. This adds significant latency to each IOP.
  • DRBD’s bandwidth is limited by the network throughput between the nodes. Dedicated cross-over links are not possible in all scenarios (for example VMs) which leads to low and inconsistent throughput.
  • DRBD resyncs can take a very long time and greatly impact performance while running.
  • Although DRBD protects against hardware failure, it does a very poor job of protecting against many classes of software failure. For example, a corrupt database is replicated whole to the other node, so failing over will not correct the system.

Beware LVM snapshots impact on performance

LVM is generally recommended for storing all Chef Server data (/var/opt/opscode in standalone/tier installs and /var/opt/opscode/drbd/data in HA installs) because it provides the ability to expand disks on the fly and create crash-consistent snapshots.

However it’s important to know that as LVM snapshots increase in size it is very detrimental to performance:

Therefore it is recommend that snapshots are used to create consistent backups, but are immediately deleted after they are no longer needed.

Chef Server tuning tips

Server sizing

Chef Server frontends:

  • Frontends run stateless services only (erchef, bifrost, reporting, manage) and can be scaled horizontally.
  • They are almost always CPU bound, and only suffer memory or disk pressure during fault scenarios (typically because of backend issues).
  • A good starting point for frontends is 4 CPU cores and 8 GB RAM. Disk on frontends does not matter.

Chef server backends:

  • Backends mix a number of disk, memory and CPU bound services (Postgres, Solr, RabbitMQ, Expander)
  • A good starting point for backends is 8 CPU cores and 32 GB of RAM.
  • Flash-based storage is highly recommend, combined with the XFS filesystem and LVM.

chef-server.rb tuning settings

Database pooling:

In the erlang OTP process model, the number of workers is limited by the size of the database connection pool (default 20). Increasing the database pool allows for more workers, but puts added memory pressure on the database service.

In order to handle the greater number of connections, you must also increase the Postgres max_connections value. This value must consider an erchef, bifrost and reporting process connecting from each frontend, plus an extra 20% for breathing room.

Suggested values for a high-performing cluster with 4-6 frontends:

1
2
3
postgresql['max_connections'] = 1024
opscode_erchef['db_pool_size'] = 40
oc_bifrost['db_pool_size'] = 40

Erchef to bifrost http connection pool: erchef also maintains a pool of http connections to bifrost, the authz service. It’s important to raise the initial and maximum number of connections with respect to the database pool sizes.

1
2
3
oc_chef_authz['http_init_count'] = 100
oc_chef_authz['http_max_count'] = 100
oc_chef_authz['http_queue_max'] = 200

Erchef depsolver and keygen tuning: Two expensive computations that erchef must perform are the depsolver (a Ruby process which solves the cookbook dependencies) as well as the client key generator (which can be hit hard when large fleets of chef nodes are provisioned). Note that Chef 12 clients default to client-side key generation and you probably only need to adjust the keygen value if you still use Chef 11 clients.

Suggested values:

1
2
3
opscode_erchef['depsolver_worker_count'] = 4 # should equal the number of CPU cores
opscode_erchef['depsolver_timeout'] = 10000
opscode_erchef['keygen_cache_size'] = 1000

NEW IN CHEF SERVER 12.1.0: Bounded queueing for Pooler There are several upstream services who’s connections are managed by pooler: sqerl (database connection), depsolver workers and the authz pool (connections from erchef to bifrost). Currently when any of erchef’s pools are exhausted, it throws a 500 error. Chef Server 12.1 added the ability to add bounded queues to each pool which greatly reduces error rates and also reduces the need for large connection pools (which are suboptimal for Postgres).

Queueing is disabled by default, but is enabled by setting the timeout value to > 0. When using queueing, it’s recommended to use a smaller pool size matched with a queue that is 1-2x the size of the pool.

1
2
3
4
5
6
7
8
9
10
11
# erchef database pooler queue
opscode_erchef['db_pool_queue_max'] = 40
opscode_erchef['db_pooler_timeout'] = 2000

# bifrost database pooler queue
oc_bifrost['db_pooler_timeout'] = 2000
oc_bifrost['db_pool_queue_max'] = 40

# erchef depsolver queue
opscode_erchef['depsolver_pool_queue_max'] = 10
opscode_erchef['depsolver_pooler_timeout'] = 100000

Nginx cookbook caching: A new feature in Chef Server 12.0.4 is Nginx cookbook caching. This takes load off of the backend Bookshelf service by storing cookbook files in Nginx.

Suggested values:

1
2
opscode_erchef['nginx_bookshelf_caching'] = ":on"
opscode_erchef['s3_url_expiry_window_size'] = "100%"

PostgreSQL tuning: We already tune PostgreSQL memory settings to sane values based on the backend’s phyiscal RAM. For example, effective_cache_size is set to 50% of RAM, and shared_buffers to 25% of physical RAM.

To handle the heavy write load on large clusters, it is recommended to tune the checkpointer per [https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server]

Finally, the log_min_duration_statement setting is super useful for the detection and postmortem analysis of performance issues. It is equivalent to the “slow query log” in MySQL. The tuning setting below will log all of the queries that took longer than 1000ms to complete.

Suggested values:

1
2
3
postgresql['checkpoint_segments'] = 64
postgresql['checkpoint_completion_target'] = 0.9
postgresql['log_min_duration_statement'] = 1000

Solr JVM tuning: By default we compute Solr’s JVM heap size to be either 25% of system memory or 1024MB, whichever is smaller. Large chef server clusters should increase this value to smaller of 25% of system memory or 4096MB. Extremely large and busy Chef clusters run successfully with an 8GB Solr heap size.

Suggested values:

1
2
opscode_solr4['heap_size'] = 4096
opscode_solr4['new_size'] = 256

WARNING: It is not recommended to use a JVM heap_size above 8GB, unless you have in-depth knowledge of JVM tuning combined with detailed JVM monitoring

Comments