[Varnish] varnishstats – Just another Sys Admin blog… wait really ?

varnishstat is a tool used to monitor the basic health of Varnish. Unlike others it doesn’t read log entries, but displays statistics from a running varnishd instance. It can be used to determine request rate, memory and thread usage.

Output

Like top the command output is in real-time. Data are displayed in a table form. The first column is the raw data of the counter. For example, in case of the ‘cache_hit’ counter, this is the total number of cache hits since varnishd was started. The second column is the counter change per second. The third is the average change per second since varnishd was started. The three next are the same except with larger time scale (10, 100 and 1000 seconds respectively).

Note that you can use the option -1 for non ‘interactive’ use. In this case varnishstat list all stats and quit immediately.

Interesting counters

There is a lot of counters to look after, but keys metrics can be divided into 4 categories:

Client: client connections and requests
Cache: cache hits, evictions
Thread: thread creation, failures, queues
Backend: success, failure and health

Client metrics

client_req: this counter display the number of requests you’re receiving per unit of time. Monitoring this metric can alert you of spikes in incoming web traffic, whether legitimate or nefarious.

sess_dropped: once Varnish is out of worker threads, each new request is queued up and this counter is incremented. When the queue is full, new incoming requests will be dropped and this counter will also be incremented. If sess_dropped isn’t equal to zero, either your varnish is overloaded or it thread pool is too small.

Cache performance metrics

Using the cache_hit and cache_miss counter, you can calculate the cache hit ratio:

ratio = cache_hit / (cache_hit + cache_miss)

This derived metric provides visibility into the effectiveness of the cache. The higher the ratio the better. A ratio above 0.7 is considered as ‘good’. If your ratio is ‘bad’ you should check which objects aren’t cached and why.

n_lru_nuked: the LRU (Least Recently Used) nuked objects counter should be watch closely. If the counter value increase a lot that probably means varnish is evicting objects at a faster rate then usual because of memory shortage. In this case you may want to increase the cache size if possible.

Thread metrics

threads_failed: the number of times varnishd unsuccessfully tried to create a thread. A value greater then zero likely indicate you reach the server limits. It could also append if you try to spawn a huge number of thread in a short time. The latter case usually occurs right after varnish is started, and can be corrected by increasing the thread_pool_add_delay value.

threads_limited: number of times a thread needed to be created but couldn’t because varnishd already maxed out its capacity. If you have a value greater the zero and still have available resources left, you should increase the thread_pool_max value.

Backend metrics

backend_fail: number of backend connection failures. This counter should be very close to zero. If it’s not the case, it could means you have:

network issues
overloaded/laggy backend (time to first byte or between bytes exceeded)
unresponsive backend

Further Reading and sources

https://www.datadoghq.com/blog/top-varnish-performance-metrics/