Protocol Scalability

Adam Crain September 19, 2022

Building systems that scale is becoming critically important as the power grid undergoes a rapid transformation to meet the challenge of renewables and climate change. Software architecture is crucial to scaling communications solutions for emerging applications like distributed energy resources (DERs).

This blog post discusses how software architecture affects the scalability of communication protocols and provides performance data for our protocol offerings.

Parallelism vs Concurrency

The Internet experienced its largest growth in user base from 1995 to 2000. At this time, desktop and server computers typically had a single CPU. Operating systems like Linux and Windows allowed multiple applications to share the CPU using preemptive multitasking. Sharing a CPU in this manner is an example of concurrency: allowing multiple tasks to run in overlapping periods of time.

In the early 2000s systems began to emerge that had more than one physical processor or core. Operating systems could now run more than one task at the same time. Systems were now concurrent and parallel.

As the laws of physics began to limit the increase in clock frequency of CPUs, it became increasingly important to design software that could actually make use of multi-core systems. In 2022, it’s not uncommon to see large servers with 64 or 128 cores.

Case Study: Apache vs Nginx

Software architecture has a massive impact on different aspects of performance including:

Throughput - How many transactions per second can the server handle?
Concurrency - How does throughput change as the number of concurrency communication sessions increases?
Memory usage - How does the memory usage of the application change as concurrency varies?

As the scale of the Web was exploding around Y2K, people began to realize that existing webservers just were not up to the task. Dan Kegel framed this as the C10K Problem, namely, how can we design webservers that will gracefully handle 10K concurrent sessions? The Apache webserver used a process-per-connection concurrency model that fell apart as the number of connections increased. In short, using operating system processes or threads was not the right solution for achieving concurrency at this scale.

The answer to this problem was to move to asynchronous programming. Operating systems added efficient asynchronous APIs like select and later epoll which allowed network sockets to be used in a non-blocking manner. A single thread could now efficiently handle a large number of concurrent requests. Webservers like Nginx were designed around these asynchronous APIs and could gracefully scale concurrency without requiring additional OS threads.

What does this mean for power systems?

I’ve seen some truly atrocious performance numbers for utility SCADA systems over the course of my career. In 2009, I worked on a team that built an advanced application for a major investor-owned utility. The application implemented rolling blackouts if the utility were to ever lose a major generation resource.

The front-end processor (FEP) could only handle about 100-200 DNP3 TCP/IP communication sessions per server before its performance became unstable. The utility had thousands of substations in a multi-state region. They needed a dozen server instances for all the primaries and backups.

This seemed silly to me at the time. After all, DNP3 is a lightweight protocol compared to HTTPS and Nginx can handle thousands of simultaneous connections. Why does this SCADA platform scale so poorly? The answer was that it was based on a thread-per-connection architecture. As we continue to build out larger and larger systems in electric power, we can’t afford to throw more and more hardware at bad software architectures.

Benchmarking DNP3

Our protocol stacks are designed for multi-core. You can read more about how we approach multi-tasking and concurrent communication sessions in our documentation. The general idea is that tasks share threads, and we never use more threads than the native parallelism of the system, i.e. the number of system cores. As we’ll explore in this post, this makes our solutions scalable and efficient.

Our DNP3 library includes a performance benchmark that was used to produce all of the data in this post. The application has a number of variables that can be adjusted:

Number of communication sessions (concurrency)
Number of worker threads (parallelism)
Test duration, type of communication, etc

The benchmarks in this post connected N master sessions to N outstation sessions. Each pair independently exchanged unsolicited measurement data which was confirmed by the master. The throughput of each test was the number of unsolicited messages per second exchanged.

Diagram

The benchmarks were performed on capable workstation with 12 physical cores and 24 logical cores.

Throughput

In the first test, I allowed the library to initialize the default number of worker threads (24) and varied the number of concurrent communication sessions. Throughput continues to increase with concurrency, even beyond the number of logical cores, but eventually it starts to plateau.

Throughput

In the 2nd test, I artificially limited the number of worker threads (parallelism) for two different levels of concurrency: 64 and 256 sessions. There is a distinct kink in the curve around 12 worker threads where the slope of the line decreases. This makes sense because the CPU has 12 physical cores. Each core has hyper-threading which provides additional parallelism, but not at the same rate.

MemoryPerSession

The final test used Valgrind’s memory profiler to measure the peak heap usage of the application as I doubled the number of concurrent communication sessions. In my last post I hinted that the memory per session would actually decrease as the number of communication sessions increased. This occurs because the sessions share the pool of worker threads and other Tokio resources.

A single master/outstation pair uses ~240kB of memory, but as you add concurrency the memory per session converges at 130kB.

Summary

Our protocol libraries scale both in terms of concurrency and hardware parallelism. You don’t need a giant server, however, to take advantage of this performance. You can run highly-concurrent communication servers on single-core embedded Linux platforms using a single worker thread. This is why we say “scale up or down” in our marketing.