Engineering
Building a logging infrastructure that supports separation and isolation: The long journey toward log drains
Ann Guilinger
Product & Engineering
Centralized logging isn’t a novel feature, but that doesn’t make it easy to build. It’s complex, tends to require a lot of iteration, and early architecture decisions can have long-term consequences you didn’t anticipate.
At Aptible, implementing centralized logging was even more difficult because of a decision to build a logging system that could support separation and isolation – a feature that many companies needed but few providers offered (or few built in a way that met customer needs).
As a PaaS provider, one of Aptible’s differentiators is that we offer security and compliance as easily and as seamlessly as we offer the rest of our platform. But as any developer who’s worked with compliance issues in an infrastructure context can tell you, it often feels like you’re trying to mix oil and water.
The core problem is that infrastructure is typically shared across applications. Shared infrastructure tends to offer greater scalability and reliability for each application, but compliance often requires separation and isolation to safeguard end-user information.
HIPAA, for example, enforces an extensive range of logging regulations that apply to log compilation, storage, and assessment. HIPAA-compliant companies have to audit their logs and follow parallel compilation, storage, and assessment rules for the audit logs.
Typically, companies bolt compliance features on after the fact – often when a business development leader eventually notices how many healthcare and financial companies depend on legacy services. But we knew that this after-the-fact approach wouldn’t work for logging because the separation and isolation our initial customers sought needed to be built in from the beginning as an architectural requirement and first principle.
This design choice came to a head with our work on log drains. Our fellow PaaS provider Heroku invented the original concept. Offering log drains allows PaaS customers to collect logs from their applications and forward them to log destinations like Elasticsearch and Datadog, where they can do further analysis and monitoring.
In this article, I’m going to walk through the process of building log drains at Aptible. Along the way, I’ll describe the feature’s evolution from barely functional to sending logs to multiple destinations and remaining performant without hogging resources. I’ll also explain why it took four versions and the expansion of our team from one to all of our engineers to achieve our vision.
Why we needed centralized logging with isolation and separation
By offering log drains, PaaS providers enable customers to route the logs they output to the logging destinations they select. There, customers can review the logs, analyze them, archive them, and monitor them with alerts.
Most providers set up a shared infrastructure, so they can centralize logging. This approach makes sense in most contexts because centralized logs are easier to manage and interpret, and a shared infrastructure makes centralization simpler.
Because we wanted to make our product compelling – not only for small teams who might have a passing interest in logs but also for enterprises and companies with complex regulatory requirements – we knew we needed to build log drains with isolation and separation as a first principle.
In the typical approach to logs and log drains, logging infrastructure is shared and one app’s logs are processed alongside another company’s logs. This is fine for many companies, but we believed that current and future customers would not be okay with their logs sitting so close to one another.
To be clear, shared logging infrastructure is not supposed to contain any sensitive data. But when you’re trying to support enterprise privacy and security needs, you’re less interested in following the rules and more interested in assuring customers that the rules are built into the system itself.
We wanted customers to know we weren’t just ‘avoiding’ cross-pollination—we had made it impossible.
v0: Remote logging with syslog
In the early days of his journey with Aptible, our founder, Frank Macreery, built the first version of our logging system.
This early version was simple but functional, and it fulfilled its primary purpose of making logging—a necessary, minimally viable feature—possible. This initial system used syslog to enable remote logging for our customers, but until we introduced the next version, customers could only configure one destination per account.
That meant they could only send logs to one destination, even if they had a few apps. They could also only use log management products that accepted syslog.
It was good enough to get us started, but we knew a new version would soon be necessary.
v1: More customers required more destinations
It wasn’t long before we were suffering from success—a good problem for any startup to have. We were signing up lots of customers, and many of those customers wanted more destinations for their logs.
That’s when “joecool” arrived on the scene.
Joecool is what we called one of the two primary components of this next version of the system. We started using lumberjack as the messaging protocol and wrapped logstash forwarder per each container (or set of containers).
Joecool was, well, cool—but joecool wasn’t complete without “gentlemanjerry.” Gentlemenjerry was the name we gave to the logstash. Customers could now send logs to this logstash and the logstash, which sometimes lived on a different host, could send those logs out to the destinations a customer had configured.
We offered one gentlemanjerry per destination and, generally, one joecool per service in our customers’ apps.
This setup supports Stacks – one of four primary concepts that structure how Aptible isolates, deploys, and allocates resources. When we deploy resources to a customer’s application, we do so via a Stack that contains the underlying virtualized infrastructure so that we can isolate resources at a network level.
Following this framework, we ensured that each gentlemanjerry runs isolated from every other customer’s gentlemenjerry, and also supported running multiple gentlemenjerrys for each customer when the customer has multiple distinct logging destinations.
Now, customers could send logs to multiple destinations, such as their own Elasticsearch or Datadog instances.
Later, we added support for logging databases, endpoints, and ephemeral sessions.
All in all, it felt like a true v1. It took the original engineering team about a month to build, and the result was fairly robust, allowing us to support many more customers. We were able to keep growing, better support the customers we already had, and offer more compelling support to incoming customers.
Still, it was a v1, and there were known issues around efficiency and resource consumption.
Because gentlemanjerry ran logstash, there was frequent resource hogging. We knew we could handle this for a while, but as our customer base continued to grow, resource consumption was going to get out of control.
Plus, the default state of this version incurred the consumption of so much CPU, we couldn’t tell when logstash was broken. That meant that logstash would eventually and inevitably break. Worse, it meant that we could never reliably and quickly repair it because we weren’t able to monitor it and see when it had broken.
Eventually, it started to feel like a boat with a hole in the bottom. We were getting to our destinations, but we were bailing more and more water out along the way.
v2: Solving the logstash breakage
Logstash, our leaky boat, was breaking often enough that we were motivated to build a new version. Our goal was simple: to fix the leak.
Ashley Mathew, who leads engineering today, stepped up to take on this project. Within a month, she had finished v2. Along the way, she made two major changes:
She inserted redis as a buffer
She upgraded joecool, so it could use filebeat
With the first change, we could use redis as the queuing layer. With the second, we could use filebeat to write directly to redis (lumberback wasn’t able to do this).
The combination allowed us to meet our main goal for v2: Now, we could look at the queue depth in redis to see when logstash was broken.
At this point, our logging system wasn’t perfect, but it felt functional and maintainable in a way previous versions hadn’t. But even with the performance gains made possible by this version, we still ran into efficiency issues.
The biggest problem was how difficult it was to get logstash processing effectively. We found ourselves having to frequently restart the log drains and hope that a restart would free up some resources.
As we pulled on this thread, we quickly realized another version was necessary. This next version would be the biggest overhaul yet.
v3: Focusing on reliability
In v3, our primary goal was to resolve the reliability problems in gentlemanjerry. The diagnosis was simple, but the treatment was difficult. To make it happen, the entire engineering team worked for two months.
Gentlemanjerry relied on logstash, but the underlying logstash version was outdated. Given the way logstash was weaved into our logging system, neither upgrading nor replacing it was going to be easy. Because logging and log drains remained an important feature to our customers, we also wanted to thoroughly evaluate which path would be best.
We did two forms of testing and performed each test in both ideal and real-world scenarios. The former could show us maximum throughput and the latter could show us the effects of latency.
First, we found all the open-source log collector tools that seemed potentially suitable. We put each through extensive performance tests to see which was most effective. We also got feedback from the team to see which features they preferred.
To run performance tests, we built an app with a pre-set delay that helped us simulate a few different network conditions. We wanted to ensure, as best we could, that we could account for situations when a destination might be in the same AWS region or when a destination might be in the same half of the country.
In the ideal scenario test, we set the response time to 1ms and shoved as many logs through as we possibly could. We had to get to the point where adding CPU to gentlemanjerry no longer yielded any improved throughput.
As you can see in the graph below, we then measured the rate by monitoring (via Cloudwatch) the requests/min between gentlemanjerry and the endpoint we were shipping the logs to.
This test demonstrated a direct correlation between the CPU we allocated to gentlemanjerry and the maximum throughput we could reach. We tested CPU allocations at 50%, 100%, 200%, 400%, and 800%. At 400%, we could push logs at a rate of 160,000/min (or 230.4 million lines/day). Beyond that, there was essentially no ROI to adding more CPU usage.
The real-world scenario test took the same basic setup but, to demonstrate latency, we added about 40ms to the response time. Here, we saw a rate of about 33,000/min (or 47.52 million lines/day) and gentlemanjerry consumed less than 100% of the available CPU resources.
By the end of our performance tests, we had two options:
Upgrading logstash: The latest version of logstash performed better in our tests but only in terms of the sheer amount of logs sent. It oversaturated each destination, so the effective rate was limited.
Replacing logstash: Our testing made Fluentd the favorite among the open-source log collector tools we tried. Fluentd could fully saturate the destinations, but it also offered well-supported, vendor-written plugins. Plus, it came with built-in monitoring and offered a range of pretty compelling security features, even in the open-source version.
After testing how best to collect the logs, we analyzed how well we could receive logs by analyzing the rates at which each of our natively supported destinations could accept logs. Our approach was to generate a few more logs than we knew a given destination could handle. We wanted to push past what the destination could accept so that we could use the generation rate to estimate the max transmission rates.
Knowing generation rates and max transmission rates showed us the externally imposed limitations on the log-collecting tools we had identified. We didn’t want to make the same kind of mistake car buyers often make when distracted by flashiness and style. Buyer’s remorse stings if you end up wasting your money on an expensive sports car that you can only use to putter around the suburbs.
We tested our log collector options against different log providers, including Sumologic, Datadog, and Mezmo (then called LogDNA). As we did this, we also measured the “backpressure” to verify that the logs were just barely queuing up. We also checked to make sure that CPU wasn’t a limiting factor.
In the end, fluentd won out against logstash and other log collector tools because of its effective rate (informed by our tests on the logging providers), its plugins, and its add-ons. The choice was even easier because fluentd was two replacements in one. Fluentd provided an internal buffer, so replacing logstash with fluentd also made redis redundant.
It was a major improvement, all in all, but we still had issues.
Figuring out how best to define outputs was a hairy situation and came with a few different problems. The providers’ plugins defined the outputs, and that sometimes meant we were sending slightly different outputs than what we were sending before.
One provider, for example, hard-coded rules based on what we had been sending before, so they expected outputs in only that format. When we released this new version, we then had the ironic problem of breaking the log formatting by sending outputs in the format the provider’s plugin actually required, instead of the format they had hard-coded.
And because every solution has to incur a new problem, we also ran into issues using both fluentd and filebeat. We were able to make it work, for the most part, but there were troublesome, persistent communication issues between the two.
Now, as proud as we were of this work, if you refer back to the main problem we identified in v2 and reread what we did in v3, you’ll see we didn’t fully address the inefficient log draining problem.
v3 offered an entirely different operational paradigm than v2. In v2, we could restart the log drains and free up resources as necessary. This wasn’t an ideal solution but it worked because restarting freed up some internal resources.
In v3, the buffer was in-memory within the drains—meaning we couldn’t restart. We were still able to send signals to the containers to control how fluentd handled the buffer, but we were painfully aware that we still lacked a complete solution (or at least a silver bullet).
v4: Cutting filebeat
v3 was our biggest overhaul to date, but v4 is when we added the final puzzle piece (for now).
In v3, the hand-off between filebeat and fluentd was one of our biggest sources of flakiness. As we worked on v4, we realized a simple solution was also the best one: We could avoid the flakiness by ditching filebeat and using fluentd all the way through.
We tested to ensure this setup could handle the load that the v3 research revealed, and we tweaked the configuration to ensure we were going to get the reliability and robustness we wanted. We were then able to optimize system performance and ensure the log drains were working efficiently, without hogging resources like they were before.
Once we had everything fine-tuned, we had v4.
One of the best advantages of v4 wasn’t even the reduction in flakiness. Building v3 had given our team some muscle for supporting fluentd, but filebeat had added more complexity than it was worth. Eliminating filebeat reduced flakiness and reduced the number of moving pieces involved, which made debugging more efficient. Fewer things to debug meant faster debugging.
With v4, we’re much closer to having a rote playbook, meaning we have the most effective system yet, even though it requires the least ongoing effort.
Focusing on functionality and effectiveness rather than comprehensiveness and power
You’ve likely heard this before, but sometimes the most important lessons are the ones you have to learn over and over again: The tech doesn’t matter until your customers are using it. Eric Ries said it, any number of Agile proponents have said it, and if you log onto Twitter, you can likely find a thread extolling the same basic idea.
But here’s the thing: It’s true.
Early in the process of developing our logging infrastructure, we (well, Frank) found that it was better to start with a feature that was functional rather than wait to build and deploy a feature that was comprehensive. In v0, customers could only have one destination, but it was worth building that initial version to give them that initial capability.
Later in the process, we learned the lesson again as we tested different log collector tools. Performance ranged among the tools, but effective performance was ultimately more important than sheer performance. An upgraded version of logstash initially appeared to perform better, but the logging destinations imposed limitations that resulted in oversaturation. Fluentd was effectively better (the only kind of “better” that counts) when we focused on actual, effective rates.
In both cases, refocusing from sheer technology concerns to customer concerns shifted our design decisions. For v0, that meant shipping something that was functional but incomplete, and for v3, that meant focusing on performance that would actually matter for our customers.
And that points to a meta-lesson of sorts: Some lessons are worth re-learning, and sometimes, the best work comes from re-applying ideas you thought you already knew. Ship, iterate, focus on customer value – the classics hold up.