Dashdive (YC W23) is an innovative cloud cost observability tool. While conventional tools focus on presenting cost and usage data at the resource level on a dashboard (e.g., "EC2 instance X cost $Y last month"), Dashdive takes a more granular approach.
Traditional cloud cost products display resource-level usage data, sourced straight from the underlying cloud providers. However, they fall short when it comes to addressing more detailed queries below the resource level. For example, if your application serves multiple users on the same infrastructure, these tools cannot provide insights into which customers and features are contributing to specific increases in usage.
Dashdive, on the other hand, enables users to comprehend multitenant costs through sub-resource insights. It achieves this by meticulously calculating the cloud costs associated with each user action within a product. This involves the collection of individual usage events, such as HTTP requests, object downloads, and database inserts, which are then tagged based on customer, resource, or responsible team.
We interviewed Adam Shugar, Co-Founder and CTO of Dashdive, to see how Porter has helped Dashdive handle the thousands of requests per second that arise from their product’s heightened granularity.
Kubernetes from the start
Dashdive began hosting on AWS ECS, directly transferring their Docker images through ECR. They were handling DevOps themselves but knew their infrastructure needed to be streamlined as they scaled.
As soon as they began onboarding customers, Adam realized that they needed to focus solely on writing code and delivering products to their customers and not burden their engineering bandwidth with infrastructure-related work. However, they also needed highly performant, flexible, and scalable infrastructure to handle the massive amount of events they were ingesting for their customers.
Not your traditional SaaS application
Most SaaS applications ingest a small volume of domain-specific data which they modify and use to provide the end user a service - it’s not overly complex. Dashdive’s architecture is more similar to an analytics platform like Datadog in that there’s a bidirectional collection of data that is constantly pending and never being deleted. On the other end, this data is being compressed, optimized, and presented on a real-time dashboard.
Dashdive wanted to present cloud usage breakdown as granularly as possible to their clients. Just telling their customers how much each instance cost wouldn’t be enough. Furthermore, most enterprise companies have multiple clusters, so even a tagging-based approach with the attributes of “Customer X” and “Feature Y” doesn’t work if the database is used by multiple tenants.
To allow their users to understand multi-tenant costs, Dashdive needed tooling that captured individual usage events This required the collection of a large volume of events, at high throughput, and at the same time - a constantly-updating business intelligence dashboard showing the real-time breakdown of usage.
This sort of architecture requires highly performant infrastructure. Adam knew that using Kubernetes (K8s) was their best option, as it was the most optimal way for them to scale and ingest hundreds of thousands of events per second at peak and scale down when traffic was lower.
Developer experience on a PaaS
At one point, Dashdive was recommended a traditional Platform-as-a-Service (PaaS), named Render The PaaS runs on Kubernetes, but does not actually give the end-user access to the underlying K8s and doesn’t allow for any sort of granular configuration of their infrastructure. Even though the PaaS allows users to not worry about managing infrastructure and focus on pushing code through its intuitive UI, the lack of configurability and flexibility made it an invalid option.
Taking care of the undifferentiated heavy-lifting
Adam had a checklist of items to configure that were essential for Dashdive’s infrastructure on Kubernetes:
- Log ingestion and searching through Grafana and Loki
- Metrics collection through Prometheus and Grafana for graphs
- SSL certificate management
- NGINX Ingress Controller
- Continuous integration for automatically building new Docker images from GitHub Actions, uploading them to ECR, and auto-pulling and deploying them into Kubernetes through Argo CD
- Zero downtime re-deployments and health checks
- Slack integration to notify of crashes / failed health checks
- Tailscale for secure remote access
“This is the sort of tooling that would normally only be available to our engineering team if we were at a larger company. They all came out of the box with Porter, so we were ready to scale from the start without worrying about the undifferentiated heavy lifting on the infrastructure side.” - Adam Shugar, Co-Founder and CTO of Dashdive
When evaluating Porter, Adam found that all of this configuration that helped ensure their infrastructure’s uptime and reliability was built-in to every cluster provisioned on the platform. Porter provides all users with logs with 7-day retention and metrics (CPU, memory, and network usage) for up to 30 days, through Prometheus, as well as third-party add-ons such as DataDog and Grafana for more robust logging and metrics. Porter also manages all SSL certificate renewals for users, and every cluster provisioned by Porter comes with an NGINX Ingress Controller out of the box.
Porter takes care of CI/CD for users through GitHub Actions, so instead of having to investigate Argo CD and configure it themselves, Dashdive is able to push, build, and deploy their Docker images with a few clicks onto their cluster. To ensure the uptime of their applications, Dashdive also needed health checks (endpoints that indicate an application is healthy and ready to receive traffic with a ‘200’ status code). On Porter-managed clusters, traffic won’t switch from an old application instance to a new one until it’s healthy, allowing for zero-downtime deployments. Furthermore, all application events, from deployment successes and failures to erroneous exits, crash loops, OOM errors, and failed health checks are aggregated on the Porter dashboard and get piped in as alert notifications over Slack (or email, depending on what users prefer). One of the add-ons Porter supports is Tailscale, allowing for secure access remotely.
Ingesting thousands of events per second
With Porter, Dashdive receives one of the other main benefits of K8s - autoscaling. As traffic increases, new instances are automatically spun up and when traffic goes down, they’re automatically shut down. Dashdive also uses Clickhouse, a database management system, as part of their tooling. However, the massive amount of raw events they were ingesting made it so Clickhouse’s native offering was not sufficient, and instead required Dashdive to run cron jobs that perform manual pre-aggregations of their data, to speed up query times.
Since they have full access to their own infrastructure and can always go under the hood with Porter, with the Porter team’s guidance, they spun up more NGINX reverse proxy nodes - in production, they have eight nodes that are purely responsible for accepting requests from clients, ingesting them, and sending them to an AWS Kafka cluster peered to their Porter-managed cluster.
After going through their data pipelines, the processed data returns to a Kafka Connect REST API housed in their Porter-managed production cluster. Furthermore, they created an admin box within their cluster so they could SSH, or securely access, their cluster to create new Clickhouse databases and for other tasks, like adding or revoking API keys.
“Even for a single customer, the scale of events was up to a billion per day. This is very difficult to handle out the gate so we had to increase the number of NGINX reverse proxy nodes and the resources allocated to each. The Porter team quickly helped us make this change.” Adam Shugar, Co-Founder and CTO of Dashdive
Dashdive has two clusters on Porter - one for staging and one for production. They test any code changes locally, but also keep the staging cluster so they can always sanity check prior to pushing to production, especially due to the complex nature of their architecture and the associated DevOps setup supporting it. Overall, Adam believes he’s gotten exactly what he was looking for from Porter:
- The convenience of a PaaS for their deployment process where they just put in their Dockerfile and specify the root directory to deploy applications, with everything else taken care of Github Actions.
- The flexibility of hosting within their own AWS VPC and EKS cluster which they can configure for any need as it arises.
“I’m very bullish on Porter. We didn’t want to manage Kubernetes ourselves - it’s a huge pain and timesink despite its obvious and often necessary benefits. And every other PaaS provider is too cookie-cutter to support our needs. Porter provides the best of both worlds, with the added bonus of letting us ramp up on K8s as we used the platform.” - Adam Shugar, Co-Founder and CTO of Dashdive