o11y - What is it? Why is it important? What are the tools you need? More importantly - how can you adopt an observability mindset? Habu Software Architect Siddharth Sharma reports from his session at super{summit} 2022.
This post recaps a session I co-presented, alongside Peter Han and Tyler Bonilla, at the super{summit} conference in August 2022.
It’s amusing to see how our industry flocks around the latest tech buzzword. It’s cool, hip and trending, so why not get on the bandwagon?
The new buzzword kid on the blocks is o11y - a.k.a. Observability!
Actually, observability is a topic very near and dear to my heart over the last 15 years of my software development journey. In this post, I'll discuss what observability is, why it is important, a few challenges, and how to overcome them using Metrics, Events, Logs, and Traces.
Reading along should get you abreast with some of the concepts, patterns, tools and cultural aspects of embracing observability to the best of my knowledge. All references to open source and vendor tools are for illustration purposes only to guide you on the path of embracing observability. Depending on the cost and maintenance overhead tradeoffs you could evaluate the right tool that suits your business needs.
In control theory, Observability lets us understand a system from the outside, by letting us ask questions, without knowing the inner workings of that system. Observability lets us deal with unknown unknowns.
In software parlance, Observability is the ability to ask new questions of the health of your running services without deploying new code
In a world of cloud-native loosely coupled services, polyglot persistence, and dynamic infrastructure, traditional metrics-based monitoring approaches are woefully inadequate when it comes to understanding system state, and triaging and diagnosing behavioural and performance issues.
Over the years, Kubernetes has become a developer’s choice for designing and deploying scalable and distributed applications. However, Kubernetes is unaware of the internal state of an application. However, its dynamic nature also gave rise to an increased number of problems for platform engineers who needed to keep track of its performance despite the pace.
These two are often used interchangeably; however, they are different in their application. By monitoring, you can verify if the infrastructure and applications are functioning as expected. On the other hand, observability provides you with comprehensive actionable insights to take steps towards improving performance and making the applications and the entire infrastructure more stable and resilient.
Managing networking in a monolithic architecture is a relatively simple task: the path between the client to the server is generally through a finite collection of points.
In a distributed microservice architecture, the network becomes much more critical and complex: the path between client and application got much more winding and harder to reason about due to the
It makes root cause analysis and incident resolution potentially a lot harder, yet the same questions still need answering:
In the past few years, much has been talked about and written about the “three pillars of observability”: metrics, logs, and traces.
A Trace records the paths taken by requests as they propagate through microservices.
A Span represents a unit of work or operation. It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.
Metrics are aggregations over a period of time of numeric data about your infrastructure or application
Log is a record of events that happened over time: a screenshot of something with an associated timestamp.
However, as pointed out by Charity Majors, these 3 pillars are not finite and can be complemented with other pillars viz. Events and Profiles
Though pretty much all signals are events, Events in specific are external to the observed system that cause some changes in that system. The most common examples are: deployments of application code, configuration changes, experiments, auto-scaling events, etc
Events can be analogous to structured logs, however, they differ from logs due to the below traits
Profiling is the act of measuring a program’s behavior using data we gather as our code executes (for example, frequency and duration of function calls, CPU time or memory usage, and more).
Profiling is a new addition to the Observability stack and the profiling SIG (special interest group) has just kickstarted. I will be keeping a close eye on it as this evolves and gets supported by various open source and vendor tools.
Below are some of the best practices for implementing Observability in your stack. They can be used to build common shim layers or shared libraries for crosscutting concerns. Developing such generic layers in your programming language of choice will avoid code duplication and help in applying standard conventions across the stack and simplify maintenance and upgrades.
Ingest, index and visualize all the logs from various sources in a centralized store using either open source (FluentBit, Loki, ELK..) or vendor tools (DataDog, Honeycomb, NewRelic)
Use JSON structured logging as an alternative to traditional logging. Logs written in JSON are easily readable by both humans and machines, and structured JSON logs are easily tabularized to enable filtering and queries.
Include meaningful information about the event that triggered the log, as well as the additional context that can help understand what happened, find correlations with other events, and diagnose potential issues that require further investigation. A few examples of fields are
— User Request Identifiers (X-Request-Id)
— Unique Identifiers (X-User-Id, X-Tenant-Id)
Avoid logging sensitive data and personally identifiable information (PII) that may be covered by data privacy and security regulations or standards like the European GDPR, HIPAA, or PCI DSS.
Use interceptors wherever applicable for logging gRPC/REST requests/responses and prefer using language constructs (MDC in Java or Go Context) to inject context across all logs that are part of the same request/response lifecycle.
Use appropriate log levels (INFO, WARN, ERROR) to avoid logging non-essential information that doesn’t help with diagnostics or root cause analysis resulting in increased time-to-insights, data volumes, and higher costs.
Set different retention policies (S3 IA, Glacier) for different types of logs, depending on the cost and compliance needs.
More metrics are always better if you have the right tools. Hence, gather all the infrastructure golden signals using either open source (Prometheus with Thanos or Cortex) or vendor tools (DataDog, Honeycomb..).
Build language-specific libraries (Micrometer, Prometheus) to instrument application code to emit key business metrics that need alerting (count of custom jobs failed, count of in-flight messages in the queue).
Depending on the use case, use the appropriate metric type to capture data points which can then be aggregated to build Sums, Gauges, Histograms
Like logging, use interceptors wherever applicable to capture gRPC/REST request/response metrics and prefer using language constructs (MDC in Java or Go Context) to inject context across all metrics that are part of the same request/response lifecycle.
RED (Rate, Error, Duration)
Request-scoped — For every request, check utilization, saturation, and errors.
USE (Utilization, Saturation, Error)
Resource-scoped — For every resource, check utilization, saturation, and errors.
Once the golden signals are collected, they can be used collectively for alerting, troubleshooting or tuning and capacity planning.
Cardinality is the number of unique combinations of metric names and dimension values. Choose which dimensions you want to attach to your metrics based on what meaningful information you want to extract from your telemetry data. Immutable infrastructures on Kubernetes and Containers lead to cardinality explosion as once a resource is created, it is never updated.
In addition to the default W3C Span Context, code can be instrumented to use custom key-value attributes to annotate a Span to carry information about the operation it is tracking.
For example, if a span tracks an operation that adds an item to a user’s shopping cart in an eCommerce system, you can capture the user’s ID, the ID of the item to add to the cart, and the cart ID.
An application can be instrumented for emitting traces either automatically or manually.
The monitoring market used to be dominated by proprietary vendors. Each vendor had its own share of pros and cons related to cost and feature support. In response, various free and open-source software projects started or were spun out of tech companies. Early examples include Prometheus for metrics and Zipkin and Jaeger for tracing. In the logging space, the “ELK stack” (Elasticsearch, Logstash, and Kibana) gained market share and became popular.
The market has hit an inflection point, and cloud-native architectures are much larger in scale, more distributed and too interdependent. Developers need the flexibility to choose and control the data they collect and analyze.
One key milestone was the merger of the OpenTracing and OpenCensus projects to form OpenTelemetry, a major project within CNCF.
The configuration below should help you get started locally by running Docker containers of Prometheus, Jaeger and OpenTelemetry Collector.
You can then instrument your applications either automatically by using the OpenTelemetry Agents or by manually instrumenting them using OpenTelemetry SDK.
In both approaches, metrics and traces from your applications will be sent to the OpenTelemetry collector either using gRPC or HTTP. OpenTelemetry collector can then be configured to build pipelines for filtering or enriching telemetry data before emitting them to the tools like Jaeger, Prometheus, DataDog or NewRelic.
This multiplexing gateway approach avoids first having to select the tool and then instrument your application using the selected tool’s SDKs or Agents. You can seamlessly swap out tools without changing your application code.
Similar to DevOps, one cannot buy observability off the shelf. Tooling is part of the equation — you’ll need a platform that can ingest, correlate and analyze data — but tools alone aren’t the key to observability. It’s more than deploying certain tools or adopting certain workflows since it has to be embraced in your engineering culture. The culture has to be supported and backed by the engineering leadership and applied to all aspects of the software development lifecycle.
Across the board, super{set} companies are embracing observability by leveraging some of the above patterns and tools as part of their engineering culture and using it to improve the developer and customer experience.
Transcript
In the first episode of The Closed Session, meet Tom Chavez and Vivek Vaidya, serial entrepreneurs and podcast hosts.
read moreIn the second episode of The Closed Session, Tom and Vivek discuss the framework for starting your own company from scratch, and the three dimensions that should be taken into account.
read moreYou’ve decided to launch a business, but before you hurtle blindly into the breach, you need a bulletproof plan and a perfect pitch deck to persuade your co-founders, investors, partners, and employees to follow you into the unknown.
read moreIn this episode of The Closed Session, Tom and Vivek talk about dilution, methods, mindset, benchmarks and best practices for raising investment capital for a new tech startup.
read moreNow that you've written the business plan and raised money, it's time to recruit your early team. In this episode, Tom and Vivek cover the do's and dont's of building a high-output team - who to hire, how to build chemistry and throughput, how to think about talent when your company is a toddler versus when it's an adolescent.
read moreWelcome to Season 2 of The Closed Session! In this first episode of 2020, Tom and Vivek talk about the five companies super{set} launched in 2019 and the lessons they’re learning as they go.
read moreTom and Vivek talk about inclusion and reflect on their personal experiences as brown guys in tech. Inclusion feels like a moral imperative, but does it really make for stronger, better companies? Are there unintended consequences of acting on good intentions to 'fix' an inclusion problem at a company? Why is tech so lacking in diversity, and what can we do to get it right?
read moreWe are living in a time of extraordinary concern about the negative consequences of online platforms and social media. We worry about the damage interactive technologies cause to society; about the impact to our mental health; and about the way that these platforms and their practices play to our most destructive impulses. Too often, the experiences we have online serve only to polarize, divide, and amplify the worst of human nature.
read moreThis post was written by Habu software engineer, Martín Vargas-Vega, as part of our new #PassTheMic series.
read moreThis post was written by Ketch Developer Advocate, Ryan Overton, as part of our #PassTheMic series.
read moreThis post was written by Ketch Data Privacy & Compliance Specialist, Jocelyn Brunson, as part of our #PassTheMic series.
read moreThis post was written by Ketch Sales Director, Sheridan Rice, as part of our #PassTheMic series.
read moreThe super{set} studio model for early-stage venture It is still early days for the startup studio model. We know this because at super{set} we still get questions from experienced operators and investors. One investor that we’ve known for years recently asked us: “you have a fund — aren’t you just a venture capital firm with a different label?”
read moreThis post was written by MarkovML Co-Founder, Lindsey Meyl, as part of our #PassTheMic series.
read moreWhere do the ideas come from? How do we build companies from scratch at super{set}?
read moreComing up with new company ideas is easy: we take the day off, go to the park, and let the thoughts arrive like butterflies. Maybe we grab a coconut from that guy for a little buzz. While this describes a pleasant day in San Francisco, it couldn’t be further from the truth of what we do at super{set}. If only we could pull great ideas out of thin air. Unfortunately, it just doesn’t work that way.
read moreThe wheel. Electricity. The automobile. These are technologies that had a disproportionate impact on the merits of their first practical use-case; but beyond that, because they enabled so much in terms of subsequent innovation, economic historians call them “general-purpose technologies” or GPTs...
read moreIn our last post, we discussed how data is the new general-purpose technology and that is why at super{set} we form data-driven companies from scratch. But new technologies are a promise, not a sudden phase change.
read moreWhen a VC decides to invest in a company, they write up a document called the “Investment Memo” to convince their partners that the decision is sound. This document is a thorough analysis of the startup...
read moreThis post was written by Ketch Solutions Engineer, Sahiti Surapaneni, as part of our #PassTheMic series.
read moreWhat does it mean to be a super{set} co-founder and who do we look for? Why is the Head of Product the first co-founder we bring on board?
read moreConsidered by some to be “America’s Second Independence Day,” Juneteenth has only recently entered the national zeitgeist. Celebrated on the third Saturday in June, it became a federal holiday just last year under President Joe Biden. Many companies are left wondering how to acknowledge the holiday. We sat down with Eskalera’s co-founder Dr. Tolonda Tolbert to get her take.
read moreHas someone looking to make a key hire ever told you that they are after “coachability”? Take a look at the Google ngram for “coachability” — off like a rocket ship since the Dot Com bubble, and it’s not even a real word! Coaching is everywhere in Silicon Valley...
read moreAt super{set}, we stand side-by-side and pick up the shovel with our co-founders. Our first outside co-founder at a super{set} company is usually a Head of Product. Let’s unpack each portion of that title....
read morePankaj Rajan, co-founder at MarkovML, describes his Big Tech and startup experience and his journey to starting a company at super{set}.
read moreThe decision to start a company – or to join an early stage one – is an act of the gut. On good days, I see it as a quasi-spiritual commitment. On bad days, I see it as sheer irrationality. Whichever it is, you’ll be happier if you acknowledge and calmly accept the lunacy of it all...
read moreTom and Vivek describe how building the best product is like planning the perfect heist: just like Danny Ocean, spend the time upfront to blueprint and stage, get into the casino with the insertion product, then drill into the safe and make your escape with the perfect product roadmap.
read moreTom and Vivek discuss what the very first customers of a startup must look and act like, the staging and sequencing of setting up a sales operation with a feedback loop to product, and end with special guest Matt Kilmartin, CEO of Habu and former Chief Revenue Officer (CRO) of Krux, for his advice on effective entrepreneurial selling.
read moreReflections after a summer as an engineering intern at super{set}
read moreGal Vered of Checksum explains his rationale for leaving Google to co-found a super{set} company.
read moreThe era of easy money - or at least, easy returns for VCs - is over. Tom Chavez is calling for VCs to show up in-person at August board meetings, get off the sidelines, and start adding real value and hands-on support for founders.
read moreTom and Vivek describe what the ideal CEO looks like in the early stage, why great product people aren’t necessarily going to make great CEOs, and what the division of labor looks like between the CEO and the rest of the early team. They then bring on special guest Dane E. Holmes from super{set} company Eskalera to hear about his decision to join a super{set} company and his lessons for early-stage leadership.
read moreo11y - What is it? Why is it important? What are the tools you need? More importantly - how can you adopt an observability mindset? Habu Software Architect Siddharth Sharma reports from his session at super{summit} 2022.
read moreOthmane Rifki, Principal Applied Scientist at super{set} company Spectrum Labs, reports from the session he led at super{summit} 2022: "When Inference Meets Engineering." Using super{set} companies as examples, Othmane reveals the 3 ways that data science can benefit from engineering workflows to deliver business value.
read moreHead of Infrastructure at Ketch, and Kapstan Advisor, Anton Winter explains a few of the infrastructure and DevOps headaches he encounters every day.
read moreTom and Vivek jump on the pod for a special bonus episode to call BULLSHIT on VCs, CEOs, the “categorical shit,” and more. So strap yourselves in because the takes are HOT.
read moreThe Move Accelerates the Rapidly Growing Startup Studio’s Mission to Lead the Next Generation of AI and Data-Driven Market Innovation and Success
read moreAnnouncing Jon Suarez-Davis (jsd) as super{set}’s Chief Commercial Officer: jsd tells us in his own words why he's joining super{set}
read moreTom and Vivek describe the lessons learned from fundraising at Rapt in 1999 - the height of the first internet bubble - through their experience at Krux - amid the most recent tech bubble. After sharing war stories, they describe how super{set} melds funding with hands-on entrepreneurship to set the soil conditions for long-term success.
read moreTom and Vivek have come full circle: in this episode they’re talking about closed session board meetings in The {Closed} Session. They discuss their experience in board meetings - even some tense ones - as serial founders and how they approach board meetings today as both co-founders and seed investors of the companies coming out of the super{set} startup studio.
read moreArthur Patterson, founder of venture capital firm Accel, sits down for a fireside chat with super{set} founding partner Tom Chavez as part of our biweekly super{set} Community Call. Arthur and Tom cover venture investing, company-building, and even some personal stories from their history together.
read moreArthur Patterson, the founder of venture capital firm Accel, sits down for a fireside chat with super{set} founding partner Tom Chavez as part of our biweekly super{set} Community Call.
read moreThis month we pass the mic to Sagar Gaur, Software Engineer at super{set} MLOps company MarkovML, who shares with us his tips for working within a global startup with teams in San Francisco and Bengaluru, India
read moreArthur Patterson, legendary VC and founder of Accel Partners, sits down with Tom Chavez to discuss insights into company building. Tom and Vivek review the tape on the latest episode of The {Closed} Session.
read moreChris Fellowes, super{set} interned turned full time employee at super{set} portfolio company Kapstan, gives his 7 recommendations for how to turn an internship into a job at a startup.
read moreKicking off the fourth season of the {Closed} Session podcast with a great topic and guest: Frida Polli, CEO and co-founder of pymetrics, which was recently acquired by Harver, joins us to talk about the critical role that technology and specifically AI and neuroscience can play in eliminating bias in hiring and beyond.
read moreObsessive intensity. Pack animal nature. Homegrown hero vibes. Unyielding grit. A chip on the shoulder. That's who we look for to join exceptional teams.
read moreGo-to-market has entered a new operating environment. Enter: RevOps. We dig into the next solution space for super{set}, analyzing the paradigm shift in GTM and the data challenges a new class of company must solve.
read moreWe are delighted to share our new episode of the {Closed} Session podcast with guest Alyssa Hutnik. Alyssa looms large in the privacy world, and she’s been thinking deeply about the intersections of data, technology and the law for nearly two decades. She’s also the Chief Privacy and Data Security Architect at Ketch, a super{set} company, as well as a lawyer. Hope you enjoy the episode!
read moresuper{set} startup studio portfolio company’s seed funding round was led by Forerunner Ventures with participation from Ulu Ventures Raise will enable boombox.io to accelerate product development on the way to becoming the winning creator platform for musicians globally
read moreOn the heels of boombox.io's $7M seed fundraise led by Forerunner, Tom Chavez and Vivek Vaidya sit down with boombox co-founders India Lossman and Max Mathieu for a special episode straight from super{summit} 2023 in New Orleans!
read moreHead of Infrastructure at Ketch, and Kapstan Advisor, Anton Winter explains a few of the infrastructure and DevOps headaches he encounters every day.
read moreReflections after a summer as an engineering intern at super{set}
read moreIn our last post, we discussed how data is the new general-purpose technology and that is why at super{set} we form data-driven companies from scratch. But new technologies are a promise, not a sudden phase change.
read moreThe super{set} studio model for early-stage venture It is still early days for the startup studio model. We know this because at super{set} we still get questions from experienced operators and investors. One investor that we’ve known for years recently asked us: “you have a fund — aren’t you just a venture capital firm with a different label?”
read moreThis post was written by Ketch Solutions Engineer, Sahiti Surapaneni, as part of our #PassTheMic series.
read moreThis post was written by MarkovML Co-Founder, Lindsey Meyl, as part of our #PassTheMic series.
read moreComing up with new company ideas is easy: we take the day off, go to the park, and let the thoughts arrive like butterflies. Maybe we grab a coconut from that guy for a little buzz. While this describes a pleasant day in San Francisco, it couldn’t be further from the truth of what we do at super{set}. If only we could pull great ideas out of thin air. Unfortunately, it just doesn’t work that way.
read moreThis post was written by Ketch Developer Advocate, Ryan Overton, as part of our #PassTheMic series.
read moreWe are living in a time of extraordinary concern about the negative consequences of online platforms and social media. We worry about the damage interactive technologies cause to society; about the impact to our mental health; and about the way that these platforms and their practices play to our most destructive impulses. Too often, the experiences we have online serve only to polarize, divide, and amplify the worst of human nature.
read morePankaj Rajan, co-founder at MarkovML, describes his Big Tech and startup experience and his journey to starting a company at super{set}.
read moresuper{set} startup studio portfolio company’s seed funding round was led by Forerunner Ventures with participation from Ulu Ventures Raise will enable boombox.io to accelerate product development on the way to becoming the winning creator platform for musicians globally
read moreConsidered by some to be “America’s Second Independence Day,” Juneteenth has only recently entered the national zeitgeist. Celebrated on the third Saturday in June, it became a federal holiday just last year under President Joe Biden. Many companies are left wondering how to acknowledge the holiday. We sat down with Eskalera’s co-founder Dr. Tolonda Tolbert to get her take.
read moreThis post was written by Ketch Sales Director, Sheridan Rice, as part of our #PassTheMic series.
read moreThe Move Accelerates the Rapidly Growing Startup Studio’s Mission to Lead the Next Generation of AI and Data-Driven Market Innovation and Success
read moreArthur Patterson, founder of venture capital firm Accel, sits down for a fireside chat with super{set} founding partner Tom Chavez as part of our biweekly super{set} Community Call. Arthur and Tom cover venture investing, company-building, and even some personal stories from their history together.
read moreObsessive intensity. Pack animal nature. Homegrown hero vibes. Unyielding grit. A chip on the shoulder. That's who we look for to join exceptional teams.
read moreOthmane Rifki, Principal Applied Scientist at super{set} company Spectrum Labs, reports from the session he led at super{summit} 2022: "When Inference Meets Engineering." Using super{set} companies as examples, Othmane reveals the 3 ways that data science can benefit from engineering workflows to deliver business value.
read moreWhen a VC decides to invest in a company, they write up a document called the “Investment Memo” to convince their partners that the decision is sound. This document is a thorough analysis of the startup...
read moreGal Vered of Checksum explains his rationale for leaving Google to co-found a super{set} company.
read moreThe decision to start a company – or to join an early stage one – is an act of the gut. On good days, I see it as a quasi-spiritual commitment. On bad days, I see it as sheer irrationality. Whichever it is, you’ll be happier if you acknowledge and calmly accept the lunacy of it all...
read moreThe wheel. Electricity. The automobile. These are technologies that had a disproportionate impact on the merits of their first practical use-case; but beyond that, because they enabled so much in terms of subsequent innovation, economic historians call them “general-purpose technologies” or GPTs...
read moreThis post was written by Ketch Data Privacy & Compliance Specialist, Jocelyn Brunson, as part of our #PassTheMic series.
read moreGo-to-market has entered a new operating environment. Enter: RevOps. We dig into the next solution space for super{set}, analyzing the paradigm shift in GTM and the data challenges a new class of company must solve.
read moreAnnouncing Jon Suarez-Davis (jsd) as super{set}’s Chief Commercial Officer: jsd tells us in his own words why he's joining super{set}
read moreThe era of easy money - or at least, easy returns for VCs - is over. Tom Chavez is calling for VCs to show up in-person at August board meetings, get off the sidelines, and start adding real value and hands-on support for founders.
read moreo11y - What is it? Why is it important? What are the tools you need? More importantly - how can you adopt an observability mindset? Habu Software Architect Siddharth Sharma reports from his session at super{summit} 2022.
read moreHas someone looking to make a key hire ever told you that they are after “coachability”? Take a look at the Google ngram for “coachability” — off like a rocket ship since the Dot Com bubble, and it’s not even a real word! Coaching is everywhere in Silicon Valley...
read moreAt super{set}, we stand side-by-side and pick up the shovel with our co-founders. Our first outside co-founder at a super{set} company is usually a Head of Product. Let’s unpack each portion of that title....
read moreThis post was written by Habu software engineer, Martín Vargas-Vega, as part of our new #PassTheMic series.
read moreChris Fellowes, super{set} interned turned full time employee at super{set} portfolio company Kapstan, gives his 7 recommendations for how to turn an internship into a job at a startup.
read moreThis month we pass the mic to Sagar Gaur, Software Engineer at super{set} MLOps company MarkovML, who shares with us his tips for working within a global startup with teams in San Francisco and Bengaluru, India
read more