Tracing the real-world challenges of back-end engineers at Spotify
Overview
Spotify is the world’s largest streaming music provider. Its outstanding user experience relies on delivering highly responsive performance from the platform.
To optimise this performance, Spotify uses a technology called distributed tracing (DT) for tracking requests as they move across its complex ecosystem of back-end services. Spotify recently engaged Elsewhen on a research project to better understand the effectiveness and value of DT across the organisation.
Highlights
- Conducting a 9-week research project to help Spotify’s leadership understand the usage and effectiveness of distributed tracing.
- Providing insights on current challenges, and outlining options to deliver more value for engineers and the organisation.
- The challenge
- Gaining insight into the usage and value of Distributed Tracing
- Qualitative research
- User interviews
- User survey analysis
- Insights database
- Data analysis
- Process mapping
- Stakeholder workshops
- Report findings and recommendations
Spotify uses Distributed Tracing to identify and diagnose latency problems in its systems, and to visualise the connections and flow of requests between services. But the third-party tool that Spotify used for DT was proving expensive and had a low uptake among its 1,000+ back-end engineers.
Spotify engaged Elsewhen to research the use and cost-effectiveness of DT. It also wanted a better understanding of engineers’ needs in order to shape user-centric requirements for possible alternative DT tools.
Questions that Spotify asked the Elsewhen research team included:
What are engineers trying to achieve when they use DT?
How, when and why do they use the current DT tool in their day-to-day work?
What are users’ pain points with DT today?
- Strategic approach
- Understanding the real experiences and opinions of users and stakeholders
Elsewhen ran a 9 week project to research the status and value of DT at Spotify. Our team interviewed 13 back-end engineers, and also analysed data from a survey of 48 engineers. We then synthesised our findings and categorised the insights using Notion, FigJam, and other powerful tools.
The team identified that the primary use case for Distributed Tracing at Spotify was reactive troubleshooting and diagnosing latency issues. Secondary use cases included proactive performance management and analysing ecosystem connections. We helped leadership visualise the process flow and benefits of DT in these use cases.
However, the team also found that just 16% of back-end engineers at Spotify were regularly using the existing DT tool. This reflected low levels of satisfaction with the tool and low awareness of DT benefits.
- The solution
- Identifying the issues and blockers that engineers face
Great work! Really clear deck and presentation of findings.
To help Spotify leadership understand the issues restricting the usage and effectiveness of its existing digital tools, the research team identified four key themes around data sampling, service coverage, replaceability, and user engagement with the tool.
Overall, while the potential value of DT was high, these issues were driving a vicious circle of limited utility and usage for the existing DT tool.
- The outcomes
- Providing ways forward to maximise value and effectiveness
This is pure gold. Thanks Elsewhen team!
The research team determined which DT strengths are most important to engineers. The priorities included ensuring 90%+ service coverage, having a clear DT strategy and implementation policy, building DT awareness and understanding – and enabling flexible data sampling control. Finally, the team mapped potential paths forward:
Stick with the existing DT tool and try to resolve issues
Switch to another DT provider
Build a custom DT tool
Sunset the current DT tool without a replacement
For each option, the team outlined opportunities and risks, and how Elsewhen could help with implementation. As a result of the project, Spotify was fully equipped with the evidence, insights and strategic options they required to make the right decisions.