#OpenDataSavesLives Innovation Session: Reflections on OpenSAFELY
We were delighted with the turnout of our Innovation Session, where we brought together analysts and data scientists from the DHSC, our sponsors and the Open Innovations team to ask some questions with open data around the theme of removing unwarranted variation in quality of healthcare. Joining us from the Mental Health and Shielded Patient Analysis and Data Exploitation teams at the DHSC, our attendees brought a focus on mental health when exploring questions such as:
- Do poor Brits die younger than poor Americans?
- Do Brits have better physical access to healthcare?
- How does background relate to outcome in healthcare? Is BAME/non-BAME an appropriate cultural indicator?
- What is the relationship between access, patient experience and clinical outcomes?
One of our aims for this session was to look at OpenSAFELY, the open-source software platform for analysis of electronic health records data, to explore data governance and the use of synthetic data in trusted research environments (TREs). We were not committed to a specific outcome, with our goal being to build and innovate with data and document our process and learning as we go. We approached OpenSAFELY as a new user, exploring the feasibility of employing the platform to ask these sorts of questions of datasets that are normally restricted to use within TREs. In this blog post, we will share what we learned about the OpenSAFELY platform over the two days.
What did we do?
The OpenSAFELY documentation provides plenty of guidance for setting up the platform, so we started here. There are two options for getting started: either installing OpenSAFELY and its dependencies locally, or hosting a development environment in browser using Gitpod. Some of our attendees told us that they are restricted in what they can install on a government machine, therefore we followed the latter installation steps which considers this use case. Local installation was more time-consuming if its dependencies – Python, Anaconda & Docker – are not already installed.
A handy Getting Started Guide walks through setting up a first study and producing some basic charts, then directs to other sections for more detail. The study definition specifies the data to be extracted from the OpenSAFELY database. It also defines the expected distribution of dummy data that is generated to test your code. Running the study definition in the test environment produces a CSV file containing your dummy data with the specified patient population, variables and expected distribution. In a live environment, the framework uses this study definition to query the database, runs the analysis script against securely held data and returns an output. OpenSAFELY is implemented inside the servers where the datasets are stored, which means that data is never moved from its original location and raw pseudonymized data is never seen by the user.
We wanted to see whether we could create a study definition that more closely resembled the questions that we wanted to ask. We had only the test environment available to us, therefore running our analysis against real data was outside the scope of the event. We did, however, want to learn about the onboarding process and what would be required to deploy the OpenSAFELY platform onto a new dataset. Currently, the core data used within OpenSAFELY is based on primary care records that are collected and securely stored within EHR systems provided by TPP and EMIS. The dataset contains all the highly sensitive, pseudonymized but potentially identifiable information that is recorded and accessed by GPs to deliver healthcare services. Could OpenSAFELY be deployed onto other data centres, for example the Kent Integrated Dataset (KID) and KeRNEL, to deliver new insights on patient experience and clinical outcomes and how might we achieve this?
What did we learn?
Joining us on the day, Seb Bacon, CTO of the DataLab and technical lead on OpenSAFELY, and Chris Bates of TPP, offered some insights as to the future of OpenSAFELY and addressed some of the challenges we came up against during the event. They told us about their robust onboarding process, and how new users of the platform are given full guidance and support on using the OpenSAFELY cohort extractor, OpenCodelists and OpenSAFELY jobs runner to deliver analyses on real data.
When asked about future directions of the platform, they told us that the first wave of pilot users is underway and they intend to increase their user base, cautiously opening up the platform to other research groups, subject to approval and review by NHS England. Currently, users need to have substantial data science skills, good epidemiological knowledge and strong experience working with primary care datasets. Support is currently delivered through a series of intensive support windows in which training is given over one or more weeks to introduce the platform and start producing high quality research. Over time they hope to produce tools that will support a wider range of users with a reduced need for substantial engagement with the team. While the platform is in its earlier stages, its adopters make up a collaborative community of users and co-contributors to its documentation, codelists, informative content and co-development of the platform itself.
Importantly, OpenSAFELY was created to deliver urgent results during the Covid-19 pandemic, and since its development has enabled an unprecedented scale of secure Covid-19-related analyses. Priority is still given to Covid-19 research but its focus could broaden over time as adoption of the platform increases and the pandemic evolves. It may be therefore possible to deploy OpenSAFELY onto new data centres in the future, however this requires complex integration and intensive support from the team. Currently only the TPP backend is supported fully and EMIS support is still under development.
The platform ensures transparency by publishing its source code, projects, analysis logs and updates openly on GitHub. Tightly built around git, a project must be started by cloning the research template and publishing analysis code as a git repository. Every time a researcher changes their code it is logged in version control, and anytime their code is run against real data, the event is logged and published via the OpenSAFELY jobs runner. Each project has its own project page which links to the source repository and a record of its version history. All code that is run against live datasets is therefore published in the open for review and re-use by other users. In contract to working within TREs, there is no concern that this code will contain disclosive information as it is not generated whilst working with actual data. A log of activity is recorded within TREs however this is not published, as the analysis code itself may contain identifying information about individual patients.
Take home messages
Despite not being heavily publicized, our Innovation Session attracted interest from across the DHSC, who were very keen to get involved and learn about OpenSAFELY and its future. Through conversations with the Open Innovations team, others have demonstrated that there is a growing number of potential use cases for which OpenSAFELY could provide a secure platform for transparent handling of sensitive data. Its open, collaborative format could facilitate greater steps forward than teams working in silos with heavy information governance processes impeding progress. The need and success of such a platform has been demonstrated during the pandemic through the sheer volume of Covid-19 research that has been conducted via the OpenSAFELY framework. An increasing number of use cases seem to be emerging in industry with independent analytics services working in partnership with the NHS to improve healthcare outcomes. Preserving privacy is therefore critical as the number of data services increases along with the development and use of TREs for health data analysis.
Get in touch
Alongside our investigation of OpenSAFELY, we are thrilled to present some of the innovation work that came out of the event. Take a look at the event hub page for our other blog posts in the series, in which we’ll share the prototypes that came out of our two-day hackathon along with an upcoming story and technical blog that documents our process.
If you would like to learn more about OpenSAFELY, our #ODSL Innovation Session or the other events we host in the Open Innovations space, then get in touch via firstname.lastname@example.org.
As always, a huge thank you to our sponsors, TPP and NHS SCW, and to the Health Foundation for supporting the event series.