How Nightingale’s 40 TB Datasets are Increasing the Utility of AI for the Medical Community
Data is a crucial resource for most medical and health research. Recently, Nightingale Open Science, an open-source platform containing advanced, de-identified medical datasets, has made 40 terabytes of medical data widely available for researchers for free. It could help train AI to predict medical conditions earlier and save lives, benefitting the patients and enhancing the utility of AI for clinicians.
What do the datasets include?
The data include 40TB of medical imagery, such as electrocardiogram waveforms, X-rays, and pathology specimens, from patients with different conditions, including sudden cardiac arrest, high-risk breast cancer, Covid-19, maternal mortality, etc. The images are also labeled with the medical outcomes, such as the severity of Covid-19 disease and whether the Covid patient needed a ventilator. The datasets have been built over two years in the U.S. and Taiwan and will soon be expanded to Kenya and Lebanon to include medical diversity.
How are these datasets different and more valuable?
As per Ziad Obermeyer, the physician and Machine Learning scientist who launched Nightingale Open Science, what sets these datasets apart from the rest is the fact they are not labeled with the doctor’s opinion but with what actually happened to the patient.
For example, cardiac arrest ECG datasets will not be labeled with whether cardiologists detected something wrong with the patient but whether the patient eventually suffered a cardiac arrest. This gives researchers the ability to learn from the ground truth, which is the actual patient outcomes.
The project has already started garnering attention from tech giants like Google and hundreds of researchers from institutions like MIT, Berkeley, Cornell, Stanford, and the University of Chicago.
Problems with the current landscape of the medical data
The current medical data is often siloed. Some part of carefully examined, high-quality data is primarily owned by medical institutions that reserve access to this data to their own researchers. The other part is owned by big tech companies that can spend large amounts to access the data. But what about other people and enterprises that cannot pay large sums to access medical data? They end up wasting several days and even months negotiating for access to data. They have to deal with several challenges in the middle, including – getting themselves into the hospital system to access the data and wasting several months in the process.
How does Nightingale’s Open Science make things easier for researchers?
Nightingale Open Science is an endeavor to help researchers, startups, and anyone who has to work extremely hard to access high-quality medical data. It shares de-identified clinical datasets with the researchers, absolutely free of cost. It also accelerates Artificial Intelligence in the field of health sciences by collecting, vetting, and cleaning the data, making it exceptionally high-quality.
How does Nightingale fill gaps in existing medical datasets?
It does so via linkages, which is done before the de-identification of data. Nightingale’s partner health institutions give it access to raw patient data. After this, linkages are processed to chart a health progression. In this process, everything, including – cancer registry information, e-health records (like vital signs, lab information, height, and weight), and Social Security data, are merged. This allows researchers to establish what actually happened to the patient, eliminating physicians’ biases. After processing linkages, the data is de-identified to remove patient names and other protected categories of identifiers, including their social security numbers. Finally, the de-identified information is moved to Nightingale’s cloud platform, from where researchers can access it for free.
Nightingale’s impact on the accuracy of algorithms
Until now, due to the lack of meaningful data, or more curated information, healthcare algorithms have been amplifying existing health disparities. Nightingale’s carefully curated and processed high-quality datasets include data that reflects the diversity of the population, resulting in broader applicability and greater accuracy of the algorithms. These datasets root out AI’s underlying biases. As a result, AI can be harnessed to understand the root cause of the problems instead of what physicians think.
Wrap Up
With the proper training, AI can become capable enough to pick up the details that we could not have known, but for that to happen, suitable datasets are crucial. The Nightingale initiative gives hope for better datasets in the future, changing the relationship AI has with the medical industry now.