This article is part of a VB special issue. Read the full series: AI and the future of health care

The health care industry produces an enormous amount of data. An IDC study estimates the volume of health data created annually, which hit over 2,000 exabytes in 2020, will continue to grow at a 48% rate year over year. Accelerated by the passage of the U.S. Patient Protection and Affordable Care Act, which mandated that health care practitioners adopt electronic records, there’s now a wealth of digital information about patients, practices, and procedures where before there was none.

The trend has enabled significant advances in AI and machine learning, which rely on large datasets to make predictions ranging from hospital bed capacity to the presence of malignant tumors in MRIs. But unlike other domains to which AI has been applied, the sensitivity and scale of health care data makes collecting and leveraging it a formidable challenge. Tellingly, although 91% of respondents to a recent KPMG survey predicted that AI could increase patient access to care, 75% believe AI could threaten patient data privacy. Moreover, a growing number of academics point to imbalances in health data that could exacerbate existing inequalities.


Tech companies and health systems have trained AI to perform remarkable feats using health data. Startups like K Health source from databases containing hundreds to millions of EHRs to build patient profiles and personalize automated chatbots’ responses. IBM, Pfizer, Salesforce, and Google, among others, are attempting to use health records to predict the onset of conditions like Alzheimer’s, diabetes, diabetic retinopathy, breast cancer, and schizophrenia. And at least one startup offers a product that remotely monitors patients suffering from heart failure by collecting recordings via a mobile device and analyzing them with an AI algorithm.

The datasets used to train these systems come from a range of sources, but in many cases, patients aren’t fully aware their information is included among them. Emboldened by the broad language in the Health Insurance Portability and Accountability Act (HIPAA), which enables companies and care providers to use patient records to carry out “healthcare functions” and share information with businesses without first asking patients, companies have tapped into the trove of health data collected by providers in pursuit of competitive advantages.

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!


Learn More

In 2019, The Wall Street Journal reported details on Project Nightingale, Google’s partnership with Ascension, which is the nation’s second-largest health system, that is collecting the personal health data of tens of millions of patients for the purposes of developing AI-based services for medical providers. Separately, Google maintains a 10-year research partnership with the Mayo Clinic that grants the company limited access to anonymized data it can use to train algorithms.

Regulators have castigated Google for its health data practices in the past. A U.K. regulator concluded that The Royal Free London NHS Foundation Trust, a division of the U.K.’s National Health Service based in London, provided Google parent company Alphabet’s DeepMind with data on 1.6 million patients without their consent. And in 2019, Google and the University of Chicago Medical Center were sued for allegedly failing to scrub timestamps from anonymized medical records. (A judge tossed the suit in September.)

But crackdowns and outcries are exceptions to the norm. K Health claims to have trained its algorithms on a 20-year database of millions of health records and billions of “health events” supplied partially by Maccabi, Israel’s second-largest health fund, but it’s unclear how many of the patients represented in the datasets were informed that their data would be used commercially. Other firms including IBM have historically drawn on data from research like the Framingham Heart Study for experiments unrelated to the original purpose (albeit in some cases with approval from institutional review boards).

Startups are generally loath to disclose the source of their AI systems’ training data for competitive reasons. Health Data Analytics Institute says only that its predictive health outcome models were trained on data from “over 100 million people in the U.S.” and over 20 years of follow-up records. For its part, Vara, which is developing algorithms to screen for breast cancer, says it uses a dataset of 2.5 million breast cancer images for training, validation, and testing.

In a recent paper published in the New England Journal of Medicine, researchers described an ethical framework for how academic centers should use patient data. They align with the belief that the standard consent form that patients typically sign at the point of care isn’t sufficient to justify the use of their data for commercial purposes, even in anonymized form. These documents, which typically ask patients to consent to the reuse of their data to support medical research, are often vague about what form that medical research might take.

“Regulations give substantial discretion to individual organizations when it comes to sharing deidentified data and specimens with outside entities,” the coauthors wrote. “Because of important privacy concerns that have been raised after recent revelations regarding such agreements, and because we know that most participants don’t want their data to be commercialized in this way, we [advocate prohibiting] the sharing of data under these circumstances.”


From 2009 to 2016, the U.S. government commissioned researchers to find the best way to improve and promote the use of electronic health records (EHR). One outcome was a list of 140 data elements that should be collected from every patient on each visit to a physician, which the developers of EHR systems were incentivized to incorporate into their products through a series of federal stimulus packages.

Unfortunately, the implementation of these elements tended to be haphazard. Experts estimate that as many as half of records are mismatched when data is transferred between health care systems. In a 2018 survey by Stanford Medicine in California, 59% of clinicians said they felt that their electronic medical records (EMRs) systems needed a complete overhaul.

The nonprofit MITRE Corporation has proposed what it calls the Standard Health Record (SHR), an attempt at establishing a high-quality, computable source of patient information. The open source specification, which draws on existing medical records models like the Health Level Seven International’s Fast Healthcare Interoperability Resources, contains information critical to patient identification, emergency care, and primary care as well as areas related to social determinants of health. Plans for future iterations of SHR call for incorporating emerging treatment paradigms such as genomics, microbiomics, and precision medicine.

However, given that implementing an EMR system could cost a single physician over $160,000, specs like SHR seem unlikely to gain currency anytime soon.

Errors and biases

Errors and biases aren’t strictly related to the standardization problem, but they’re emergent symptoms of it.

Tracking by the Pennsylvania Patient Safety Authority in Harrisburg found that from January 2016 to December 2017, EHR systems were responsible for 775 problems during laboratory testing in the state, with human-computer interactions responsible for 54.7% of events and the remaining 45.3% caused by a computer. Furthermore, a draft U.S. government report issued in 2018 found that clinicians are inundated with (and not uncommonly miss) alerts that range from minor issues about drug interactions to those that pose considerable risks.

Mistakes and missed alerts contribute to another growing problem in health data: bias. Partly due to a reticence to release code, datasets, and techniques, much of the data used today to train AI algorithms for diagnosing diseases might perpetuate inequalities.

A team of U.K. scientists found that almost all eye disease datasets come from patients in North America, Europe, and China, meaning eye disease-diagnosing algorithms are less certain to work well for racial groups from underrepresented countries. In another study, Stanford University researchers claimed that most of the U.S. data for studies involving medical uses of AI come from California, New York, and Massachusetts. A study of a UnitedHealth Group algorithm determined that it could underestimate by half the number of Black patients in need of greater care. Researchers from the University of Toronto, the Vector Institute, and MIT showed that widely used chest X-ray datasets encode racial, gender, and socioeconomic bias. And a growing body of work suggests that skin cancer-detecting algorithms tend to be less precise when used on Black patients, in part because AI models are trained mostly on images of light-skinned patients.


Even in the absence of bias, errors, and other confounders, health systems must remain vigilant for signs of cyber intrusion. Malicious actors are increasingly holding data hostage in exchange for ransom, often to the tune of millions of dollars.

In September, employees at Universal Health Services, a Fortune 500 owner of a nationwide network of hospitals, reported widespread outages that resulted in delayed lab results, a fallback to pen and paper, and patients being diverted to other hospitals. Earlier that month, a ransomware attack at a Dusseldorf University hospital in Germany resulted in emergency-room diversions to other hospitals.

Over 37% of IT health care professionals responding to a Netwrix survey said their health care organization experienced a phishing incident. Just over 32% said their organization experienced a ransomware attack during the novel coronavirus pandemic’s first few months, and 37% reported there was an improper data sharing incident at their organization.


Solutions to challenges in managing health care data necessarily entail a combination of techniques, approaches, and novel paradigms. Securing data requires data-loss prevention, policy and identity management, and encryption technologies, including those that allow organizations to track actions that affect their data. As for standardizing it, both incumbents like Google and Amazon and startups like Human API offer tools designed to consolidate disparate records.

On the privacy front, experts agree that transparency is the best policy. Stakeholder consent must be clearly given to avoid violating the will of those being treated. And deidentification capabilities that remove or obfuscate personal information are table stakes for health systems, as are privacy-preserving methods like differential privacy, federated learning, and homomorphic encryption.

“I think [federated learning] is really exciting research, especially in the space of patient privacy and an individual’s personally identifiable information,” Andre Esteva, head of medical AI at Salesforce Research, told VentureBeat in a phone interview. “Federated learning has a lot of untapped potential … [it’s] yet another layer of protection by preventing the physical removal of data from [hospitals] and doing something to provide access to AI that’s inaccessible today for a lot of reasons.”

Biases and errors are harder problems to solve, but the coauthors of one recent study recommend that health care practitioners apply “rigorous” fairness analyses prior to deployment as one solution. They also suggest that clear disclaimers about the dataset collection process and the potential resulting bias could improve assessments for clinical use.

“Machine learning really is a powerful tool, if designed correctly — if problems are correctly formalized and methods are identified to really provide new insights for understanding these diseases,” Dr. Mihaela van der Schaar, a Turing Fellow and professor of ML, AI, and health at the University of Cambridge and UCLA, said during a keynote at the ICLR 2020 conference in May. “Of course, we are at the beginning of this revolution, and there is a long way to go. But it’s an exciting time. And it’s an important time to focus on such technologies. I really believe that machine learning can open clinicians and medical researchers [to new possibilities] and provide them with powerful new tools to better [care for] patients.”

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.