What Will Healthcare AI Developers Do When and If the Data Well Runs Dry?

By John McCormack

Artificial intelligence (AI) tools are built and improved on data. However, a study recently released by Epoch AI predicts that this data might soon be in short supply. That outcome would make it difficult to develop AI tools such as ChatGPT and others that are dependent on large language models.

More specifically, the study projects that tech companies will exhaust the supply of publicly available training data for AI language models sometime between 2026 and 2032. With this depletion in natural resources, the AI field is apt to face challenges as it strives to keep making progress.

This begs the question: Will this potential data shortage have an impact on the development of AI tools in healthcare? Christina Silcox, PhD, research director for digital health and an adjunct assistant professor at the Duke-Margolis Institute for Health Policy in Washington, DC, recently offered her perspective.

AHIMA: In general, what do the findings of the Epoch AI study mean for the healthcare industry?

Silcox: When we talk about systems like ChatGPT running out of publicly available training data, we are talking specifically about large language models. However, in healthcare, there is AI that is built using other methods. This AI is not dependent on large language foundational models. Many clinical AI tools are trying to predict what will happen in the future. For example, an AI tool might look at eye images and predict conditions like diabetic retinopathy and heart disease. Other tools use electronic health record (EHR) and monitoring data in the hospital to detect sepsis. These types of tools are not based on large language models.

The professionals that develop these tools still have to rely on the fact that there is just not as much health data as there are other types of data. That’s not new, though. They have always had to deal with that.

So, the availability of large volumes of healthcare data has always been a challenge. Indeed, these models in healthcare have always been built with much less data than large language models. For example, an AI tool that leverages data from all the MRIs in the world is using much less data than what’s found in large language models, as there is only a finite number of MRIs that can be used for training.

So working with the challenge of limited data sets is nothing new for healthcare AI developers. For example, there's a lot of interest in using AI for rare diseases. With many rare diseases, we're talking about less than 200,000 people in the United States. So that's an enormously smaller volume of data than what is available with large language models. While developers need to use different machine learning techniques, they are still able to produce effective tools. Indeed, we have seen very good results with this limited data.

AHIMA: Are there any areas in healthcare where this pending limitation with large language models might have an effect?

Silcox: Some of the tools that people are most excited about in healthcare right now are those that would help relieve administrative burden, help with charting, documentation, and prior authorization letters. These types of tools are built or will be built with large language models as their foundation. When large language models run out of training data, then that will potentially put a ceiling on how much those tools can improve. However, developers most likely will figure out some ways to work around that.

AHIMA: Are data privacy regulations limiting the data availability – and, therefore, hampering AI development in healthcare?

Silcox: Yes. For example, personal health information (PHI) collected within EHRs or other systems within health systems are protected by HIPAA (Health Insurance Portability and Accountability Act of 1996) and other laws. Even when de-identified, there can be a lot of hesitation in sharing health data for training or testing AI tools. This contributes to the current limitations of training data for healthcare-specific AI tools. However, there are privacy-protecting data-sharing techniques being explored, such as federated learning or creating synthetic data, that may help with this challenge.

Data privacy and security is something that chief information officers and chief privacy officers are thinking about deeply as they deploy (and in some cases develop) AI tools in their organizations. There needs to be clear understanding of where patient data will be moved and how it will be used during AI tool use. This should be an important part of any decision-making regarding AI tool selection and use.

AHIMA: How can healthcare organizations start to prepare for the potential lack of data when using large language models to develop AI tools?

Silcox: There's a lot of room for improvement in healthcare tools before any limitations of a large language model inherently would limit the ability of a healthcare tool to continue to improve. It would be concerning if what we have right now is the best we can do. Indeed, I think that there are a lot of smart people who can work with what we have to continue to make better and better tools. So I don’t think we will be stuck with [what we have regarding] tool performance even if the data runs out.

John McCormack is a Riverside, IL-based freelance writer who specializes in healthcare IT, clinical, and policy issues.

Back to Artificial Intelligence Page