Minutes to Data Sharing and Informatics Subcommittee Call 2026-03-16

Present (in BOLD):

NASA/JPL: Dan Crichton, Sean Kelly, Heather Kincaid, Ashish Mahabal
Arizona State University: Ji Qiu
Boston University: Jennifer Beane
EVMS/Old Dominion: Julius Nyalwidhe
DMCC: Jackie Dahlgren, Chad He, Royce Malnik, Stephanie Page-Lester, Suzanna Reid
Johns Hopkins: Zhen Zhang
Moffitt: Matt Schabath, John Heine, Yoga Balagurunathan, Radka Stoyanova
NCI: Sidney Fu, Guillermo Marquez, Christos Patriotis, Juan Miguel Villanueva
PNNL: Tao Liu
University of California: William Hsu
University of Washington: Savannah Partridge

Current Action Items:

ONGOING: PIs are asked to review their data in LabCAS and let JPL know of any issues. Update: Dan Crichton reminded PIs to let the know about any new data they want to send to JPL and to make sure their data is updated in LabCAS.
ONGOING: Discuss roadmap for additional hackathons and workshops.
Federated Learning: Matt Schabath and Eugene Koay may present on this topic at a future meeting. Heather Kincaid is working on scheduling this presentation.

Agenda/Discussion:

The main focus of this meeting was to develop a summary of the panel discussion that occurred at the 45th EDRN Steering Committee Meeting. Below is the summary provided by Jennifer Beane:

The panel session on Data Science, Digital Infrastructure, and Biomarkers focused on how EDRN can better position its data assets to support biomarker discovery, validation, and broader scientific use. Moderated by Jennifer Beane and Nicholas Hodges, the session emphasized that EDRN data are increasingly complex, spanning clinical data, imaging, histopathology, and multiple molecular modalities, often collected longitudinally and across institutions. Against this backdrop, the panel framed the central challenge as building a data ecosystem that is both FAIR—findable, accessible, interoperable, and reusable—and AI-ready, so that data can be more readily analyzed, combined, and reused for downstream research and model development.

The scientific intent of the session was to discuss how EDRN can ensure that its data resources are not only well managed, but also decision-relevant for biomarker research. The desired output was a prioritized roadmap for improving data standards, AI capabilities, and digital dashboards that support biomarker discovery and validation. Panelists brought expertise from imaging, biostatistics, machine learning, informatics, and data infrastructure, and the discussion was organized around several core themes: usable data sharing, harmonization across sites, AI applications for EDRN, dashboard functionality, workflow and communication, and overall prioritization of next steps.

A major focus of the discussion was the challenge of making shared data scientifically useful rather than merely available. Panelists noted that planning for sharing and harmonization must begin early in a study, rather than after data generation is complete. Common pain points include inconsistent metadata, incomplete documentation, losses introduced during de-identification, and the lack of standardized approaches across data types and sites. The group highlighted the need for data-type-specific working groups, for example around CT imaging, to establish practical standards for de-identification and release. There was also strong recognition that EDRN should anticipate emerging needs around digitized pathology and spatial transcriptomics, where both imaging and molecular data must be linked in ways that remain usable for downstream analysis. In parallel, the panel emphasized the importance of defining minimum metadata requirements and quality control expectations so that datasets can be combined across institutions and assessed by secondary users.

The session also examined where AI and digital infrastructure could deliver near-term value. One theme was that “AI-ready” data require more than raw files; they also need structured labels, metadata, provenance, and documentation that make model development reproducible. Panelists discussed the potential use of large language models to help researchers query EDRN resources more easily and determine whether available data are relevant to their questions. Additional possibilities included AI tools for assessing whether data are adequately de-identified and for mapping free-text fields to standardized schemas. On the infrastructure side, participants suggested that dashboards should evolve from simple inventories into decision-support tools that allow users to filter datasets by tissue type and key clinical variables, identify biospecimens profiled across multiple modalities, and even perform basic exploratory queries on genes or proteins of interest. The discussion also noted that APIs already exist, but stronger use cases and user-centered design are needed to make them more broadly useful.

Several clear messages emerged from the final discussion and summary slides. First, the panel repeatedly returned to the need to clarify the primary purpose and audience of EDRN data infrastructure—whether it is intended mainly for internal NIH/EDRN use, for public-facing dissemination, or for broader extramural reuse. That question influences how data should be organized, described, and prioritized. Second, there was strong support for expanding beyond internal EDRN use cases and enabling secondary analysis by outside investigators. Third, the panel emphasized that successful progress will require deliberate standards for de-identification, common data elements, and tiered or “leveled” standards tailored to specific data types. Practical examples raised during the discussion included adopting or adapting existing imaging standards such as ACRIN/TCIA approaches, repeating hackathon-style exercises to expose gaps in data readiness, piloting federated learning for sensitive datasets that cannot be widely distributed, and using working groups to define the minimum patient, sample, and assay metadata needed for integration.

In terms of next steps, the session suggests a pragmatic roadmap for EDRN. The most immediate actions would be to define 1–2 top priorities in each of the major areas—data sharing, harmonization, AI, and dashboards/infrastructure—and then test those priorities on a small set of candidate datasets. This could include convening focused working groups for current high-value data types, standardizing minimum metadata and de-identification practices, and selecting exemplar datasets to evaluate how well current infrastructure supports reuse, querying, and AI applications. The discussion also pointed toward value in broader visibility efforts, including better advertising of EDRN data resources and creating more interactive, informative dashboards that lower the barrier to discovery and reuse. Overall, the panel moved the conversation from general aspirations about FAIR and AI-ready data toward concrete operational priorities that could strengthen EDRN’s ability to support both biomarker science and broader community access.

Next Call: Monday, April 20, 2026 at 1pm Eastern/10 am Pacific