Oak Ridge National Laboratory Scientists Develop Neural Network to Extract Cancer Data in Record Time

Every year, more than 17 million people around the world are diagnosed with cancer.

To better identify trends in cancer diagnoses and treatment responses, scientists at the Oak Ridge National Laboratory (ORNL) developed an AI-based, natural language processing tool to improve information extraction from textual pathology reports.

The work has the potential to help better guide research dollars and public resources to fight cancer.

“Manually extracting information is costly, time-consuming, and error-prone, so we are developing an AI-based tool,” said Mohammed Alawad, research scientist in the ORNL Computing and Computational Sciences Directorate. He is the lead author of the paper published in the Journal of the American Medical Informatics Association, Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

What makes pathology report extraction challenging for humans, let alone a machine, is that pathology reports are often ungrammatical, fragmented, marred with typos, and often exhibit linguistic variability across different pathologists for the same cancer characteristics.

The deep learning-based model represents the first time a neural network has been used to analyze cancer pathology reports.

“Population-level cancer surveillance is critical for monitoring the effectiveness of public health initiatives aimed at preventing, detecting, and treating cancer,” said Gina Tourassi, director of the Health Data Sciences Institute and the National Center for Computational Sciences at ORNL.

With two multitask convolutional neural networks (MTCNN), the team trained and tested their models on real health data using 95,000 pathology reports from the Louisiana Tumor Registry. To train the networks the team used NVIDIA V100 GPUs, with the cuDNN-accelerated TensorFlow deep learning framework, on the Summit supercomputer.

“We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes),” the researchers stated in their paper.

The results show that the deep learning-based models outperform previous machine learning models on both the micro and macro scores across all five extraction tasks, the researchers explained.

For inference the team used an NVIDIA P100 GPU.

*The image visualizes how the team’s multitask convolutional neural network classifies primary cancer sites. Image source: Hong-Jun Yoon/ORNL*

“During testing, [the researchers] found that the hard parameter sharing multitask model outperformed the four other models (including the cross-stitch multitask model) and increased efficiency by reducing computing time and energy consumption,” the organization wrote in a press release, ORNL researchers develop ‘multitasking’ AI tool to extract cancer data in record time.

The scientists say their next step is to launch a large-scale user study in which the technology is deployed across cancer registries to identify the most effective ways of integration in the registry’s workflows.

“The goal is not to replace the human but rather augment the human,” Tourassi said.