Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning

Takuma Oda, Shih Wei Chiu, Takuhiro Yamaguchi

Research output: Contribution to journalArticlepeer-review

Abstract

Objective This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning. Materials and Methods Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms-namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms-to identify the one that could generate the best prediction model. Results The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format. Conclusion By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.

Original languageEnglish
Pages (from-to)49-61
Number of pages13
JournalMethods of Information in Medicine
Volume60
Issue number1-2
DOIs
Publication statusPublished - 2021 May 1

Keywords

  • clinical trial
  • data conversion
  • database
  • supervised machine learning

ASJC Scopus subject areas

  • Health Informatics
  • Advanced and Specialised Nursing
  • Health Information Management

Fingerprint

Dive into the research topics of 'Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning'. Together they form a unique fingerprint.

Cite this