Large-scale, sometimes nationwide, prospective genomic cohorts biobanking rich biological specimens such as blood, urine and tissues, have been established and released their vast amount of data in several countries. These genetic and epidemiological resources are expected to allow investigators to disentangle genetic and environmental components conferring common complex diseases. There are, however, two major challenges to statistical genetics for this goal: small sample size—high dimensionality and multilayered—heterogenous endophenotypes. Rather counterintuitively, biobank data generally have small sample size relative to their data dimensionality consisting of genomic variation, lifestyle questionnaire, and sometimes their interaction. This is a widely acknowledged difficulty in data analysis, so-called “p»n problem” in statistics or “curse of dimensionality” in machine-learning field. On the other hand, we have too many measurements of individual health status, which are endophenotypes, such as health check-up data, images, psychological test scores in addition to metabolomics and proteomics data. These endophenotypes are rich but not so tractable because of their worsen dimensionality, and substantial correlation, sometimes confusing causation among them. We have tried to overcome the problems inherent to biobank data, using statistical machine-learning and deep-learning technologies.
ASJC Scopus subject areas