Bioinfomatics-Biostatistics Journal Club and Research Seminar : Feature Selection Bias in Assessing the Predictivity of SNPs for Alzheimer's Disease


Bioinfomatics-Biostatistics Journal Club and Research Seminar : Feature Selection Bias in Assessing the Predictivity of SNPs for Alzheimer's Disease

[img]

June 03, 2019 - 10:00 - 11:00
Team: Events, Training & News
Posted on May 24, 2019

CHIMb.ca
[img]

Bioinfomatics-Biostatistics Journal Club and Research Seminar : Feature Selection Bias in Assessing the Predictivity of SNPs for Alzheimer's Disease

Team: Events, Training & News

June 03, 2019 - 10:00 - 11:00

Description


In the context of identifying related SNPs for a phenotype of interest, we consider assessing the predictivity of SNPs selected by performing GWAS. There are two kinds of cross-validation methods. One is called internal cross-validation (ICV), in which a subset of SNPs are pre-selected based on all samples then cross-validation is applied to the selected subset. The other is external cross-validation (ECV), in which features are re-selected based on only the training samples in each fold of cross-validation. The feature selection bias of ICV has not received sufficient attention when predicting a phenotype with SNP data. We demonstrate that ICV can lead to severe false discovery using Alzheimer's disease. We use a real SNP dataset related to late-onset Alzheimer's disease (LOAD) and two synthetic datasets. For the prediction, we compare the performances of three regularized logistic regression methods. For the LOAD dataset, if using ECV method, no other SNPs can improve the prediction of LOAD based on only APOE. However, the predictivity estimate of selected SNPs given by ICV can reach an R^2 of 80%. The results of synthetic datasets are similar to the real dataset. Furthermore, we have found that the predictivity estimate given by ICV can significantly higher than the oracle predictivity based on the truly related SNPs. We have also found that Hyper-LASSO performs better than LASSO and elastic net. We recommend that ICV should not be used to measure the predictivity of selected SNP.

Featuring:

Dr. Longhai Li

Professor
Department of Mathematics and Statistics, University of Saskatchewan


Bioinformatics-Biostatistics Journal Club (BBJC) is hosted by the Data Science platform at the George & Fay Yee Centre for Healthcare Innovation. The BBJC aims to stimulate discussion of bioinformatical and statistical methods development and applications for complex health data in translation and personalized medicine. All students, faculty, and staff are welcome to join this monthly event. We are looking for faculty members and trainees to lead future discussions.

For more information, contact Dr. Pingzhao Hu at pingzhao.hu@umanitoba.ca or Dr. Robert Balshaw at Robert.Balshaw@umanitoba.ca

Documents and Photos