Protein networks tomography: Targeting cancer and associated morbidities
Networks represent powerful inference tools for the analysis of complex biological systems. Inference is especially relevant when associations between network nodes are established by focusing on modularity. The problem of identifying first, and validating then, modules in networks has received substantial attention, and many approaches have been proposed. An important goal is functional validation of the identified modules, based on existing database resources. The quality and performance of algorithms can be assessed by evaluating the matching rate between retrieved and well annotated modules, in addition to newly established associations. Due to the variety of algorithms, the concept of module resolution spectrum has become central to this specific research field. In general, coarse-resolution modules reflect global network regulation patterns operating at the gene level or at the protein pathway scale. Fine-resolution modules localize dense regions, uncovering details of the variety of the constitutive connectivity patterns. The resolution limit problem is affected by uncertainty factors such as experimental accuracy and detection power of inference methods, and impacts the quality and accuracy of functional annotation. Our proposed approach works at the systems level; it aims to dissect networks and look at modularity in breadth-first search followed by in-depth analysis. In particular, “slicing” the protein interactome under exam yields a sort of tomography scan implemented by eigendecomposition of network affinity matrices. Such affinity matrices can be designed ad hoc, characterized by topological attributes, and analyzed with spectral methods. Consequently, a selected interactome data set allows the exploration of disease protein maps modularity through selected eigenmodes that are informative of both direct (protein-centric) and indirect (protein-neighbor centric) connectivity patterns of cancer targets and associated morbidities. The network tomography approach is thus recommended to infer about disease-induced multiscale modularity.
Classification of lung adenocarcinoma and squamous cell carcinoma samples based on their gene expression profile in the sbv IMPROVER Diagnostic Signature Challenge
Barriers, such as the lack of confidence in the robustness of disease signatures based on gene expression measurements, still hinder progress toward personalized medicine. It is therefore important that once derived, a signature is verified via an unbiased process. The IMPROVER initiative was set up to establish an impartial view of methods and results for the classification of patients, based on molecular profiles of disease-relevant or surrogate tissues. Here, the focus is on the Lung Cancer Signature Challenge, in which participants have been asked to classify lung tumor gene expression profiles into 4 classes: adenocarcinoma (AC) and squamous cell carcinoma (SCC), each at either stage 1 or 2. The method reported here was the best performing method in the 4-way classification. The original method is presented as well as an algorithmic approach to replace the empirical (non-computational) steps used in the challenge. In the discussion, the difficulty in classifying stages of tumors as compared with the relatively good classification of subtypes is examined. Hypotheses are made concerning possible reasons for erroneous classification of some of the samples, in view of additional information on the test samples that was not made available to challenge participants.
Hierarchical-TGDR: Combining biological hierarchy with a regularization method for multi-class classification of lung cancer samples via high-throughput gene-expression data
Regularization methods that simultaneously select a small set of the most relevant features and build a classifier using the selected features have gained much attention recently in problems of classification of “omics” data. In many multi-class classification problems, which are of practical importance, the classes are naturally endowed with a hierarchical structure. However, such natural hierarchical structure is often ignored. Here, we use an existing regularization algorithm, Threshold Gradient Descent Regularization, in a hierarchical fashion, which takes advantage of natural biological structure to specifically tackle multi-class classification of microarray data. We apply this approach to one of the tasks presented by the sbv IMPROVER Diagnostic Signature Challenge: the Lung Cancer Sub-Challenge. Gene expression data from non-small cell lung carcinoma were used to classify tumors into adenocarcinoma and squamous cell carcinoma subtypes, and their clinical stages (I and II). Genetic and transcriptomic differences between AC and SCC have been reported, indicating a potentially different pathological mechanism of differentiation and invasion. The results from this analysis show that hierarchical-TGDR outperforms pairwise TGDRs in terms of predictive performance, and is substantially more parsimonious. In conclusion, the hierarchical-TGDR approach trains classifiers in a top-down fashion by considering the naturally existing structure within the data, reducing the number of pairwise-TGDRs to be trained. It also highlights different mechanisms of “invasion” in the two subtypes. This work suggests that incorporating known biological information into classification algorithms, such as data hierarchies, can improve the discriminative performance and biological interpretation of this classifier.
Relapsing-remitting multiple sclerosis classification using elastic net logistic regression on gene expression data
As part of the first Industrial Methodology for Process Verification in Research Challenge, the aim of the MS Diagnostic sub-challenge was to identify a robust diagnostic signature for relapsing-remitting multiple sclerosis from gene expression data. In this regard, we built a classifier that discriminates samples into two phenotype groups, either RRMS or controls, using the transcriptome of peripheral blood mononuclear cells. For our classifier, we used logistic regression with elastic net regression as implemented in the glmnet package in R. We selected the values of the regularization hyper-parameters using cross-validation performance on the provided training data, number of non-zero parameters in our model, and based on the distribution of output values when the input vector for the test data were used with our classifier. We analyzed our classifier performance with two different strategies for feature extraction, using either only genes or including additional constructed features from gene pathways data. The two different strategies produced little differences in performance when comparing the 10-fold cross-validation of the training data and prediction on the test data. Our final submission for the sub-challenge used only genes as features, and identified a diagnostic signature consisting of 58 genes, that was ranked second out of a total of 39 submissions.
Kernel-based method for feature selection and disease diagnosis using transcriptomics data
Global transcriptome profiling is the foundation of systems biology and has been extensively used in biomarker discovery. Tools have been developed to extract meaningful biological information and useful gene features from transcriptomics data. However, there is no commonly accepted method for such purposes. The first IMPROVER (industrial methodology for process verification of research) challenge was launched to assess and verify classification methods using transcriptomics data from clinical samples. We established a computational approach that combined a kernel Fisher discriminant classifier and a feature selection scheme, which used scaled alignment selection and recursive feature elimination methods. A simple and reliable batch effect correction approach was also used. With this approach, a set of informative genes, i.e., biomarker candidates, could be identified for disease diagnosis and classification. We applied this approach to the sbv IMPROVER Challenge and achieved the highest rank in the psoriasis sub-challenge. Here, we describe our methodology and results for the sub-challenge.
Predicting COPD status with a random generalized linear model
Sample classification, especially disease status prediction, is an important area of investigation for gene expression studies. Many machine learning methods have been developed to tackle this problem. To evaluate different prediction methods, the IMPROVER Challenge made several data sets available. Here we focus on one sub-challenge: chronic obstructive pulmonary disease (COPD). We outlined critical preprocessing steps to make training and test data comparable. We compared our recently introduced random generalized linear model (RGLM) predictor with Leo Breiman’s random forest (RF) predictor on the COPD data set. We discussed potential reasons for the superior performance of the RGLM predictor in this sub-challenge. Interestingly, we found that although several genes were highly predictive of COPD status, none were necessary to achieve accurate prediction when demographic features smoking status and age were used. In conclusion, RGLM achieved superior predictive accuracy for predicting COPD status with smoking status and age as mandatory features. Future cohort studies could evaluate whether the resulting predictor has clinical utility.