검색 상세

Mutation profile for top-k search exploiting gene function relationship and matrix factorization

상위 K개 질의를 위한 유전자 기능 정보와 행렬분해 기반의 돌연변이 프로파일링 기법

초록/요약

Given a large quantity of genome mutation data collected from clinics, how can we search for similar patients? Similarity search based on patient mutation pro les can solve various translational bioinformatics tasks, including prognos- tics and treatment ecacy predictions for better clinical decision making through sheer volume of data. However, this is a challenging problem due to heterogeneous and sparse characteristics of the mutation data as well as its high dimensionality. To tackle this problem, we suggest a compact representation and search strategy based on Gene-Ontology (GO) and orthogonal non-negative matrix factorization (ONMF). Statistical signi cance of relationship between the identi ed cancer sub- types and their clinical features are computed for validation; results show that our method can identify and characterize clinically meaningful tumor subtypes better than the recently introduced Network Based Strati cation method while enabling real-time search. To the best of our knowledge, this is the rst attempt to simul- taneously characterize and represent somatic mutational data for ecient search purposes. As a next step, to obtain a more accurate mutation pro le for similarity search, we propose a new mutation pro le, called Multi-Latent Semantic Analysis Mu- tation Pro le (MLSA-MP). MLSA-MP is inspired by the fact that the genes can have complex relationships in each gene set, in which the gene set contains genes that are biologically related with each other. Accordingly, it makes the same pair of patients to have di erent proximities according to the gene sets. To build MLSA-MP, given a mutation data and a number of pre-de ned gene sets, we rst generate a collection of sub-pro les of the mutation data. For each sub-pro le, a set of latent representations are constructed by repeatedly exploiting Latent Semantic Analysis (LSA). Finally, the MLSA-MP is built by concatenating a set of latent representations. According to the experimental result, MLSA-MP allows us to more accurately retrieve clinically similar patients than both of NBS and ONMF-MP. In terms of the predictive power of the identi ed cancer subtypes, the comparison result shows that MLSA-MP can identify and characterize clinically meaningful tumor subtypes better than both of ONMF-MP and NBS as well.

more

목차

1 Introduction 1
1.1 Somatic mutations and associated challenges . . . . . . . . . . . . 1
1.2 Gene Ontology and Orthogonal Non-negative Matrix Factorization
based mutation pro le . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Multi-Latent Semantic Analysis based mutation pro le . . . . . . . 4
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related works 7
2.1 Cancer genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Gene sequencing analysis of tumor samples . . . . . . . . . 8
2.1.2 Identifying and charcterzing cancer subtypes . . . . . . . . 9
2.2 Background knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Non-negative Matrix Factorization . . . . . . . . . . . . . . 10
2.2.2 Somatic mutation of cancer . . . . . . . . . . . . . . . . . . 11
2.2.3 Exterernal information of gene functional interaction: Gene
Ontology and Molecular Signature Database . . . . . . . . 12
3 Preliminaries 14
3.1 Somatic mutation data and sparsity problem . . . . . . . . . . . . 14
3.2 Similarity measure for raw mutation pro le . . . . . . . . . . . . . 15
3.2.1 Mutation pro les with di erent similarity measures . . . . . 15
3.2.2 Non-metric multi-dimensional scaling . . . . . . . . . . . . 17
3.3 Visualization results . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Gene-Ontology and Orthogonal-NMF based Mutation Pro le 20
4.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Overview of Patient Pro le Construction and Validation . . 20
4.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Constructing Mutation Pro les . . . . . . . . . . . . . . . . 22
4.1.4 Representation and Strati cation with ONMF . . . . . . . 25
4.1.5 Identifying Signi cant GO terms . . . . . . . . . . . . . . . 27
4.1.6 Search Performance Validation . . . . . . . . . . . . . . . . 29
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Search Accuracy and Speed . . . . . . . . . . . . . . . . . . 29
4.2.2 Validation of Cancer Strati cation . . . . . . . . . . . . . . 33
4.2.3 Empirical Analysis of ONMF-MP Results Based on GO-
term Propagation . . . . . . . . . . . . . . . . . . . . . . . 38
5 Multi-LSA based mutation pro le 43
5.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.1 Latent semantic analysis . . . . . . . . . . . . . . . . . . . . 46
5.1.2 External collections of gene sets . . . . . . . . . . . . . . . . 48
5.1.3 Multi-LSA based mutation pro le . . . . . . . . . . . . . . 48
5.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2 Accuracy of top-k search . . . . . . . . . . . . . . . . . . . 53
5.2.3 Strati cation analysis in statistical signi cance . . . . . . . 59
6 Conclusion 62

more