Mutation profile for top-k search exploiting gene function relationship and matrix factorization
상위 K개 질의를 위한 유전자 기능 정보와 행렬분해 기반의 돌연변이 프로파일링 기법
- 발행기관 포항공과대학교 일반대학원
- 지도교수 유환조
- 발행년도 2015
- 학위수여년월 2015. 8
- 학위명 박사
- 학과 및 전공 일반대학원 컴퓨터공학과
- 실제URI http://www.dcollection.net/handler/postech/000002062204
- 본문언어 영어
- 저작권 포항공과대학교 논문은 저작권에 의해 보호받습니다.
초록/요약
Given a large quantity of genome mutation data collected from clinics, how can we search for similar patients? Similarity search based on patient mutation proles can solve various translational bioinformatics tasks, including prognos- tics and treatment ecacy predictions for better clinical decision making through sheer volume of data. However, this is a challenging problem due to heterogeneous and sparse characteristics of the mutation data as well as its high dimensionality. To tackle this problem, we suggest a compact representation and search strategy based on Gene-Ontology (GO) and orthogonal non-negative matrix factorization (ONMF). Statistical signicance of relationship between the identied cancer sub- types and their clinical features are computed for validation; results show that our method can identify and characterize clinically meaningful tumor subtypes better than the recently introduced Network Based Stratication method while enabling real-time search. To the best of our knowledge, this is the rst attempt to simul- taneously characterize and represent somatic mutational data for ecient search purposes. As a next step, to obtain a more accurate mutation prole for similarity search, we propose a new mutation prole, called Multi-Latent Semantic Analysis Mu- tation Prole (MLSA-MP). MLSA-MP is inspired by the fact that the genes can have complex relationships in each gene set, in which the gene set contains genes that are biologically related with each other. Accordingly, it makes the same pair of patients to have dierent proximities according to the gene sets. To build MLSA-MP, given a mutation data and a number of pre-dened gene sets, we rst generate a collection of sub-proles of the mutation data. For each sub-prole, a set of latent representations are constructed by repeatedly exploiting Latent Semantic Analysis (LSA). Finally, the MLSA-MP is built by concatenating a set of latent representations. According to the experimental result, MLSA-MP allows us to more accurately retrieve clinically similar patients than both of NBS and ONMF-MP. In terms of the predictive power of the identied cancer subtypes, the comparison result shows that MLSA-MP can identify and characterize clinically meaningful tumor subtypes better than both of ONMF-MP and NBS as well.
more목차
1 Introduction 1
1.1 Somatic mutations and associated challenges . . . . . . . . . . . . 1
1.2 Gene Ontology and Orthogonal Non-negative Matrix Factorization
based mutation prole . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Multi-Latent Semantic Analysis based mutation prole . . . . . . . 4
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related works 7
2.1 Cancer genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Gene sequencing analysis of tumor samples . . . . . . . . . 8
2.1.2 Identifying and charcterzing cancer subtypes . . . . . . . . 9
2.2 Background knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Non-negative Matrix Factorization . . . . . . . . . . . . . . 10
2.2.2 Somatic mutation of cancer . . . . . . . . . . . . . . . . . . 11
2.2.3 Exterernal information of gene functional interaction: Gene
Ontology and Molecular Signature Database . . . . . . . . 12
3 Preliminaries 14
3.1 Somatic mutation data and sparsity problem . . . . . . . . . . . . 14
3.2 Similarity measure for raw mutation prole . . . . . . . . . . . . . 15
3.2.1 Mutation proles with dierent similarity measures . . . . . 15
3.2.2 Non-metric multi-dimensional scaling . . . . . . . . . . . . 17
3.3 Visualization results . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Gene-Ontology and Orthogonal-NMF based Mutation Prole 20
4.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Overview of Patient Prole Construction and Validation . . 20
4.1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Constructing Mutation Proles . . . . . . . . . . . . . . . . 22
4.1.4 Representation and Stratication with ONMF . . . . . . . 25
4.1.5 Identifying Signicant GO terms . . . . . . . . . . . . . . . 27
4.1.6 Search Performance Validation . . . . . . . . . . . . . . . . 29
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Search Accuracy and Speed . . . . . . . . . . . . . . . . . . 29
4.2.2 Validation of Cancer Stratication . . . . . . . . . . . . . . 33
4.2.3 Empirical Analysis of ONMF-MP Results Based on GO-
term Propagation . . . . . . . . . . . . . . . . . . . . . . . 38
5 Multi-LSA based mutation prole 43
5.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.1 Latent semantic analysis . . . . . . . . . . . . . . . . . . . . 46
5.1.2 External collections of gene sets . . . . . . . . . . . . . . . . 48
5.1.3 Multi-LSA based mutation prole . . . . . . . . . . . . . . 48
5.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2 Accuracy of top-k search . . . . . . . . . . . . . . . . . . . 53
5.2.3 Stratication analysis in statistical signicance . . . . . . . 59
6 Conclusion 62