Go to the content. | Move to the navigation | Go to the site search | Go to the menu | Contacts | Accessibility

| Create Account

Tran, Van Dinh (2018) Kernel methods for large-scale graph-based heterogeneous biological data integration. [Ph.D. thesis]

Full text disponibile come:

[img]
Preview
PDF Document (PhD Thesis) - Accepted Version
Available under License Creative Commons Attribution.

4Mb

Abstract (english)

The last decade has experienced a rapid growth in volume and diversity of biological data, thanks to the development of high-throughput technologies related to web services and embeded systems. It is common that information related to a given biological phenomenon is encoded in multiple data sources. On the one hand, this provides a great opportunity for biologists and data scientists to have more unified views about phenomenon of interest. On the other hand, this presents challenges for scientists to find optimal ways in order to wisely extract knowledge from such huge amount of data which normally cannot be done without the help of automated learning systems. Therefore, there is a high need of developing smart learning systems, whose input as set of multiple sources, to support experts to form and assess hypotheses in biology and medicine. In these systems, the problem of combining multiple data sources or data integration needs to be efficiently solved to achieve high performances. Biological data can naturally be represented as graphs. By taking graphs for data representation, we can take advantages from the access to a solid and principled mathematical framework for graphs, and the problem of data integration becomes graph-based integration. In recent years, the machine learning community has witnessed the tremendous growth in the development of kernel-based learning algorithms. Kernel methods whose kernel functions allow to separate between the representation of the data and the general learning algorithm. Interestingly, kernel representation can be applied to any type of data, including trees, graphs, vectors, etc.
For this reason, kernel methods are a reasonable and logical choice for graph-based inference systems. However, there is a number of challenges for graph-based systems using kernel methods need to be effectively solved, including definition of node similarity measure, graph sparsity, scalability, efficiency, complementary property exploitation, integration methods.
The contributions of the thesis aim at investigating to propose solutions that overcome the challenges faced when constructing graph-based data integration learning systems.

The first contribution is the definition of a decompositional graph node kernel, named Conjunctive Disjunctive Node Kernel (CDNK), which intends to measure the similarities between nodes of graphs. Differently of existing graph node kernels that only exploit the topologies of graphs, the proposed kernel also utilizes the available information on the graph nodes. In CDNK, first, the graph is transformed into a set of linked connected components in which we distinguish between “conjunctive” links whose endpoints are in the same connected components and “disjunctive” links that connect nodes located in different connected components. Then the similarity between any couple of nodes is measured by employing a particular graph kernel on two neighborhood subgraphs rooted as each node. Next, it integrates the side information by applying convolution of the discrete information with the real valued vectors associated to graph nodes. Empirical evaluation shows that the kernel presents better performance compared to state-of-the-art graph node kernels.
The second contribution aims at dealing with the graph sparsity problem. When working with sparse graphs, i.e graphs with a high number of missing links, the available information is not efficient to learn effectively. An idea to overcome this problem is to use link enrichment to enrich information for graphs. However, the performance of a link enrichment strongly depends on the adopted link prediction method. Therefore, we propose an effective link prediction method (JNSL). In this method, first, each link is represented as a joint neighborhood subgraphs. Then link prediction is considered as a binary classification. We empirically show that the proposed link prediction outperforms various other methods. Besides, we also present a method to boost the performance of diffusion-based kernels, which are most popularly used, by coupling kernel methods with link enrichment. Experimental results prove that the performances of diffusion-based graph node kernels are considerably improved by using link enrichment.

The last contribution proposes a general kernel-based framework for graph integration that we name Graph-one. Graph-one is designed to overcome the challenges when handling with graph integration. In particular, it is a scalable and efficient framework. Besides, it is able to deal with unbanlanced settings where the number of positive and negative instances are much different. Numerous variations of Graph-one are evaluated in disease gene prioritization context. The results from experiments illustrate the power of the proposed framework. Precisely, Graph-one shows better performance than various methods. Moreover, Graph-one with data integration gets higher results than it with any single data source. It presents the effectiveness of Graph-one in exploiting the complementary property of graph integration.

Abstract (italian)

The last decade has experienced a rapid growth in volume and diversity of biological data, thanks to the development of high-throughput technologies related to web services and embeded systems. It is common that information related to a given biological phenomenon is encoded in multiple data sources. On the one hand, this provides a great opportunity for biologists and data scientists to have more unified views about phenomenon of interest. On the other hand, this presents challenges for scientists to find optimal ways in order to wisely extract knowledge from such huge amount of data which normally cannot be done without the help of automated learning systems. Therefore, there is a high need of developing smart learning systems, whose input as set of multiple sources, to support experts to form and assess hypotheses in biology and medicine. In these systems, the problem of combining multiple data sources or data integration needs to be efficiently solved to achieve high performances. Biological data can naturally be represented as graphs. By taking graphs for data representation, we can take advantages from the access to a solid and principled mathematical framework for graphs, and the problem of data integration becomes graph-based integration. In recent years, the machine learning community has witnessed the tremendous growth in the development of kernel-based learning algorithms. Kernel methods whose kernel functions allow to separate between the representation of the data and the general learning algorithm. Interestingly, kernel representation can be applied to any type of data, including trees, graphs, vectors, etc.
For this reason, kernel methods are a reasonable and logical choice for graph-based inference systems. However, there is a number of challenges for graph-based systems using kernel methods need to be effectively solved, including definition of node similarity measure, graph sparsity, scalability, efficiency, complementary property exploitation, integration methods.
The contributions of the thesis aim at investigating to propose solutions that overcome the challenges faced when constructing graph-based data integration learning systems.

The first contribution is the definition of a decompositional graph node kernel, named Conjunctive Disjunctive Node Kernel (CDNK), which intends to measure the similarities between nodes of graphs. Differently of existing graph node kernels that only exploit the topologies of graphs, the proposed kernel also utilizes the available information on the graph nodes. In CDNK, first, the graph is transformed into a set of linked connected components in which we distinguish between “conjunctive” links whose endpoints are in the same connected components and “disjunctive” links that connect nodes located in different connected components. Then the similarity between any couple of nodes is measured by employing a particular graph kernel on two neighborhood subgraphs rooted as each node. Next, it integrates the side information by applying convolution of the discrete information with the real valued vectors associated to graph nodes. Empirical evaluation shows that the kernel presents better performance compared to state-of-the-art graph node kernels.
The second contribution aims at dealing with the graph sparsity problem. When working with sparse graphs, i.e graphs with a high number of missing links, the available information is not efficient to learn effectively. An idea to overcome this problem is to use link enrichment to enrich information for graphs. However, the performance of a link enrichment strongly depends on the adopted link prediction method. Therefore, we propose an effective link prediction method (JNSL). In this method, first, each link is represented as a joint neighborhood subgraphs. Then link prediction is considered as a binary classification. We empirically show that the proposed link prediction outperforms various other methods. Besides, we also present a method to boost the performance of diffusion-based kernels, which are most popularly used, by coupling kernel methods with link enrichment. Experimental results prove that the performances of diffusion-based graph node kernels are considerably improved by using link enrichment.

The last contribution proposes a general kernel-based framework for graph integration that we name Graph-one. Graph-one is designed to overcome the challenges when handling with graph integration. In particular, it is a scalable and efficient framework. Besides, it is able to deal with unbanlanced settings where the number of positive and negative instances are much different. Numerous variations of Graph-one are evaluated in disease gene prioritization context. The results from experiments illustrate the power of the proposed framework. Precisely, Graph-one shows better performance than various methods. Moreover, Graph-one with data integration gets higher results than it with any single data source. It presents the effectiveness of Graph-one in exploiting the complementary property of graph integration.

Statistiche Download
EPrint type:Ph.D. thesis
Tutor:Sperduti, Alessandro
Supervisor:Sperduti, Alessandro
Ph.D. course:Ciclo 30 > Corsi 30 > BRAIN, MIND AND COMPUTER SCIENCE
Data di deposito della tesi:03 April 2018
Anno di Pubblicazione:03 April 2018
Key Words:graph kernels, data integration, disease gene prioritization
Settori scientifico-disciplinari MIUR:Area 01 - Scienze matematiche e informatiche > INF/01 Informatica
Struttura di riferimento:Dipartimenti > Dipartimento di Matematica
Codice ID:11217
Depositato il:25 Oct 2018 16:41
Simple Metadata
Full Metadata
EndNote Format

Download statistics

Solo per lo Staff dell Archivio: Modifica questo record