Thesis Details

Platform for Biological Sequence Analysis Using Machine Learning

Bachelor's Thesis Student: Lacko Dávid Academic Year: 2021/2022 Supervisor: Martínek Tomáš, doc. Ing., Ph.D.
Czech title
Platforma pro analýzu biologických sekvencí s využitím strojového učení
Language
English
Abstract

Machine learning has many active areas and one of them is protein characterisation since experimental annotation is usually costly and time-consuming, and many datasets suitable for training predictors are currently being published. One of the recent methods, called innov'SAR, combines the Fourier transform with partial linear regression and has been used in several protein engineering applications. However, the code for the method is not freely available and the method itself was not statistically verified. The goal of this thesis is to address these limitations, implement and extend the method using Python language in an easy-to-use platform that allows training and testing of the models. The extensions include parallelization, Spearman scoring function and aligned sequence input. The statistical significance testing is also performed to verify the impact of the found dependencies between input sequences and properties of the proteins. The method proved to be statistically significant with strong dependencies found between inputs and outputs. Two newly collected halalkane dehalogenase datasets were used to train models and they have cross validation scores of Q2 = 0.54 and Q2 = 0.77 with almost double the improvement over the baseline models. Created models allow filtering of large sequence databases and scanning for potential improvements in the protein properties.

Keywords

machine learning, protein engineering, bioinformatics, PLS, haloalkane dehalogenases

Department
Degree Programme
Information Technology
Files
Status
defended, grade A
Date
14 June 2022
Reviewer
Committee
Sekanina Lukáš, prof. Ing., Ph.D. (DCSY FIT BUT), předseda
Hradiš Michal, Ing., Ph.D. (DCGM FIT BUT), člen
Jaroš Jiří, doc. Ing., Ph.D. (DCSY FIT BUT), člen
Křivka Zbyněk, Ing., Ph.D. (DIFS FIT BUT), člen
Lengál Ondřej, Ing., Ph.D. (DITS FIT BUT), člen
Citation
LACKO, Dávid. Platform for Biological Sequence Analysis Using Machine Learning. Brno, 2022. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2022-06-14. Supervised by Martínek Tomáš. Available from: https://www.fit.vut.cz/study/thesis/25037/
BibTeX
@bachelorsthesis{FITBT25037,
    author = "D\'{a}vid Lacko",
    type = "Bachelor's thesis",
    title = "Platform for Biological Sequence Analysis Using Machine Learning",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2022,
    location = "Brno, CZ",
    language = "english",
    url = "https://www.fit.vut.cz/study/thesis/25037/"
}
Back to top