Thesis Details
Platform for Biological Sequence Analysis Using Machine Learning
Machine learning has many active areas and one of them is protein characterisation since experimental annotation is usually costly and time-consuming, and many datasets suitable for training predictors are currently being published. One of the recent methods, called innov'SAR, combines the Fourier transform with partial linear regression and has been used in several protein engineering applications. However, the code for the method is not freely available and the method itself was not statistically verified. The goal of this thesis is to address these limitations, implement and extend the method using Python language in an easy-to-use platform that allows training and testing of the models. The extensions include parallelization, Spearman scoring function and aligned sequence input. The statistical significance testing is also performed to verify the impact of the found dependencies between input sequences and properties of the proteins. The method proved to be statistically significant with strong dependencies found between inputs and outputs. Two newly collected halalkane dehalogenase datasets were used to train models and they have cross validation scores of Q2 = 0.54 and Q2 = 0.77 with almost double the improvement over the baseline models. Created models allow filtering of large sequence databases and scanning for potential improvements in the protein properties.
machine learning, protein engineering, bioinformatics, PLS, haloalkane dehalogenases
Hradiš Michal, Ing., Ph.D. (DCGM FIT BUT), člen
Jaroš Jiří, doc. Ing., Ph.D. (DCSY FIT BUT), člen
Křivka Zbyněk, Ing., Ph.D. (DIFS FIT BUT), člen
Lengál Ondřej, Ing., Ph.D. (DITS FIT BUT), člen
@bachelorsthesis{FITBT25037, author = "D\'{a}vid Lacko", type = "Bachelor's thesis", title = "Platform for Biological Sequence Analysis Using Machine Learning", school = "Brno University of Technology, Faculty of Information Technology", year = 2022, location = "Brno, CZ", language = "english", url = "https://www.fit.vut.cz/study/thesis/25037/" }