Thesis Details

Recegnition of Repeating SMS Patterns

Bachelor's Thesis Student: Kočalka Jakub Academic Year: 2020/2021 Supervisor: Holík Lukáš, doc. Mgr., Ph.D.
Czech title
Rozpoznávání opakujících se vzorů SMS zpráv
Language
English
Abstract

With the advances in e-mail spam recognition and user awareness, spammers are moving towards less researched media. One of those is the short messaging system (SMS), which boasts high availability and open rates. Those characteristics are also attractive to legitimate businesses that need to send short, bulk messages to their clients. However, while these messages might be solicited by the end-user, they might represent a loss for the SMS service provider, as these businesses often misuse unlimited SMS plans meant for regular customers to avoid paying for more expensive solutions designated for them. It is therefore desirable to be able to recognize both unsolicited and solicited bulk messages. Bulk messages are generally generated from a template. The goal of this work is to design a clustering algorithm that treats a message as a sequence of lexical units (words), and evaluate it's effectiveness compared to a locality sensitivity hashing method that treats the message as a string of symbols. The work evaluates the suitability of the Smith-Waterman alignment algorithm for this task. The work details why Smith-Waterman (and other local alignment techniques) is unsuitable, and how it can be replaced by Needleman-Wunsch (global alignment) to produce much better results. The resulting algorithm is able to cluster real messages into campaigns satisfactorily, and performs well even in situations where the benchmark locality sensitivity hashing method fragments campaigns.

Keywords

Smith-Waterman, Needleman-Wunsch, SMS, spam, sequence alignment, string clustering

Department
Degree Programme
Information Technology
Files
Status
defended, grade D
Date
24 August 2021
Reviewer
Committee
Hruška Tomáš, prof. Ing., CSc. (DIFS FIT BUT), předseda
Bidlo Michal, doc. Ing., Ph.D. (DCSY FIT BUT), člen
Grézl František, Ing., Ph.D. (DCGM FIT BUT), člen
Herout Adam, prof. Ing., Ph.D. (DCGM FIT BUT), člen
Smrčka Aleš, Ing., Ph.D. (DITS FIT BUT), člen
Citation
KOČALKA, Jakub. Recegnition of Repeating SMS Patterns. Brno, 2021. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2021-08-24. Supervised by Holík Lukáš. Available from: https://www.fit.vut.cz/study/thesis/24169/
BibTeX
@bachelorsthesis{FITBT24169,
    author = "Jakub Ko\v{c}alka",
    type = "Bachelor's thesis",
    title = "Recegnition of Repeating SMS Patterns",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2021,
    location = "Brno, CZ",
    language = "english",
    url = "https://www.fit.vut.cz/study/thesis/24169/"
}
Back to top