Journal of Sensors

Research Article

Effective Preprocessing and Normalization Techniques for COVID-19 Twitter Streams with POS Tagging via Lightweight Hidden Markov Model

Normalization Algorithm.

Input: Tweet Dataset
Output: Normalized Tweet
BEGIN
Normalized Tweet {}
FOREACH token IN Tweet DO
IF token NOT EQUAL TO Noun THEN
IF token IN OOV word THEN
TokenOOV word
ELSE
Token not OOV word
IF token IN BROWN Cluster THEN
Cand-Token Fetch the candidate words from the BROWN Cluster
and perform the Levenshtein and Metaphone Edit distance
Normalized Tweet Append the Cand-token with highest frequency score
ELSE
Token Not in BROWN Cluster
Sug-token Retrieve the suggestion for the token using PyEnchant dictionary
FOREACH Suggestion from Sug-token DO
Score=(Prob(Prev-token + Suggestion) + Prob(Suggestion + Next-token))/2
[Using Microsoft Web N-Gram]
Normalized Tweet Append the Suggestion with highest score
RETURN Normalized Tweet
END