Research Article

Effective Preprocessing and Normalization Techniques for COVID-19 Twitter Streams with POS Tagging via Lightweight Hidden Markov Model

Algorithm 1

Normalization Algorithm.
Input: Tweet Dataset
Output: Normalized Tweet
BEGIN
    Normalized Tweet {}
    FOREACH token IN Tweet DO
        IF token NOT EQUAL TO Noun THEN
            IF token IN OOV word THEN
                TokenOOV word
        ELSE
        Token not OOV word
    IF token IN BROWN Cluster THEN
        Cand-Token Fetch the candidate words from the BROWN Cluster
                and perform the Levenshtein and Metaphone Edit distance
        Normalized Tweet Append the Cand-token with highest frequency score
    ELSE
        Token Not in BROWN Cluster
    Sug-token Retrieve the suggestion for the token using PyEnchant dictionary
    FOREACH Suggestion from Sug-token DO
        Score=(Prob(Prev-token + Suggestion) + Prob(Suggestion + Next-token))/2
                              [Using Microsoft Web N-Gram]
Normalized Tweet Append the Suggestion with highest score
RETURN Normalized Tweet
END