Abstract

The major focus of this research work is to refine the basic preprocessing steps for the unstructured text content and retrieve the potential conceptual features for further enhancement processes such as semantic enrichment and named entity recognition. Although some of the preprocessing techniques such as text tokenization, normalization, and Part-of-Speech (POS) tagging work exceedingly well on formal text, it has not performed well when it is applied into informal text such as tweets and short messages. Hence, we have given the enhanced text normalization techniques to reduce the complexity persist over the twitter streams and eliminate the overfitting issues such as text anomalies and irregular boundaries while fixing the grammar of the text. The hidden Markov model (HMM) has been pervasively used to extract the core lexical features from the Twitter dataset and suitably adapt the external documents to supplement the extraction techniques to complement the tweet context. Using this Markov process, the POS tags are identified as states of the Markov process, and words are the desired results of the model. As this process is very crucial for the next stage of entity extraction and classification, the effective handling of informal text is considered to be important and therefore proposed the most effective hybrid approach to deal with the issues appropriately.

1. Introduction

In recent years, the impact of social media sites is rampant and gaining huge popularity among social media users consistently. Particularly, Twitter has gained huge momentum among the users and providing an open platform for information exchange in a variety of events and situations. The events can be classified as political crisis, natural calamities, disasters, celebrations, etc. Recently, the tweets related to COVID-19 have been very pervasive and made a prominent impact on the government agencies to take immediate actions. Also, the information pertaining to coronavirus has been used for travelers and business people to take preliminary actions for their proposed plans. The information posted in Twitter are needs to be organized and classified according to its credibility score and further pave way for segregating them into primary and secondary information. Normally, the secondary information in a tweet is a retweet [1]. Nowadays, users prefers to use social media platforms such as Twitter for getting the latest news, and there are high chances of drawing wrong conclusions by reading false news. Hence, the demand for implementing the credible system that is capable of identifying the correct news and classifying them into the right emotions provides the right information for the decision-making processes.

Therefore, automatic detection of events such as people, organizations, locations, and other entities from unstructured content is challenging and has shown very poor performance due to its unorthodox content [2]. Similarly, many named entity recognition (NER) research works have been carried out recently with respect to twitter streams such as [3, 4]. These research works have largely been aimed at augmenting the capabilities to extract the potential named entities from the tweets and focused on improving the state-of-the-art methods in detecting the Out-Of-Vocabulary (OOV) words. But due to the lack of contexts and noisy structure of the tweets, detecting the potential named entities from tweets poses a great challenge and gives huge difficulties to annotate the tweets with necessary POS tags. Figure 1 illustrates the open challenges in handling COVID-19 twitter streams. Table 1 provides the detailed information about the publically available named entity annotated tweets.

Besides, tweet tokenization has been a great challenge for many NER systems, and the existing methods such as PennTree Bank (PTB), TweetMotif, TwokenizerTool [5], and TwitIE tokenizer [6] failed to address these issues effectively. Therefore, we provide the mechanism to solve the fundamental preprocessing techniques such as tokenization, normalization, and POS tagging of tweets. These normalization processes have reduced the complexity persist over the given datasets and addressed the overfitting issues on the informal text categorically. As this process is very crucial for the next stage of entity extraction and classification, the effective handling of informal text is considered to be important and therefore proposed the most effective hybrid approach to deal with the issues appropriately. In addition to that, informal text would have certain common open challenges that has been listed in Figure 1. This research work has profusely handled these open challenges in the effective way and able to outperform the results with good precision.

The major contributions of this paper are given below: (i)A detailed discussion on text normalization techniques also delineated the difficulties of converting Out-Of-Vocabulary (OOV) words into In Vocabulary (IV) words(ii)Proposed an enhanced tweet tokenization technique to the alternate of Stanford tokenizer and Penn Tree Bank Tokenizer(iii)In order to extract the appropriate named entities from unstructured text, we have utilized the lightweight hidden Markov model (HMM) to filter the correct lexical features that has been generated through the POS tagging

1.1. Paper Structure

The rest of the paper is organized as follows: Section 2 discusses the collaborative work rendered by the researchers in the fields of tweet normalization and preprocessing. Also, the prominent tools and methods used for tweet tokenization and normalization have been discussed briefly. Section 3 gives a comprehensive idea about our proposed methods and techniques followed for tweet normalization. Besides, we have briefed about the Out-Of-Vocabulary conversion methods and some evaluation metrics to detect the OOV methods. Section 4 highlights the procedures for tweet tokenization and segmentation. This section has delineated the procedures to tokenize the tweets with some standard nomenclatures. In Section 5, the tweet preprocessing and normalization approach has been dealt with a novel algorithm and analyzed the existing dictionaries to detect the potential OOV words. In Section 6, POS tagging for tweets has been discussed and also introduced the lightweight hidden Markov model for extracting the lexical features generated from the standard POS tagging. Section 7 gives the empirical shreds of evidence of tweet normalization and effective preprocessing results. Also, we have given some error analyses of our proposed methods.

As most people use social media sites to post their messages daily, the amount of information stored on these websites get increased exponentially, and the messages are informal in nature due to its limited space constraints. Text normalization plays the seminal role in the process of detecting and removing the noisy text (i.e., tweet in this research) into standard words. Therefore, it has gained huge research attention in recent years and increasingly attracted many researchers to carry forward their research work in this domain. Besides, many academic conferences and workshops [79] have been conducted to gather the data related to informal texts. The Association for Computational Linguistics (ACL) [10] and North American Association for Computational Linguistics (NAACL) [11] have been encouraging the researchers and students to actively participate in their conferences and workshops to gain knowledge on both formal and informal text. Recently, the Text Retrieval Conference (TREC) has created a new web page related to informal languages used in social media sites and also conducted workshops [12] relevant to the field. In this research work, we have analyzed the works demonstrated by various researchers and their research findings and shortcomings in detail.

Earlier, the researchers [13, 14] have taken their research work only on normalizing the spelling errors produced from the web sources. They have used the -gram model to assess the probability of each word within its context and estimate the relative frequency of the word in the given sequence. Their -gram model has mapped the words using Many-to-One (-to-1) cardinality, and the real word substitutions such as word usage with its context and grammatical structure of the sentence were detected and converted successfully, but their research work faced some serious lapses in dealing with unstructured content and failed to retain the accuracy rate attributed by the many prominent researchers [1517].

Meanwhile, [18, 19] has demonstrated their research work on microtexts such as Twitter and SMS for detecting phonetic misspellings, standard acronyms, and contractions. Generally, misspelled words were detected by Natural Language Processing (NLP) systems using the mult-channel models which effectively find the lexical variance on some factors such as contextual wounding of the word, phonetic similarity, orthographic factors, and expansion of acronym using the standard dictionary. As suggested by previous researchers [2023], they have utilized the Aspell spell corrector to detect the misspelling on Twitter as well as on SMS datasets.

Later, [24] has developed a spelling corrector which uses Google Style-based spell corrector, and it just find the proximity of a word and recognizes the correct spelling for the given word. The algorithms work on the conditional probability of the word based on edit distance measure and choose the word which has less edit distance (i.e., less number of deletion, replacement, insertion, or transposition used to convert the word into correct form). They have set the threshold limit of edit distance is less than or equal to 2. But due to textual sparseness in informal text, most of the misspelled words in the informal text require more edits and demand more comparison concerning the context of the words. Again, the Norvig system [25] works exceptionally well on standard orthodox text but failed to get the precision on informal text.

Finding the Out-Of-Vocabulary word is very challenging in social media sites; particularly, it has been prevalently used in twitter streams. The OOV word is defined as unorthodox words, and it has not been presented in the standard dictionary for reference. To tackle this issue, many research works had been carried out [2630] and attained some considerable accuracy rates with respect to the BLEU score. The researchers [31] have used the classifier to detect the OOV words as ill-formed words based on the similarity measures such as phoneme and grapheme score and converted the ill-formed words into standard English words. In their approach, they had used the dictionary, context support for the word, and similarity measure to predict the correct form of the OOV words and finally attained the -score of 68.30%. Even though the result is considerably good at some aspects, it had not performed well on noisy tweets and yields poor results if there was no context support for the ill-formed words.

Later, the author [3234] has used a hybrid approach to deal with OOV words present in social media sites and prepared heuristic approach such as string similarity measure, edit-distance function, and subsequence overlap function to detect the OOV words and converted them into its appropriate In-Vocabulary words. The correct candidate word was selected based on the -gram model and used the confusion matrix to find the proximity score which is likely to be the correct English form. This approach has reduced the burden of previous researchers and yields a good accuracy rate of up to 72.15%.

3. Proposed Method

In this research work, we have downloaded the Twitter datasets related to COVID-19 from 6th Workshop on Noisy User-generated Text (W-NUT) [35] for our analysis. It has manually annotated almost 10,000 tweets related to COVID-19 and built a corpus called COVIDKB that is a well-structured knowledge base to support the SPARQL queries. To extract the structured knowledge from the tweets, our primary task in this preprocessing step is to remove the usernames, special symbols, retweets, hashtags, and emoticons from the tweets and take only the original tweets for the next level of processing.

3.1. Problem Definition

“Given the tweet corpus T, eliminate the tweets which do not convey much information regarding the event E and remove or replace noisy tokens in the tweets with normalized tweets.”

The basic regular expression followed to remove the special symbols, retweets, and other emoticons is given below:

def process_text():

"" " Remove emoticons, usernames, retweets etc. and returns list of cleaned tweets. """

data=pull_tweets()

regex_remove="(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(\w+:\/\/\S+)|^RT|http.+?"

stripped_text=[re.sub(regex_remove, “, tweets).strip() for tweets in data]

return ". ".join(stripped_text)

Once the tweets are cleaned using the above regular expression snippet, we need to perform the tokenization of tweets to fix the proper tagging of words and identify the proper nouns and pronouns for effective entity extraction and disambiguation. Each of the following methods helps to solve the ambiguity that persists over the tweets and identifies the entities with proper references in the external document sources. The three basic components of this research, i.e., tokenization, normalization, and POS tagging are considered to be noncore components but they are very crucial in this research because the informal nature of tweets has condensed the words and give space for ambiguity. Hence, we have proposed a novel method to deal with the issues and remove the ambiguity with the support of external document.

Before we go into the next phase of the preprocessing pipeline as given in Figure 2, we have taken two types of dictionaries to correct the misspelled words in the tweets and fix the correct word form to it. In this case, at first, we have benchmarked some of the standard online spell correction dictionaries for our analysis such as Norvig’s Spell Corrector (https://norvig.com/spell-correct.html), BK-Tree (https://issues.apache.org/jira/browse/LUCENE-2230) (Burkhard–Keller Tree), SymSpell (https://symspellpy.readthedocs.io/en/latest/) (Symmetric Delete Spell correction algorithm), LinSpell (https://github.com/wolfgarbe/LinSpell) (Linear Search Spell Correction), and PyEnchant Dictionary(https://pyenchant.github.io/pyenchant/tutorial.html) as given in Table 2. From the analysis, we have observed that the PyEnchant Dictionary is suitable for informal text processing and compatible with all the programming environments. Besides, PyEnchant Dictionary is faster than the above four algorithms, and the indexing method is found very effective for searching the words. The PyEnchant dictionary has been getting updated frequently on every year and enriched its gazetteer words. Hence, we have used the PyEnchant Dictionary for the misspelled words and identified the OOV words in the tweets if any.

Second, we have created our own slang dictionary for converting the slang word into its correct English word form and fixing the correct meaning for the slang word. In this context, we have searched the online slang dictionary application sites such as NoSlang Dictionary (https://www.noslang.com/), Urban Dictionary (https://www.urbandictionary.com/), Translit (https://www.translit.ie/), and a few more web sources. Further we manually extracted the slang words such as contractions, abbreviations, slangs, unorthodox word forms, and canonical words from the above listed online sources and gathered their equivalent English meanings appropriately. Then, we have listed all the slang words with their corresponding English meanings into separate files such as abbreviation, slang words, contractions, and emoticons. Later, we formatted each of the above files and removed the duplicate entry if present in each file (Table 3). Eventually, we checked the absolute meaning of each token in the file and ordered them alphabetically for easy processing of search operation. The first column in each file contains the slang words or abbreviations or contractions, and the second column gives the corresponding meaning or abbreviation of that token.

3.2. Tweet Normalization

The first step in the tweet normalization approach is tokenization. This is the basic preprocessing step followed in all the natural language processing as well as in the information extraction projects. The proposed approach for tweet normalization has been given in the following Figure 2.

Mainly, the process of tweet normalization takes three critical analysis: (i) detect the candidate tokens on the mutual comparison of standard vocabulary sets; (ii) identify the symbols, emoticons, and OOV words from the tweets with respect to word contortions such as spelling mistakes and displacement of the grammatical structure of words; and (iii) discard the OOV words from the tweets using the standard corpus. All the above-described steps are completely language independent and work exceedingly well with language specifies resources (see Figure 3). For the OOV words, it has been largely dependent on standard abbreviations and acronyms to filter out the In Vocabulary (IV) words and produce the candidate list of IV words for POS tagging.

The main challenge in processing informal text such as tweets is that it gives difficulty in dividing the tweets into multiple tokens and categorically identifies the potential named entities from the divided tokens. The major task of classifying the tokens into IV words and OOV words would be a serious implication in the process of tweet normalization. The standard dictionary (i.e., in this case, we have used PyEnchant Dictionary for word comparison and dictionary lookup) is more than enough to identify the IV words from the tweets and appropriately categorize those into one of the predefined categories sets based on the POS tags assigned to it, but the remaining nonstandard tokens (i.e., OOV words) need to be compared and find the appropriate candidate list to fit it into the correct word match.

3.2.1. Statistical Rules

So far, we have discussed handling the OOV words and the problem of choosing the appropriate candidate words for the given nonstandard token in the tweets/sentences. We have identified some of the implicit traits of the nonstandard tokens (i.e., OOV words) after running through a huge tweet corpus downloaded from COVIDKB [35] and defined the following basic procedure to tackle the OOV words if any present on the user-generated content.

Here are the following examples:

Type 1: in the OOV words, the social users may have missed the spaces either knowingly or unknowingly and stretched into two or more standard words. Example: stywith (stay with) and cometogether (come together)

Type 2: the OOV words are framed upon the sound of the words rather than the lexical structure of the words. Example: c u agn (see you again)

Type 3: the OOV words are constructed based on the first letters of the standard words or the phonetic positioning of the words. Example: u r (you are) and thx (thanks)

For these types of errors in the OOV words, we have constructed the Slang dictionary (Table 3) as mentioned in the previous Section 3 and detect the possible candidate set of words to disambiguate the OOV words. We also find the exact fit of matching words to the given token based on the context given by the language model. We have reduced the 1-to- mapping of the candidate list for the OOV words into 1-to-1 mapping and increased the efficiency of the tweet normalization.

3.2.2. Multiple Character Reduction

The OOV words have occurred at many places in the tweets and disturb the process of transforming them into standard English words. The major problem faced in handling the OOV words was that it contains many nonword tokens and repetition of multiple characters to express the inherent emotions to the reader. This was also explained in [36]. The characters which occurred more than once will be deduced to single characters, and then PyEnchant Dictionary has been utilized subsequently to prevent the further mistakes caused by the earlier reduction of multiple characters. For example, the words such as speed, speech, and breed have contained the same character that appears more than once in the word and still gives the correct English meanings. If we reduce those multiple occurrences of characters to a single character, then it would cause spelling mistake and leads to bad normalization. To get the correct form of the word, we have proposed the appropriate method where it has taken the utmost care to tokenize the OOV words and split the OOV words based on some designed patterns. Besides, multiple punctuation symbols posed similar difficulties, and we reduced the multiple punctuation marks into single punctuation using the defined regular expression.

3.3. Evaluation Metrics

To assess the quality of results for the OOV words, the following metrics have been used and have been evaluated the performance of the system.

3.3.1. Miss Rate (MR)

It measures that the number of OOV words was missed with respect to OOV words detected.

3.3.2. False Alarm Rate (FAR)

It measures the number of IV words that had been falsely reported as OOV words.

3.3.3. Word Error Rate (WER)

It measures the number of errors that occurred during the substitution, deletion, or insertion of characters by the proposed system.

3.3.4. Precision

It measures that the number of OOV words has been detected correctly by the proposed system.

3.3.5. Recall

It measures the number of OOV words detected with respect to the OOV references.

3.3.6. F1

It measures both the precision and recall of the OOV words detected by the proposed system.

3.4. Experimental Analysis

We have manually analyzed the tweet normalization for the COVIDKB since there is no gold standard dataset followed to assess this language model and hence assessed the performance of COVIDKB which contains 10,000 tweets. For each tweet, we have considered all the modifications that happened in the process of normalization by the proposed system and validated them. The four major tweet normalization operations such as insertion, deletion, substitution, and tokenization have been monitored manually, and the correctness of the results has been measured through the F1-score produced by the proposed system. Table 4 shows the number of OOV words detected, pronunciation accuracy, and identical pronunciation score of the system.

Based on Table 4, the precision, recall, and F1 score for the OOV words have been measured, and it has been given in Table 5. Figure 4 illustrates the ROC curve and statistical analysis of investigating the OOV words in tweets.

This ROC curve depicts the accuracy rate, and sensitivity of OOV words presents in the tweets and was able to identify the missed percentage of Part-of-Speech tags for the tokenized tweets.

In addition to finding the OOV words present in the tweets, there are also some other factors to be considered for effective normalization such as stemming, lemmatization, stop word removal, and emoticons detections [3739]. Extra supervision is required to handle these preprocessing methods and further, these methods help to provide contextual support for sentiment analysis, word cluster, information extraction, entity detection, and many more (see Table 6). As we discussed finding the potential named entities in the tweets, these features help to solve the ambiguity that persists over the text and largely support the contextual score for the proposed system.

The performance metrics and combinational score of the preprocessing methods have been given in Table 7 for further comprehension and enhancement for accuracy. Figure 5 portrays the ROC curve for text normalization and statistical summary of preprocessing steps. It has become evident that the blend of emotions, lemmatization, and stopwords removal has given the performance marginally high and outperfomed with the other integrated approaches.

4. Tweet Tokenization and Segmentation

As tokenization is a first step in the pipeline, the major aim of tweet tokenization is to split the tweet into some meaningful chunks (i.e., semantic tokens) that can be words, word phrases, or any cardinals. Due to the informal nature of the tweets, tweet tokenization process gives difficulties in handling the informal text and comparatively challenging than formal text processing operations [40]. Hence, it requires some sophisticated techniques to solve the issues and effectively perform the tokenization processes. In this connection, we have analyzed some of the techniques followed by earlier researches for tokenization of normal text content. Since the formal text has been supported with well-structured context and language grammar [41], it had performed well on all the grounds, and the major tokenization approach that was followed by researchers was Stanford tokenization. Normally, the Stanford tokenizer utilized the JFlex lexical analyser [42] for tokenization of sentences and produced the results for the given formal text. In some cases, the researchers have even used Penn Treebank Tokenizer [43] to tokenize the content which is using the specific regular expression written in SED script. The above tokenizers have been commonly used for most formal text processing and yield a good accuracy rate for all the instances. But when the tweet is informal in nature and mostly unorthodox, then the above tokenizers would have been a bad choice in this regard and produce inappropriate results.

4.1. Proposed Approach for Tokenization

Like the procedure followed for formal text tokenization, some researchers followed the same on informal text tokenization [44]. In the formal text tokenization methods, they have divided the token into meaningful chunks if they encountered any spaces or specific delimiters present in the sentence. This method has resulted in poor performance in POS tagging and made the entity detection processes complicated for informal text such as tweets. Therefore, we have considered the key phrases of the token up to the length of 4 (i.e., as almost all the named entities can be restricted within that range) and split the tokens based on the following patterns: (1)(Noun)+: for the given tweet, the tokenizers find more than one continuous nouns that can be clubbed into one key phrase and considered as one single token. Example: Samsung Galaxy Phone(2)(Adjective) + (noun)+: if the noun started with one or more adjectives, again, it is considered to be a single token, and division has been made accordingly by the tokenizer. Example: Fantastic Donald Trump and Digital Camera(3)(Noun) + [CD]: one or more nouns followed with some cardinals. Example: James Bond 007 and iPhone 8i

To support the above patterns and filter the tokens from the tweets, we have emulated the tokenizer called ARKTweetNLP [45] which is an open-source module for download and infused our pattern into the above package to effectively filter out the key phrases for the next phases of POS tagging. The main reason to choose this module over other tokenizers is the fact that this tokenizer has been designed by [46] in considering the Twitter-specific regular expressions, covered a wide range of emoticons, and achieved the good performance on tweet tokenization. We have given some examples in Table 8 that the proposed tokenizer has been able to split the tweets into some meaningful chunks successfully.

5. Tweet Normalization and Processing

The next step in our pipeline is the normalization process that is used to identify the tokens as either Out-Of-Vocabulary or not and convert them into their Standard English word or word phrase. Normalization is very important for text processing because it will help to solve the ambiguity that persists on any token in the text [47, 48]. As explained earlier, many researchers have used statistical machine translations, phrase-based statistical model, character-level edit distance, dependency parsers, and even built-in parallel corpora to train the model to generate the possible candidates for the Out-Of-Vocabulary (OOV) words. Some have even tried to use language models and phonemic edit distance measures to deal with the problems differently, but normalization problem is quite challenging in itself and poses great difficulty to the informal text. For instance, abbreviations and slang words are very difficult to map in the existing spell correctors.

5.1. Proposed Approach for Normalization

We have used the supervised normalization technique for the proposed system and also utilized online resources such as Brown Corpus, PyEnchant Dictionary, and Microsoft Web -gram model. We have used the Brown Corpus for text normalization because of the fact that similar words occur in a similar context. That is, similar words would have the same set of distributions and arrangements of words on either side, i.e., left and right. With that assumption, we have clustered the Brown Corpus and trained it for our supervised classifier to normalize the words if it encounters any OOV words. As the Brown Corpus has already clustered 47 Million tweets and almost produced more than 1200 word clusters (i.e., on each cluster, it has arranged the tweets with similar context), we have effectively utilized these clusters in this normalization process and transform the OOV words successfully as depicted in Figure 6.

Besides, we have used the PyEnchant Dictionary for mapping the OOV words in its dictionary and performed two string similarity measures such as Levenshtein and metaphone edit distance. We used both the string similarity measures for the candidate word selection because some of the words have given the same phonetic sound, for example, know and no and pork and fork. To disambiguate this conundrum, we have taken both the measures and choose the best one using the Microsoft Web -gram model. Out of all the suggestions listed from both the Brown Cluster and PyEnchant Dictionary, the Microsoft Web -gram model will choose the correct word which has the high score based on the conditional probabilities of candidate words given in the context (i.e., word combinations before and after the words as suggested by Brown Cluster and String Similarity Measure). The word with the highest score can be returned as output by the system. The algorithm has been given below for the whole normalization process and how the system has returned the normalized output successfully.

Input: Tweet Dataset
Output: Normalized Tweet
BEGIN
    Normalized Tweet {}
    FOREACH token IN Tweet DO
        IF token NOT EQUAL TO Noun THEN
            IF token IN OOV word THEN
                TokenOOV word
        ELSE
        Token not OOV word
    IF token IN BROWN Cluster THEN
        Cand-Token Fetch the candidate words from the BROWN Cluster
                and perform the Levenshtein and Metaphone Edit distance
        Normalized Tweet Append the Cand-token with highest frequency score
    ELSE
        Token Not in BROWN Cluster
    Sug-token Retrieve the suggestion for the token using PyEnchant dictionary
    FOREACH Suggestion from Sug-token DO
        Score=(Prob(Prev-token + Suggestion) + Prob(Suggestion + Next-token))/2
                              [Using Microsoft Web N-Gram]
Normalized Tweet Append the Suggestion with highest score
RETURN Normalized Tweet
END

6. Tweet Part-of-Speech Tagging

After normalizing the tweets using the hybrid approach, we need to perform the Part-Of-Speech (POS) tagging, and this process is very crucial for entity extraction and classification. The entity extraction has been performed on the tweets based on the POS tagging and extracts the entities which have been attributed as nouns, proper nouns, pronouns, or any objects as nouns [49]. So, POS tagging of tweets would determine the grammatical structure and category for each token that is segmented on the normalized tweets. Many words in the tweets would have no syntactic features such as hashtags, URLs, emoticons, and @mentions. The dependency parser has taken more time in processing than these nonsyntactic features and consumes time unnecessarily. But without appropriate utilization of standard annotators, with the use of a simple rule-based filter, it can extract and annotate the #hashtags, @mentions, punctuations, and retweet tokens effectively. Next, for the multiword expressions, we have considered two types of approaches. First, the proper nouns have been compounded together for information extraction and assigned the single tag for the compound words. Second, the lexical idioms, such as “stuck in the crowd” and “hay in the stack,” have been manipulated with shallow parsing and clubbed into a single token for tagging. The same approach has been followed even for the idiomatic relationships and performs the internal analysis of multiword tokensdependency parsers.

6.1. Proposed Approach on Part-of-Speech (POS) Tagging

To effectively attribute the tags to every token divided on the given tweets, we have implemented the supervised learning approach to train the model and tagged the tokens on the linguistic features followed by the natural language processing toolkits. To assign the tag to the segmented tokens, there are many features considered such as capitalization, surrounding words, tags on the surrounding words, and presence of any cardinal on the word. Upon scrutinizing the model features, appropriate tags have been assigned to the token correctly. Since the tweets lack grammatical structures and a dearth of context around the words, assigning appropriate tags to the segmented tokens has become critical and sometimes failed to attribute the correct tagging. Also, as the tweets contain more slang words, OOV words, spelling errors, and abbreviations, it has become challenging to assign the appropriate POS tagging on the tokens and need extra supervision for picking the features for the proposed model. Particularly, capitalization has not been considered to be a good feature in informal text processing because many social users have not followed the proper capitalization of words. In addition to that, they have used more adjectives to extend their greetings and emphasized more on the thoughts which become a cumbersome task for the POS tagging (see Table 9). Hence, we have used the lightweight hidden Markov model to predict the context and assign the correct tags to the tokens.

6.2. Lightweight Hidden Markov Model–POS Tagging

The lightweight hidden Markov model used the supervised word clusters trained from Brown Corpus and extracted other lexical features generated from Stanford POS tagging. These word clusters have been used to train on the unlabeled tweets and filter out the word clusters again generated from the new labeled datasets. These tagging features extract only the conventional tagging features such as words, surrounding words, surrounding tags, and use of cardinals, and also, it extracts the Twitter-specific features such as hashtags, usernames, and emoticons. This distributional-based word similarity feature finds useful for twitter streams and outperforms the Stanford tagger on the datasets given. The word-tag probability is considered to be a stochastic model in which the tagger is deemed as a Markov process with unobservable states and yields the observable outputs. In simpler terms, the POS tags are the states of the Markov process, and words are the desired results of the model. The conditional probability of the lightweight hidden Markov model has been given below:

The POS tagger comprises of the following: (i): probability of the sequence beginning in tag (ii): Probability of the sequence changing from tag to tag (iii): Probability of the sequence terminating word on tag

The tagger makes two straightforward suppositions: (i)The likelihood of a word depends just on its tag, i.e., given its tag, it is autonomous of different words and different labels(ii)The likelihood of a tag depends just on its past labels, i.e., given the past labels, it is autonomous of next labels and labels before the past labels

In noun phrases, a noun acts like a subject or an object to a verb or an adjective. To create a noun phrase Chunker, we define a noun phrase Chunker that indicates how the tweet to be chunked.

Here are the following examples: tweet: “I am at Bicycle Ranch in Scottsdale, AZ”. POS tagging of tweet: (I/PRP, am/VBP, Bicycle/NNP, Ranch/NNP, in/IN, Scottsdale/NNP, AZ/NNP).

7. Experimental Evaluation and Analysis

For our experiment analysis, we have downloaded the COVIDKB Twitter datasets provided by [35]. The COVIDKB Twitter datasets consist of 10,000 annotated tweets with ground truth values. But for our experimental analysis, we have taken 1000 tweets as the test sets due to the condition of the system and to save the computational time (see Table 10). We have considered this COVIDKB Twitter datasets because it has been manually annotated with research students and contains a wide range of entities such as a person, location, travel, and contacts. Therefore, we benchmarked this dataset for our experimental analysis and observed the state of the performance of the proposed system.

7.1. Result Analysis

We have compared our proposed model with existing preprocessing techniques and found that our proposed model has outperformed all other conventional models and yielded a better performance and accuracy rate. For the analysis of the results, we have considered the basic normalizing features such as repeated characters, abbreviations, and misspelled words. As these three components are very crucial in the tweet normalization, it has been compared with regular expression and replace function using WordNet, NLTK Library, and replacement CSV file. Our proposed model has converted the informal words into correct English words with a much higher accuracy rate and almost 20-25% increase in the precision score. As the noisy token has been hindered the performance of POS tagging and further block the entity extraction, the proposed approach has paved the better way for the POS tagging at this level and helps to solve the entity detection and recognition in the next level of processing as it was witnessed in Figure 7. A detailed view of the comparison has been given in the following Table 11. Moreover, Table 12 presents the accuracy rate of the proposed method on POS tagging of tweets.

The experiments are evaluated based on the BLEU score and WER score. While analyzing the output with other comparison models, it has become evident that the proposed method has effectively removed the repeated characters and misspelled words and normalized the appropriate text forms for the given tweets. The calculated values from the BLEU and WER have been comparatively less than 0.005 at the 95% CI level, 0.0029 and 0.0028, respectively. Therefore, the proposed method has invariably shown the consistent precision with its BLEU score less than 85% or the WER score greater than 8%. Figure 8 depicts the ROC Curve for evaluated POS tagging of tweets.

This ROC curve has been obtained after the comparative analysis over the techniques mentioned in Table 12. It has become very apparent that the proposed model has been yielded with good accuracy rate as it was measured with BLEU and WER techniques.

The overall analysis for the tweet normalization, preprocessing, and assignment of POS tags to the tokenized tweets has been given in Table 13. The accuracy rate and the sensitivity rate of the ROC analysis have been gradually improved after employing the proposed method. Eventually, the fitted ROC accuracy has been increased considerably and proved that the proposed method has outperformed other preprocessing steps carried through for the unstructured datasets.

8. Conclusions

Natural language processing (NLP) allows digital gadgets and devices to comprehend the semantics in languages. Usually, the NLP can be broadly characterized into two categories such as data preprocessing and model development [5057]. There are several text normalization strategies proposed by eminent researchers to solve the impending issues and reduce the error rate considerably [5053]. However, they have certain confinements and still do not accomplish great outcomes when it has come to informal text processing. Rather than normalizing one kind of ill-formed word, we have considered a wide range of ill-formed words found on the tweets datasets and cleaned them under three primary classifications: incorrectly spelled words, contractions, and repeated letters. The primary motivation behind why we have sorted these unorthodox words is because we might want to guarantee that all subcategories of these three fundamental issues are standardized into their right form by the most appropriate procedures. Hence, the target of this exploration is to locate the best standardization approach with the end goal to productively and precisely clean tweets containing incorrect spellings, shortened forms, and repeated characters. The future scope of this research work would add tremendous advantage if it has been resourcefully utilized for natural disasters and the imminent threat of future disease outbreaks. The major limitation of this work has been restricted only with twitter streams in general and deals particularly on preprocessing techniques of the tweet tokenization and POS tagging. Further, it needs fine-grained datasets for identifying the OOV words if the domain-specific research work has been carried out in future.

Data Availability

The article’s original contributions generated for this study are included; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.