Research Article

Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

Table 1

Input and output data types for all tasks.

Output data
FileStringBufferSynsetSequenceTokenSequenceFeatureVector

Input dataFileGDFF, SFE, TAFPF2SB
StringBufferAFSB, CPFSB, CPTFSB, CFSB, FEjISB, FEtISB, FHISB, FUISB, FUNISB, GLFSB, IFSB, MLFSB, NFSB, SBTLC, SFSB, SHFSB, SWFSB, TCFSBSB2SSSB2TS
SynsetSequenceSS2FV
TokenSequenceTSPS, TSSITS2FV
FeatureVectorTCFFV, TDFFV

Available tasks: Input data file: GuessDateFromFilePipe (GDFF), StoreFileExtensionPipe (SFE), TargetAssigningFromPathPipe (TAFP), File2StringBufferPipe (F2SB). Input data StringBuffer: AbbreviationFromStringBufferPipe (AFSB), ComputePolarityFromStringBufferPipe (CPFSB), ComputePolarityTBWSFromStringBufferPipe (CPTFSB), ContractionsFromStringBufferPipe (CFSB), FindEmojiInStringBufferPipe (FEjISB), FindEmoticonInStringBufferPipe (FEtISB), FindHashtagInStringBufferPipe (FHISB), FindUrlInStringBufferPipe (FUISB), FindUserNameInStringBufferPipe (FUNISB), GuessLanguageFromStringBufferPipe (GLFSB), InterjectionFromStringBufferPipe (IFSB), MeasureLengthFromStringBufferPipe (MLFSB), NERFromStringBufferPipe (NFSB), StringBufferToLowerCasePipe (SBTLC), SlangFromStringBufferPipe (SFSB), StripHTMLFromStringBufferPipe (SHFSB), StopWordFromStringBufferPipe (SWFSB), TeeCSVFromStringBufferPipe (TCFSB), StringBuffer2SynsetSequencePipe (SB2SS), StringBuffer2TokenSequencePipe (SB2TS). Input data SynsetSequence: SynsetSequence2FeatureVectorPipe (SS2FV). Input data TokenSequence: TokenSequencePorterStemmerPipe (TSPS), TokenSequenceStemIrregularPipe (TSSI), TokenSequence2FeatureVectorPipe (TS2FV). Input data FeatureVector: TeeCSVFromFeatureVectorPipe (TCFFV), TeeDatasetFromFeatureVectorPipe (TDFFV).