| Input initial seed keywords from the literature |
| Stage 1: BERT word vector similarity selection |
(1) | Initialize empty similar words vocabulary |
(2) | For each seed keyword do |
(3) | Collect corresponding baidu baike text |
(4) | Construct keywords vocabulary based on JIEBA segmentation |
(5) | Vectorize seed keywords and potential keywords in vocabulary based on BERTvec |
(6) | For each keyword in potential keywords vocabulary do |
(7) | Calculate cosine similarity score between and |
(8) | IF threshold then |
(9) | Add to similar words vocabulary |
(10) | End for |
(11) | End for |
(12) | Output similar words vocabulary |
| Stage 2: NEZHA word importance selection |
(13) | Initialize empty similar & important vocabulary |
(14) | Collect data from CLUE data set in the form of (keywords, text) |
(15) | Randomly select words from text as pseudo-keywords at a ratio of 1 : 1 |
(16) | Build finetune data set (Keyword/Pseudo-Keyword, text, label) as |
(17) | Construct training set and development set from data set |
(18) | Finetune BERT-TensorFlow, BERT-MindSpore, NEZHA-MindSpore in training set |
(19) | Select the best performing model (NEZHA-MindSpore) by precision on the development set |
(20) | For each keyword in similar words vocabulary do |
(21) | Calculate context importance score based on model |
(22) | Add and to similar and important vocabulary |
(23) | End for |
(24) | Keep words with top 100 importance scores in vocabulary |
| Output similar and important vocabulary |
| Stage 3: LSTM stock index forecast |
(25) | For keyword in do |
(26) | For lagging term in 1 to 10 do |
(27) | Calculate lagged search index time series |
(28) | End for |
(29) | Use Pearson correlation coefficient to select the most related lagged term |
(30) | End for |
(30) | Train LSTM to forecast CSI300 stock index on the 2215-day train data set |
(31) | Calculate and compare model RMSE on the 243-day test data set |
| Output model RMSE |