我有一个字符串,如下所示: [1] "What can we learn from the Mahabharata " [2] "What are the most iconic songs associated with the Vietnam War " [3] "What are some major social faux pas to avoid when visiting Malta " [4] "Will Read
[1] "What can we learn from the Mahabharata " [2] "What are the most iconic songs associated with the Vietnam War " [3] "What are some major social faux pas to avoid when visiting Malta " [4] "Will Ready Boost technology contribute to CFD software usage " [5] "Who is Jon Snow " ...
和一个数据框,为每个单词分配一个分数:
word score the 11 to 9 What 9 I 7 a 6 are 6
我想为每个字符串分配其中包含的单词的分数总和,我的解决方案是以下函数
score_fun<- function(x) # obtaining the list of words {z <- unlist(strsplit(x,' ')); # returning the sum of the words' scores return(sum(word_scores$score[word_scores$word %in% z]))} # using sapply() in conjunction with the function scores <- sapply(my_strings, score_fun, USE.NAMES = F) # the output will look like scores [1] 20 26 24 9 0 0 38 32 30 0
我遇到的问题是性能问题,我有大约500k的字符串,超过一百万字,在我的I-7,16GB机器上使用该功能需要一个多小时.
此外,解决方案只是感觉不雅,笨重..
是否有更好(更有效)的解决方案?
重现数据:
my_strings <- c("What can we learn from the Mahabharata ", "What are the most iconic songs associated with the Vietnam War ", "What are some major social faux pas to avoid when visiting Malta ", "Will Ready Boost technology contribute to CFD software usage ", "Who is Jon Snow ", "Do weighing scales measure mass or weight ", "What will happen to the money in foreign banks after demonetizing 500 and 1000 rupee notes ", "Is it mandatory to stay for 11 months in a rented house if the rental agreement was made for 11 months ", "What are some really good positive comments to say on a cricket field to your teammates ", "Is Donald Trump fact free ") word_scores <- data.frame(word = c("the", "to", "What", "I", "a", "are", "in", "of", "and", "do" ), score = c(11L, 9L, 9L, 7L, 6L, 6L, 6L, 6L, 3L, 3L), stringsAsFactors = F)您可以使用tidytext :: unnest_tokens将其标记为单词,然后加入并聚合:
library(tidyverse) library(tidytext) data_frame(string = my_strings, id = seq_along(string)) %>% unnest_tokens(word, string, 'words', to_lower = FALSE) %>% distinct() %>% left_join(word_scores) %>% group_by(id) %>% summarise(score = sum(score, na.rm = TRUE)) #> # A tibble: 10 × 2 #> id score #> <int> <int> #> 1 1 20 #> 2 2 26 #> 3 3 24 #> 4 4 9 #> 5 5 0 #> 6 6 0 #> 7 7 38 #> 8 8 32 #> 9 9 30 #> 10 10 0
如果您愿意,请保留原始字符串,或者最后通过ID重新加入它们.
在小数据上,它要慢得多,但它在规模上变得更快,例如当my_strings重新采样到10,000时:
Unit: milliseconds expr min lq mean median uq max neval Reduce 5440.03300 5656.41350 5815.2094 5814.0406 5944.9969 6206.2502 100 sapply 460.75930 486.94336 511.2762 503.4932 532.2363 746.8376 100 tidytext 86.92182 94.65745 101.7064 100.1487 107.3289 134.7276 100