当前位置 : 主页 > 网络安全 > 测试自动化 >

提高在大型字符串向量上计算字分数之和的性能?

来源:互联网 收集:自由互联 发布时间:2021-06-22
我有一个字符串,如下所示: [1] "What can we learn from the Mahabharata " [2] "What are the most iconic songs associated with the Vietnam War " [3] "What are some major social faux pas to avoid when visiting Malta " [4] "Will Read
我有一个字符串,如下所示:

[1] "What can we learn from the Mahabharata "                                                                
 [2] "What are the most iconic songs associated with the Vietnam War "                                        
 [3] "What are some major social faux pas to avoid when visiting Malta "                                      
 [4] "Will Ready Boost technology contribute to CFD software usage "                                          
 [5] "Who is Jon Snow " ...

和一个数据框,为每个单词分配一个分数:

word score
   the    11
    to     9
  What     9
     I     7
     a     6
   are     6

我想为每个字符串分配其中包含的单词的分数总和,我的解决方案是以下函数

score_fun<- function(x)

 # obtaining the list of words 

 {z <- unlist(strsplit(x,' ')); 

 # returning the sum of the words' scores     

 return(sum(word_scores$score[word_scores$word %in% z]))} 

 # using sapply() in conjunction with the function  

 scores <- sapply(my_strings, score_fun, USE.NAMES = F)

 # the output will look like 
 scores
 [1] 20 26 24  9  0  0 38 32 30  0

我遇到的问题是性能问题,我有大约500k的字符串,超过一百万字,在我的I-7,16GB机器上使用该功能需要一个多小时.
此外,解决方案只是感觉不雅,笨重..

是否有更好(更有效)的解决方案?

重现数据:

my_strings <- c("What can we learn from the Mahabharata ", "What are the most iconic songs associated with the Vietnam War ", 
"What are some major social faux pas to avoid when visiting Malta ", 
"Will Ready Boost technology contribute to CFD software usage ", 
"Who is Jon Snow ", "Do weighing scales measure mass or weight ", 
"What will happen to the money in foreign banks after demonetizing 500 and 1000 rupee notes ", 
"Is it mandatory to stay for 11 months in a rented house if the rental agreement was made for 11 months ", 
"What are some really good positive comments to say on a cricket field to your teammates ", 
"Is Donald Trump fact free ")


word_scores <- data.frame(word = c("the", "to", "What", "I", "a", "are", "in", "of", "and", "do"
), score = c(11L, 9L, 9L, 7L, 6L, 6L, 6L, 6L, 3L, 3L), stringsAsFactors = F)
您可以使用tidytext :: unnest_tokens将其标记为单词,然后加入并聚合:

library(tidyverse)
library(tidytext)

data_frame(string = my_strings, id = seq_along(string)) %>% 
    unnest_tokens(word, string, 'words', to_lower = FALSE) %>% 
    distinct() %>%
    left_join(word_scores) %>% 
    group_by(id) %>%
    summarise(score = sum(score, na.rm = TRUE))

#> # A tibble: 10 × 2
#>       id score
#>    <int> <int>
#> 1      1    20
#> 2      2    26
#> 3      3    24
#> 4      4     9
#> 5      5     0
#> 6      6     0
#> 7      7    38
#> 8      8    32
#> 9      9    30
#> 10    10     0

如果您愿意,请保留原始字符串,或者最后通过ID重新加入它们.

在小数据上,它要慢得多,但它在规模上变得更快,例如当my_strings重新采样到10,000时:

Unit: milliseconds
     expr        min         lq      mean    median        uq       max neval
   Reduce 5440.03300 5656.41350 5815.2094 5814.0406 5944.9969 6206.2502   100
   sapply  460.75930  486.94336  511.2762  503.4932  532.2363  746.8376   100
 tidytext   86.92182   94.65745  101.7064  100.1487  107.3289  134.7276   100
网友评论