当前位置 : 主页 > 网络安全 > 测试自动化 >

水库采样的性能与获取列表的长度和挑选随机元素的关系

来源:互联网 收集:自由互联 发布时间:2021-06-22
我写了两个函数来从未知长度的列表中选择一个随机元素.第一个使用储层采样(具有大小为1的储层),第二个使用列表的长度来选择随机索引并将其返回.出于某种原因,前者要快得多. 第一
我写了两个函数来从未知长度的列表中选择一个随机元素.第一个使用储层采样(具有大小为1的储层),第二个使用列表的长度来选择随机索引并将其返回.出于某种原因,前者要快得多.

第一个函数使用单个遍历并以概率(1 / i)选择每个元素,其中i是列表中元素的索引.它导致挑选每个元素的概率相等.

pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
  stdgen <- newStdGen
  return (pickRandom' xs x 1 stdgen)

-- Pick a random number using reservoir sampling
pickRandom' :: (RandomGen g) => [a] -> a -> Int -> g -> a
pickRandom' [] xi _ _ = xi
pickRandom' (x:xs) xi n gen =
  let (rand, gen') = randomR (0, n) gen in
  if (rand == 0) then
    pickRandom' xs x (n + 1) gen' -- Update value
  else
    pickRandom' xs xi (n + 1) gen' -- Keep previous value

第二个版本遍历列表一次以获得其长度,然后选择0和输入列表(-1)的长度之间的索引以获得元素之一,同样具有相同的概率.列表1.5的预期遍历次数:

-- Traverses the list twice
pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
  gen <- newStdGen
  (e, _) <- return $randomR (0, (length xs) - 1) gen
  return $xs !! e

以下是我用于对这两个函数进行基准测试的代码:

main :: IO ()
main = do
  gen <- newStdGen
  let size = 2097152
      inputList = getRandList gen size
  defaultMain [ bench "Using length" (pickRandomWithLen inputList)
              , bench "Using reservoir" (pickRandom inputList)
              ]

这是一个剥离的输出:

benchmarking Using reservoir
mean: 82.72108 ns, lb 82.02459 ns, ub 83.61931 ns, ci 0.950

benchmarking Using length
mean: 17.12571 ms, lb 16.97026 ms, ub 17.37352 ms, ci 0.950

换句话说,第一个功能比第二个功能快约200倍.我预计运行时主要受随机数生成和列表遍历数(1对1.5)的影响.还有哪些其他因素可以解释如此巨大的差异?

您的基准测试操作实际上并未评估结果,

pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
  stdgen <- newStdGen
  return (pickRandom' xs x 1 stdgen)

只获得一个新的StdGen并返回一个thunk.这很直接.

pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
  gen <- newStdGen
  (e, _) <- return $randomR (0, (length xs) - 1) gen
  return $xs !! e

计算列表的长度,然后返回一个thunk,当然要慢得多.

强制两者评估结果,

return $! ...

使用版本的速度更快,

benchmarking Using length
mean: 14.65655 ms, lb 14.14580 ms, ub 15.16942 ms, ci 0.950
std dev: 2.631668 ms, lb 2.378186 ms, ub 2.937339 ms, ci 0.950
variance introduced by outliers: 92.581%
variance is severely inflated by outliers

benchmarking Using reservoir
collecting 100 samples, 1 iterations each, in estimated 47.00930 s
mean: 451.5571 ms, lb 448.4355 ms, ub 455.7812 ms, ci 0.950
std dev: 18.50427 ms, lb 14.45557 ms, ub 24.74350 ms, ci 0.950
found 4 outliers among 100 samples (4.0%)
  2 (2.0%) high mild
  2 (2.0%) high severe
variance introduced by outliers: 38.511%
variance is moderately inflated by outliers

(在强制输入列表之前通过打印其总和来评估之后),因为它只需要一次调用PRNG,而水库采样使用长度列表–1次调用.

如果使用比StdGen更快的PRNG,则差异可能会更小.

实际上,使用System.Random.Mersenne而不是StdGen(要求pickRandom’具有IO结果类型,并且因为它在特定范围内不提供生成但是仅默认范围稍微扭曲了所选元素的分布,但是因为我们只对伪随机数生成所需的时间,这并不重要),储层采样的时间下降到

mean: 51.83185 ms, lb 51.77620 ms, ub 51.91259 ms, ci 0.950
std dev: 482.4712 us, lb 368.4433 us, ub 649.1758 us, ci 0.950

(当然,pickRandomWithLen时间不会发生显着变化,因为它只使用了一代).大约九倍的加速,表明伪随机生成是主导因素.

网友评论