我有大约30个带有结构的文本文件 wordleft1|wordright1wordleft2|wordright2wordleft3|wordright3... 文件的总大小约为1 GB,大约有3200万行字组合. 我尝试了一些方法来尽快加载它们并将组合存储在哈希中
wordleft1|wordright1 wordleft2|wordright2 wordleft3|wordright3 ...
文件的总大小约为1 GB,大约有3200万行字组合.
我尝试了一些方法来尽快加载它们并将组合存储在哈希中
$hash{$wordleft} = $wordright
逐个文件打开文件并逐行阅读大约需要42秒.然后我用Storable模块存储哈希值
store \%hash, $filename
再次加载数据
$hashref = retrieve $filename
将时间缩短到大约28秒.我使用快速SSD驱动器和快速CPU,并有足够的RAM来容纳所有数据(大约需要7 GB).
我正在寻找一种更快的方法将这些数据加载到RAM中(由于某些原因,我无法将其保留在那里).
您可以尝试使用Dan Bernstein的CDB文件格式使用绑定哈希,这将需要最少的代码更改.您可能需要安装 CDB_File.在我的笔记本电脑上,cdb文件打开得非常快,我每秒可以进行大约200-250k的查找.以下是创建/使用/基准测试cdb的示例脚本:test_cdb.pl
#!/usr/bin/env perl use warnings; use strict; use Benchmark qw(:all) ; use CDB_File 'create'; use Time::HiRes qw( gettimeofday tv_interval ); scalar @ARGV or die "usage: $0 number_of_keys seconds_to_benchmark\n"; my ($size) = $ARGV[0] || 1000; my ($seconds) = $ARGV[1] || 10; my $t0; tic(); # Create CDB my ($file, %data); %data = map { $_ => 'something' } (1..$size); print "Created $size element hash in memory\n"; toc(); $file = 'data.cdb'; create %data, $file, "$file.$$"; my $bytes = -s $file; print "Created data.cdb [ $size keys and values, $bytes bytes]\n"; toc(); # Read from CDB my $c = tie my %h, 'CDB_File', 'data.cdb' or die "tie failed: $!\n"; print "Opened data.cdb as a tied hash.\n"; toc(); timethese( -1 * $seconds, { 'Pick Random Key' => sub { int rand $size }, 'Fetch Random Value' => sub { $h{ int rand $size }; }, }); tic(); print "Fetching Every Value\n"; for (0..$size) { no warnings; # Useless use of hash element $h{ $_ }; } toc(); sub tic { $t0 = [gettimeofday]; } sub toc { my $t1 = [gettimeofday]; my $elapsed = tv_interval ( $t0, $t1); $t0 = $t1; print "==> took $elapsed seconds\n"; }
输出(100万键,10秒以上测试)
./test_cdb.pl 1000000 10 Created 1000000 element hash in memory ==> took 2.882813 seconds Created data.cdb [ 1000000 keys and values, 38890944 bytes] ==> took 2.333624 seconds Opened data.cdb as a tied hash. ==> took 0.00015 seconds Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds... Fetch Random Value: 10 wallclock secs (10.46 usr + 0.01 sys = 10.47 CPU) @ 236984.72/s (n=2481230) Pick Random Key: 9 wallclock secs (10.11 usr + 0.02 sys = 10.13 CPU) @ 3117208.98/s (n=31577327) Fetching Every Value ==> took 3.514183 seconds
输出(1000万键,经过10秒测试)
./test_cdb.pl 10000000 10 Created 10000000 element hash in memory ==> took 44.72331 seconds Created data.cdb [ 10000000 keys and values, 398890945 bytes] ==> took 25.729652 seconds Opened data.cdb as a tied hash. ==> took 0.000222 seconds Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds... Fetch Random Value: 14 wallclock secs ( 9.65 usr + 0.35 sys = 10.00 CPU) @ 209811.20/s (n=2098112) Pick Random Key: 12 wallclock secs (10.40 usr + 0.02 sys = 10.42 CPU) @ 2865335.22/s (n=29856793) Fetching Every Value ==> took 38.274356 seconds