主要功能是计算空间,输出是类型的嵌套映射
{"w1" {"w11" 10, "w12" 31, ...} "w2" {"w21" 14, "w22" 1, ...} ... }
意思是“w1”与“w11”共同发生了10次,等等……
它需要一系列文档(句子)和一系列目标词,它迭代两者并最终应用上下文fn(如滑动窗口)来提取上下文词.更具体地说,我在滑动窗口上传递了一个封闭物
(compute-space docs (fn [target doc] (sliding-window target doc 5)) targets)
我用大约5000万字(大约300万个句子)和大约500万字来测试它. 20,000个目标.这个版本需要一天多的时间才能完成.我还写了一个pmap并行函数(pcompute-space),可以将计算时间减少到大约10个小时,但我仍觉得它应该更快.我没有其他代码可供比较,但我的直觉说它应该更快.
(defn compute-space ([docs context-fn targets] (let [space (atom {})] (doseq [doc docs target targets] (when-let [contexts (context-fn target doc)] (doseq [w contexts] (if (get-in @space [target w]) (swap! space update-in [target w] (partial inc)) (swap! space assoc-in [target w] 1))))) @space))) (defn sliding-window [target s n] (loop [todo s seen [] acc []] (let [curr (first todo)] (cond (= curr target) (recur (rest todo) (cons curr seen) (concat acc (take n seen) (take n (rest todo)))) (empty? todo) acc :else (recur (rest todo) (cons curr seen) acc))))) (defn pcompute-space [docs step context-fn targets] (reduce #(deep-merge-with + %1 %2) (pmap (fn [chunk] (do (tick)) (compute-space chunk context-fn targets)) (partition-all step docs)))
我用jvisualvm描述了应用程序,我发现clojure.lang.Cons,clojure.lang.ChunkedCons和clojure.lang.ArrayChunk正在过度控制这个过程(见图).这肯定与我使用这个双剂量循环的事实有关(先前的实验表明这种方式比使用map,reduce等更快,尽管我正在使用时间来对函数进行基准测试).
我非常感谢您提供的任何见解,以及重构代码并使其运行得更快的建议.我想减速器在这里可能有所帮助,但我不确定如何和/或为什么.
眼镜
MacPro 2010 2.4 GHz Intel Core 2 Duo 4 GB RAM
Clojure 1.6.0
Java 1.7.0_51 Java HotSpot(TM)64位服务器VM
Test data
GithubGist with the entire code
测试数据是:> 42个字符串(目标)的懒惰序列
>懒惰的105,040套懒人套装. (文件)
> Documents中的每个lazy seq都是一个懒惰的字符串序列.文件中包含的字符串总数为1,146,190.
比你的工作量小很多. Criterium用于收集时间. Criterium多次计算表达式,首先预热JIT然后收集平均数据.
使用我的测试数据和您的代码,计算空间耗时22秒:
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies. Evaluation count : 60 in 60 samples of 1 calls. Execution time mean : 21.989189 sec Execution time std-deviation : 471.199127 ms Execution time lower quantile : 21.540155 sec ( 2.5%) Execution time upper quantile : 23.226352 sec (97.5%) Overhead used : 13.353852 ns Found 2 outliers in 60 samples (3.3333 %) low-severe 2 (3.3333 %) Variance from outliers : 9.4329 % Variance is slightly inflated by outliers
第一次优化更新以使用频率从单词向量到单词的映射及其出现次数.
为了帮助我理解计算的结构,我编写了一个单独的函数,它接受文档集合,context-fn和单个目标,并将上下文单词的映射返回到计数.计算空间返回的一个目标的内部映射.使用内置的Clojure函数写出来,而不是更新计数.
(defn compute-context-map-f [documents context-fn target] (frequencies (mapcat #(context-fn target %) documents)))
使用compute-context-map-f编写的计算空间,名为compute-space-f,相当简短:
(defn compute-space-f [docs context-fn targets] (into {} (map #(vector % (compute-context-map-f docs context-fn %)) targets)))
时间与上述数据相同,是原始版本的65%:
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies. Evaluation count : 60 in 60 samples of 1 calls. Execution time mean : 14.274344 sec Execution time std-deviation : 345.240183 ms Execution time lower quantile : 13.981537 sec ( 2.5%) Execution time upper quantile : 15.088521 sec (97.5%) Overhead used : 13.353852 ns Found 3 outliers in 60 samples (5.0000 %) low-severe 1 (1.6667 %) low-mild 2 (3.3333 %) Variance from outliers : 12.5419 % Variance is moderately inflated by outliers
并行化第一次优化
我选择按目标而不是文档进行分块,因此将地图合并在一起不需要修改目标的{context-word count,…}映射.
(defn pcompute-space-f [docs step context-fn targets] (into {} (pmap #(compute-space-f docs context-fn %) (partition-all step targets))))
与上述数据相同的时间是原始版本的16%:
user> (criterium.core/bench (pcompute-space-f documents 4 #(sliding-window %1 %2 5) keywords)) WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies. Evaluation count : 60 in 60 samples of 1 calls. Execution time mean : 3.623018 sec Execution time std-deviation : 83.780996 ms Execution time lower quantile : 3.486419 sec ( 2.5%) Execution time upper quantile : 3.788714 sec (97.5%) Overhead used : 13.353852 ns Found 1 outliers in 60 samples (1.6667 %) low-severe 1 (1.6667 %) Variance from outliers : 11.0038 % Variance is moderately inflated by outliers
产品规格
> Mac Pro 2009 2.66 GHz四核Intel Xeon,48 GB RAM.
> Clojure 1.6.0.
> Java 1.8.0_40 Java HotSpot(TM)64位服务器VM.
TBD
进一步优化.
描述测试数据.