1.10.2.5.1. 比较两个大文件的重复数据的算法

思路1：使用bloom filter算法。

如果不考虑100%准确率，那么bloom filter将是很好的选择

思路2：hash

hash的原理就不说了，这里主要使用“大而化小，分而治之”的策略。

遍历a（假如a是较大的文件），对a的每一行做hash运算，根据hash值将该行数据映射到一个小文件a1-a100文件中；
此时遍历b，做同样的hash算法，映射到b1-b100小文件中；（注意：两个字符串如果相同，那么他们经过同一hash算法得到的必然也是相等的）
逐个比较文件对，此时数据量够小，可以装载到hashmap中进行比对，最后得到结果。

参考
https://www.zhihu.com/question/21827402/answer/387830719
https://my.oschina.net/vdroid/blog/373439
https://blog.csdn.net/qingdujun/article/details/82343756
https://blog.csdn.net/tiankong_/article/details/77234726

results matching ""

No results matching ""