前言
最近在网上看到一个问题:10亿QQ号如何去重?
我觉得挺有意思的。
今天这篇文章跟大家一起分享一下常见的解决方案,希望对你会有所帮助。
更多项目实战在我的技术网站:http://www.susan.net.cn/project
一、技术难点
1.1 数据规模分析
<ul>原始数据:10亿×8字节 = 8GB
HashSet去重:至少16GB内存(Java对象开销)
理想方案:> 3) + 1]; // 每byte存储8个数字 } public void add(int num) { int arrayIndex = num >> 3; // num/8 int position = num & 0x07; // num%8 bits[arrayIndex] |= 1 > 3; int position = num & 0x07; return (bits[arrayIndex] & (1 > 3); int position = (int)(num & 7); bits[arrayIndex] |= 1 { try { r.close(); } catch (IOException e) {}}); } } class MergeEntry implements Comparable { String value; BufferedReader reader; public MergeEntry(String value, BufferedReader reader) { this.value = value; this.reader = reader; } @Override public int compareTo(MergeEntry o) { return this.value.compareTo(o.value); } } [/code]五、分布式解决方案
5.1 分片策略设计
5.2 Spark实现方案
- public class BitMap {
- private final byte[] bits;
-
- public BitMap(int maxNum) {
- this.bits = new byte[(maxNum >> 3) + 1]; // 每byte存储8个数字
- }
-
- public void add(int num) {
- int arrayIndex = num >> 3; // num/8
- int position = num & 0x07; // num%8
- bits[arrayIndex] |= 1 << position;
- }
-
- public boolean contains(int num) {
- int arrayIndex = num >> 3;
- int position = num & 0x07;
- return (bits[arrayIndex] & (1 << position)) != 0;
- }
- }
复制代码 5.3 内存优化:RoaringBitmap
存储优势对比:- (10^10 - 10^4) / 8 / 1024/1024 ≈ 1.16GB
复制代码 六、生产级架构:Lambda架构
6.1 系统架构图
6.2 各层技术选型
架构层技术栈处理目标批处理层Spark + HDFS全量数据去重速度层Flink + Redis实时增量去重服务层Spring Boot + HBase统一查询接口6.3 实时去重实现
- // 偏移量优化:存储(qq - 10000)
- public void add(long qq) {
- long num = qq - 10000;
- int arrayIndex = (int)(num >> 3);
- int position = (int)(num & 7);
- bits[arrayIndex] |= 1 << position;
- }
复制代码 七、终极方案:分层位图索引
7.1 架构设计
7.2 存储计算
QQ号范围:10000 - 9999999999(约100亿)
分层设计:
- 第一层分片:100个区间(每区间1亿)
- 第二层分片:100个子区间(每区间100万)
- 第三层存储:RoaringBitmap(每分区1.2MB)
总内存需求:- public class BloomFilter {
- private final BitSet bitset;
- private final int size;
- private final int[] seeds;
-
- public BloomFilter(int size, int hashCount) {
- this.bitset = new BitSet(size);
- this.size = size;
- this.seeds = new int[hashCount];
- for (int i = 0; i < hashCount; i++) {
- seeds[i] = i * 31;
- }
- }
-
- public void add(String qq) {
- for (int seed : seeds) {
- int hash = hash(qq, seed);
- bitset.set(Math.abs(hash % size), true);
- }
- }
-
- public boolean contains(String qq) {
- for (int seed : seeds) {
- int hash = hash(qq, seed);
- if (!bitset.get(Math.abs(hash % size))) {
- return false;
- }
- }
- return true;
- }
-
- private int hash(String value, int seed) {
- // MurmurHash 实现
- int result = 0;
- for (char c : value.toCharArray()) {
- result = seed * result + c;
- }
- return result;
- }
- }
复制代码 7.3 Java实现
- // 外部排序
- public void externalSort(String input, String output) throws IOException {
- List<File> chunks = splitAndSort(input, 100_000_000); // 每个文件1千万
- mergeFiles(chunks, output);
- }
- // 多路归并
- void mergeFiles(List<File> files, String output) {
- PriorityQueue<MergeEntry> queue = new PriorityQueue<>();
- List<BufferedReader> readers = new ArrayList<>();
- // 初始化堆
- for (File file : files) {
- BufferedReader reader = new BufferedReader(new FileReader(file));
- readers.add(reader);
- String line = reader.readLine();
- if (line != null) {
- queue.add(new MergeEntry(line, reader));
- }
- }
- try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) {
- long last = -1;
- while (!queue.isEmpty()) {
- MergeEntry entry = queue.poll();
- long qq = Long.parseLong(entry.value);
- // 去重:只写入不重复的QQ号
- if (qq != last) {
- writer.write(entry.value);
- writer.newLine();
- last = qq;
- }
- // 读取下一行
- String next = entry.reader.readLine();
- if (next != null) {
- queue.add(new MergeEntry(next, entry.reader));
- }
- }
- } finally {
- readers.forEach(r -> { try { r.close(); } catch (IOException e) {}});
- }
- }
- class MergeEntry implements Comparable<MergeEntry> {
- String value;
- BufferedReader reader;
- public MergeEntry(String value, BufferedReader reader) {
- this.value = value;
- this.reader = reader;
- }
- @Override
- public int compareTo(MergeEntry o) {
- return this.value.compareTo(o.value);
- }
- }
复制代码 八、方案对比与选型建议
[table][tr]方案适用场景内存/存储时间复杂度精度[/tr][tr][td]单机位图[/td][td] |