Engineering at DueDil

Efficient broadcast joins in Spark, using Bloom filters

22 November 2018 Mohamed Abdelbary

Broadcast joins are a nice way to avoid a shuffle operation in Spark. However, Spark’s collect operation for the broadcast set can introduce memory pressure on the driver. Bloom filters can provide a neat solution to this problem. »