Bloom Filter example

  • Be sure to checkout “Tips & Tricks”
    Dear Guest Visitor → Once you register and log-in please checkout the “Tips & Tricks” page for some very handy tips!

    /Steve.
  • BootAble – FreeDOS boot testing freeware

    To obtain direct, low-level access to a system's mass storage drives, SpinRite runs under a GRC-customized version of FreeDOS which has been modified to add compatibility with all file systems. In order to run SpinRite it must first be possible to boot FreeDOS.

    GRC's “BootAble” freeware allows anyone to easily create BIOS-bootable media in order to workout and confirm the details of getting a machine to boot FreeDOS through a BIOS. Once the means of doing that has been determined, the media created by SpinRite can be booted and run in the same way.

    The participants here, who have taken the time to share their knowledge and experience, their successes and some frustrations with booting their computers into FreeDOS, have created a valuable knowledgebase which will benefit everyone who follows.

    You may click on the image to the right to obtain your own copy of BootAble. Then use the knowledge and experience documented here to boot your computer(s) into FreeDOS. And please do not hesitate to ask questions – nowhere else can better answers be found.

    (You may permanently close this reminder with the 'X' in the upper right.)

PHolder

Well-known member
Sep 16, 2020
1,518
1
611
Ontario, Canada
Security Now episode 989 (https://www.grc.com/sn/sn-989-notes.pdf) is a "propeller head" episode that discusses Bloom Filters. This inspired me to code up an example of one (in Java), which I will present here for those interested. First, here's an old example I coded up ages ago, that uses Google Guava's implementation of a Bloom Filter. This is to show you that you probably don't need to roll your own, but you may enjoy learning from doing so.

Java:
import org.eclipse.jdt.annotation.Nullable;
import java.math.BigInteger;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.PrimitiveSink;

final class BloomFilterExample
{
    private static final int MAX_NUMBER_OF_ENTRIES = 10_000;
    private static final int EXTRA_ROOM_FOR_ENTRIES_FOR_FALSE_POSITIVE_PREVENTION = 0;
    // The default value for false positive probability is 0.03 (3%)
    private static final double FALSE_POSITIVE_PROBABILITY = 0.01;
    private static final int STARTING_PRIME = 7883;  // 7883 is one of the first one thousand primes
    
    private final BloomFilter<BigInteger> bloomFilter = BloomFilter.create(BigIntegerFunnel.INSTANCE,
            MAX_NUMBER_OF_ENTRIES + EXTRA_ROOM_FOR_ENTRIES_FOR_FALSE_POSITIVE_PREVENTION,
            FALSE_POSITIVE_PROBABILITY);
    private final List<BigInteger> listOfBigIntsPutIntoBloomFilter = new ArrayList<>(MAX_NUMBER_OF_ENTRIES);

    public void doExample()
    {
        // Find a series of [probable] prime numbers
        BigInteger curPrime = BigInteger.valueOf(STARTING_PRIME);
        for (int i=0; i<MAX_NUMBER_OF_ENTRIES; i++)
        {
            curPrime = curPrime.nextProbablePrime();
            bloomFilter.put(curPrime);
            listOfBigIntsPutIntoBloomFilter.add(curPrime);
        }
        
        // Test the bloom filter, tracking the results
        // Note that the testing range includes more than just primes, so the number of tests will
        // be significant larger than the number of entries we installed into the bloom filter
        // We do, however, skip testing the even numbers, because if this bloom filter was to actually
        // be used for primality indication, no good programmer would bother testing even numbers.
        int numberOfValuesTested = 0;
        int numberOfFalsePositives = 0;
        for (int i=STARTING_PRIME; i<curPrime.intValue(); i+=2) 
        {
            final BigInteger soughtValue = BigInteger.valueOf(i);
            numberOfValuesTested++;
            if (bloomFilter.mightContain(soughtValue))
            {
                if (listOfBigIntsPutIntoBloomFilter.contains(soughtValue))
                {
                    //System.out.println(soughtValue + " was in both");
                }
                else
                {
                    //System.out.println(soughtValue + " was a false positive");
                    numberOfFalsePositives++;
                }
            }
            else
            {
                //System.out.println(soughtValue + " not in bloom filter");
            }
        }
        System.out.println("When " + numberOfValuesTested + " different values were tested, there were " + numberOfFalsePositives + " false positives.");
        final double falsePositivePercentage = ((double)(int)((double)(numberOfFalsePositives*10000) / numberOfValuesTested))/100;
        System.out.println("The actual false positive rate was " + falsePositivePercentage + "%, while the requested rate was " + (FALSE_POSITIVE_PROBABILITY * 100) + "%.");
    }

    private static enum BigIntegerFunnel implements Funnel<BigInteger>
    {
        // The Guava docs recommend making the Funnel an enum to help with
        // [de]serialization
        INSTANCE;

        @Override
        // This is quite probably @NonNull but it's painful to override and this check is safe and easy
        public void funnel(final BigInteger from, final @Nullable PrimitiveSink into)
        {
          Objects.requireNonNull(into);
          into.putBytes(from.toByteArray());       
        }   
    }
}
 
And now for my example. First let's define some Interfaces of what a general Bloom Filter might look like. You need to be able to add items. And then you can confirm whether an item MIGHT be present, or is DEFINITELY NOT present. T in this example is a "generic" type. My further code will be based on Strings, but you don't need to hard specify that to define the interface.

Java:
public interface BloomFilter<T>
{
  void addItem(T itemToAdd);
  boolean isDefinitelyNotPreset(T itemToCheckPresenceOf);
  boolean isPotentiallyPresent(T itemToCheckPresenceOf);
}

A Bloom Filter needs to use some sort of set of bits (aka a BitSet) and also some sort of a hashing function.

Java:
public interface BloomFilterBitset
{
  void setBitnumber(int bitNumberToSet);
  void setBitNumbers(int[] bitNumbersToSet);
  boolean isBitNumberSet(int bitNumberToCheck);
  boolean areAllBitNumbersSet(int[] bitNumbersToCheck);
}

Java:
public interface BloomFilterHash<T>
{
  int convertInputToBitNumber(T inputToConvert);
  int[] convertInputToBitNumbers(T inputToConvert);
}
 
Last edited:
Now that we know how the implementations will work, here is an implementation of a "Basic" Bloom Filter, which will rely on the implementations of the bit set and hashes.

Java:
public class BasicBloomFilter<T> implements BloomFilter<T>
{
  private final BloomFilterHash<T> bloomFilterHash;
  private final BloomFilterBitset bloomFilterBitset;
 
  public BasicBloomFilter(final BloomFilterHash<T> bloomFilterHashToUse, final BloomFilterBitset bloomFilterBitsetToUse)
  {
    bloomFilterHash   = bloomFilterHashToUse;
    bloomFilterBitset = bloomFilterBitsetToUse;
  }

  @Override
  public void addItem(T itemToAdd)
  {
    final int[] bits = bloomFilterHash.convertInputToBitNumbers(itemToAdd);
    bloomFilterBitset.setBitNumbers(bits);
  }

  @Override
  public boolean isDefinitelyNotPreset(final T itemToCheckPresenceOf)
  {
    final int[] bits = bloomFilterHash.convertInputToBitNumbers(itemToCheckPresenceOf);
    return !bloomFilterBitset.areAllBitNumbersSet(bits);
  }

  @Override
  public boolean isPotentiallyPresent(final T itemToCheckPresenceOf)
  {
    final int[] bits = bloomFilterHash.convertInputToBitNumbers(itemToCheckPresenceOf);
    return bloomFilterBitset.areAllBitNumbersSet(bits);
  }
}

For the bit set, we'll do something somewhat inefficient, but expedient, and rely on the features of a BigInteger which already can access the bits by index.

Java:
import java.math.BigInteger;

public class BigIntegerBitset implements BloomFilterBitset
{
  private final int maxNumberOfBits;
  private BigInteger bitset;

  public BigIntegerBitset(final int maxNumberOfBits)
  {
    this.maxNumberOfBits = maxNumberOfBits;
    bitset = BigInteger.ZERO;
  }

  @Override
  public void setBitnumber(final int bitNumberToSet)
  {
    rangeCheckBitNumber(bitNumberToSet);
    bitset = bitset.setBit(bitNumberToSet);
  }

  private void rangeCheckBitNumber(final int bitNumberToRangeCheck)
  {
    if (bitNumberToRangeCheck >= maxNumberOfBits)
      throw new IllegalArgumentException("Request to access bit number " + bitNumberToRangeCheck + " when maxNumberOfBits is " + maxNumberOfBits);
  }

  @Override
  public void setBitNumbers(final int[] bitNumbersToSet)
  {
    for (int i=0; i<bitNumbersToSet.length; i++)
    {
      setBitnumber(bitNumbersToSet[i]);
    }
  }

  @Override
  public boolean isBitNumberSet(int bitNumberToCheck)
  {
    rangeCheckBitNumber(bitNumberToCheck);
    return bitset.testBit(bitNumberToCheck);
  }

  @Override
  public boolean areAllBitNumbersSet(int[] bitNumbersToCheck)
  {
    for (int i=0; i<bitNumbersToCheck.length; i++)
    {
      if (!isBitNumberSet(bitNumbersToCheck[i])) return false;
    }
    return true;
  }
}

For the hashing, we're going to use one of the least recommended hashes, but again it will be expedient. In Java, all objects that are comparable use a supplied hash function to save effort. (If the hashes don't equate, then it's assured the objects cannot equate without looking at them deeper.)

Note: In Java, all integers are signed, so the & 0x7fff_ffff just removes the sign bit to remove negative offsets.

Java:
import java.util.Arrays;

// Example only, probably not a great hash to actually use for this purpose
public class StringObjectHash implements BloomFilterHash<String>
{
  private final int numberOfHashPrefixes;
  private final String[] hashPrefixes;
  private final int numberOfBitsInBitset;
 
  public StringObjectHash(final String[] hashPrefixes, final int numberOfBitsInBitset)
  {
    numberOfHashPrefixes = hashPrefixes.length;
    this.hashPrefixes = Arrays.copyOf(hashPrefixes, numberOfHashPrefixes);
    this.numberOfBitsInBitset = numberOfBitsInBitset;
  }

  @Override
  public int convertInputToBitNumber(final String inputToConvert)
  {
    return ((inputToConvert.hashCode()) & 0x7fff_ffff) % numberOfBitsInBitset;
  }

  @Override
  public int[] convertInputToBitNumbers(final String inputToConvert)
  {
    final int[] result = new int[numberOfHashPrefixes];
    for (int i=0; i<numberOfHashPrefixes; i++)
    {
      result[i] = (((hashPrefixes + inputToConvert).hashCode()) & 0x7fff_ffff) % numberOfBitsInBitset;
    }
    return result;
  }
}
 
Last edited:
And now that we have implementations, we can make use of them in a simple example:

Java:
public class BloomFilterExampleMain
{
  private static final int MAX_NUMBER_OF_BITS = 1_000;

  public static void main(final String[] args)
  {
    final String[] hashPrefixes = {"One", "Two", "Three"};
    final BloomFilterHash<String> bfhs = new StringObjectHash(hashPrefixes, MAX_NUMBER_OF_BITS);
    final BloomFilterBitset bfbs = new BigIntegerBitset(MAX_NUMBER_OF_BITS);
    final BloomFilter<String> stringBloomFilterObjectHashBigIntegerBitSet = new BasicBloomFilter<>(bfhs, bfbs);
   
    final String[] testSet01 = {"Test", "Test123", "Testing"};
    addStrings(stringBloomFilterObjectHashBigIntegerBitSet, testSet01);
    final int numPotentiallyPresent = countNumberPotentiallyPresent(stringBloomFilterObjectHashBigIntegerBitSet, testSet01);
    if (numPotentiallyPresent == testSet01.length)
    {
      System.out.println("Expected first result");
    }
    else
    {
      System.out.println("Bad first result! " + numPotentiallyPresent);
    }

    final String[] testSet02 = {"TestA", "TestA123", "TestingA", "TestingB"};
    final int numNotPresent = countNumberNotPresent(stringBloomFilterObjectHashBigIntegerBitSet, testSet02);
    if (numNotPresent == testSet02.length)
    {
      System.out.println("Expected second result");
    }
    else
    {
      System.out.println("Bad second result! " + numNotPresent);
    }
  }

  private static <T> void addStrings(final BloomFilter<T> bloomFilterToAddTo, T[] itemsToAdd)
  {
    for (int i=0; i<itemsToAdd.length; i++)
    {
      bloomFilterToAddTo.addItem(itemsToAdd[i]);
    }
  }

  private static <T> int countNumberPotentiallyPresent(final BloomFilter<T> bloomFilterToCheck, T[] itemsToCheck)
  {
    int count = 0;
    for (int i=0; i<itemsToCheck.length; i++)
    {
      if (bloomFilterToCheck.isPotentiallyPresent(itemsToCheck[i]))  count++;
    }
    return count;
  }

  private static <T> int countNumberNotPresent(BloomFilter<T> bloomFilterToCheck, T[] itemsToCheck)
  {
    int count = 0;
    for (int i=0; i<itemsToCheck.length; i++)
    {
      if (bloomFilterToCheck.isDefinitelyNotPreset(itemsToCheck[i]))  count++;
    }
    return count;
  }
}
 
Last edited:
And there we have it. A complete working example of a very simple Bloom Filter using three different hashes. It's not a recommended implementation for real world use, but it should provide the basics to learn from and to improve upon. You can definitely use the interfaces provided here to produce your own implementations that are less basic.
 
  • Like
Reactions: Badrod