tech.v3.datatype.sampling

Implementation of reservoir sampling designed to be used in other systems. Provides a low-level sampler object and a double-reservoir that implements DoubleConsumer and derefs to a buffer of doubles.

->random

(->random {:keys [random seed algorithm], :or {algorithm :lcg}})(->random)

Given an options map return an implementation of java.util.Random.

Options:

  • :algorithm - either :lcg or :mersenne-twister. Defaults to :lcg which is the default java implementation.
  • :seed - long integer seed.
  • :random - User provided instance of java.util.Random. If this exists it overrides all other options.

reservoir-sampler

(reservoir-sampler reservoir-size)(reservoir-sampler reservoir-size options)

Return hamf parallel reducer that will accept objects and whose value is the reservoir of data.

Merging consists of adding elements from the second distribution into the first.

Same options as ->random and the options are passed unchanged into make-container:

  • :algorithm - either :lcg or :mersenne-twister. Defaults to :lcg which is the default java implementation.
  • :seed - long integer seed.
  • :random - User provided instance of java.util.Random. If this exists it overrides all other options.
  • :datatype - Specify container's datatype. Defaults to :object

Examples:

tech.v3.datatype.sampling> (hamf/reduce-reducer (reservoir-sampler 10) (range 200))
[189 15 49 128 167 157 170 7 182 162]
tech.v3.datatype.sampling> (hamf/reduce-reducer (reservoir-sampler 10 {:datatype :float32}) (range 200))
[0.0 117.0 37.0 3.0 190.0 186.0 27.0 89.0 63.0 108.0]
tech.v3.datatype.sampling> (hamf/preduce-reducer (reservoir-sampler 10 {:datatype :float32}) (range 200000))
[5750.0
 128996.0
 146881.0
 174104.0
 101110.0
 24560.0
 25344.0
 170374.0
 158145.0
 124138.0]

reservoir-sampler-supplier

(reservoir-sampler-supplier reservoir-size options)(reservoir-sampler-supplier reservoir-size)

Create a java.util.function.LongSupplier that will generate an infinite sequence of longs. If a long is -1 it means do nothing, else it will output the index to replace with the new item. The sampler expects the reservoir is full before the first call to .getAsLong.

deref'ing the supplier allows you to see the internal state of the sampler.

Same options as ->random:

  • :algorithm - either :lcg or :mersenne-twister. Defaults to :lcg which is the default java implementation.
  • :seed - long integer seed.
  • :random - User provided instance of java.util.Random. If this exists it overrides all other options.