tech.v3.datatype.char-input

deprecated

This namespace has been completely superceded by Charred, will receive no further updates, and is subject to removal at any time in the future.

Efficient ways to read files via the java.io.Reader interface. You can read a file into an iterator of (fixed rotating) character buffers, create a new and much faster reader-like interface from the character buffers and parse a csv/tsv type file with an interface that is mostly compatible with but far faster than clojure.data.csv.

Files are by default read by a separate thread into character arrays and those arrays are then processed. For details around the threading system see tech.v3.parallel.queue-iter.

CSV parsing is broken up into two parts. The first is reading a file and creating an iterator of char[] buffers. The second is parsing an iterator of char[] buffers. As mentioned earlier we further move the file->char buffer parsing off onto an offline thread.

This design is meant to be easy to experiment with. It could be that an mmap-based pathway allows us to read the file faster, it could be that parsing into blocks of rows as opposed to char[]s in an offline thread is faster, etc.

Overall parsing many csv's in parallel will a better strategy than using many threads to make parsing a single csv incrementally faster so this current design keeps the memory requirements for a single csv quite low while still gaining the majority of meaningful performance benefits.

Supporting Java classes:

  • CharBuffer.java - StringBuilder-like class that implements whitespace trimming, clear, and nil empty strings.
  • CharReader.java - A java.io.Reader-like class that only correctly implements single-character unread but allows access to the underlying buffer.
  • CSVReader.java - All the tight loops necessary to parse a CSV file quickly.
  • JSONReader.java - All the tight loops necessary to parse a JSON file. Fairly customizeable.

char-ary-cls

json-reader-fn

deprecated

(json-reader-fn options)

parse-json-fn

deprecated

(parse-json-fn & [options])

Return a function from input->json. Reuses the parse context and thus when parsing many small JSON inputs where you intend to get one and only one JSON object from them this pathway is a bit more efficient than read-json.

Same options as read-json-fn.

read-csv

deprecated

(read-csv input & [options])

Read a csv into a row iterator. Parse algorithm the same as clojure.data.csv although this returns an iterator and each row is an ArrayList as opposed to a persistent vector. To convert a java.util.List into something with the same equal and hash semantics of a persistent vector use either tech.v3.datatype.ListPersistentVector or vec. To convert an iterator to a sequence use iterator-seq.

The iterator returned derives from AutoCloseable and it will terminate the iteration and close the underlying iterator (and join the async thread) if (.close iter) is called.

For a drop-in but much faster replacement to clojure.data.csv use read-csv-compat.

Options:

  • :async? - Defaults to true - read the file into buffers in an offline thread. This speeds up reading larger files (1MB+) by about 30%.
  • :separator - Field separator - defaults to ,.
  • :quote - Quote specifier - defaults to //".
  • :close-reader? - Close the reader when iteration is finished - defaults to true.
  • :column-whitelist - Sequence of allowed column names.
  • :column-blacklist - Sequence of dis-allowed column names. When conflicts with :column-whitelist then :column-whitelist wins.
  • :trim-leading-whitespace? - When true, leading spaces and tabs are ignored. Defaults to true.
  • :trim-trailing-whitespace? - When true, trailing spaces and tabs are ignored. Defaults to true
  • :nil-empty-values? - When true, empty strings are elided entirely and returned as nil values. Defaults to true.

read-csv-compat

(read-csv-compat input & options)

Read a csv returning a clojure.data.csv-compatible sequence. For options see read-csv.

read-json

deprecated

(read-json input & args)

Drop in replacement for clojure.data.json/read and clojure.data.json/read-str. For options see read-json-fn.

read-json-fn

deprecated

(read-json-fn input & [options])

Read one or more JSON objects. Returns an auto-closeable function that when called by default throws an exception if the read pathway is finished. Input may be a character array or string (most efficient) or something convertible to a reader. Options for conversion to reader are described in reader->char-reader although for the json case we default :async? to false as most json is just too small to benefit from async reading of the input.

Options:

  • :bigdec - When true use bigdecimals for floating point numbers. Defaults to false.
  • :double-fn - If :bigdec isn't provided, use this function to parse double values.
  • :profile - Which performance profile to use. This simply provides defaults to :array-iface and :obj-iface. The default :immutable value produces persistent datastructures and supports value-fn and key-fn. :mutable produces an object arrays and java.util.HashMaps - this is about 30% faster. :raw produces ArrayLists for arrays and a JSONReader$JSONObj type with a public data member that is an ArrayList for objects.
  • :key-fn - Function called on each string map key.
  • :value-fn - Function called on each map value. Function is passed the key and val so it takes 2 arguments. If this function returns :tech.v3.datatype.char-input/elided then the key-val pair will be elided from the result.
  • :array-iface - Implementation of JSONReader$ArrayReader called on the object array of values for a javascript array.
  • :obj-iface - Implementation of JSONReader$ObjReader called for each javascript object. Note that providing this overrides key-fn and value-fn.
  • :eof-error? - Defaults to true - when eof is encountered when attempting to read an object throw an EOF error. Else returns a special EOF value.
  • :eof-value - EOF value. Defaults to
  • :eof-fn - Function called if readObject is going to return EOF. Defaults to throwing an EOFException.

reader->char-buf-fn

deprecated

(reader->char-buf-fn rdr & [options])

Given a reader, return a clojure fn that when called reads the next buffer of the reader. This function iterates through a fixed number of buffers under the covers so you need to be cognizant of the number of actual buffers that you want to have present in memory. This fn also implement AutoCloseable and closing it will close the underlying reader.

Options:

  • :n-buffers - Number of buffers to use. Defaults to 8 - if this number is too small then buffers in flight will get overwritten.
  • :bufsize - Size of each buffer - defaults to 2048. Small improvements are sometimes seen with larger or smaller buffers.
  • :async? - defaults to false. When true data is read in an async thread.
  • :close-reader? - When true, close input reader when finished. Defaults to true.

reader->char-reader

deprecated

(reader->char-reader rdr options)(reader->char-reader rdr)

Given a reader, return a CharReader which presents some of the same interface as a pushbackreader but is only capable of pushing back 1 character.

Options:

Options are passed through mainly unchanged to queue-iter and to reader->char-buf-iter.

  • :async? - default to true - reads the reader in an offline thread into character buffers.