charred.api

Efficient pathways to read/write csv-based formats and json. Many of these functions have fast pathways for constructing the parser,writer in order to help with the case where you want to rapidly encode/decode a stream of small objects. For general uses, the simply named read-XXX, write-XXX functions are designed to be drop-in but far more efficient replacements of their clojure.data.csv and clojure.data.json equivalents.

This is based on an underlying char[] based parsing system that makes it easy to build new parsers and allows tight loops to iterate through loaded character arrays and are thus easily optimized by HotSpot.

  • CharBuffer.java - More efficient, simpler and general than StringBuilder.
  • CharReader.java - PushbackReader-like abstraction only capable of pushing back 1 character. Allows access to the underlying buffer and relative offset.

On top of these abstractions you have reader/writer abstractions for java and csv.

Many of these abstractions return a CloseableSupplier so you can simply use them with with-open and the underlying stream/reader will be closed when the control leaves the block. If you read all the data out of the supplier then the supplier itself will close the input when finished.

json-reader-fn

(json-reader-fn options)

Given options, return a function that when called constructs a json reader from exactly those options. This avoids the work of upacking/analyzing the options when constructing many json readers for a sequence small inputs.

json-writer-fn

(json-writer-fn options)

Return a function that when called efficiently constructs a JSONWriter from the given options. Same arguments as write-json.

parse-json-fn

(parse-json-fn & [options])

Return a function from input->json. Parses the options once and thus when parsing many small JSON inputs where you intend to get one and only one JSON object from them this pathway is a bit more efficient than read-json.

Same options as read-json-supplier.

PToJSON

protocol

Protocol to extend support for converting items to a json-supported datastructure. These can be a number, a string, an implementation of java.util.List or an implementation of java.util.Map.

members

->json-data

(->json-data item)

Automatic conversion of some subset of types to something acceptible to json. Defaults to toString for types that aren't representable in json.

read-csv

(read-csv input & {:as args})

Read a csv returning a clojure.data.csv-compatible sequence. For options see read-csv-supplier.

An important note is that :comment-char is disabled by default during read-csv for backward compatibility while it is not disabled by default during read-csv-supplier. Also :close-reader? defaults to false to match the behavior of data.csv.

read-csv-supplier

(read-csv-supplier input & [options])

Read a csv into a row supplier. Parse algorithm the same as clojure.data.csv although this returns a java.util.function.Supplier which also implements AutoCloseable as well as clojure.lang.Seqable and clojure.lang.IReduce.

The supplier returned derives from AutoCloseable and it will terminate the reading and close the underlying read mechanism (and join the async thread) if (.close supp) is called.

For a drop-in but much faster replacement to clojure.data.csv use read-csv.

Options:

In additon to these options, see options for reader->char-buf-supplier.

  • :async? - Defaults to true - read the file into buffers in an offline thread. This speeds up reading larger files (1MB+) by about 30%.
  • :separator - Field separator - defaults to ,.
  • :quote - Quote specifier - defaults to //".
  • :escape - Escape character - defaults to disabled.
  • :close-reader? - Close the reader when iteration is finished - defaults to true.
  • :column-allowlist - Sequence of allowed column names or indexes. :column-whitelist still works but isn't preferred.
  • :column-blocklist - Sequence of dis-allowed column names or indexes. When conflicts with :column-allowlist then :column-allowlist wins. :column-blacklist still works but isn't preferred
  • :comment-char - Defaults to #. Rows beginning with character are discarded with no further processing. Setting the comment-char to nil or (char 0) disables comment lines.
  • :trim-leading-whitespace? - When true, leading spaces and tabs are ignored. Defaults to true.
  • :trim-trailing-whitespace? - When true, trailing spaces and tabs are ignored. Defaults to true
  • :nil-empty-values? - When true, empty strings are elided entirely and returned as nil values. Defaults to false.
  • :profile - Either :immutable or :mutable. :immutable returns persistent vectors while :mutable returns arraylists.

read-json

(read-json input & {:as args})

Drop in replacement for clojure.data.json/read and clojure.data.json/read-str. For options see read-json-supplier.

read-json-supplier

(read-json-supplier input & [options])

Read one or more JSON objects. Returns an auto-closeable supplier that when called by default throws an exception if the read pathway is finished. Input may be a character array or string (most efficient) or something convertible to a reader. Options for conversion to reader are described in reader->char-reader although for the json case we default :async? to false as most json is just too small to benefit from async reading of the input. For input streams

  • unlike csv - :async? defaults to false as most JSON files are relatively small - in the 10-100K range where async loading doesn't make much of a difference. On a larger file, however, setting :async? to true definitely can make a large difference.

Map keys are canonicalized using an instance of charred.StringCanonicalizer. This results in less memory usage and faster performance as java strings cache their hash codes. You can supply the string canonicalizer potentially pre-initialized with the parser-fn option. For an example of using the parser-fn option see fjson.clj.

Options:

In addition to the options below, see options for reader->char-reader.

  • :bigdec - When true use bigdecimals for floating point numbers. Defaults to false.
  • :double-fn - If :bigdec isn't provided, use this function to parse double values.
  • :profile - Which performance profile to use. This simply provides defaults to :array-iface and :obj-iface. The default :immutable value produces persistent datastructures and supports value-fn and key-fn. :mutable produces an object arrays and java.util.HashMaps - this is about 30% faster. :raw produces ArrayLists for arrays and a JSONReader$JSONObj type with a public data member that is an ArrayList for objects.
  • :key-fn - Function called on each string map key.
  • :value-fn - Function called on each map value. Function is passed the key and val so it takes 2 arguments. If this function returns :charred.api/elided then the key-val pair will be elided from the result.
  • :array-iface - Implementation of JSONReader$ArrayReader called on the object array of values for a javascript array.
  • :obj-iface - Implementation of JSONReader$ObjReader called for each javascript object. Note that providing this overrides key-fn and value-fn.
  • :eof-error? - Defaults to true - when eof is encountered when attempting to read an object throw an EOF error. Else returns a special EOF value, controlled by the :eof-value option.
  • :eof-value - EOF value. Defaults to the keyword :eof
  • :eof-fn - Function called if readObject is going to return EOF. Defaults to throwing an EOFException.
  • :parser-fn - Function that overrides the array-iface and obj-iface parameters - this is called each time the parser is created and must return a map with at least array-iface, obj-iface and finalize-fn keys. It may also optionally have a :string-canonicalizer key which, if present, must be an instance of charred.StringCanonicalizer. Thus you can ensure the share string tables between parser invocations or create a context-dependent set of array and object interface specifications.

reader->char-buf-supplier

(reader->char-buf-supplier rdr & [options])

Given a reader, return a supplier that when called reads the next buffer of the reader. When n-buffers is >= 0, this function iterates through a fixed number of buffers under the covers so you need to be cognizant of the number of actual buffers that you want to have present in memory. This fn also implement AutoCloseable and closing it will close the underlying reader.

Options:

  • :n-buffers - Number of buffers to use. Defaults to 6 as the queue size defaults to 4 - if this number is positive but too small then buffers in flight will get overwritten. If n-buffers is <= 0 then buffers are allocated as needed and not reused - this is the safest option but also can make async loading much slower than it would be otherwise. This must be at least 2 larger than queue-depth.
  • :queue-depth - Defaults to 4. See comments on :n-buffers.
  • :bufsize - Size of each buffer - defaults to (* 64 1024). Small improvements are sometimes seen with larger or smaller buffers.
  • :async? - defaults to true if the number of processors is more than one.. When true data is read in an async thread.
  • :close-reader? - When true, close input reader when finished. Defaults to true.

reader->char-reader

(reader->char-reader rdr options)(reader->char-reader rdr)

Given a reader, return a CharReader which presents some of the same interface as a pushbackreader but is only capable of pushing back 1 character. It is extremely quick to instantiate this object from a string or character array.

Options:

See options for reader->char-buf-supplier.

write-csv

(write-csv w data & {:as options})

Writes data to writer in CSV-format. See also write-csv-rf.

Options:

  • :separator - Default ,)
  • :quote - Default ")
  • :quote? A predicate function which determines if a string should be quoted. Defaults to quoting only when necessary. May also be the the value 'true' in which case every field is quoted.
  • :newline - :lf (default) or :cr+lf)
  • :close-writer? - defaults to false unless w is a string. When true, close writer when finished.

write-csv-rf

(write-csv-rf w)(write-csv-rf w options)

Returns a transduce-compatible rf that will write a csv. See options for write-csv.

This rf must be finalized (rf last-reduced-value) and will return the number of rows written in that case.

Example:

user> (transduce (map identity) (charred/write-csv-rf "test.csv") [[:a :b :c][1 2 3]])
2
user> (slurp "test.csv")
":a,:b,:c
1,2,3
"

write-json

(write-json output data & {:as argmap})

Write json to output. You can extend the writer to new datatypes by implementing the ->json-data function of the protocol PToJSON. This function need only return json-acceptible datastructures which are numbers, booleans, nil, lists, arrays, and maps. The default type coercion will in general simply call .toString on the object.

Options:

  • :escape-unicode - If true (default) non-ASCII characters are escaped as \uXXXX
  • :escape-js-separators If true (default) the Unicode characters U+2028 and U+2029 will be escaped as \u2028 and \u2029 even if :escape-unicode is false. (These two characters are valid in pure JSON but are not valid in JavaScript strings.
  • :escape-slash If true (default) the slash / is escaped as /
  • :indent-str When nil (default) json is printed raw with no indent or whitespace. For two spaces of indent per level of nesting, choose " ".
  • :obj-fn - Function called on each non-primitive object - it is passed the JSONWriter and the object. The default iterates maps, lists, and arrays converting anything that is not a json primitive or a map, list or array to a json primitive via str. java.sql.Date classes get special treatment and are converted to instants which then converted to json primitive objects via the PToJSon protocol fn ->json-data which defaults to toString. This is the most general override mechanism where you will need to manually call the JSONWriter's methods. The simpler but slightly less general pathway is to override the protocol method ->json-data.

write-json-fn

(write-json-fn argmap)

Return a function of two arguments, (output,data), that efficiently constructs a json writer and writes the data. This is the most efficient pathway when writing a bunch of small json objects as it avoids the cost associated with unpacking the argument map. Same arguments as write-json.

write-json-str

(write-json-str data & {:as args})

Write json to a string. See options for write-json.