This namespace has been completely superceded by Charred, will receive no further updates, and is subject to removal at any time in the future.
Efficient ways to read files via the java.io.Reader interface. You can read a file into an iterator of (fixed rotating) character buffers, create a new and much faster reader-like interface from the character buffers and parse a csv/tsv type file with an interface that is mostly compatible with but far faster than clojure.data.csv.
Files are by default read by a separate thread into character arrays and those arrays are then processed. For details around the threading system see tech.v3.parallel.queue-iter.
CSV parsing is broken up into two parts. The first is reading a file and creating an iterator of char buffers. The second is parsing an iterator of char buffers. As mentioned earlier we further move the file->char buffer parsing off onto an offline thread.
This design is meant to be easy to experiment with. It could be that an mmap-based pathway allows us to read the file faster, it could be that parsing into blocks of rows as opposed to chars in an offline thread is faster, etc.
Overall parsing many csv's in parallel will a better strategy than using many threads to make parsing a single csv incrementally faster so this current design keeps the memory requirements for a single csv quite low while still gaining the majority of meaningful performance benefits.
Supporting Java classes:
- CharBuffer.java - StringBuilder-like class that implements whitespace trimming, clear, and nil empty strings.
- CharReader.java - A java.io.Reader-like class that only correctly implements single-character unread but allows access to the underlying buffer.
- CSVReader.java - All the tight loops necessary to parse a CSV file quickly.
- JSONReader.java - All the tight loops necessary to parse a JSON file. Fairly customizeable.
(parse-json-fn & [options])
Return a function from input->json. Reuses the parse context and thus when parsing many small JSON inputs where you intend to get one and only one JSON object from them this pathway is a bit more efficient than read-json.
Same options as read-json-fn.
(read-csv input & [options])
Read a csv into a row iterator. Parse algorithm the same as clojure.data.csv although
this returns an iterator and each row is an ArrayList as opposed to a persistent
vector. To convert a java.util.List into something with the same equal and hash semantics
of a persistent vector use either
convert an iterator to a sequence use iterator-seq.
The iterator returned derives from AutoCloseable and it will terminate the iteration and close the underlying iterator (and join the async thread) if (.close iter) is called.
For a drop-in but much faster replacement to clojure.data.csv use read-csv-compat.
:async?- Defaults to true - read the file into buffers in an offline thread. This speeds up reading larger files (1MB+) by about 30%.
:separator- Field separator - defaults to ,.
:quote- Quote specifier - defaults to //".
:close-reader?- Close the reader when iteration is finished - defaults to true.
:column-whitelist- Sequence of allowed column names.
:column-blacklist- Sequence of dis-allowed column names. When conflicts with
:trim-leading-whitespace?- When true, leading spaces and tabs are ignored. Defaults to true.
:trim-trailing-whitespace?- When true, trailing spaces and tabs are ignored. Defaults to true
:nil-empty-values?- When true, empty strings are elided entirely and returned as nil values. Defaults to true.
(read-csv-compat input & options)
Read a csv returning a clojure.data.csv-compatible sequence. For options see read-csv.
(read-json input & args)
Drop in replacement for clojure.data.json/read and clojure.data.json/read-str. For options see read-json-fn.
(read-json-fn input & [options])
Read one or more JSON objects.
Returns an auto-closeable function that when called by default throws an exception
if the read pathway is finished. Input may be a character array or string (most efficient)
or something convertible to a reader. Options for conversion to reader are described in
reader->char-reader although for the json case we default
:async? to false as
most json is just too small to benefit from async reading of the input.
:bigdec- When true use bigdecimals for floating point numbers. Defaults to false.
:double-fn- If :bigdec isn't provided, use this function to parse double values.
:profile- Which performance profile to use. This simply provides defaults to
:obj-iface. The default
:immutablevalue produces persistent datastructures and supports value-fn and key-fn.
:mutableproduces an object arrays and java.util.HashMaps - this is about 30% faster.
:rawproduces ArrayLists for arrays and a JSONReader$JSONObj type with a public data member that is an ArrayList for objects.
:key-fn- Function called on each string map key.
:value-fn- Function called on each map value. Function is passed the key and val so it takes 2 arguments. If this function returns
:tech.v3.datatype.char-input/elidedthen the key-val pair will be elided from the result.
:eof-error?- Defaults to true - when eof is encountered when attempting to read an object throw an EOF error. Else returns a special EOF value.
:eof-value- EOF value. Defaults to
:eof-fn- Function called if readObject is going to return EOF. Defaults to throwing an EOFException.
(reader->char-buf-fn rdr & [options])
Given a reader, return a clojure fn that when called reads the next buffer of the reader.
This function iterates through a fixed number of buffers under the covers so you need to
be cognizant of the number of actual buffers that you want to have present in memory.
This fn also implement
AutoCloseable and closing it will close the underlying reader.
:n-buffers- Number of buffers to use. Defaults to 8 - if this number is too small then buffers in flight will get overwritten.
:bufsize- Size of each buffer - defaults to 2048. Small improvements are sometimes seen with larger or smaller buffers.
:async?- defaults to false. When true data is read in an async thread.
:close-reader?- When true, close input reader when finished. Defaults to true.
(reader->char-reader rdr options)
Given a reader, return a CharReader which presents some of the same interface as a pushbackreader but is only capable of pushing back 1 character.
Options are passed through mainly unchanged to queue-iter and to reader->char-buf-iter.
:async?- default to true - reads the reader in an offline thread into character buffers.