tech.v3.dataset

Dataframe (map of columns) data processing system for clojurescript. This API is a simplified version of the jvm-version's api.

Datasets are maps of columns so assoc will add a new column and dissoc can remove a column. In addition they allow very fast subrect selection, filtering, sorting, concatenation and grouping (group-by). The columnwise analogues are always a lot faster than the general analogues so for instance sort-by-column is much faster than sort-by.

Datasets serialize and deserialize to transit (or anything else) much faster than a sequence of maps and they take up less memory overall.

cljs.user> (require '[tech.v3.dataset :as ds])
nil

cljs.user> (-> (ds/->dataset {:a (range 100)
                              :b (take 100 (cycle ["hey" "you" "goonies"]))})
               (ds/head))
#dataset[unnamed [5 2]
| :a |      :b |
|---:|---------|
|  0 |     hey |
|  1 |     you |
|  2 | goonies |
|  3 |     hey |
|  4 |     you |]

->>dataset

(->>dataset options data)(->>dataset data)

data-last analogue of ->dataset for use in ->> macros.

->dataset

(->dataset data options)(->dataset data)(->dataset)

Convert either a sequence of maps or a map of columns into a dataset. Options are similar to the jvm version of tech.v3.dataset in terms of parser-fn. This function can take either a sequence of maps or a map of columns.

Examples:

cljs.user> (->> (ds/->dataset {:a (range 100)
                               :b (take 100 (cycle ["hey" "you" "goonies"]))})
                (vals)
                (map (comp :datatype meta)))
(:float64 :string)

cljs.user> (->> (ds/->dataset {:a (range 100)
                               :b (take 100 (cycle ["hey" "you" "goonies"]))}
                              {:parser-fn {:a :int8}})
                (vals)
                (map (comp :datatype meta)))
(:int8 :string)

column

(column ds k)

Return the column at positing k. Failing to find the column is an error.

column->data

(column->data col)

Transform a column in raw data safe for passing to transit or edn.

column-count

(column-count ds)

Integer column count of the dataset.

column-map

(column-map dataset result-colname map-fn res-dtype-or-opts filter-fn-or-ds)(column-map dataset result-colname map-fn filter-fn-or-ds)(column-map dataset result-colname map-fn)

Produce a new (or updated) column as the result of mapping a fn over columns.

  • dataset - dataset.
  • result-colname - Name of new (or existing) column.
  • map-fn - function to map over columns. Same rules as tech.v3.datatype/emap.
  • res-dtype-or-opts - If not given result is scanned to infer missing and datatype. If using an option map, options are described below.
  • filter-fn-or-ds - A dataset, a sequence of columns, or a tech.v3.datasets/column-filters column filter function. Defaults to all the columns of the existing dataset.

Returns a new dataset with a new or updated column.

Options:

  • :datatype - Set the dataype of the result column. If not given result is scanned to infer result datatype and missing set.
  • :missing-fn - if given, columns are first passed to missing-fn as a sequence and this dictates the missing set. Else the missing set is by scanning the results during the inference process. See tech.v3.dataset.column/union-missing-sets and tech.v3.dataset.column/intersect-missing-sets for example functions to pass in here.

Examples:


  ;;From the tests --

  (let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
    ;;result scanned for both datatype and missing set
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
    ;;result scanned for missing set only.  Result used in-place.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %))
                               {:datatype :float64} [:b]))))
    ;;Nothing scanned at all.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(inc %)
                               {:datatype :float64
                                :missing-fn ds-col/union-missing-sets} [:b]))))
    ;;Missing set scanning causes NPE at inc.
    (is (thrown? Throwable
                 (ds/column-map testds :b2 #(inc %)
                                {:datatype :float64}
                                [:b]))))

  ;;Ad-hoc repl --

user> (require '[tech.v3.dataset :as ds]))
nil
user> (def ds (ds/->dataset "test/data/stocks.csv"))
#'user/ds
user> (ds/head ds)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|-------|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
          (ds/head))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   price^2 |
|--------|------------|-------|-----------|
|   MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|   MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|   MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|   MSFT | 2000-04-01 | 28.37 |  804.8569 |
|   MSFT | 2000-05-01 | 25.45 |  647.7025 |



user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user> ds1
_unnamed [3 2]:

|  :b | :a |
|----:|---:|
|     |  1 |
| 2.0 |    |
| 3.0 |  2 |
user> (ds/column-map ds1 :c (fn [a b]
                              (when (and a b)
                                (+ (double a) (double b))))
                     [:a :b])
_unnamed [3 3]:

|  :b | :a |  :c |
|----:|---:|----:|
|     |  1 |     |
| 2.0 |    |     |
| 3.0 |  2 | 5.0 |
user> (ds/missing (*1 :c))
{0,1}

column-names

(column-names ds)

Return the column names as a sequence.

columns

(columns ds)

Return the columns, in order, of the dataset.

concat

(concat ds & args)(concat)

This is a copying concatenation so the result will be realized. Missing columns will be filled in with missing values.

data->column

(data->column {:keys [metadata missing data]})

Transform data produced via column->data into a column

data->dataset

(data->dataset ds-data)

Given data produced via dataset->data create a new dataset.

dataset->data

(dataset->data ds)

Convert a dataset into a pure data datastructure save for transit or direct json serialization. Uses base64 encoding of numeric data.

dataset->transit-str

(dataset->transit-str ds & [format handlers])

Write a transit string adding in the dataset write handler

dataset-parser

(dataset-parser options)

Create a dataset parser that implements PDatasetParser, ICounted an IIndexed (nth). (nth) in this case returns a row. deref'ing the parser yields the dataset. The parser also implemetns reduce which will yield the current rows.

dataset?

(dataset? ds)

Return true of this is a dataset.

filter

(filter ds pred)

Filter the dataset. Pred gets passed each row as a map.

filter-column

(filter-column ds colname & [pred])

Filter the dataset by column colname. If pred isn't passed in the column's values are treated as truthy.

filter-dataset

(filter-dataset dataset filter-fn-or-ds)

Filter the columns of the dataset returning a new dataset. This pathway is designed to work with the tech.v3.dataset.column-filters namespace.

  • If filter-fn-or-ds is a dataset, it is returned.
  • If filter-fn-or-ds is sequential, then select-columns is called.
  • If filter-fn-or-ds is :all, all columns are returned
  • If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.

group-by

(group-by ds f)

Group the dataset by the values returned from passing f over each row, represented as a map, of the dataset.

group-by-column

(group-by-column ds colname)

Group the dataset by column colname

head

(head ds n)(head ds)

Return the first n rows of the dataset.

intersect-missing-sets

(intersect-missing-sets col-seq)

Intersect the missing sets of the columns

mapseq-parser

(mapseq-parser options)(mapseq-parser)

Return a clojure function that when called with one arg that arg must be the next map to add to the dataset. When called with no args returns the current dataset. This can be used to efficiently transform a stream of maps into a dataset while getting intermediate datasets during the parse operation.

Options are the same for ->dataset.

cljs.user> (def pfn (ds/mapseq-parser nil))
#'cljs.user/pfn
cljs.user> (pfn {:a 1 :b 2})
nil
cljs.user> (pfn {:a 1 :b 2})
nil
cljs.user> (pfn {:a 2 :c 3})
nil
cljs.user> (pfn)
#dataset[unnamed [3 3]
| :a |  :b |  :c |
|---:|----:|----:|
|  1 |   2 | NaN |
|  1 |   2 | NaN |
|  2 | NaN |   3 |]
cljs.user> (pfn {:a 3 :d 4})
nil
cljs.user> (pfn {:a 5 :c 6})
nil
cljs.user> (pfn)
#dataset[unnamed [5 4]
| :a |  :b |  :c |  :d |
|---:|----:|----:|----:|
|  1 |   2 | NaN | NaN |
|  1 |   2 | NaN | NaN |
|  2 | NaN |   3 | NaN |
|  3 | NaN | NaN |   4 |
|  5 | NaN |   6 | NaN |]

mapseq-parser-rf

(mapseq-parser-rf options)

Return a transduce-compatible sequence-of-maps parser. For example of use see definition of row-map.

merge-by-column

(merge-by-column lhs rhs colname)

Merge rows assuming left, right have the same columns. Left is taken first then any right not appear with left are appended. This is far less general but much faster than a join operation; it is useful for merging timeseries data.

missing

(missing ds-or-col)

Return the missing set as a clojure set. The underlying protocol returns missing sets as js sets as those have superior performance when using numbers.

remove-columns

(remove-columns ds colnames)

Remove these columns from the dataset.

remove-missing

(remove-missing ds)(remove-missing ds colname)

Remove missing rows from a dataset or column

remove-rows

(remove-rows ds rowidxs)

Remove these row indexes out of the dataset.

rename-columns

(rename-columns ds rename-map)

Given a map of old-name->new-name, rename some subset of columns without changing their column order.

replace-missing

(replace-missing ds colnames & [replace-cmd])

Replace missing values in dataset.

  • colnames one or more columns to run replace cmd
  • replace-cmd - one of :first :last :lerp [:value val] ifn

If replace-cmd is an ifn it will be given the column-datatype first and last arguments in the missing span and the number of missing elements. Either the first or last may be nil if the missing span is at the beginning or end. In the case where all values are missing both arguments may be nil.

reverse-rows

(reverse-rows ds)

Reverse the order of the rows of a dataset or a column

row-at

(row-at ds idx)

Get row as a map at index idx. Negative indexes index from the end.

row-count

(row-count ds-or-col)

Integer row count of the dataset.

row-map

(row-map ds map-fn & [options])

Map a function across the rows of the dataset producing a new dataset that is merged back into the original potentially replacing existing columns. Options are passed into the ->dataset function so you can control the resulting column types by the usual dataset parsing options described there.

Examples:

cljs.user> (def stocks (ds/transit-file->dataset "test/data/stocks.transit-json"))
#'cljs.user/stocks
cljs.user> (ds/head stocks)
#dataset[https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [5 3]
| :symbol |      :date | :price |
|---------|------------|-------:|
|    MSFT | 2000-01-01 |  39.81 |
|    MSFT | 2000-02-01 |  36.35 |
|    MSFT | 2000-03-01 |  43.22 |
|    MSFT | 2000-04-01 |  28.37 |
|    MSFT | 2000-05-01 |  25.45 |]
cljs.user> (ds/head (ds/row-map stocks (fn [row]
                                    {:symbol (keyword (row :symbol))
                                     :price2 (* (row :price)(row :price))})))
#dataset[https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [5 4]
| :symbol |      :date | :price |       :price2 |
|---------|------------|-------:|--------------:|
|   :MSFT | 2000-01-01 |  39.81 | 1584.83610000 |
|   :MSFT | 2000-02-01 |  36.35 | 1321.32250000 |
|   :MSFT | 2000-03-01 |  43.22 | 1867.96840000 |
|   :MSFT | 2000-04-01 |  28.37 |  804.85690000 |
|   :MSFT | 2000-05-01 |  25.45 |  647.70250000 |]

rows

(rows ds)

Get a sequence of maps from a dataset

rowvec-at

(rowvec-at ds idx)

Get row as a vec of values at index idx. Negative indexes index from the end.

rowvecs

(rowvecs ds)

Get a sequence of persistent vectors from a dataset

select

(select ds cols rows)

Select a subrect of the dataset.

select-columns

(select-columns ds colnames)

Select these column in this order. This can be used both to select specific columns and to set the order of columns. Columns not found are errors

select-missing

(select-missing ds)

Select the missing rows from a dataset or a column

select-rows

(select-rows ds rowidxs)

Select these row indexes out of the dataset.

soft-select-columns

(soft-select-columns ds colnames)

Select these columns in this order. Columns not found are quietly ignored. To get errors for missing columns see select-columns.

sort-by

(sort-by ds keyfn & [comp options])

Sort dataset by keyfn. Keyfn is passed each row as a map.

sort-by-column

(sort-by-column ds colname & [sort-op options])

Sort the dataset by column colname. For sort options and the interaction between sort-fn and the options see tech.v3.datatype.argops/argsort.

  • sort-op - a boolean binary predicate comparison operation such as < or >.

Options:

  • :nan-strategy - defaults to :last - for numeric columns where to place missing values. Options are :first, :last, :exception.
  • :comparator - pass in a custom comparator - a function returning -1,0, or 1. If no sort-op is passed in this defaults to compare.

tail

(tail ds n)(tail ds)

Return the last n rows of the dataset.

transit-read-handler-map

(transit-read-handler-map)

Return a map mapping the dataset tag to a transit read handler.

transit-str->dataset

(transit-str->dataset json-data & [format handlers])

Parse a transit string adding in the dataset read handler

transit-write-handler-map

(transit-write-handler-map)

Return a map mapping the dataset type to a transit writer handler.

union-missing-sets

(union-missing-sets col-seq)

Union the missing sets of the columns

unique-by

(unique-by ds f)

Unique-by taking first

unique-by-column

(unique-by-column ds colname)

Unique-by taking first

update

(update lhs-ds filter-fn-or-ds update-fn & args)

Update this dataset. Filters this dataset into a new dataset, applies update-fn, then merges the result into original dataset.

This pathways is designed to work with the tech.v3.dataset.column-filters namespace.

  • filter-fn-or-ds is a generalized parameter. May be a function, a dataset or a sequence of column names.
  • update-fn must take the dataset as the first argument and must return a dataset.
(ds/bind-> (ds/->dataset dataset) ds
           (ds/remove-column "Id")
           (ds/update cf/string ds/replace-missing-value "NA")
           (ds/update-elemwise cf/string #(get {"" "NA"} % %))
           (ds/update cf/numeric ds/replace-missing-value 0)
           (ds/update cf/boolean ds/replace-missing-value false)
           (ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
                                 #(dtype/elemwise-cast % :float64)))