tech.v3.dataset
Dataframe (map of columns) data processing system for clojurescript. This API is a simplified version of the jvm-version's api.
Datasets are maps of columns so assoc will add a new column and dissoc
can remove a column. In addition they allow very fast subrect selection,
filtering, sorting, concatenation and grouping (group-by). The columnwise
analogues are always a lot faster than the general analogues so for instance
sort-by-column
is much faster than sort-by
.
Datasets serialize and deserialize to transit (or anything else) much faster than a sequence of maps and they take up less memory overall.
cljs.user> (require '[tech.v3.dataset :as ds])
nil
cljs.user> (-> (ds/->dataset {:a (range 100)
:b (take 100 (cycle ["hey" "you" "goonies"]))})
(ds/head))
#dataset[unnamed [5 2]
| :a | :b |
|---:|---------|
| 0 | hey |
| 1 | you |
| 2 | goonies |
| 3 | hey |
| 4 | you |]
->>dataset
(->>dataset options data)
(->>dataset data)
data-last analogue of ->dataset for use in ->>
macros.
->dataset
(->dataset data options)
(->dataset data)
(->dataset)
Convert either a sequence of maps or a map of columns into a dataset. Options are similar to the jvm version of tech.v3.dataset in terms of parser-fn. This function can take either a sequence of maps or a map of columns.
Examples:
cljs.user> (->> (ds/->dataset {:a (range 100)
:b (take 100 (cycle ["hey" "you" "goonies"]))})
(vals)
(map (comp :datatype meta)))
(:float64 :string)
cljs.user> (->> (ds/->dataset {:a (range 100)
:b (take 100 (cycle ["hey" "you" "goonies"]))}
{:parser-fn {:a :int8}})
(vals)
(map (comp :datatype meta)))
(:int8 :string)
column
(column ds k)
Return the column at positing k. Failing to find the column is an error.
column->data
(column->data col)
Transform a column in raw data safe for passing to transit or edn.
column-map
(column-map dataset result-colname map-fn res-dtype-or-opts filter-fn-or-ds)
(column-map dataset result-colname map-fn filter-fn-or-ds)
(column-map dataset result-colname map-fn)
Produce a new (or updated) column as the result of mapping a fn over columns.
dataset
- dataset.result-colname
- Name of new (or existing) column.map-fn
- function to map over columns. Same rules astech.v3.datatype/emap
.res-dtype-or-opts
- If not given result is scanned to infer missing and datatype. If using an option map, options are described below.filter-fn-or-ds
- A dataset, a sequence of columns, or atech.v3.datasets/column-filters
column filter function. Defaults to all the columns of the existing dataset.
Returns a new dataset with a new or updated column.
Options:
:datatype
- Set the dataype of the result column. If not given result is scanned to infer result datatype and missing set.:missing-fn
- if given, columns are first passed to missing-fn as a sequence and this dictates the missing set. Else the missing set is by scanning the results during the inference process. Seetech.v3.dataset.column/union-missing-sets
andtech.v3.dataset.column/intersect-missing-sets
for example functions to pass in here.
Examples:
;;From the tests --
(let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
;;result scanned for both datatype and missing set
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
;;result scanned for missing set only. Result used in-place.
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(when % (inc %))
{:datatype :float64} [:b]))))
;;Nothing scanned at all.
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(inc %)
{:datatype :float64
:missing-fn ds-col/union-missing-sets} [:b]))))
;;Missing set scanning causes NPE at inc.
(is (thrown? Throwable
(ds/column-map testds :b2 #(inc %)
{:datatype :float64}
[:b]))))
;;Ad-hoc repl --
user> (require '[tech.v3.dataset :as ds]))
nil
user> (def ds (ds/->dataset "test/data/stocks.csv"))
#'user/ds
user> (ds/head ds)
test/data/stocks.csv [5 3]:
| symbol | date | price |
|--------|------------|-------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
(ds/head))
test/data/stocks.csv [5 4]:
| symbol | date | price | price^2 |
|--------|------------|-------|-----------|
| MSFT | 2000-01-01 | 39.81 | 1584.8361 |
| MSFT | 2000-02-01 | 36.35 | 1321.3225 |
| MSFT | 2000-03-01 | 43.22 | 1867.9684 |
| MSFT | 2000-04-01 | 28.37 | 804.8569 |
| MSFT | 2000-05-01 | 25.45 | 647.7025 |
user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user> ds1
_unnamed [3 2]:
| :b | :a |
|----:|---:|
| | 1 |
| 2.0 | |
| 3.0 | 2 |
user> (ds/column-map ds1 :c (fn [a b]
(when (and a b)
(+ (double a) (double b))))
[:a :b])
_unnamed [3 3]:
| :b | :a | :c |
|----:|---:|----:|
| | 1 | |
| 2.0 | | |
| 3.0 | 2 | 5.0 |
user> (ds/missing (*1 :c))
{0,1}
concat
(concat ds & args)
(concat)
This is a copying concatenation so the result will be realized. Missing columns will be filled in with missing values.
data->column
(data->column {:keys [metadata missing data]})
Transform data produced via column->data into a column
data->dataset
(data->dataset ds-data)
Given data produced via dataset->data create a new dataset.
dataset->data
(dataset->data ds)
Convert a dataset into a pure data datastructure save for transit or direct json serialization. Uses base64 encoding of numeric data.
dataset->transit-str
(dataset->transit-str ds & [format handlers])
Write a transit string adding in the dataset write handler
dataset-parser
(dataset-parser options)
Create a dataset parser that implements PDatasetParser, ICounted an IIndexed (nth). (nth) in this case returns a row. deref'ing the parser yields the dataset. The parser also implemetns reduce which will yield the current rows.
filter-column
(filter-column ds colname & [pred])
Filter the dataset by column colname. If pred isn't passed in the column's values are treated as truthy.
filter-dataset
(filter-dataset dataset filter-fn-or-ds)
Filter the columns of the dataset returning a new dataset. This pathway is designed to work with the tech.v3.dataset.column-filters namespace.
- If filter-fn-or-ds is a dataset, it is returned.
- If filter-fn-or-ds is sequential, then select-columns is called.
- If filter-fn-or-ds is :all, all columns are returned
- If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.
group-by
(group-by ds f)
Group the dataset by the values returned from passing f over each row, represented as a map, of the dataset.
intersect-missing-sets
(intersect-missing-sets col-seq)
Intersect the missing sets of the columns
mapseq-parser
(mapseq-parser options)
(mapseq-parser)
Return a clojure function that when called with one arg that arg must be the next map to add to the dataset. When called with no args returns the current dataset. This can be used to efficiently transform a stream of maps into a dataset while getting intermediate datasets during the parse operation.
Options are the same for ->dataset.
cljs.user> (def pfn (ds/mapseq-parser nil))
#'cljs.user/pfn
cljs.user> (pfn {:a 1 :b 2})
nil
cljs.user> (pfn {:a 1 :b 2})
nil
cljs.user> (pfn {:a 2 :c 3})
nil
cljs.user> (pfn)
#dataset[unnamed [3 3]
| :a | :b | :c |
|---:|----:|----:|
| 1 | 2 | NaN |
| 1 | 2 | NaN |
| 2 | NaN | 3 |]
cljs.user> (pfn {:a 3 :d 4})
nil
cljs.user> (pfn {:a 5 :c 6})
nil
cljs.user> (pfn)
#dataset[unnamed [5 4]
| :a | :b | :c | :d |
|---:|----:|----:|----:|
| 1 | 2 | NaN | NaN |
| 1 | 2 | NaN | NaN |
| 2 | NaN | 3 | NaN |
| 3 | NaN | NaN | 4 |
| 5 | NaN | 6 | NaN |]
mapseq-parser-rf
(mapseq-parser-rf options)
Return a transduce-compatible sequence-of-maps parser. For example of use see definition of row-map.
merge-by-column
(merge-by-column lhs rhs colname)
Merge rows assuming left, right have the same columns. Left is taken first then any right not appear with left are appended. This is far less general but much faster than a join operation; it is useful for merging timeseries data.
missing
(missing ds-or-col)
Return the missing set as a clojure set. The underlying protocol returns missing sets as js sets as those have superior performance when using numbers.
remove-missing
(remove-missing ds)
(remove-missing ds colname)
Remove missing rows from a dataset or column
rename-columns
(rename-columns ds rename-map)
Given a map of old-name->new-name, rename some subset of columns without changing their column order.
replace-missing
(replace-missing ds colnames & [replace-cmd])
Replace missing values in dataset.
- colnames one or more columns to run replace cmd
- replace-cmd - one of
:first
:last
:lerp
[:value val]
ifn
If replace-cmd is an ifn it will be given the column-datatype first and last arguments in the missing span and the number of missing elements. Either the first or last may be nil if the missing span is at the beginning or end. In the case where all values are missing both arguments may be nil.
row-at
(row-at ds idx)
Get row as a map at index idx. Negative indexes index from the end.
row-map
(row-map ds map-fn & [options])
Map a function across the rows of the dataset producing a new dataset that is merged back into the original potentially replacing existing columns. Options are passed into the ->dataset function so you can control the resulting column types by the usual dataset parsing options described there.
Examples:
cljs.user> (def stocks (ds/transit-file->dataset "test/data/stocks.transit-json"))
#'cljs.user/stocks
cljs.user> (ds/head stocks)
#dataset[https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [5 3]
| :symbol | :date | :price |
|---------|------------|-------:|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |]
cljs.user> (ds/head (ds/row-map stocks (fn [row]
{:symbol (keyword (row :symbol))
:price2 (* (row :price)(row :price))})))
#dataset[https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv [5 4]
| :symbol | :date | :price | :price2 |
|---------|------------|-------:|--------------:|
| :MSFT | 2000-01-01 | 39.81 | 1584.83610000 |
| :MSFT | 2000-02-01 | 36.35 | 1321.32250000 |
| :MSFT | 2000-03-01 | 43.22 | 1867.96840000 |
| :MSFT | 2000-04-01 | 28.37 | 804.85690000 |
| :MSFT | 2000-05-01 | 25.45 | 647.70250000 |]
rowvec-at
(rowvec-at ds idx)
Get row as a vec of values at index idx. Negative indexes index from the end.
select-columns
(select-columns ds colnames)
Select these column in this order. This can be used both to select specific columns and to set the order of columns. Columns not found are errors
soft-select-columns
(soft-select-columns ds colnames)
Select these columns in this order. Columns not found are quietly ignored. To get errors for missing columns see select-columns.
sort-by
(sort-by ds keyfn & [comp options])
Sort dataset by keyfn. Keyfn is passed each row as a map.
sort-by-column
(sort-by-column ds colname & [sort-op options])
Sort the dataset by column colname. For sort options and the interaction between sort-fn and the options see tech.v3.datatype.argops/argsort.
sort-op
- a boolean binary predicate comparison operation such as < or >.
Options:
:nan-strategy
- defaults to:last
- for numeric columns where to place missing values. Options are:first
,:last
,:exception
.:comparator
- pass in a custom comparator - a function returning -1,0, or 1. If no sort-op is passed in this defaults tocompare
.
transit-read-handler-map
(transit-read-handler-map)
Return a map mapping the dataset tag to a transit read handler.
transit-str->dataset
(transit-str->dataset json-data & [format handlers])
Parse a transit string adding in the dataset read handler
transit-write-handler-map
(transit-write-handler-map)
Return a map mapping the dataset type to a transit writer handler.
update
(update lhs-ds filter-fn-or-ds update-fn & args)
Update this dataset. Filters this dataset into a new dataset, applies update-fn, then merges the result into original dataset.
This pathways is designed to work with the tech.v3.dataset.column-filters namespace.
filter-fn-or-ds
is a generalized parameter. May be a function, a dataset or a sequence of column names.- update-fn must take the dataset as the first argument and must return a dataset.
(ds/bind-> (ds/->dataset dataset) ds
(ds/remove-column "Id")
(ds/update cf/string ds/replace-missing-value "NA")
(ds/update-elemwise cf/string #(get {"" "NA"} % %))
(ds/update cf/numeric ds/replace-missing-value 0)
(ds/update cf/boolean ds/replace-missing-value false)
(ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
#(dtype/elemwise-cast % :float64)))