Published at: 2015-05-08 06:13
Tags: clojure
I'm happy to announce the release of semantic-csv, a humble little CSV library for higher level CSV parsing features.
Existing CSV parsing libraries for Clojure (clojure/data.csv and clojure-csv) only concern themselves with the grammar of CSV. Given a string or file with CSV data, they produce a collection of row vectors, a minimal data representation of the structure of CSV.
However, CSV data typically means something. Sometimes the items of the row vectors refer to numbers or dates. Sometimes the first row is a header. Sometimes a row has been "commented" out. These are all things which are not concerned so much with the grammer of CSV data, but it's semantics.
In the spirit of composability, semantic-csv lives solely at the level of semantic processing. By not concerning itself with the details of grammar, it frees itself to be interoperable with any underlying grammar parsing approach (that is, it's completely compatible with both clojure/data-csv
and clojure-csv
).
Semantic-csv emphasizes a set of composable collection processing functions. Each of these functions takes the collection of data rows as the final argument, facilitating interoperability with the standard collection processing functions (map
, filter
, reduce
, etc.) within the context of the ->>
threading macro. However, as a convenience, several functions are provided which wrap these individual processing functions with some opinionated but configurable defaults (see process
, slurp-csv
and spit-csv
).
Specific things semantic-csv can do:
column-name -> row-val
#
)In a future release, I also plan on adding an optional "sniffer" that reads in N lines, and uses them to guess column types. There will likely be some other smaller features offered along the way as well.
Start by adding [semantic-csv "0.1.0"]
(the latest release of of this writing) to your profile.clj
. Once you have that, you can fire up a REPL and start typing away.
We'll start by requiring the main API name space, and also that of clojure/data.csv
.
(require '[clojure/data.csv :as csv]
'[clojure.java.io :as io]
'[semantic-csv :as sc :refer :all])
Now, let's take the simple example CSV file from within the test
directory of the sematic-csv
source:
this,that,more
# some comment lines...
1,2,stuff
2,3,"other yeah"
First, let's see what this looks like when we load it using the vanilla csv/read-csv
:
=> (with-open [in-file (io/reader "test.csv")]
(doall (csv/read-csv in-file)))
(["this" "that" "more"]
["# some comment lines"]
["1" "2" "stuff"]
["2" "3" "other yeah"])
(Notice that since csv/read-csv
is lazy, we must consume the file before leaving the with-open
scope, or the lazy sequence will error out when we try to get "the next element" only to find the file handle has closed. The doall
function accomplishes this by forcing realization of the lazy sequence.)
We now have a minimal translation of the raw CSV text into data. However, you'll see there are a number of things not currently represented in this data:
First, let's take care of that pesky comment line. We can do this using the remove-comments
function.
=> (with-open [in-file (io/reader "test.csv")]
(->> (csv/read in-file)
remove-comments
doall))
(["this" "that" "more"]
["1" "2" "stuff"]
["2" "3" "other yeah"])
Now, let's use the mappify
function to consume the header row and return the remaining rows as maps.
=> (with-open [in-file (io/reader "test.csv")]
(->> (csv/read in-file)
remove-comments
mappify
doall))
({:this "1" :that "2" :more "stuff"}
{:this "2" :that "3" :more "other yeah"})
Now we should really transform :this
and :more
entries into numbers, since we'll probably want to be treating them as such. We can do this using the cast-with
function, which takes as its first argument either a function to be applied to all values, or a map of column names to casting functions (see the docs for everything it can do).
=> (with-open [in-file (io/reader "test.csv")]
(->> (csv/read in-file)
remove-comments
mappify
(cast-with {:this ->double :that ->integer})
doall))
({:this 1.0 :that 2 :more "stuff"}
{:this 2.0 :that 3 :more "other yeah"})
Et voilĂ ! A simple, composable tool for parsing the semantics of CSV (and other tabular) data.
Prior to working on pol.is, I was a Systems Analyst Programmer at Fred Hutchinson Cancer Research Center in the Computational Biology department. CSV data was something I worked with all the time, and I'd typically use python and R to work with this data. There were a number of things I came to both love and hate about the way these languages handled CSV data.
For the most part, I really liked python's standard csv
library and its simple approach to CSV data. One of the things I liked most about it was that I could absorb a CSV header and return subsequent row's as dictionaries using csv.DictReader
, and write from dictionaries using csv.DictWriter
. For anything more than a couple of columns I knew would never change, this was almost always what I went for. However, the one thing I frequently wished python's csv
library had was simple column casting.
R's CSV reading/writing had a very different flavor. The basic CSV reading functionality from R is eager, but conveniently slurps data into a data frame, not only taking in the first row as column names (by default), but also guessing at the various types of each column. Of course, not being lazy is annoying for larger data, and sometimes the casting would bork things up, leading to subtle bugs. Further, string columns were by default read in as factors, which can cause LOTS of subtle bugs where values get interpretted as their numeric values instead of the strings they represent. In short, while I often found R's heavy opinionatedness convenient, it could also be very frustrating.
When I got to Clojure, I found the simplicity of the tools refreshing, but also found myself missing some of the higher level features I'd liked in python and R. Realizing there was a hole in Clojure's library offerings, I began to reflect on my previous experiences, and consider what an optimal solution might look like.
Based on the tools I had previously dealt with, my natural inclination was to design an opinionated function (or two; read/write) with over-ridable defaults. But as I thought more carefully about the most Clojuric approach to the problem, I realized there was a better way; that by breaking things into pieces, you could have something simpler and more flexible. And so semantic-csv
was born.
I welcome you to try out semantic-csv
! So far I've enjoyed using it, and hope you do too.
For the latest version:
Docs: http://metasoarous.github.io/semantic-csv Source: https://github.com/metasoarous/semantic-csv Chat/contact: https://gitter.im/metasoarous/semantic-csv
Linked in and Twitter icons by Anas Ramadan and Elegant Themes from Flaticon, licensed CC BY 3.0
Content Copyright 2019, Christopher T. Small; Site generated by Oz