Home Blog Research & Publications Talks Code Contact

Presenting semantic-csv: High-level CSV processing for Clojure

Published at: 2015-05-08 06:13

Tags: clojure

I'm happy to announce the release of semantic-csv, a humble little CSV library for higher level CSV parsing features.

The semantic niche

Existing CSV parsing libraries for Clojure (clojure/data.csv and clojure-csv) only concern themselves with the grammar of CSV. Given a string or file with CSV data, they produce a collection of row vectors, a minimal data representation of the structure of CSV.

However, CSV data typically means something. Sometimes the items of the row vectors refer to numbers or dates. Sometimes the first row is a header. Sometimes a row has been "commented" out. These are all things which are not concerned so much with the grammer of CSV data, but it's semantics.

In the spirit of composability, semantic-csv lives solely at the level of semantic processing. By not concerning itself with the details of grammar, it frees itself to be interoperable with any underlying grammar parsing approach (that is, it's completely compatible with both clojure/data-csv and clojure-csv).

API design

Semantic-csv emphasizes a set of composable collection processing functions. Each of these functions takes the collection of data rows as the final argument, facilitating interoperability with the standard collection processing functions (map, filter, reduce, etc.) within the context of the ->> threading macro. However, as a convenience, several functions are provided which wrap these individual processing functions with some opinionated but configurable defaults (see process, slurp-csv and spit-csv).

Features

Specific things semantic-csv can do:

Absorb header row as a vector of column names, and return remaining rows as maps of column-name -> row-val
Write from a collection of maps, given a header
Apply casting/formatting functions by column name, while reading or writing
Remove commented out lines (by default, those starting with #)
Compatible with any CSV parsing library returning/writing a sequence of row vectors

In a future release, I also plan on adding an optional "sniffer" that reads in N lines, and uses them to guess column types. There will likely be some other smaller features offered along the way as well.

A little demo

Start by adding [semantic-csv "0.1.0"] (the latest release of of this writing) to your profile.clj. Once you have that, you can fire up a REPL and start typing away.

We'll start by requiring the main API name space, and also that of clojure/data.csv.

(require '[clojure/data.csv :as csv]
         '[clojure.java.io :as io]
         '[semantic-csv :as sc :refer :all])

Now, let's take the simple example CSV file from within the test directory of the sematic-csv source:

this,that,more
# some comment lines...
1,2,stuff
2,3,"other yeah"

First, let's see what this looks like when we load it using the vanilla csv/read-csv:

=> (with-open [in-file (io/reader "test.csv")]
     (doall (csv/read-csv in-file)))
(["this" "that" "more"]
 ["# some comment lines"]
 ["1" "2" "stuff"]
 ["2" "3" "other yeah"])

(Notice that since csv/read-csv is lazy, we must consume the file before leaving the with-open scope, or the lazy sequence will error out when we try to get "the next element" only to find the file handle has closed. The doall function accomplishes this by forcing realization of the lazy sequence.)

We now have a minimal translation of the raw CSV text into data. However, you'll see there are a number of things not currently represented in this data:

It's quite clear the first row is supposed to be a header row of column names, and to be treated differently than the remaining rows.
The second row is intended to be a comment row, and should clearly be removed. While you can gripe about whether this should be allowed in CSV files, it happens, so it's good to have a way of dealing with it...
The remaining rows contain what should clearly be numeric values, but are still strings.

First, let's take care of that pesky comment line. We can do this using the remove-comments function.

=> (with-open [in-file (io/reader "test.csv")]
     (->> (csv/read in-file)
          remove-comments
          doall))
(["this" "that" "more"]
 ["1" "2" "stuff"]
 ["2" "3" "other yeah"])

Now, let's use the mappify function to consume the header row and return the remaining rows as maps.

=> (with-open [in-file (io/reader "test.csv")]
     (->> (csv/read in-file)
          remove-comments
          mappify
          doall))
({:this "1" :that "2" :more "stuff"}
 {:this "2" :that "3" :more "other yeah"})

Now we should really transform :this and :more entries into numbers, since we'll probably want to be treating them as such. We can do this using the cast-with function, which takes as its first argument either a function to be applied to all values, or a map of column names to casting functions (see the docs for everything it can do).

=> (with-open [in-file (io/reader "test.csv")]
     (->> (csv/read in-file)
          remove-comments
          mappify
          (cast-with {:this ->double :that ->integer})
          doall))
({:this 1.0 :that 2 :more "stuff"}
 {:this 2.0 :that 3 :more "other yeah"})

Et voilà! A simple, composable tool for parsing the semantics of CSV (and other tabular) data.

How my previous experience with CSV informed this design

Prior to working on pol.is, I was a Systems Analyst Programmer at Fred Hutchinson Cancer Research Center in the Computational Biology department. CSV data was something I worked with all the time, and I'd typically use python and R to work with this data. There were a number of things I came to both love and hate about the way these languages handled CSV data.

For the most part, I really liked python's standard csv library and its simple approach to CSV data. One of the things I liked most about it was that I could absorb a CSV header and return subsequent row's as dictionaries using csv.DictReader, and write from dictionaries using csv.DictWriter. For anything more than a couple of columns I knew would never change, this was almost always what I went for. However, the one thing I frequently wished python's csv library had was simple column casting.

R's CSV reading/writing had a very different flavor. The basic CSV reading functionality from R is eager, but conveniently slurps data into a data frame, not only taking in the first row as column names (by default), but also guessing at the various types of each column. Of course, not being lazy is annoying for larger data, and sometimes the casting would bork things up, leading to subtle bugs. Further, string columns were by default read in as factors, which can cause LOTS of subtle bugs where values get interpretted as their numeric values instead of the strings they represent. In short, while I often found R's heavy opinionatedness convenient, it could also be very frustrating.

When I got to Clojure, I found the simplicity of the tools refreshing, but also found myself missing some of the higher level features I'd liked in python and R. Realizing there was a hole in Clojure's library offerings, I began to reflect on my previous experiences, and consider what an optimal solution might look like.

Based on the tools I had previously dealt with, my natural inclination was to design an opinionated function (or two; read/write) with over-ridable defaults. But as I thought more carefully about the most Clojuric approach to the problem, I realized there was a better way; that by breaking things into pieces, you could have something simpler and more flexible. And so semantic-csv was born.

Try it out!

I welcome you to try out semantic-csv! So far I've enjoyed using it, and hope you do too.

For the latest version:

Docs: http://metasoarous.github.io/semantic-csv Source: https://github.com/metasoarous/semantic-csv Chat/contact: https://gitter.im/metasoarous/semantic-csv

Linked in and Twitter icons by Anas Ramadan and Elegant Themes from Flaticon, licensed CC BY 3.0