Overview
This series of posts is my attempt to familiarize myself with some of the
databases that originated in the Clojure ecosystem over the last several years. The first will be Datomic.
The dataset I'll be using for investigation is the daily energy offers published by the Electricity Authority in New Zealand. To describe the set in short, it allows sellers of energy and reserves to make offers to supply the NZ wholesale market.
The process I have used is to load the data first into a local table-based relational store that I can use as a point of comparison. After that, I built up a Datomic model, loaded the data, and tried to produce the same query results as I could with SQL.
Datomic also describes itself as a relational store, just not a table-based one. It stores data in Entity, Attribute, Value (EAV) triples.
An entity represents something you describe in the database—a person, a product, an event, etc.
An attribute is a property or characteristic of the entity. Attributes describe specific facts about the entity, such as "name," "age," or "location."
The value is the specific piece of data associated with the attribute for that entity. It can be any data type, such as strings, numbers, dates, or even references to other entities.
The Offers CSV Data Set
For this example, we'll use the daily dataset from 2024-06-10, it's 41 megs and 271k lines.
I prefer to use DuckDB for my local SQL-based analysis it's easy to create and load a table from a CSV set.
CREATE TABLE offers AS SELECT * FROM '/home/jonesn/dev/arachnid/energynz/resources/20240610_Offers.csv';
DuckDB loads and interprets the dataset as defined by the CREATE TABLE statement below, this is a pretty good effort for a 1 liner. The ramp-up attributes should be numeric as well but this is enough to execute the queries we'll want to compare with Datomic.
CREATE TABLE offers
(
TradingDate DATE,
TradingPeriod BIGINT,
ParticipantCode VARCHAR,
PointOfConnection VARCHAR,
Unit VARCHAR,
ProductType VARCHAR,
ProductClass VARCHAR,
ReserveType VARCHAR,
ProductDescription VARCHAR,
UTCSubmissionDate DATE,
UTCSubmissionTime TIME,
SubmissionOrder BIGINT,
IsLatestYesNo VARCHAR,
Tranche BIGINT,
MaximumRampUpMegawattsPerHour VARCHAR,
MaximumRampDownMegawattsPerHour VARCHAR,
PartiallyLoadedSpinningReservePercent VARCHAR,
MaximumOutputMegawatts VARCHAR,
ForecastOfGenerationPotentialMegawatts VARCHAR,
Megawatts DOUBLE,
DollarsPerMegawattHour DOUBLE
);
Datomic Setup
I added this section because I downloaded the pro bundle instead of using the embedded local option described in the official doco. The main reason for this was it gave me access to the console app that is part of the pro bundle. I found that was useful for insight into Datomic's structure when starting from zero knowledge.
The tmux config I use to start a transactor and the console are listed below.
The Schema
Datomic requires a schema before data is loaded. I enjoyed this step as I frequently find I'd like the structure of my domain defined somewhere when I'm using Clojure. I've tried the code-only models like Spec but this feels more scalable with a natural delineation between the formal data model in the database versus the malleable operations on that model in code.
One small issue I have is being forced to serialize through the legacy Java date API (:db.type/instant). I'm always going to want to convert to the java.time variants so every project will require conversion helper functions.
So the above table maps to the Datomic "offers" entity below.
Once the schema is loaded it can be viewed via the Datomic console in the schemas window (localhost:8080/browse)
Unique Keys
I wanted to define what made a row unique, this ended up being a large composite key. To enforce the uniqueness I tried to use the tuple type, this was working quite well but after 8 attributes in the key I got the error below saying that was the maximum the tuple could hold. At this point, I stopped and decided the dataset wasn't well suited to a composite key. I wouldn't use one this large in SQL either. This reflects more on how the data is presented in the CSV file than any limitation with Datomic.
Populating the Data
As mentioned above the create table statement in DuckDB both created the table and loaded the CSV data. So to catch up the next step is to load the data into Datomic and then compare the number of records with SQL.
Most of the work is converting from a CSV record to a Datomic fact matching our schema.
As a single transaction, the 271k facts were too much for the dev-local storage and the transaction timed out. I've collected the error messages here for reference. To work around this I created a helper function that would chunk the data into blocks of 10k facts.
Equivalent Queries
This section is a set of comparisons between SQL and the equivalent datalog queries. This proved more fiddly than I was expecting, the AI gods struggled with conversions from SQL to Datalog. To the point where it wasn't worth using.
Two points I want to call out when parsing below:
The SQL "O.ReserveType IN NULL" filter is replaced by the Datomic filter condition: "[?e :offers/product-type "ENERGY"]". This is due to an enrichment at load time in the Datomic version of the loader because it doesn't allow NULL values.
The other is the importance of the ":with" clause in the aggregate query. It seemed to force Datomic to consider all the entities (Think rows).
Inequivalent Queries
As I was exploring Datomic and building some of the queries I noticed it was set orientated by default and must be optimized for it via indexes on each of the elements in its triple. I will demonstrate with a couple of queries but basically in SQL a SELECT a operation will return the value a of all rows in the database. Datomic operates like SELECT DISTINCT i.
Summary
It was a long-winded process to get up and running with Datomic, but I appreciated some of what it offered. It is not a quick hack tool, it feels orientated to well-thought-out, long-life data storage. Which I guess is what you want a database for :-).
Pros
I like that transactions are first class. I guess they are in traditional relational databases but it's really in your face here.
I like it has an upfront schema, you can't just toss in whatever you like.
No nulls.
The EAV triple model seems interesting but I don't fully grasp the implications yet.
Database as at a time.
Cons
The tooling is rough compared to traditional relational stores.
Still blessed with some opaque Clojure error messages. Especially when parsing queries.
Building queries with sets and lists. I guess that makes it nice for building dynamic queries but I don't find that is something I need to do often.
コメント