Matthew Burke

Mathematician, Software Developer

Keeping Your Code DRY Through Data Abstraction and Metaprogramming

Recently I was working on a project that required keeping audit logs of changes to tables in the database. For each table that had to be logged, I attached INSERT, UPDATE and DELETE triggers to record all changes. The approach looked roughly like the following (using MySQL as the database):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
CREATE TABLE members (
  member_id       INTEGER PRIMARY KEY AUTO_INCREMENT,
  member_name     VARCHAR(128) NOT NULL,
  member_state    CHAR(2) NOT NULL,
  INDEX state_idx (member_state),
  FOREIGN KEY (member_state) REFERENCES postal_abbreviations(abbreviation)
) ENGINE=INNODB;

CREATE TABLE members_journal (
  journal_id    INTEGER PRIMARY KEY AUTO_INCREMENT,
  operation     VARCHAR(16) NOT NULL,
  member_id     INTEGER NOT NULL,
  member_name   VARCHAR(128) NULL,
  member_state  CHAR(2) NULL
) ENGINE=INNODB;

DROP TRIGGER IF EXISTS members_insert_trigger;

DELIMITER |
CREATE TRIGGER members_insert_trigger AFTER INSERT ON members
FOR EACH ROW
BEGIN
  INSERT INTO members_journal (operation, member_id, member_name)
     VALUES ('insert', NEW.member_id, NEW.member_name, NEW.member_state);
END;
|
DELIMITER ;

-- more trigger definitions here. . .

Briefly: for each primary table, such as members, you create a table to record the changes, members_journal. The journal table should include all the columns from the primary table, without constraints, as well as its own primary key and a column to include the operation type (insert, update, delete). Typically, you would want to include a timestamp in the journal table, but I have not shown it, as the example source code is already fairly lengthy. Then you create triggers to insert information into the journal table each time the primary table is modified.

There are some things which you need to be concerned with if you use this technique. The first that comes to mind is performance. But in this particular case, it’s not too much of a concern because once we went into production, the content would change infrequently and only a handful of users would have permission to change the content.

More pressing for us was the fact that a) we were making fairly frequent and extensive schema changes during development and b) unlike the example above, our tables had lots of columns.

You can probably see the implication of the above. We would update our code and recreate the database only to find that the column lists in the table definitions and the column lists in the trigger definitions were out of sync. Keep in mind that for each table being audited we had four column lists to keep consistent: one table and three triggers.

Now let’s change gears for a moment and consider software development principles. DRY is an acronym formed from the coding aphorism Don’t Repeat Yourself (also known as Single Source of Truth). The essence of the principle is that, according to Hunt and Thomas in The Pragmatic Programmer, “[e]very piece of knowledge must have a single, unambiguous, authoritative representation within a system.”

Why keep your code DRY (i.e. follow this principle)? Because if you do not, dire things happen! You find yourself having to fix the same bug in multiple places (and always, always, always forgetting to correct one of the instances), or introducing inconsistencies in your software, or not seeing opportunities to simplify your implementation because you’re blinded by lots of excess code.

Now back to my database problem. Clearly, if you’re not DRY, then you must be WET (write everything twice). Except in my case, I’m beyond WET because I’m writing everything four times!

I needed a mechanism that allowed me to avoid this duplication. There are several approaches I could have taken including the following:

  1. Write the SQL statements to create the tables then parse these SQL statements to auto-generate the trigger definitions.
  2. Define the tables and then pull the necessary information to generate the triggers from the database’s information schema.
  3. Describe the table columns using something more abstract than SQL code and auto-generate both the table definitions and the trigger definitions from this same source.

I chose the third option. There remained the decision of how to represent the column definitions and how to do the auto-generation. An obvious choice would be to use XML as the data representation and use XSLT to do the transformation. I leave it as an exercise for the reader to research this option.

Instead, I chose to use the language Lua. Lua is an elegant, powerful scripting language that began as a data description language in the early 1990s. Although it has evolved significantly since then, it has never lost its powerful data description capabilities. I’ll discuss Lua in more detail in a future blog post.

Using Lua, I created a data structure to describe each of the primary tables. Using the example schema from above, the data structure looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
schema = {
  members = {
     fields = {
        { 'member_id', 'integer', 'auto_increment', },
        { 'member_name', 'varchar(128)', 'not null', },
        { 'member_state', 'char(2)', 'not null', },
     },

     primary_key = {
        'member_id',
     },

     foreign_keys = {
        { 'state', 'postal_abbreviations', 'abbrev', },
     },

     indices = {
        'state',
     },
   }

   some_other_table = {
                    . . .
      . . .
   }

A few points about the above code: in Lua, the primary data structure is the table, which can be treated as an array, an associative array (hash), or both at the same time. The schema table includes entries for each primary table in my database. Each table entry describes the fields and auxiliary information such as table constraints, keys, and indices to be created. As with many scripting languages, if you attempt to access a key in a table that doesn’t exist, you get the special value, nil, returned. This means that your data definitions can be sparse.

With the above data structure in hand, we can now easily write different scripts to process the schema in different ways. For example, below is the start of a script to generate the table definition code.

1
2
3
4
5
6
for tableName, tableDefinition in pairs(schema) do
  print(string.format("CREATE TABLE %s(", tableName)
  for _, fieldDefinition in pairs(tableDefinition['fields']) do
    . . .
    . . .
end

Using the above, I have moved from being all WET to DRY. I use the same information to generate the tables, the journal tables, and the triggers and so I can be assured that everything is properly in sync. There is one small, remaining item. After I make a change to the field definitions, I need to remember to run all the generators. But that can be handled with proper use of build automation and will have to be left to another blog post!