Finding answers inside of Redshift JSON columns

December 11, 2019

Accessing responder data from a Postgres-like database.

Understanding database columns and JSON

Conceptually, you can think of a table in a columnar relational database such as Redshift as similar to a spreadsheet.

The table's schema determines which columns are present, and what types the values in those columns should take.

Date              Name              Age
--------------------------------------------------

A database can enforce that dates are formatted as dates, and much more.

When data is written into the table, rows are created, which you can think of as new rows in a spreadsheet.

Date              Name              Age
--------------------------------------------------
12/09/2019        Olivia            4
12/09/2019        Stubby            7

While Formsort form flows also have schemas, those schemas can change over time, as new questions are added in order to collect new answers. To prevent needing to update the Redshift database columns every time a new answer is collected, answers are written into a single column, answers, in a data format known as JSON.

JSON allows us to express objects that contain multiple values, which are referenced by their keys. So instead of needing multiple columns for the above examples, the single answers column would contain the values all together.

date              answers
--------------------------------------------------------------
12/09/2019        {
                    "name": "Olivia",
                    "age": 4
                  }

12/10/2019        {
                    "name": "Stubby",
                    "age": 7
                  }             

Note that while every row will have the same columns, not every row might have the same set of keys and values. For example, if we start asking users for their birthday, the answer to that question will only be present in the dataset from that point on.

date              answers
--------------------------------------------------------------
12/09/2019        {
                    "name": "Olivia",
                    "age": 4
                  }

12/10/2019        {
                    "name": "Stubby",
                    "age": 7
                  }

12/11/2019        {
                    "name": "Rosie",
                    "age": 10,
                    "birthday": "02/13/2009"
                  }         

Querying JSON columns to get values

To get answers out of the JSON column, you'd use the standard SELECT statement that you may be familiar with.

SELECT
  answers
FROM
  formsort.answer_set

But if you merely select answers, you'll get a blob of text containing the JSON object, which isn't easy to use for direct analysis:

{
  "name": "Olivia",
  "age": 4
}

{
  "name": "Stubby",
  "age": 7
}

{
  "name": "Rosie",
  "age": 10,
  "birthday": "02/13/2009"
}

To reach into answers you'll want to use a function called json_extract_path_text instead, telling the database which key we want from answers using the second parameter, in this case, 'name'.

SELECT
  json_extract_path_text(answers, 'name') AS name
FROM
  formsort.answer_set
name
----------
Olivia
Stubby
Rosie

You can also select multiple answers out at once by repeating the fields you'd like to select out.

SELECT
  json_extract_path_text(answers, 'name') AS name,
  json_extract_path_text(answers, 'age') AS age
FROM
  formsort.answer_set
name          age
-------------------
Olivia        4
Stubby        7
Rosie         10

Omitting rows without particular answer values

As mentioned earlier, when new answers are collected in a flow, they will only appear in the answers column from that point on. The behavior of json_extract_path_text is to return the empty string when a value is missing, so we can use that in a WHERE clause to filter out rows where a particular answer is not set.

SELECT
  json_extract_path_text(answers, 'birthday') AS name,
FROM
  formsort.answer_set
WHERE json_extract_path_text(answers, 'birthday') != ''
birthday
------------
02/13/2009

For more detail, see the Redshift documentation on JSON_EXTRACT_PATH_TEXT