Finding answers inside of Redshift JSON columns

Accessing responder data from a Postgres-like database.

December 11, 2019


Understanding database columns and JSON

Conceptually, you can think of a table in a columnar relational database such as Redshift as similar to a spreadsheet.

The table's schema determines which columns are present, and what the types the values in those columns should take.

DateNameAge

A database can enforce that dates are formatted as dates, and much more.

When data is written into the table, rows are created, which you can think of as new rows in a spreadsheet.

datenameage
12/09/2019Olivia4
12/09/2019Stubby7

While Formsort flows also have schemas, those schemas can change over time, as new questions are added in order to collect new answers. To prevent needing to update the Redshift database columns every time a new answer is collected, answers are written into a single column, answers, in a data format known as JSON.

JSON allows us to express objects that contain multiple values, which are referenced by their keys. So instead of needing multiple columns for the above examples, the single answers column would contain the values all together.

dateanswers
12/09/2019

            {
              "name": "Olivia",
              "age": 4
            }
            
12/10/2019

            {
              "name": "Stubby",
              "age": 7
            }
            

Note that while every row will have the same columns, not every row might have the same set of keys and values. For example, if we start asking users for their birthday, the answer to that question will only be present in the dataset from that point on.

dateanswers
12/09/2019

            {
              "name": "Olivia",
              "age": 4
            }
            
12/10/2019

            {
              "name": "Stubby",
              "age": 7
            }
            
12/11/2019

            {
              "name": "Rosie",
              "age": 10,
              "birthday": "02/13/2009"
            }
            

Querying JSON columns to get values

To get answers out of the JSON column, you'd use the standard SELECT statement that you may be familiar with.

SELECT
  answers
FROM
  formsort.answer_set

But if you merely select answers, you'll get a blob of text containing the JSON object, which isn't easy to use for direct analysis:

answers
{
  "name": "Olivia",
  "age": 4
}
{
  "name": "Stubby",
  "age": 7
}
{
  "name": "Rosie",
  "age": 10,
  "birthday": "02/13/2009"
}

To reach into answers you'll want to use a function called json_extract_path_text instead, telling the database which key we want from answers using the second parameter, in this case, 'name'.

SELECT
  json_extract_path_text(answers, 'name') AS name
FROM
  formsort.answer_set
name
Olivia
Stubby
Rosie

You can also select multiple answers out at once by repeating the fields you'd like to select out.

SELECT
  json_extract_path_text(answers, 'name') AS name,
  json_extract_path_text(answers, 'age') AS age
FROM
  formsort.answer_set
nameage
Olivia4
Stubby7
Rosie10

Omitting rows without particular answer values

As mentioned earlier, when new answers are collected in a flow, they will only appear in the answers column from that point on. The behavior of json_extract_path_text is to return the empty string when a value is missing, so we can use that in a WHERE clause to filter out rows where a particular answer is not set.

SELECT
  json_extract_path_text(answers, 'birthday') AS name,
FROM
  formsort.answer_set
WHERE json_extract_path_text(answers, 'birthday') != ''
birthday
02/13/2009

For more detail, see the Redshift documentation on JSON_EXTRACT_PATH_TEXT

Contact
  • hello@formsort.com
  • Jobs
  • Security
Resources
  • Formsort Inc
    © 2020