Finding answers inside of Redshift JSON columns
Accessing responder data from a Postgres-like database.
Understanding database columns and JSON
Conceptually, you can think of a table in a columnar relational database such as Redshift as similar to a spreadsheet.
The table's schema determines which columns are present, and what types the values in those columns should take.
Date Name Age
--------------------------------------------------
A database can enforce that dates are formatted as dates, and much more.
When data is written into the table, rows are created, which you can think of as new rows in a spreadsheet.
Date Name Age
--------------------------------------------------
12/09/2019 Olivia 4
12/09/2019 Stubby 7
While Formsort form flows also have schemas, those schemas can change over time, as new questions are added in order to collect new answers. To prevent needing to update the Redshift database columns every time a new answer is collected, answers are written into a single column, answers, in a data format known as JSON.
JSON allows us to express objects that contain multiple values, which are referenced by their keys. So instead of needing multiple columns for the above examples, the single answers column would contain the values all together.
date answers
--------------------------------------------------------------
12/09/2019 {
"name": "Olivia",
"age": 4
}
12/10/2019 {
"name": "Stubby",
"age": 7
}
Note that while every row will have the same columns, not every row might have the same set of keys and values. For example, if we start asking users for their birthday, the answer to that question will only be present in the dataset from that point on.
date answers
--------------------------------------------------------------
12/09/2019 {
"name": "Olivia",
"age": 4
}
12/10/2019 {
"name": "Stubby",
"age": 7
}
12/11/2019 {
"name": "Rosie",
"age": 10,
"birthday": "02/13/2009"
}
Querying JSON columns to get values
To get answers out of the JSON column, you'd use the standard SELECT statement that you may be familiar with.
SELECT
answers
FROM
formsort.answer_set
But if you merely select answers, you'll get a blob of text containing the JSON object, which isn't easy to use for direct analysis:
{
"name": "Olivia",
"age": 4
}
{
"name": "Stubby",
"age": 7
}
{
"name": "Rosie",
"age": 10,
"birthday": "02/13/2009"
}
To reach into answers you'll want to use a function called json_extract_path_text instead, telling the database which key we want from answers using the second parameter, in this case, 'name'.
SELECT
json_extract_path_text(answers, 'name') AS name
FROM
formsort.answer_set
name
----------
Olivia
Stubby
Rosie
You can also select multiple answers out at once by repeating the fields you'd like to select out.
SELECT
json_extract_path_text(answers, 'name') AS name,
json_extract_path_text(answers, 'age') AS age
FROM
formsort.answer_set
name age
-------------------
Olivia 4
Stubby 7
Rosie 10
Omitting rows without particular answer values
As mentioned earlier, when new answers are collected in a flow, they will only appear in the answers column from that point on. The behavior of json_extract_path_text is to return the empty string when a value is missing, so we can use that in a WHERE clause to filter out rows where a particular answer is not set.
SELECT
json_extract_path_text(answers, 'birthday') AS name,
FROM
formsort.answer_set
WHERE json_extract_path_text(answers, 'birthday') != ''
birthday
------------
02/13/2009
For more detail, see the Redshift documentation on JSON_EXTRACT_PATH_TEXT