A quick dive into the importance of pivot tables in data analysis and how to achieve it in PostgreSQL

Pivot tables are a critical tool in data analysis as they provide a way to easily summarize, organize and present large amounts of data in a meaningful and understandable way. By rearranging rows into columns and aggregating data, pivot tables allow data analysts to quickly identify patterns, trends, and relationships in the data.

In PostgreSQL, the crosstab function can be used to create pivot tables and perform same data summarization tasks. With the crosstab function, analysts can easily manipulate the structure of the data to understand better the information it contains thereby greatly enhancing the efficiency and effectiveness of data analysis.

What is Pivot in SQL and why do we use it?

A pivot in SQL allows one to take data that is organized in one way, with rows and columns, and re-organize it into a different structure where some of the rows become columns and some of the columns become rows. This can be useful for creating summary or cross-tabulation reports from data stored in a relational database, allowing one to view the data in a desired format.

What is crosstab function?

The crosstab function in PostgreSQL is a tool that can be used to create pivot tables from data stored in a relational database. It can be used to rearrange or reshape your data in a meaningful way so that it’s easier to analyze or understand. It takes one or two SQL queries as input and returns a table with one column for each unique value in a specified column and one row for each unique combination of values in a specified row and column.

How to enable the crosstab function?

The crosstab function is part of the tablefunc extension in PostgreSQL. To use the crosstab function in your SQL queries, you must first enable the tablefunc extension by executing the following SQL command:

CREATE EXTENSION IF NOT EXISTS tablefunc;

This command will check if the tablefunc extension is already installed, and if not, it will install it. Once the tablefunc extension is installed, you can start using the crosstab() function in your SQL queries. It’s important to remember that the crosstab function is not available by default in PostgreSQL and requires the tablefunc extension to be installed and enabled.

How does the crosstab function work?

As discussed above, the crosstab function in PostgreSQL works by taking one or two SQL SELECT commands as parameters. This blog covers the one which takes two SELECT commands as parameters.

crosstab(text source_sql, text category_sql)

The first parameter source_sql must return at least 3 columns.

  1. The first column is used as the identifier for each row in the pivot table.
  2. The second column represents the categories in the pivot table, and it’s important to note that the values of this column will expand into many columns in the pivot table.
  3. The third column represents the value to be assigned to each cell of the pivot table.

The second parameter category_sql should return all the distinct values of the second column in the source_sql in the same order they are going to display as columns in the pivot table.

Let’s see the working of the crosstab function based on a problem statement.

Problem Statement

Transform the following input table

into the following output table. (Need to enclose negative values in round brackets)

**Note:** The PostgreSQL client that I am using here is the Postgres terminal itself.
          So the commands to set up the database and the table will be based on the same.

Setting up Database

--Create a database named sales---
CREATE DATABASE sales;

--Once the database is created, run the following command to access the database--
\\c sales

Setting up a table and inserting data

Save the below SQL script in a .sql file. Let’s call the file crosstab.sql.

--Drop, if any table exists with the name sales_data--
drop table sales_data;

--Create the table named sales_data containing the monthly sales data of different customers--
create table sales_data
    (
        sales_date      date,
        customer_id     varchar(30),
        amount          varchar(30)
    );

--Insert data into the table--
insert into sales_data values (to_date('03-Jan-2021','dd-mon-yyyy'), 'Cust-1', '50$');
insert into sales_data values (to_date('04-Jan-2021','dd-mon-yyyy'), 'Cust-1', '150$');
insert into sales_data values (to_date('05-Jan-2021','dd-mon-yyyy'), 'Cust-1', '50$');
insert into sales_data values (to_date('20-Dec-2021','dd-mon-yyyy'), 'Cust-1', '225$');
insert into sales_data values (to_date('01-Jan-2021','dd-mon-yyyy'), 'Cust-2', '200$');
insert into sales_data values (to_date('02-Jan-2021','dd-mon-yyyy'), 'Cust-2', '100$');
insert into sales_data values (to_date('03-Jan-2021','dd-mon-yyyy'), 'Cust-2', '95$');
insert into sales_data values (to_date('06-Feb-2021','dd-mon-yyyy'), 'Cust-2', '-150$');
insert into sales_data values (to_date('07-Feb-2021','dd-mon-yyyy'), 'Cust-2', '-100$');
insert into sales_data values (to_date('09-Feb-2021','dd-mon-yyyy'), 'Cust-2', '-50$');
insert into sales_data values (to_date('11-Mar-2021','dd-mon-yyyy'), 'Cust-3', '80$');
insert into sales_data values (to_date('10-Apr-2021','dd-mon-yyyy'), 'Cust-3', '100$');
insert into sales_data values (to_date('04-May-2021','dd-mon-yyyy'), 'Cust-3', '160$');
insert into sales_data values (to_date('08-Jun-2021','dd-mon-yyyy'), 'Cust-3', '30$');
insert into sales_data values (to_date('02-Jul-2021','dd-mon-yyyy'), 'Cust-3', '-10$');
insert into sales_data values (to_date('14-Dec-2021','dd-mon-yyyy'), 'Cust-3', '170$');
insert into sales_data values (to_date('07-Aug-2021','dd-mon-yyyy'), 'Cust-4', '-3$');
insert into sales_data values (to_date('01-Sep-2021','dd-mon-yyyy'), 'Cust-4', '-15$');
insert into sales_data values (to_date('01-Oct-2021','dd-mon-yyyy'), 'Cust-4', '-1$');
insert into sales_data values (to_date('10-Mar-2021','dd-mon-yyyy'), 'Cust-5', '45$');
insert into sales_data values (to_date('14-Nov-2021','dd-mon-yyyy'), 'Cust-5', '-20$');
insert into sales_data values (to_date('20-Dec-2021','dd-mon-yyyy'), 'Cust-5', '100$');

Let’s go back to the PostgreSQL terminal and after accessing the database, run the following command to run the crosstab.sql script to create the table sales_data and insert data into it.

\\i 'crosstab.sql'

Note: For running the script, make sure you use the absolute path of the.sql file.

Now let’s display the table sales_data

-- Display all the data from the sales_data table--
SELECT * FROM sales_data;

The table obtained will be the input table.

 sales_date | customer_id | amount 
------------+-------------+--------
 2021-01-03 | Cust-1      | 50$
 2021-01-04 | Cust-1      | 150$
 2021-01-05 | Cust-1      | 50$
 2021-12-20 | Cust-1      | 225$
 2021-01-01 | Cust-2      | 200$
 2021-01-02 | Cust-2      | 100$
 2021-01-03 | Cust-2      | 95$
 2021-02-06 | Cust-2      | -150$
 2021-02-07 | Cust-2      | -100$
 2021-02-09 | Cust-2      | -50$
 2021-03-11 | Cust-3      | 80$
 2021-04-10 | Cust-3      | 100$
 2021-05-04 | Cust-3      | 160$
 2021-06-08 | Cust-3      | 30$
 2021-07-02 | Cust-3      | -10$
 2021-12-14 | Cust-3      | 170$
 2021-08-07 | Cust-4      | -3$
 2021-09-01 | Cust-4      | -15$
 2021-10-01 | Cust-4      | -1$
 2021-03-10 | Cust-5      | 45$
 2021-11-14 | Cust-5      | -20$
 2021-12-20 | Cust-5      | 100$

Now let’s create the output table in a step-by-step manner.

Step 1 – Generate a table that displays the monthly amount of sales for each customer

source_sql query

SELECT customer_id, to_char(sales_date, 'MM-YYYY') AS sales_month, 
SUM(CAST(REPLACE(amount,'$', '') as integer)) as amount 
FROM sales_data GROUP BY customer_id, sales_month ORDER BY 1,2;

In the source_sql, I have changed the sales_date format to ‘MM-YYYY’ with a new alias sales_month. The reason for doing so is to sort the same in ascending order of month and year. Also removed the ‘$’ symbol from the amount and type cast to the integer type in order to find the monthly amount of each customer.

Here the first column customer_id is used as the identifier for each row in the pivot table, the second column sales_month represents the categories in the pivot table and the third column amount represents the value to be assigned to each cell of the pivot table.

It’s mandatory that the source_sql always include order by 1,2 clause so that the same row names customer_id with their corresponding categories sales_month (sorted in ascending order) are grouped together.

 customer_id | sales_month | amount 
-------------+-------------+--------
 Cust-1      | 01-2021     |    250
 Cust-1      | 12-2021     |    225
 Cust-2      | 01-2021     |    395
 Cust-2      | 02-2021     |   -300
 Cust-3      | 03-2021     |     80
 Cust-3      | 04-2021     |    100
 Cust-3      | 05-2021     |    160
 Cust-3      | 06-2021     |     30
 Cust-3      | 07-2021     |    -10
 Cust-3      | 12-2021     |    170
 Cust-4      | 08-2021     |     -3
 Cust-4      | 09-2021     |    -15
 Cust-4      | 10-2021     |     -1
 Cust-5      | 03-2021     |     45
 Cust-5      | 11-2021     |    -20
 Cust-5      | 12-2021     |    100

Step 2 – Display all the distinct months with year in the same order they display on the pivot table

category_sql query

SELECT DISTINCT to_char(sales_date, 'MM-YYYY') sales_month FROM sales_data ORDER BY 1;

There is a condition that while using  category_sql in the crosstab function, they should be ordered in a specific way as we display categories in the pivot table.

So here the above query ****returns the list of sales_month in ascending order (from Jan-2021 to Dec-2021) in the same order as we are planning to display month and year in the pivot table to achieve the given output.

sales_month 
-------------
 01-2021
 02-2021
 03-2021
 04-2021
 05-2021
 06-2021
 07-2021
 08-2021
 09-2021
 10-2021
 11-2021
 12-2021

Step 3 – Generate the pivot table using crosstab function

SQL query using crosstab()

SELECT * FROM crosstab(
 'SELECT customer_id, 
        to_char(sales_date, ''MM-YYYY'') AS sales_month, 
        SUM(CAST(REPLACE(amount,''$'', '''') as integer)) as amount 
  FROM sales_data 
  GROUP BY customer_id, sales_month 
  ORDER BY 1,2', 
 'SELECT DISTINCT to_char(sales_date, ''MM-YYYY'') sales_month 
  FROM sales_data 
  ORDER BY sales_month ASC') 
 AS ("Customer" VARCHAR, 
     "Jan-2021" BIGINT, "Feb-2021" BIGINT, "Mar-2021" BIGINT, "Apr-2021" BIGINT, 
     "May-2021" BIGINT, "Jun-2021" BIGINT, "Jul-2021" BIGINT, "Aug-2021" BIGINT, 
     "Sep-2021" BIGINT, "Oct-2021" BIGINT, "Nov-2021" BIGINT, "Dec-2021" BIGINT);

Here the crosstab function is invoked in the FROM clause of the SELECT statement. Also, we need to mention the column names with their data types that will go into the pivot table.

In the above query, I have mapped the column customer_id to the column Customer and each distinct value of sales_month to the column name with the format ‘Mon-YYYY’ like ’01-2021’ to ‘Jan-2021’, ‘02-2021’ to ‘Feb-2021’, and so on.

The pivot table will be

 Customer | Jan-2021 | Feb-2021 | Mar-2021 | Apr-2021 | May-2021 | Jun-2021 | Jul-2021 | Aug-2021 | Sep-2021 | Oct-2021 | Nov-2021 | Dec-2021 
----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------
 Cust-1   |      250 |          |          |          |          |          |          |          |          |          |          |      225
 Cust-2   |      395 |     -300 |          |          |          |          |          |          |          |          |          |         
 Cust-3   |          |          |       80 |      100 |      160 |       30 |      -10 |          |          |          |          |      170
 Cust-4   |          |          |          |          |          |          |          |       -3 |      -15 |       -1 |          |         
 Cust-5   |          |          |       45 |          |          |          |          |          |          |          |      -20 |      100

In the above table, in each row, the Customer acts as a unique identifier with 12 separate columns to represent 12 months where each column holds the monthly amount of each customer.

Step 4 – Add 0 to columns having NULL values

WITH 
monthly_sales AS (
SELECT * FROM crosstab(
 'SELECT customer_id, 
        to_char(sales_date, ''MM-YYYY'') AS sales_month, 
        SUM(CAST(REPLACE(amount,''$'', '''') as integer)) as amount 
  FROM sales_data 
  GROUP BY customer_id, sales_month 
  ORDER BY 1,2', 
 'SELECT DISTINCT to_char(sales_date, ''MM-YYYY'') sales_month 
  FROM sales_data 
  ORDER BY sales_month ASC') 
 AS ("Customer" VARCHAR, 
     "Jan-2021" BIGINT, "Feb-2021" BIGINT, "Mar-2021" BIGINT, "Apr-2021" BIGINT, 
     "May-2021" BIGINT, "Jun-2021" BIGINT, "Jul-2021" BIGINT, "Aug-2021" BIGINT, 
     "Sep-2021" BIGINT, "Oct-2021" BIGINT, "Nov-2021" BIGINT, "Dec-2021" BIGINT)
),
final_data AS (
SELECT "Customer", 
        coalesce("Jan-2021",0) AS "Jan-2021", 
        coalesce("Feb-2021",0) AS "Feb-2021", 
        coalesce("Mar-2021",0) AS "Mar-2021", 
        coalesce("Apr-2021",0) AS "Apr-2021", 
        coalesce("May-2021",0) AS "May-2021", 
        coalesce("Jun-2021",0) AS "Jun-2021", 
        coalesce("Jul-2021",0) AS "Jul-2021", 
        coalesce("Aug-2021",0) AS "Aug-2021", 
        coalesce("Sep-2021",0) AS "Sep-2021", 
        coalesce("Oct-2021",0) AS "Oct-2021", 
        coalesce("Nov-2021",0) AS "Nov-2021", 
        coalesce("Dec-2021",0) AS "Dec-2021" 
FROM monthly_sales)
SELECT * FROM final_data;

Here I have added the crosstab function SQL command within a CTE to apply further changes over the pivot table giving an alias monthly_sales. In the table monthly_sales, I have applied COALESCE function to each month column to add a 0 value if any of them hold a NULL value and named the resulting table as final_data.

 Customer | Jan-2021 | Feb-2021 | Mar-2021 | Apr-2021 | May-2021 | Jun-2021 | Jul-2021 | Aug-2021 | Sep-2021 | Oct-2021 | Nov-2021 | Dec-2021 
----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------
 Cust-1   |      250 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |      225
 Cust-2   |      395 |     -300 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0
 Cust-3   |        0 |        0 |       80 |      100 |      160 |       30 |      -10 |        0 |        0 |        0 |        0 |      170
 Cust-4   |        0 |        0 |        0 |        0 |        0 |        0 |        0 |       -3 |      -15 |       -1 |        0 |        0
 Cust-5   |        0 |        0 |       45 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |      -20 |      100

Step 5 – Adding the $ symbol to all the amounts by enclosing negative values in round brackets ()

WITH 
monthly_sales AS (
SELECT * FROM crosstab(
 'SELECT customer_id, 
        to_char(sales_date, ''MM-YYYY'') AS sales_month, 
        SUM(CAST(REPLACE(amount,''$'', '''') as integer)) as amount 
  FROM sales_data 
  GROUP BY customer_id, sales_month 
  ORDER BY 1,2', 
 'SELECT DISTINCT to_char(sales_date, ''MM-YYYY'') sales_month 
  FROM sales_data 
  ORDER BY sales_month ASC') 
 AS ("Customer" VARCHAR, 
     "Jan-2021" BIGINT, "Feb-2021" BIGINT, "Mar-2021" BIGINT, "Apr-2021" BIGINT, 
     "May-2021" BIGINT, "Jun-2021" BIGINT, "Jul-2021" BIGINT, "Aug-2021" BIGINT, 
     "Sep-2021" BIGINT, "Oct-2021" BIGINT, "Nov-2021" BIGINT, "Dec-2021" BIGINT)
),
final_data AS (
SELECT "Customer", 
        coalesce("Jan-2021",0) AS "Jan-2021", 
        coalesce("Feb-2021",0) AS "Feb-2021", 
        coalesce("Mar-2021",0) AS "Mar-2021", 
        coalesce("Apr-2021",0) AS "Apr-2021", 
        coalesce("May-2021",0) AS "May-2021", 
        coalesce("Jun-2021",0) AS "Jun-2021", 
        coalesce("Jul-2021",0) AS "Jul-2021", 
        coalesce("Aug-2021",0) AS "Aug-2021", 
        coalesce("Sep-2021",0) AS "Sep-2021", 
        coalesce("Oct-2021",0) AS "Oct-2021", 
        coalesce("Nov-2021",0) AS "Nov-2021", 
        coalesce("Dec-2021",0) AS "Dec-2021" 
FROM monthly_sales)
SELECT "Customer", 
CASE WHEN "Jan-2021" < 0 THEN CONCAT('(', "Jan-2021" * -1, ')', '$') ELSE CONCAT("Jan-2021",'$') END "Jan-2021",
CASE WHEN "Feb-2021" < 0 THEN CONCAT('(', "Feb-2021" * -1, ')', '$') ELSE CONCAT("Feb-2021",'$') END "Feb-2021",                                                        
CASE WHEN "Mar-2021" < 0 THEN CONCAT('(', "Mar-2021" * -1, ')', '$') ELSE CONCAT("Mar-2021",'$') END "Mar-2021",                                                        
CASE WHEN "Apr-2021" < 0 THEN CONCAT('(', "Apr-2021" * -1, ')', '$') ELSE CONCAT("Apr-2021",'$') END "Apr-2021",                                                        
CASE WHEN "May-2021" < 0 THEN CONCAT('(', "May-2021" * -1, ')', '$') ELSE CONCAT("May-2021",'$') END "May-2021",                                                        
CASE WHEN "Jun-2021" < 0 THEN CONCAT('(', "Jun-2021" * -1, ')', '$') ELSE CONCAT("Jun-2021",'$') END "Jun-2021",                                                        
CASE WHEN "Jul-2021" < 0 THEN CONCAT('(', "Jul-2021" * -1, ')', '$') ELSE CONCAT("Jul-2021",'$') END "Jul-2021",                                                        
CASE WHEN "Aug-2021" < 0 THEN CONCAT('(', "Aug-2021" * -1, ')', '$') ELSE CONCAT("Aug-2021",'$') END "Aug-2021",                                                        
CASE WHEN "Sep-2021" < 0 THEN CONCAT('(', "Sep-2021" * -1, ')', '$') ELSE CONCAT("Sep-2021",'$') END "Sep-2021",                                                        
CASE WHEN "Oct-2021" < 0 THEN CONCAT('(', "Oct-2021" * -1, ')', '$') ELSE CONCAT("Oct-2021",'$') END "Oct-2021",                                                        
CASE WHEN "Nov-2021" < 0 THEN CONCAT('(', "Nov-2021" * -1, ')', '$') ELSE CONCAT("Nov-2021",'$') END "Nov-2021",                                                        
CASE WHEN "Dec-2021" < 0 THEN CONCAT('(', "Dec-2021" * -1, ')', '$') ELSE CONCAT("Dec-2021",'$') END "Dec-2021"
FROM final_data;

In the above SQL query, for each month column I have applied a CASE statement to find negative values, enclose them in round brackets and finally attach the ‘$’ symbol to all the values using the CONCAT function. To remove the negative sign, multiplied each negative value by -1. Finally, we got the table as per the output displayed above.

 Customer | Jan-2021 | Feb-2021 | Mar-2021 | Apr-2021 | May-2021 | Jun-2021 | Jul-2021 | Aug-2021 | Sep-2021 | Oct-2021 | Nov-2021 | Dec-2021 
----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------
 Cust-1   | 250$     | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 225$
 Cust-2   | 395$     | (300)$   | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$
 Cust-3   | 0$       | 0$       | 80$      | 100$     | 160$     | 30$      | (10)$    | 0$       | 0$       | 0$       | 0$       | 170$
 Cust-4   | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | (3)$     | (15)$    | (1)$     | 0$       | 0$
 Cust-5   | 0$       | 0$       | 45$      | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | 0$       | (20)$    | 100$

Advantages:

  1. Easy to use: The crosstab function is easy to use and understand.
  2. Dynamic columns: Crosstab function allows you to dynamically generate columns based on the data, which makes it easy to create pivot tables with a variable number of columns.
  3. Consistent data: With the crosstab function, it’s easy to maintain consistency in data, by specifying the values in the order you want them to appear in the pivot table.
  4. Speed: The crosstab function is relatively fast as it doesn’t require any complex JOINs or subqueries, which makes it suitable for large datasets.

Disadvantages:

  1. Limited flexibility: The crosstab function is limited in terms of the types of calculations and aggregations that can be performed.
  2. Not suitable for a large number of columns: Crosstab function is not suitable when the number of columns is large or not known in advance.
  3. Limited compatibility: The crosstab function is only available in PostgreSQL, and not all other databases support it.
  4. Error-prone: Since the crosstab function is based on the order of values specified in the query, it’s easy to make mistakes and get unexpected results if the order is not specified correctly.

Conclusion

In summary, the crosstab function in PostgreSQL is an efficient tool for creating pivot tables and cross-tabulation reports, which allow users to re-arrange data in a more meaningful way, making it easier to identify patterns and trends. To utilize the function, tablefunc extension must be installed and enabled, and a good understanding of SQL and the underlying data is required as it can be more complex than a standard pivot function. Using the crosstab function, users can gain a deeper insight into their data and make better decisions.

Thanks for reading and have fun with SQL!