Performance Testing of PostgreSQL with JSON Data: 1 to 1000 Concurrency with Data up to 50 Million Rows

Chalindu Kodikara
9 min readJun 2, 2024

--

In an internal project at WSO2 Choreo [1], we conducted a performance test to evaluate the effectiveness of using a JSON data store. Our goal was to determine if PostgreSQL performs well with JSON data.

We conducted performance tests ranging from a concurrency of 1 to 1000 for both SELECT and INSERT queries. Additionally, we compared the results using different index types: GIN jsonb_path_ops, GIN jsonb_ops, and BTree. This article aims to present our findings to benefit others working with similar setups.

1. Introduction

The goal of this test was to evaluate the query performance of a database with a large data volume of approximately 50 million records and to identify its limitations. Specifically, we aimed to determine the maximum data volume the database can handle while keeping search query performance within a 1-second threshold. Additionally, we tested the database under concurrent access scenarios, scaling up to 1000 concurrent users.

2. Test Environment

# Azure Virtual Machine was used to connect to the database server and do the performance testing.

- Hardware:

  • 4 cores, 16 GiB RAM
  • Region — East US 2

- Software:

  • OS — Linux
  • PostgreSQL CLI (used for insert/query data for evaluation)

# PostgreSQL Azure Instance (The DB instance is chosen to be equivalent to a production database.)

- Hardware:

  • 8 cores, 32 GiB RAM
  • Region — East US 2

- Software:

  • Version 16.1

3. Test Plan

Our test plan focused on evaluating how different indexing strategies impact query efficiency and data-handling capabilities with a high volume of data. We created separate databases for each index type: non-indexed, GIN (Generalized Inverted Index), and B-tree. This setup enabled us to incrementally populate the data and assess the timings for both queries and inserts.

3.1 Data Generation/Insertion

  1. We created four databases within the same DB server instance to store the generated data, each using a different indexing type. The indexing was applied only to the metadata section in the JSON structure, which is a single-level nested structure (Appendix A).
  2. Create a DB table (called kind)
  • Only 1 column (called ‘value’) in each database
  • Column type — JSONB
  • No primary key

create table public.kind (

value JSONB

);

Initial data generation and insertion to the DB was done through a Python program.

Python was selected for the following reasons;

1. Easy to setup

2. Availability of data generation/mocking frameworks

3. Less time for development

4. Won’t measure the time taken for the initial insertions

The experiments will encompass varying row counts, including 1 million, 10 million, 20 million, and 50 million records. This comprehensive range will enable a thorough evaluation of the database’s performance under different data volumes and indexing strategies.

3.2 PostgreSQL Indexing Types for JSONB Data

  1. GIN (Generalized Inverted Index) [2]

GIN indexing in PostgreSQL is tailored for indexing composite types like arrays and JSONB data. When applied to JSONB columns, GIN indexes facilitate efficient querying based on criteria such as containment, existence, and specific key-value pairs. However, GIN indexes do not support comparisons like greater than (>) or less than (<). There are two types of GIN indexes.

I. Jsonb_ops
This type creates independent index items for each key and value in the data.

CREATE INDEX gin_with_jsonb_ops ON kind USING GIN ((value->'metadata') jsonb_ops); (‘value’ is the column name.)

II. jsonb_path_ops
This type creates index items solely for each value in the data.

CREATE INDEX gin_with_jsonb_path_ops ON kind USING GIN ((value->’metadata’) jsonb_path_ops);

Disadvantages of jsonb_path_ops over jsonb_ops:

  • Key existence check is not possible.

Advantages of jsonb_path_ops over jsonb_ops:

  • The index is typically smaller.
  • Search operations are generally faster.

Example: JSON — {“foo”: {“bar”: “baz”}},

  • In jsonb_path_ops, a single index item would be generated, combining all three elements: foo, bar, and baz, into the hash value. Consequently, a containment query searching for this structure would result in a highly specific index search. However, there’s no means to determine whether foo appears as a key.
  • On the contrary, a jsonb_ops index would create three distinct index items representing foo, bar, and baz individually. Subsequently, for a containment query, it would seek rows containing all three of these items.

2. BTree [3]

B-tree indexes in PostgreSQL are a type of index structure ideal for ordered data and range queries. While traditionally used for scalar data types such as integers and strings, they can also effectively index JSONB data. Unlike GIN indexes, B-tree indexes support comparisons like <, <=, >, >=, enhancing their versatility for a wider range of query types.

CREATE INDEX btree_proj ON kind USING BTREE ((value->’metadata’->>’projectName’));

CREATE INDEX btree_name ON kind USING BTREE ((value->’metadata’->>’name’));

CREATE INDEX btree_comp ON kind USING BTREE ((value->’metadata’->>’componentName’));

CREATE INDEX btree_version ON kind USING BTREE ((value->’metadata’->>’deploymentTrackId’));

3.3 Data Querying

  • During querying, only the metadata section of the JSON was accessed. Refer to the appendix for a detailed description of the JSON structure.
  • For each experiment, a sample count of 20 was utilized. The results were then analyzed to present the minimum, maximum, average, and percentile values of query execution time.

It’s crucial to employ the correct syntax for querying with both GIN and B-tree indexes, as the query structure differs between them.

BTree

SELECT * FROM kind WHERE value->’metadata’->>’name’=’wso2_chalindu’ and value->’metadata’->>’projectName’=’choreo’;

GIN

SELECT * FROM kind WHERE (value->’metadata’) @> {“name”: “wso2_chalindu”, “projectName”: “choreo”}’;

4. Results

Important: The reported timings reflect only the time taken on the PostgreSQL database side. Any network latency was disregarded. Exact execution times were obtained using ‘\timing’ in the PostgreSQL client.

For the tests, we employed the following queries:

Query 1: This retrieves more than one JSON resource (i.e., more than one row in the database).

SELECT * FROM kind WHERE value->’metadata’->>’projectName’=’choreo_project’ and value->>’kind’=’Component’;

Query 2: This retrieves exactly one JSON resource (only one row in the database).

SELECT * FROM kind WHERE value->’metadata’->>’name’=’component_name’ and value->’metadata’->>’projectName’=’choreo’;

4.1 Concurrency 1

Figure 4.1.1: Results comparison for select query 1
Figure 4.1.2: Results comparison for select query 2

It’s evident that the execution time for select query 1 is slightly higher compared to select query 2. This variance could be attributed to the fact that select query 1 retrieves multiple JSON files from the database, whereas select query 2 only retrieves a single JSON file.

Conclusion 1: The select query execution times are reasonable, consistently below 100 ms for both GIN jsonb_path_ops and B-tree indexing types. Both indexing methods demonstrate commendable performance. Notably, B-tree indexing outperforms GIN jsonb_path_ops in our comparison.

Figure 4.1.3: Results comparison for insert query

Conclusion 2: The performance of insertion queries is excellent, with execution times consistently below 20 ms, even with indexing implemented. This indicates that PostgreSQL is a robust choice for storing JSON data, with no significant concerns regarding query performance.

4.1.1 Observation

We noted a consistent pattern of increased execution time for a specific select query with new query values. Initially, the execution time was relatively high; however, with subsequent executions using the same query, the time decreased noticeably.

1. 10 million data with an 8-core node.

  • New query execution with GIN (jsonb_path_ops) indexing
    - Initial execution ~= 25 ms
    - Afterwards ~= 19 ms
  • New query execution with BTree indexing
    - Initial execution ~= 25 ms
    - Afterwards ~= 14 ms

2. 20 million data with an 8-core node.

  • New query execution with GIN (jsonb_path_ops) indexing
    - Initial execution ~= 100 ms
    - Afterwards ~= 30 ms
  • New query execution with BTree indexing
    - Initial execution ~= 160 ms
    - Afterwards ~= 25 ms

3. 50 million data with an 8-core node.

  • New query execution with GIN (jsonb_path_ops) indexing
    - Initial execution ~= 5787 ms
    - Afterwards ~= 86 ms
  • New query execution with BTree indexing
    - Initial execution ~= 13329 ms
    - Afterwards ~= 33 ms

Note: The observed latency discrepancy, particularly on the 8-core node, was resolved upon upgrading to a higher-tier node. This suggests that the significant difference in execution times between initial and subsequent executions was likely due to a bottleneck in the lower-tier node. Therefore, it’s crucial to monitor execution times for individual queries in future work. It’s worth noting that this observation was not apparent when using a bash script to obtain results with a sample count of 20, hence it is not reflected in the above results.

4.2 Storage

  • Table size (without indexing)
    - 1 million row count ~= 700 MB
  • One JSON file size ~= 800 Bytes
Figure 4.1.2.1: Index Size Comparison without Data

While B-tree indexing outperforms GIN jsonb_path_ops in SELECT queries, it’s worth noting that its index size is larger. However, this discrepancy is negligible (~1.5 GB for 50 million rows), and therefore not a significant concern.

4.3 Load Test Results (50–1000 concurrency)

We conducted load tests using JMeter following the guidelines outlined in [4]. Each load test ran for 10 minutes, and results were recorded from the JMeter report, capturing client-side latency.

We used Azure standard_D_v3 series from 8–64 core nodes for this test.

Figure 4.3.1: Different Azure VM types

For select queries, we utilized Query 2 as mentioned previously. Additionally, we provided the query parameter (project name) as a CSV config parameter to JMeter to introduce more unique queries.

4.3.1 SELECT Query Performance

Figure 4.3.2: 8 cores vs 64 cores for 20 million rows
Figure 4.3.3: 64 cores node performance for 50 million rows
Figure 4.3.4: BTree performance for different nodes

Conclusion: B-tree indexing outperforms GIN jsonb_path_ops, particularly evident under heavy concurrency. Even at 1000 concurrency with 50 million data, the median execution time remains below 500 ms for the B-tree indexing type, while it increases to 2000 ms for GIN jsonb_path_ops.

4.3.2 INSERT Query Performance

Figure 4.3.5: Insert query performance for 1000 concurrency

Conclusion: Insertion performance remains robust even under heavy concurrency. For instance, even with 1000 concurrent operations and a dataset of 50 million records, insertion times consistently stay below 150 ms.

5. Summary

Overall, the comprehensive evaluation of PostgreSQL’s performance with JSONB data reveals its robustness as a storage solution. Particularly noteworthy is the exemplary performance demonstrated by the B-tree indexing type, underscoring PostgreSQL’s suitability for handling JSON data-intensive workloads. This empirical evidence positions PostgreSQL as a compelling alternative to MongoDB for organizations seeking a reliable, feature-rich database management system.

For those interested in replicating or further exploring our findings, the scripts utilized for inserting the initial dataset and conducting performance tests are available in the following GitHub repository: https://github.com/chalindukodikara/choreo-api-perf-test. These resources offer a valuable starting point for evaluating PostgreSQL’s capabilities in diverse environments and scenarios.

Appendix

A. Example JSON file (~800 Bytes)

B. Important Queries

  1. Get Table Size

SELECT tablename, pg_size_pretty(pg_total_relation_size(tablename)) AS total, pg_size_pretty(pg_relation_size(tablename)) AS table, pg_size_pretty(pg_indexes_size(tablename)) AS index FROM (SELECT (‘“‘ || table_schema || ‘“.”’ || table_name || ‘“‘) AS tablename FROM information_schema.tables WHERE table_name = ‘kind’) AS subquery;

2. Get Index Details

SELECT indexname AS “Index Name”, indexdef AS “Index Definition”

FROM pg_indexes

WHERE tablename = ‘kind’;

3. Check Table Existence

SELECT 1 FROM information_schema.tables WHERE table_name = ‘kind’;

References

[1] https://wso2.com/choreo/

[2] https://www.postgresql.org/docs/12/gin-intro.html

[3] https://medium.com/@thegalang/indexing-in-postgresql-and-applying-it-to-jsonb-c99ecf50a443

[4] https://chalindu.medium.com/run-jmeter-test-on-a-linux-vm-on-azure-step-by-step-guide-with-linux-commands-part-1-setup-82ef42d5ed66

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Your feedback and comments are highly valued, so please don’t hesitate to share your thoughts. I’m open to suggestions for improvement and eager to engage in further discussions on this topic.

For more insights and to connect with me, you can find me on LinkedIn at Chalindu Kodikara and on X at ChalinduSL.

Let’s stay connected and drive innovation together!

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

--

--

Chalindu Kodikara

Computer Science and Engineering Graduate, Software Engineer