Cool and interesting Postgres features that you might not know

Published on

Here is a list of interesting features about Postgres that I think are good to know about...

Table of Contents

Use TABLESAMPLE to get a random sample of your database

Since PostgreSQL 9.5 we've had access to TABLESAMPLE. This is used with SELECT queries to return a randomised subset of the results. This is really useful if you want to have a glance at a small randomised selection of data from a huge dataset.

There are two ways to randomly select the data - SYSTEM (faster, returns random pages from disk) or BERNOULLI ( better random distribution between rows, does a sequential scan on whole table and randomly returns rows).

Example of using TABLESAMPLE in postgres:

-- return approx 20% of rows
SELECT *
FROM your_table TABLESAMPLE SYSTEM(20)

-- return approx 5% of rows
SELECT *
FROM your_table TABLESAMPLE BERNOULLI(5)

If you need to seed the randomness, so it is repeatable with the same data you can use the REPEATABLE argument.

-- seed with '123', so this will always return same rows 
-- (Assuming no data was changed in the db)
-- returns 1% of rows
SELECT *
FROM your_table TABLESAMPLE SYSTEM (1) REPEATABLE (123);

Generate ranges of values with generate_series

Sometimes you need to generate ranges of values, such as [1, 2, 3, 4]. This can be useful for generating fake data.

SELECT *
FROM generate_series(1, 3); -- returns 1, 2, 3

Postgres will strip out unused joins with its join pruning feature

This isn't really a feature you should be looking to use, but it is interesting to know about it.

If you add a join to a query that isn't used (and would have no effect if you removed it from the query), then the query optimiser will silently strip it out.

For example in the next query we join table_b, but do not use it anywhere, so it will get pruned and if you inspected it with EXPLAIN you would see table_b is not read from.

SELECT A.*
FROM table_a A
         LEFT JOIN table_b B on A.id = B.fk

LATERAL JOINs

Since PostgreSQL 9.3 we have had the LATERAL JOIN type. Its a little awkward to explain...

They are used when you have subqueries or joins.

The subquery/join can reference columns from the list of columns defined before the FROM.

(A subquery (not lateral) looks like this: ... FROM (SELECT some_col FROM another_table) subquery...).

You do it by adding the word LATERAL before the subquery/join.

The example from the docs gives a trivial example.

-- Without lateral join:
SELECT *
FROM foo,
     bar
WHERE bar.id = foo.bar_id;
-- With ltaeral join:
SELECT *
FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;

That example should be quite clear, but it doesn't look very useful.

Here are a couple of more complex example (copy/pasted from official docs):

SELECT p1.id, p2.id, v1, v2
FROM polygons p1, polygons p2,
     LATERAL vertices(p1.poly) v1,
     LATERAL vertices(p2.poly) v2
WHERE (v1 <-> v2) < 10 AND p1.id != p2.id;

Example of a lateral join:

SELECT m.name
FROM manufacturers m
         LEFT JOIN LATERAL get_product_names(m.id) pname
ON true
WHERE pname IS NULL;

If you want to see it in action in a video: https://www.youtube.com/watch?v=icpPE5lKQIA

Know how to deal with dates and times easily in Postgres

Postgres has many options when it comes to working with dates, times and durations.

I'll show these with some examples - they are all pretty simple to understand.

Get all rows that were updated X days ago, with now() and subtracting ::interval:

SELECT *
FROM your_table
WHERE updated_at >= now() - '1 week'::interval

add and subtract from dates

You can use + or - to change dates.

For example date '2022-01-01' + integer '1' would return date '2022-01-02'.

Or you can specify the duration (which will return a timestamp here) date '2022-01-01' + interval '1 hour' returns timestamp '2022-01-01 01:00:00'.

Useful date/time functions to know

  • age()
  • now()
  • transaction_timestamp()
  • clock_timestamp() (the time right now - useful if used in transaction and want the time right now as opposed to when the transaction was created)
  • date_part(part, source), such as date_part('day', TIMESTAMP '2021-01-02 03:04:05') would return 02 (the day of month part).
  • current_date (a function, sometimes called a constant (as i did on first version of this article...))
  • current_time (with tz)
  • current_timestamp (with tz)
  • localtime (no tz)

You can store JSON data easily

You can store JSON data in Postgres. Here is how you can create a JSON column in a postgres table:

CREATE TABLE your_table
(
    id        serial NOT NULL PRIMARY KEY,
    meta_info json   NOT NULL
);

And here is how you insert JSON data:

INSERT INTO your_table (meta_info)
VALUES ('{ "hello": "this is a welcome message" }');

And you can query it like this:

SELECT meta_info - > 'hello' AS hello_msg
FROM your_table;

There are a few operators used with JSON data:

  • -> access an array item or object element
    • with an int, such as -> 0: get the first item in an array
    • with a string, such as -> 'hello': get an item from an object
  • ->> access an array item or object element, returning it as a string
    • with an int, such as ->> 0: get the first item in an array (returns as a string)
    • with a string, such as ->> 'hello': get an item from an object (returns as a string)
  • #> get JSON object at a path
    • For example '{"a": {"b":{"c": "foo"}}}'::json#>'{a,b}' will return {"c": "foo"}
  • #>> get JSON object at a path, returning it as text

JSONB (explained below) has some more operators such as ? (does key exist), - (delete key/value), || (concat two values into new jsonb)

You can use helper functions such as to_json()/to_jsonb(), array_to_json(), row_to_json() etc.

Difference between JSONB and JSON in PostgreSQL

  • JSON stores the data as text, exactly as you provide it in an INSERT query.
  • JSONB parses the JSON you provide, and stores it as binary data. This is the preferred way to store it in the database, and you should only store as JSON if you have specific requirements to do so.

JSONB strips all whitespace (it is stored as binary data), and the ordering of keys in objects is likely to change.

Ignoring the whitespace (which takes up a minimal amount of space), JSONB can take up more disk space, but the trade off is its flexibility and speed.

You can also add indexes to JSONB data.

JSONB is quicker to use in SELECT queries (the reason the keys change order is to increase access speed). But it is slightly slower to store as it has to be parsed.

Postgres Cluster Index

The Postgres CLUSTER command can be used to sort a table (where it is stored on disk) based on an index, to make accessing and reading that data quicker.

This increases performance if you are trying to access a range of values (or multiple rows that have the same indexed value), as they will be ordered on disk to be quicker to read and access.

The command is CLUSTER your_table_table USING your_index_name.

This will create a temporary table (which will mean it will take up more disk space - including space for all indexes as well as the index the cluster is built on).

Read more: https://www.postgresql.org/docs/current/sql-cluster.html

Different types of index

There are multiple types of indexes that you can use.

B-tree indexes

The default index - self balancing tree. You can do comparisons with this (>, >=, =, <, <=, BETWEEN, IN , IS NULL, IS NOT NULL and LIKE).

Unless you have learned about different indexes that you can use with Postgres, you have almost certainly mostly been using B-tree indexes.

You can't use a b-tree index on complex column data types - such as JSONB or array types.

CREATE INDEX your_hash_index 
    ON your_table_name;

GIN indexes

GIN indexes in Postgresql stands for Generalized Inverted iNdex. They were added in Postgres 8.2. If you can't index something with a b-tree then you will probably look at GIN indexes as they can index more complex data types.

GIN indexes are also used for full text search.

You can use comparison operators such as <@, @>, =, &&.

CREATE INDEX your_gin_index ON your_table_name
    USING gin(your_column_name);

Here is an example using to_tsvector function to create a GIN index with full text search:

CREATE INDEX your_gin_index_2 
    ON your_table_name
    USING GIN (to_tsvector('english', your_column_name));

Hash indexes

Hash indexes are quick and somewhat simple. They can only be compared with =.

Note: using a hash index will not reduce the size of your index. Its advantage is for speed.

How to create a hash index:

CREATE INDEX your_hash_index 
    ON your_table_name 
    USING HASH (your_column_name);

BRIN indexes

BRIN stands for Block Range INdexes.

BRIN indexes are smaller in size and quicker to update than B-Tree indexes.

They are useful when the indexed values are linear, such as dates. Using a BRIN index means you can easily select items close to each other in a range, as it stores the data in the same/close pages on disk.

It stores the min/max values for a range. This can have a significant reduction in the size of the index.

How to create a BRIN index:

CREATE INDEX your_brin_index ON your_table USING brin(updated_at_timestamp);

GiST Indexes

GiST stands for Generalized Search Tree. It is a lossy index (may return false positives in the index - but postgres always double checks the real row data before returning data). GiST indexes can be usef for full text (like GIN)

GiST indexes can have more comparisons such as <<, >>, ~=, <@, <<|, |>>.

Bloom index

A bloom index is a very space efficient way to index data. It can be easily and quickly used to confirm if an element is in the index (it does return false positives, but a very small %)

Example from the postgres docs, explaining that "The index is created with a signature length of 80 bits, with attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. We could have omitted the length, col1, and col2 specifications since those have the default values.":

CREATE INDEX your_bloom_index 
    ON your_Table 
    USING bloom (i1,i2,i3)
    WITH (length =80, col1=2, col2=2, col3=4);

Use ROW_NUMBER to add a the row/position counter to each result

ROW_NUMBER() is a window function that can be used to add the row number.

You use it with PARTITION BY to split up each rows (there can be cases where there are multiple rows with the same row number), and ORDER BY to set the order.

You can use it without partition by, to get the row id (notice there are 4 items, and the 4th item has a row_number of 4)

SELECT id,
       title,
       ROW_NUMBER() OVER (ORDER BY id)
FROM your_table;

Will return something like:

id,title,row_number
1,some-title,1
2,another-title,2
3,one-more,3
5,notice-the-gap,4

You can also use it to generate row numbers grouped (or partitioned) by a column.

The following example shows this - the row number is partitioned by the category_id, so each category_id will have its own row numbering.

SELECT id,
       title,
       category_id,
       ROW_NUMBER() OVER (
          PARTITION BY category_id
          ORDER BY
          title
        ) AS category_row_num
FROM your_table;
id,title,category_id,category_row_num
1,abc,1,1
4,def,1,2
3,aaa,2,1 // row num restarts at 1 for this category_id of 2
2,zzz,2,2

Change the border style in postgres command line output

There are different styles to the borders, such as:

\pset linestyle ascii
\pset border 0
id name    created     
-- ------- ----------
 1 Qwerty  2022-01-01
 2 Asdf    2022-02-02

or with borders

\pset linestyle ascii
\pset border 2
+----+--------+------------+
| id | name   | created    |
+----+--------+------------+
| 1  | Qwerty | 2022-01-01 |
| 2  | Asdf   | 2022-02-02 |
+----+--------+------------+

If you want to find out more, see https://database.guide/how-to-change-the-table-border-style-in-the-psql-query-results/

The different column data types

There is a full list of data types here: https://www.postgresql.org/docs/current/datatype.html

But I'll list some common ones

  • boolean - normally true or false
    • when storing, 1, yes, y, t are converted to true, 0, no, n, f are converted to false
    • when querying, t is converted to true, f is converted to false, and space ( ) is converted to null
  • text and strings
    • You have a few options, such as:
    • CHAR(n)
      • Pads out any strings to length n by appending spaces
      • This should never be used unless you really know why you need it
    • VARCHAR(n)
      • 'variable length char'.
      • Does not pad out strings smaller than n length
      • Unlikely to need this often - just use TEXT
    • TEXT
      • accepts any length
    • There is no advantage in 99% of cases to using anything but TEXT
  • numbers
    • integers, including:
      • SMALLINT, 2 bytes signed, range of -32,768 to 32,767
      • INT, 4 byte signed, range of -2,147,483,648 to 2,147,483,647
    • floats, including:
      • FLOAT(n) is a float with precision of n. Max of 8 bytes
      • REAL or FLOAT8 float with 4 byte floating point
      • NUMERIC(P,S) where P is the precision, S is the scale
        • precision is total number of significant digits (this can include both sides of the decimal point). scale is the num of decimal digits. For example 12.3456 has a precision of 6, and a scale of 4.
        • Integers can be imagined as a NUMERIC with a scale of 0
        • you can also do NUMERIC(P), and it defaults to a S (scale) of 0.
        • And you can do just NUMERIC (no P or S values) which sets P (precision) creates a "unconstrained numeric", with no set limits (up to what postgres can support)
        • You can also store +Infinity, -Infinity and NaN in numeric data columns. Set it like UPDATE your_table SET a_column = '-Infinity'
        • More info can be found here https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-numeric/
  • date/times
    • DATE - date only
    • TIME - time only
    • TIMESTAMP - date and time
    • TIMESTAMPTZ - date and time, with timezone
    • INTERVAL - durations of time
  • and many others
    • JSON, JSONB
    • MACADDR - for MAC addresses
    • INET for ipv4 addresses
    • UUID
    • And you can store any column as an array. You can do this by creating a table with columns such as some_column TEXT [] to create a column called 'some_column' that holds an array of strings

What the WAL is

If you watch any conference talk or maybe listen to Postgres podcasts, you will hear about the WAL all the time. It is the write-ahead-log. Once your postgres database has been in use for a while and starts building up a lot of data, you will probably encounter some issues with configuring it (and related things like teh VACUUM command).

The WAL (write ahead log) is a log of every update that happens in your database. If your database crashes, it can use this to recover data.

I won't go into it in any more depth here - I want to write a separate article about the WAL later on. You can read the official docs here: https://www.postgresql.org/docs/current/wal-intro.html

Every query is wrapped in a transaction

You might imagine that if you do a query such as SELECT * FROM your_table that there is no DB transaction involved. But really it gets an implicit one

Postgres will automatically add a BEGIN command at the start of the query, and a COMMIT at the end.

Recursive queries

You can do recursive queries. Here is an example from the Postgres docs that will sum numbers from 1 to 100.

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

See more at https://www.postgresql.org/docs/current/queries-with.html