- Published on
Here is a list of interesting features about Postgres that I think are good to know about...
Table of Contents
- Use TABLESAMPLE to get a random sample of your database
- Generate ranges of values with generate_series
- Postgres will strip out unused joins with its join pruning feature
- LATERAL JOINs
- Know how to deal with dates and times easily in Postgres
- You can store JSON data easily
- Difference between JSONB and JSON in PostgreSQL
- Postgres Cluster Index
- Different types of index
- B-tree indexes
- GIN indexes
- Hash indexes
- BRIN indexes
- GiST Indexes
- Bloom index
- Use ROW_NUMBER to add a the row/position counter to each result
- Change the border style in postgres command line output
- The different column data types
- What the WAL is
- Every query is wrapped in a transaction
- Recursive queries
Use TABLESAMPLE to get a random sample of your database
Since PostgreSQL 9.5 we've had access to TABLESAMPLE
. This is used with SELECT
queries to return a randomised subset
of the results. This is really useful if you want to have a glance at a small randomised selection of data from a huge
dataset.
There are two ways to randomly select the data - SYSTEM
(faster, returns random pages from disk) or BERNOULLI
(
better random distribution between rows, does a sequential scan on whole table and randomly returns rows).
Example of using TABLESAMPLE
in postgres:
-- return approx 20% of rows
SELECT *
FROM your_table TABLESAMPLE SYSTEM(20)
-- return approx 5% of rows
SELECT *
FROM your_table TABLESAMPLE BERNOULLI(5)
If you need to seed the randomness, so it is repeatable with the same data you can use the REPEATABLE
argument.
-- seed with '123', so this will always return same rows
-- (Assuming no data was changed in the db)
-- returns 1% of rows
SELECT *
FROM your_table TABLESAMPLE SYSTEM (1) REPEATABLE (123);
Generate ranges of values with generate_series
Sometimes you need to generate ranges of values, such as [1, 2, 3, 4]
. This can be useful for generating fake data.
SELECT *
FROM generate_series(1, 3); -- returns 1, 2, 3
Postgres will strip out unused joins with its join pruning feature
This isn't really a feature you should be looking to use, but it is interesting to know about it.
If you add a join to a query that isn't used (and would have no effect if you removed it from the query), then the query optimiser will silently strip it out.
For example in the next query we join table_b
, but do not use it anywhere, so it will get pruned and if you inspected
it with EXPLAIN
you would see table_b
is not read from.
SELECT A.*
FROM table_a A
LEFT JOIN table_b B on A.id = B.fk
LATERAL JOINs
Since PostgreSQL 9.3 we have had the LATERAL JOIN
type. Its a little awkward to explain...
They are used when you have subqueries or joins.
The subquery/join can reference columns from the list of columns defined before the FROM
.
(A subquery (not lateral) looks like this: ... FROM (SELECT some_col FROM another_table) subquery...
).
You do it by adding the word LATERAL
before the subquery/join.
The example from the docs gives a trivial example.
-- Without lateral join:
SELECT *
FROM foo,
bar
WHERE bar.id = foo.bar_id;
-- With ltaeral join:
SELECT *
FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
That example should be quite clear, but it doesn't look very useful.
Here are a couple of more complex example (copy/pasted from official docs):
SELECT p1.id, p2.id, v1, v2
FROM polygons p1, polygons p2,
LATERAL vertices(p1.poly) v1,
LATERAL vertices(p2.poly) v2
WHERE (v1 <-> v2) < 10 AND p1.id != p2.id;
Example of a lateral join:
SELECT m.name
FROM manufacturers m
LEFT JOIN LATERAL get_product_names(m.id) pname
ON true
WHERE pname IS NULL;
If you want to see it in action in a video: https://www.youtube.com/watch?v=icpPE5lKQIA
Know how to deal with dates and times easily in Postgres
Postgres has many options when it comes to working with dates, times and durations.
I'll show these with some examples - they are all pretty simple to understand.
Get all rows that were updated X days ago, with now()
and subtracting ::interval
:
SELECT *
FROM your_table
WHERE updated_at >= now() - '1 week'::interval
add and subtract from dates
You can use +
or -
to change dates.
For example date '2022-01-01' + integer '1'
would return date '2022-01-02'
.
Or you can specify the duration (which will return a timestamp here) date '2022-01-01' + interval '1 hour'
returns timestamp '2022-01-01 01:00:00'
.
Useful date/time functions to know
age()
now()
transaction_timestamp()
clock_timestamp()
(the time right now - useful if used in transaction and want the time right now as opposed to when the transaction was created)date_part(part, source)
, such asdate_part('day', TIMESTAMP '2021-01-02 03:04:05')
would return02
(the day of month part).current_date
(a function, sometimes called a constant (as i did on first version of this article...))current_time
(with tz)current_timestamp
(with tz)localtime
(no tz)
You can store JSON data easily
You can store JSON data in Postgres. Here is how you can create a JSON column in a postgres table:
CREATE TABLE your_table
(
id serial NOT NULL PRIMARY KEY,
meta_info json NOT NULL
);
And here is how you insert JSON data:
INSERT INTO your_table (meta_info)
VALUES ('{ "hello": "this is a welcome message" }');
And you can query it like this:
SELECT meta_info - > 'hello' AS hello_msg
FROM your_table;
There are a few operators used with JSON data:
->
access an array item or object element- with an int, such as
-> 0
: get the first item in an array - with a string, such as
-> 'hello'
: get an item from an object
- with an int, such as
->>
access an array item or object element, returning it as a string- with an int, such as
->> 0
: get the first item in an array (returns as a string) - with a string, such as
->> 'hello'
: get an item from an object (returns as a string)
- with an int, such as
#>
get JSON object at a path- For example
'{"a": {"b":{"c": "foo"}}}'::json#>'{a,b}'
will return{"c": "foo"}
- For example
#>>
get JSON object at a path, returning it as text
JSONB (explained below) has some more operators such as ?
(does key exist), -
(delete key/value), ||
(concat two
values into new jsonb)
You can use helper functions such as to_json()
/to_jsonb()
, array_to_json()
, row_to_json()
etc.
Difference between JSONB and JSON in PostgreSQL
- JSON stores the data as text, exactly as you provide it in an
INSERT
query. - JSONB parses the JSON you provide, and stores it as binary data. This is the preferred way to store it in the
database, and you should only store as
JSON
if you have specific requirements to do so.
JSONB strips all whitespace (it is stored as binary data), and the ordering of keys in objects is likely to change.
Ignoring the whitespace (which takes up a minimal amount of space), JSONB can take up more disk space, but the trade off is its flexibility and speed.
You can also add indexes to JSONB data.
JSONB is quicker to use in SELECT queries (the reason the keys change order is to increase access speed). But it is slightly slower to store as it has to be parsed.
Postgres Cluster Index
The Postgres CLUSTER command can be used to sort a table (where it is stored on disk) based on an index, to make accessing and reading that data quicker.
This increases performance if you are trying to access a range of values (or multiple rows that have the same indexed value), as they will be ordered on disk to be quicker to read and access.
The command is CLUSTER your_table_table USING your_index_name
.
This will create a temporary table (which will mean it will take up more disk space - including space for all indexes as well as the index the cluster is built on).
Read more: https://www.postgresql.org/docs/current/sql-cluster.html
Different types of index
There are multiple types of indexes that you can use.
B-tree indexes
The default index - self balancing tree. You can do comparisons with this (>
, >=
, =
, <
, <=
, BETWEEN
, IN
, IS NULL
, IS NOT NULL
and LIKE
).
Unless you have learned about different indexes that you can use with Postgres, you have almost certainly mostly been using B-tree indexes.
You can't use a b-tree index on complex column data types - such as JSONB or array types.
CREATE INDEX your_hash_index
ON your_table_name;
GIN indexes
GIN indexes in Postgresql stands for Generalized Inverted iNdex. They were added in Postgres 8.2. If you can't index something with a b-tree then you will probably look at GIN indexes as they can index more complex data types.
GIN indexes are also used for full text search.
You can use comparison operators such as <@
, @>
, =
, &&
.
CREATE INDEX your_gin_index ON your_table_name
USING gin(your_column_name);
Here is an example using to_tsvector
function to create a GIN index with full text search:
CREATE INDEX your_gin_index_2
ON your_table_name
USING GIN (to_tsvector('english', your_column_name));
Hash indexes
Hash indexes are quick and somewhat simple. They can only be compared with =
.
Note: using a hash index will not reduce the size of your index. Its advantage is for speed.
How to create a hash index:
CREATE INDEX your_hash_index
ON your_table_name
USING HASH (your_column_name);
BRIN indexes
BRIN stands for Block Range INdexes.
BRIN indexes are smaller in size and quicker to update than B-Tree indexes.
They are useful when the indexed values are linear, such as dates. Using a BRIN index means you can easily select items close to each other in a range, as it stores the data in the same/close pages on disk.
It stores the min/max values for a range. This can have a significant reduction in the size of the index.
How to create a BRIN index:
CREATE INDEX your_brin_index ON your_table USING brin(updated_at_timestamp);
GiST Indexes
GiST stands for Generalized Search Tree. It is a lossy index (may return false positives in the index - but postgres always double checks the real row data before returning data). GiST indexes can be usef for full text (like GIN)
GiST indexes can have more comparisons such as <<
, >>
, ~=
, <@
, <<|
, |>>
.
Bloom index
A bloom index is a very space efficient way to index data. It can be easily and quickly used to confirm if an element is in the index (it does return false positives, but a very small %)
Example from the postgres docs, explaining that "The index is created with a signature length of 80 bits, with attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. We could have omitted the length, col1, and col2 specifications since those have the default values.":
CREATE INDEX your_bloom_index
ON your_Table
USING bloom (i1,i2,i3)
WITH (length =80, col1=2, col2=2, col3=4);
Use ROW_NUMBER to add a the row/position counter to each result
ROW_NUMBER()
is a window function that can be used to add the row number.
You use it with PARTITION BY
to split up each rows (there can be cases where there are multiple rows with the same row
number), and ORDER BY
to set the order.
You can use it without partition by, to get the row id (notice there are 4 items, and the 4th item has a row_number of 4)
SELECT id,
title,
ROW_NUMBER() OVER (ORDER BY id)
FROM your_table;
Will return something like:
id,title,row_number
1,some-title,1
2,another-title,2
3,one-more,3
5,notice-the-gap,4
You can also use it to generate row numbers grouped (or partitioned) by a column.
The following example shows this - the row number is partitioned by the category_id, so each category_id will have its own row numbering.
SELECT id,
title,
category_id,
ROW_NUMBER() OVER (
PARTITION BY category_id
ORDER BY
title
) AS category_row_num
FROM your_table;
id,title,category_id,category_row_num
1,abc,1,1
4,def,1,2
3,aaa,2,1 // row num restarts at 1 for this category_id of 2
2,zzz,2,2
Change the border style in postgres command line output
There are different styles to the borders, such as:
\pset linestyle ascii
\pset border 0
id name created
-- ------- ----------
1 Qwerty 2022-01-01
2 Asdf 2022-02-02
or with borders
\pset linestyle ascii
\pset border 2
+----+--------+------------+
| id | name | created |
+----+--------+------------+
| 1 | Qwerty | 2022-01-01 |
| 2 | Asdf | 2022-02-02 |
+----+--------+------------+
If you want to find out more, see https://database.guide/how-to-change-the-table-border-style-in-the-psql-query-results/
The different column data types
There is a full list of data types here: https://www.postgresql.org/docs/current/datatype.html
But I'll list some common ones
- boolean - normally
true
orfalse
- when storing,
1
,yes
,y
,t
are converted to true,0
,no
,n
,f
are converted to false - when querying,
t
is converted to true,f
is converted to false, and space (
- when storing,
- text and strings
- You have a few options, such as:
CHAR(n)
- Pads out any strings to length
n
by appending spaces - This should never be used unless you really know why you need it
- Pads out any strings to length
VARCHAR(n)
- 'variable length char'.
- Does not pad out strings smaller than
n
length - Unlikely to need this often - just use
TEXT
TEXT
- accepts any length
- There is no advantage in 99% of cases to using anything but
TEXT
- numbers
- integers, including:
SMALLINT
, 2 bytes signed, range of -32,768 to 32,767INT
, 4 byte signed, range of -2,147,483,648 to 2,147,483,647
- floats, including:
FLOAT(n)
is a float with precision ofn
. Max of8
bytesREAL
orFLOAT8
float with 4 byte floating pointNUMERIC(P,S)
whereP
is the precision,S
is the scaleprecision
is total number of significant digits (this can include both sides of the decimal point).scale
is the num of decimal digits. For example12.3456
has a precision of 6, and a scale of 4.- Integers can be imagined as a NUMERIC with a scale of 0
- you can also do
NUMERIC(P)
, and it defaults to aS
(scale) of 0. - And you can do just
NUMERIC
(no P or S values) which setsP
(precision) creates a "unconstrained numeric", with no set limits (up to what postgres can support) - You can also store
+Infinity
,-Infinity
andNaN
innumeric
data columns. Set it likeUPDATE your_table SET a_column = '-Infinity'
- More info can be found here https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-numeric/
- integers, including:
- date/times
DATE
- date onlyTIME
- time onlyTIMESTAMP
- date and timeTIMESTAMPTZ
- date and time, with timezoneINTERVAL
- durations of time
- and many others
- JSON, JSONB
MACADDR
- for MAC addressesINET
for ipv4 addressesUUID
- And you can store any column as an array. You can do this by creating a table with columns such as
some_column TEXT []
to create a column called 'some_column' that holds an array of strings
What the WAL is
If you watch any conference talk or maybe listen to Postgres podcasts, you will hear about the WAL
all the time. It is the write-ahead-log. Once your postgres database has been in use for a while and starts building up a lot of data, you will probably encounter some issues with configuring it (and related things like teh VACUUM command).
The WAL (write ahead log) is a log of every update that happens in your database. If your database crashes, it can use this to recover data.
I won't go into it in any more depth here - I want to write a separate article about the WAL later on. You can read the official docs here: https://www.postgresql.org/docs/current/wal-intro.html
Every query is wrapped in a transaction
You might imagine that if you do a query such as SELECT * FROM your_table
that there is no DB transaction involved. But really it gets an implicit one
Postgres will automatically add a BEGIN
command at the start of the query, and a COMMIT
at the end.
Recursive queries
You can do recursive queries. Here is an example from the Postgres docs that will sum numbers from 1 to 100.
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
See more at https://www.postgresql.org/docs/current/queries-with.html