I realize it’s been a long time since I last posted, and I wanted to start the new year off right with a good post based on an experience I encountered. I’ll try to post sort of regularly going forward. Hope this post will yield useful insight!
When working with databases, it’s a common need to repeatedly access data
through complex SELECT
queries. Most RDBMS offer convenient ways to accomplish this:
views and materialized views. Let’s go over the differences between the two,
and when to use them.
A view is essentially a stored query that, for all intents and purposes, can be treated as a “regular” table. You, as the developer, define a query that gets executed every time the view is accessed. This means that the data that comes back is always fresh and reflects changes made to the tables within. In practice, especially when operating on large datasets, you will learn that this is quite expensive since the underlying query gets executed every single time.
Imagine a table called t_orders
that tracks the sales of an ecommerce site:
id | customer_id | order_date | total_amount |
---|---|---|---|
1 | 1 | 2024-11-01 | $150.00 |
2 | 2 | 2024-11-02 | $200.00 |
3 | 3 | 2024-11-03 | $50.00 |
4 | 1 | 2024-11-03 | $300.00 |
5 | 2 | 2024-11-04 | $100.00 |
We could create a standard view based on this data called vw_customer_orders
that
calculates the total price of all orders a customer has made:
DROP VIEW IF EXISTS vw_customer_orders;
CREATE OR REPLACE VIEW vw_customer_orders AS
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_spent
FROM
t_orders
GROUP BY
customer_id;
Be mindful when replacing views as it may break dependent entities such as stored procedures or other views.
This is where materialized views come in handy. The database does not need to recalculate the results during each query since the query results are instead stored on disk, making them much faster to query. However, there is one caveat — the data in these views, as compared to standard views, don’t automatically update even if the underlying table data changes. That means the materialized view can become outdated, forcing the user to manually refresh the view. Manual refreshes are not feasible for any system.
DROP VIEW IF EXISTS mv_customer_orders;
CREATE MATERIALIZED VIEW mv_customer_orders AS
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_spent
FROM
t_orders
GROUP BY
customer_id;
So, how do you keep materialized views, such as the one above, up to date? You can use the
REFRESH MATERIALIZED VIEW
command to recalculate and reload the data. Rather than executing
this command manually, a scheduler that runs every m minutes could potentially do this for you.
Below is a snippet of what this looks like in a Spring Boot application.
public class MVRefreshScheduler {
@Autowired
private UtilRepository utilRepository;
@Scheduled(cron = "*/m * * * *")
public void refreshMV() {
if (!schedulerEnabled) {
log.warn("MVRefreshScheduler not enabled");
return;
}
try {
utilRepository.callRefresh();
} catch (Exception e) {
log.error(e);
}
}
}
However, we still run into a potential roadblock — this operation locks the materialized view for the duration of the refresh, which means users can’t access anything that depends on that view. I’ve encountered a scenario where clients accessing a certain screen triggers a deadlock as the data that populates that screen depends on the materialized view. This spells trouble for high-availability systems, but rest asssured, we will explore in this post what we can do to avoid this particular pitfall.
Remember that a refresh can occur once for a materialized view at any given time.
It would probably be useful to identify the locks from the DB itself. Below is a nifty little query that does exactly that:
SELECT EXISTS (
SELECT
1
FROM
pg_stat_activity ps
JOIN pg_locks pl on ps.pid = pl.pid
WHERE
pl.relation = 'schema.my_table'::regclass
AND NOT pl.granted
AND ps.datname = current_database();
);
The query alone obviously isn’t helpful, so we can “hack” a solution - something that I quickly cobbled up together to patch this issue. All that needs to be done is to duplicate the view and treat the duplicate as a backup in case the primary view fails. Now here me out — this is far from the ideal solution — but it works to a good extent. At the time, we settled for this work-around as upgrades on our servers were interfering with our judgement. Hence, we settled for the most basic approach.
With this approach, one need simply check if there exists a lock on the view. The implementation of
the doesViewContainLock()
method is given above.
public class MVRefreshScheduler {
@Autowired
private UtilRepository utilRepository;
@Scheduled(cron = "*/m * * * *")
public void refreshMV() {
if (!schedulerEnabled) {
log.warn("MVRefreshScheduler not enabled");
return;
}
try {
utilRepository.callRefresh();
if (utilRepository.doesViewContainLock("schema.mv_primary")) {
utilRepository.callRefreshBackup();
} else {
utilRepository.callRefreshPrimary();
}
} catch (Exception e) {
log.error(e);
}
}
}
What if I told you there was a way to access the materialized view while it was still being
refreshed? Postgres allows the REFRESH MATERIALIZED VIEW [CONCURRENTLY]
option whose sole purpose
is to allow allow the materialized view to remain accessible for reads during the refresh process.
The single prerequisite is that the materialized view requires a UNIQUE
index on it. You could even
schedule these refreshes just as we did the standard refresh.
Concurrent refreshes will be faster than standard refreshes for a smaller record size but is in general slow since it requires more system resources to maintain consistency.
PostgreSQL’s materialized views can really boost performance in applications that handle a
lot of data. But to get the most out of them, it’s important to use the REFRESH MATERIALIZED VIEW [CONCURRENTLY]
command effectively. Even with automated solutions as described, (1) there
is still some overhead in maintaining the scheduler job, and (2) there will be stale data
between refreshes.