最近在数据处理中用到了窗函数, 把使用方法记录一下, 暂时只有分组排序和滑动时间窗口的例子, 以后再逐步添加
场景在SQL查询时, 会遇到有两类需要分组统计的场景, 在之前的SQL语法中是不方便实现的
- 场景1: 顾客维修设备的记录表, 每次维修产生一条记录, 每个记录包含时间, 顾客ID和维修金额, 要取出每个顾客的维修次数和最后一次维修时的金额
- 场景2: 还是上面的维修记录表, 要取出每个顾客的每次维修之间的时间间隔
- 场景3: 一个用户账户的交易流水表, 要求每个小时的交易笔数和平均收支金额, 这个平均数的统计范围是两个小时(整点时间的前后一个小时)
使用窗函数直接SQL中使用窗函数就能解决这些问题, 否则需要使用临时表, 函数或存储过程进行处理.
窗函数PostgreSQL 从2010年的版本8开始就支持窗函数了.
文档详细说明建议查看官方文档 https://www.postgresql.org/docs/current/tutorial-window.html
函数说明窗函数(window function)的计算方式与传统的单行和聚合不同
- 窗函数是在当前表中, 基于当前行的相关行的计算, 注意是基于多行的计算
- 属于一种聚合计算, 可以使用聚合类型的函数(aggregate function)
- 使用窗函数并不会导致结果的聚合, 也就是结果依然是当前的行结构
所以综合的说, 窗口函数就是在行的基础上, 允许对多行数据进行计算. 下面是一个简单的窗函数例子, 将每个员工的薪资与其所在的部门的平均薪资进行比较
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
关键词
使用窗函数时会用到的一些关键词
- OVER 前面的查询基于后面的窗口
- PARTITION BY 类似于 GROUP BY 的语义, 专用于窗口的分组
- ORDER BY 窗内的排序依据, 依据的字段决定了 RANGE 的类型
- RANGE ... PRECEDING 在当前值之前的范围, 基准是当前记录这个 ORDER BY 字段的值
- RANGE ... FOLLOWING 在当前值之后的范围, 基准是当前记录这个 ORDER BY 字段的值
- RANGE BETWEEN ... PRECEDING AND ... FOLLOWING 前后范围的组合
- WINDOW 将窗口命名为变量, 可以在 SELECT 中重复使用
功能: 将数据按指定的字段分组, 再按另一个字段排列, 给每个分组里的数据打上序号.
这是一个常用技巧, 例如要计算各组内记录之间的时间间隔, 但是用时间不方便join, 打完序号后就可以用序号join了
SELECT
ROW_NUMBER() OVER w1 AS rn,
sample_01.*
FROM
sample_01
WINDOW
w1 AS (PARTITION BY field_name ORDER BY created_at ASC);
简单时间窗口统计
功能: 将数据表按指定字段(日期类型)进行排序, 然后基于每个记录的这个字段创建一个固定宽度的时间窗口, 对窗口内的多个记录进行统计
统计单个字段, 可以直接写在select中
SELECT
MAX(amount) OVER (ORDER BY traded_at RANGE '30 minutes' PRECEDING) AS amount_max,
*
FROM sample_01
WHERE card_num = '6210812500006111111'
基于时间窗口变量进行多字段统计
功能: 和前一个功能一样, 但是要进行多个不同的统计, 要重复用到这个窗口函数
如果要统计多个字段, 可以抽出单独的WINDOW
SELECT
MAX(rn) OVER w1 AS rn_max,
MAX(amount) OVER w1 AS amount_max,
AVG(amount) OVER w1 AS amount_avg,
*
FROM sample_01_diff
WINDOW
-- w1 AS (ORDER BY traded_at RANGE '30 minutes' PRECEDING)
w1 AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING)
ORDER BY
rn ASC
在这个例子中
- 先依据 card_num 这个字段进行分区,
- 然后按 traded_at 这个字段进行排序,
- 对每个记录的 traded_at 值, 开启一个 RANGE, 包含前面的30分钟和后面的30分钟, RANGE 中能用的类型和 ORDER BY 的字段类型是相关的
- SELECT中的 MAX, MIN 等聚合函数, 是基于上面的 RANGE 进行的
多个窗口多个字段同时统计In RANGE mode, these options require that the ORDER BY clause specify exactly one column. The offset specifies the maximum difference between the value of that column in the current row and its value in preceding or following rows of the frame. The data type of the offset expression varies depending on the data type of the ordering column. For numeric ordering columns it is typically of the same type as the ordering column, but for datetime ordering columns it is an interval. For example, if the ordering column is of type date or timestamp, one could write
RANGE BETWEEN '1 day' PRECEDING AND '10 days'
FOLLOWING. The offset is still required to be non-null and non-negative, though the meaning of “non-negative” depends on its data type.
功能: 在前面的功能基础上, 同时存在多个时间窗口
SELECT
-- 1 hour
SUM(amount_in) OVER w1h AS h1_amount_in_sum,
SUM(
CASE
WHEN amount_in = 0 THEN 0
ELSE 1
END
) OVER w1h AS h1_amount_in_count,
SUM(amount_out) OVER w1h AS h1_amount_out_sum,
SUM(
CASE
WHEN amount_out = 0 THEN 0
ELSE 1
END
) OVER w1h AS h1_amount_out_count,
SUM(amount) OVER w1h AS h1_amount_sum,
COUNT(amount) OVER w1h AS h1_amount_count,
ROUND(AVG(amount) OVER w1h, 2) AS h1_amount_avg,
FIRST_VALUE(amount) OVER w1h AS h1_amount_first,
LAST_VALUE(amount) OVER w1h AS h1_amount_last,
MAX(amount) OVER w1h AS h1_amount_max,
MIN(amount) OVER w1h AS h1_amount_min,
-- 3 hour
SUM(amount_in) OVER w3h AS h3_amount_in_sum,
SUM(
CASE
WHEN amount_in = 0 THEN 0
ELSE 1
END
) OVER w3h AS h3_amount_in_count,
SUM(amount_out) OVER w3h AS h3_amount_out_sum,
SUM(
CASE
WHEN amount_out = 0 THEN 0
ELSE 1
END
) OVER w3h AS h3_amount_out_count,
SUM(amount) OVER w3h AS h3_amount_sum,
COUNT(amount) OVER w3h AS h3_amount_count,
ROUND(AVG(amount) OVER w3h, 2) AS h3_amount_avg,
FIRST_VALUE(amount) OVER w3h AS h3_amount_first,
LAST_VALUE(amount) OVER w3h AS h3_amount_last,
MAX(amount) OVER w3h AS h3_amount_max,
MIN(amount) OVER w3h AS h3_amount_min,
*
FROM sample_01
WINDOW
w1h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING),
w3h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '90 minutes' PRECEDING AND '90 minutes' FOLLOWING)
;
参考
- https://www.postgresql.org/docs/current/tutorial-window.html
- https://stackoverflow.com/questions/14989927/sql-sliding-window-finding-max-value-over-interval
- https://www.postgresqltutorial.com/postgresql-window-function/
- https://www.citusdata.com/blog/2018/06/01/fun-with-sql-window-functions-in-postgresql/
- https://tapoueh.org/blog/2013/08/understanding-window-functions/