Windowing Functions - Example with SQL and PySpark dataframe code

Shantanu Sharma
Dec 5, 2021
6 min read

Updated: Apr 11, 2022

Window functions

Window functions are a special group of functions that scan multiple input rows to compute each output value. Window functions are usually used with OVER, PARTITION BY, ORDER BY, and the windowing specification. Different from the regular aggregate functions used with the GROUP BY clause, and limited to one result value per group, window functions operate on windows where the input rows are ordered and grouped using flexible conditions expressed through an OVER and PARTITION clause. Window functions give aggregate results, but they do not group the result set. They return the group value multiple times with each record. Window functions offer great flexibility and functionalities compared term the regular GROUP BY clause.

There are 3 different categories of window functions:

1. Aggregate functions

Count
Sum
Min
Max
Avg

2. Sort functions

row_number
rank
dense_rank
percent_rank
ntile

3. Analytics functions

cume_dist
lead
lag
first_value
last_value

The syntax for a window function is as follows:

Function (arg1,..., argn) can be any function in window fuction categories:

Aggregate Functions: Regular aggregate functions (eg. sum, max)
Sort Functions: Functions for sorting data (Eg. rank, row_number)
Analytics Functions: Functions for statistics and comparisons(Eg. lead, lag, first_value)

The OVER [PARTITION BY <...>] clause is similar to the GROUP BY clause. It divides the rows into groups containing identical values in one or more partitions by columns. These logical groups are known as partitions, which is not the same term as used for partition tables. Omitting the PARTITION BY statement applies the operation to all the rows in the table.
The [ORDER BY <....>] clause is the same as the regular ORDER BY clause. It makes sure the rows produced by the PARTITION BY clause are ordered by specifications, such as ascending or descending order.

Let's play around data and code.

from pyspark.sql.functions import *
from pyspark.sql.functions import to_date, unix_timestamp, col
from pyspark.sql.window import *

# data
product_data = [
    ["tablet", "mobile and laptop", 30000],
    ["laptop", "mobile and laptop", 50000],
    ["cell phone 2", "mobile and laptop", 10000],
    ["cell phone 1", "mobile and laptop", 10000],
    ["washing machine", "home appliance", 20000],
    ["microwave", "home appliance", 15000],
    ["fridge", "home appliance", 18000],
    ["tv", "home appliance", 22000],
    ["vaccum cleaner", "home appliance", 12000],
    ["food processor", "home appliance", 14000],
    ["air fryer", "home appliance", 12000],
    ["office table", "furniture", 9000],
    ["dinning table", "furniture", 35000],
    ["study table", "furniture", 7500],
    ["sofa bed", "furniture", 25000],
    ["study chair", "furniture", 3500],
    ["kids chair", "furniture", 3500],
]

# schema
product_schema = "product_name STRING, category_name STRING, unit_price INT"

# creating a dataframe
product_dataframe = spark.createDataFrame(product_data, product_schema)

# creating table from dataframe
product_dataframe.createOrReplaceTempView("product_table")

https://gist.github.com/shan-sharma/e3aff3a64743317dae2a50bca9c001dc#file-prapare_sample_product_data-py

Next, we'll learn more details of each category of window functions through examples.

1. Aggregate functions

Window aggregate functions provides aggregation on top of windowing functions, such as getting count, min, max, or avg rows in the ordered set. The most commonly used analytics functions are as follows:

Count
Sum
Min
Max
Avg

The result would be equivalent doing GROUP BY on partition column, but aggregate functions will be beneficial when we will use them with Window expression (Row and Range type) in last section of this article.

https://gist.github.com/shan-sharma/e3aff3a64743317dae2a50bca9c001dc#file-windowing_agg_func-py

2. Window sort functions

Window sort functions provide the sorting data information, such as row number and rank, within specific groups as part of the data returned. The most commonly used sort functions are as follows:

row_number: Assigns a unique sequence number starting from 1 to each row, according to the partition and order specification.
rank: Ranks items in a group, such as finding the top N rows for specific conditions.
dense_rank: Similar to rank, but leaves no gaps in the ranking sequence when there are ties. For example, if we rank a match using dense_rank and have two players tied for second place, we would see that the two players were both in second place and that the next person is ranked third. However, the rank function would rank two people in second place, but the next person would be in fourth place.
percent_rank: Uses rank values rather than row counts in its numerator as (current rank - 1)/(total number of rows - 1). Therefore, it returns the percentage rank of a value relative to a group of values.
ntile: Divides an ordered dataset into a number of buckets and assigns an appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row.

https://gist.github.com/shan-sharma/e3aff3a64743317dae2a50bca9c001dc#file-windowing_sort_func-py

We can also use aggregate functions in the OVER clause as follows:

https://gist.github.com/shan-sharma/e3aff3a64743317dae2a50bca9c001dc#file-windowing_sort_agg_func-py

3. Window analytics functions

Window analytics functions provide extended data analytics, such as getting lag, lead, last, or first rows in the ordered set. The most commonly used analytics functions are as follows:

cume_dist: Computes the number of rows whose value is smaller than or equal to, the value of the total number of rows divided by the current row, such as (number of rows ≤ current row)/(total number of rows).
lead: This function, lead(value_expr[,offset[,default]]), is used to return data from the next row. The number (offset) of rows to lead can optionally be specified, one is by default. The function returns [,default] or NULL when the default is not specified. In addition, the lead for the current row extends beyond the end of the window.
lag: This function, lag(value_expr[,offset[,default]]), is used to access data from a previous row. The number (offset) of rows to lag can optionally be specified, one is by default. The function returns [,default] or NULL when the default is not specified. In addition, the lag for the current row extends beyond the end of the window.
first_value: It returns the first result from an ordered set.
last_value: It returns the last result from an ordered set.

#order data
order_data = [
    ["cell phone 1", "2021-01-01", 100],
    ["cell phone 1", "2021-01-02", 120],
    ["cell phone 1", "2021-01-03", 140],
    ["cell phone 1", "2021-01-04", 160],
    ["cell phone 1", "2021-01-05", 180],
    ["tablet", "2021-01-01", 50],
    ["tablet", "2021-01-02", 60],
    ["tablet", "2021-01-03", 70],
    ["tablet", "2021-01-04", 80],
    ["tablet", "2021-01-05", 90],
    ["laptop", "2021-01-01", 10],
    ["laptop", "2021-01-02", 15],
    ["laptop", "2021-01-03", 20],
    ["laptop", "2021-01-04", 25],
    ["laptop", "2021-01-05", 30],
]

#schema
order_schema = "product_name STRING, sale_date STRING, quantity_sold INT"

#creating dataframe
order_dataframe = spark.createDataFrame(order_data, order_schema)

#converting sale_date to timestamp
order_dataframe = order_dataframe.withColumn(
    "sale_date",
    to_date(unix_timestamp(col("sale_date"), 'yyyy-MM-dd').cast("timestamp")),
)

#creating table from dataframe
order_dataframe.createOrReplaceTempView("order_table")

https://gist.github.com/shan-sharma/e3aff3a64743317dae2a50bca9c001dc#file-windowing_analytical_func-py

Window expression

[<window_expression>] is used to further sub-partition the result and apply the window functions. There are two types of windows:

Row Type
Range Type

For row type windows, the definition is in terms of row numbers before or after the current row. The general syntax of the row window clause is as follows:

ROWS BETWEEN <start_expr> AND <end_expr>

<start_expr> can be any one of the following:

UNBOUNDED PRECEDING
CURRENT ROW
N PRECEDING or FOLLOWING

<end_expr> can be any one of the following:

UNBOUNDED FOLLOWING
CURRENT ROW
N PRECEDING or FOLLOWING

The following covers more details about using window expressions and their combinations:

BETWEEN ... AND: Use it to specify the start point and end point for the window. The first expression (before AND) defines the start point and the second expression (after AND) defines the endpoint. If we omit BETWEEN...AND (such as ROWS N PRECEDING or ROWS UNBOUNDED PRECEDING), Hive considers it as the start point, and the endpoint defaults to the current row (see the win6 and win7 columns in the following examples).
N PRECEDING or FOLLOWING: This indicates N rows before or after the current row.
UNBOUNDED PRECEDING: This indicates the window starts at the first row of the partition. This is the start point specification and cannot be used as an endpoint specification.
UNBOUNDED FOLLOWING: This indicates the window ends at the last row of the partition. This is the endpoint specification and cannot be used as a start point specification.
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: This indicates the first and last row for every row, meaning all rows in the table (see win14 column in the upcoming examples).
CURRENT ROW: As a start point, CURRENT ROW specifies that the window begins at the current row or value, depending on whether we have specified ROW or RANGE (RANGE is introduced later in this chapter). In this case, the endpoint cannot be M PRECEDING. As an endpoint, CURRENT ROW specifies that the window ends at the current row or value, depending on whether we have specified ROW or RANGE. In this case, the start point cannot be N FOLLOWING.

The following examples implement the window expressions in row type:

Preceding and Following
Preceding and Current
Current and Following

Current and Current will only consider the row.

https://gist.github.com/shan-sharma/e3aff3a64743317dae2a50bca9c001dc#file-windowing_analytical_row_func-py

Compared to row type windows, which are in terms of rows, the range type windows are in terms of values in the window expression's specified range. For example, the max(quantity_sold) RANGE BETWEEN 50 PRECEDING AND 100 FOLLOWING statement will calculate max(quantity_sold) within the partition by the distance from the current row’s value of - 50 to + 100. If the current row's quantity_sold is 100, this max(quantity_sold) will include rows whose quantity_sold range from 50 to 200 within each product_name -specified partition:

https://gist.github.com/shan-sharma/e3aff3a64743317dae2a50bca9c001dc#file-windowing_analytical_range_func-py

If we omit the window expression clause entirely, the default window specification is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When both ORDER BY and WINDOW expression clauses are missing, the window specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Windowing Functions - Example with SQL and PySpark dataframe code

Recent Posts

Comments