I’d like to count the number of unique values when combining several columns at once. My idea so far was to use pl.struct(...).n_unique()
, which works fine when I consider missing values as a unique value:
import polars as pl
df = pl.DataFrame({
"x": ["a", "a", "b", "b"],
"y": [1, 1, 2, None],
})
df.with_columns(foo=pl.struct("x", "y").n_unique())
shape: (4, 3)
┌─────┬──────┬─────┐
│ x ┆ y ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═════╪══════╪═════╡
│ a ┆ 1 ┆ 3 │
│ a ┆ 1 ┆ 3 │
│ b ┆ 2 ┆ 3 │
│ b ┆ null ┆ 3 │
└─────┴──────┴─────┘
However, sometimes I want to exclude a combination from the count if it contains any number of missing values. In the example above, I’d like foo
to be 2. However, using .drop_nulls()
before counting doesn’t work and produces the same output as above.
df.with_columns(foo=pl.struct("x", "y").drop_nulls().n_unique())
Is there a way to do this using only Polars expressions?
pl.Expr.drop_nulls
does not drop the row as the entirety of the struct is indeed not null.
To still achieve the desired result, you can filter out all rows which contain a null values in any of the columns of interest using pl.Expr.filter
.
(
df
.with_columns(
foo=pl.struct("x", "y").filter(
~pl.any_horizontal(pl.col("x", "y").is_null())
).n_unique()
)
)
shape: (4, 3)
┌─────┬──────┬─────┐
│ x ┆ y ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═════╪══════╪═════╡
│ a ┆ 1 ┆ 2 │
│ a ┆ 1 ┆ 2 │
│ b ┆ 2 ┆ 2 │
│ b ┆ null ┆ 2 │
└─────┴──────┴─────┘