Compute the number of unique combinations while excluding those containing missing values

  Kiến thức lập trình

I’d like to count the number of unique values when combining several columns at once. My idea so far was to use pl.struct(...).n_unique(), which works fine when I consider missing values as a unique value:

import polars as pl

df = pl.DataFrame({
    "x": ["a", "a", "b", "b"],
    "y": [1, 1, 2, None],
})

df.with_columns(foo=pl.struct("x", "y").n_unique())
shape: (4, 3)
┌─────┬──────┬─────┐
│ x   ┆ y    ┆ foo │
│ --- ┆ ---  ┆ --- │
│ str ┆ i64  ┆ u32 │
╞═════╪══════╪═════╡
│ a   ┆ 1    ┆ 3   │
│ a   ┆ 1    ┆ 3   │
│ b   ┆ 2    ┆ 3   │
│ b   ┆ null ┆ 3   │
└─────┴──────┴─────┘

However, sometimes I want to exclude a combination from the count if it contains any number of missing values. In the example above, I’d like foo to be 2. However, using .drop_nulls() before counting doesn’t work and produces the same output as above.

df.with_columns(foo=pl.struct("x", "y").drop_nulls().n_unique())

Is there a way to do this using only Polars expressions?

pl.Expr.drop_nulls does not drop the row as the entirety of the struct is indeed not null.

To still achieve the desired result, you can filter out all rows which contain a null values in any of the columns of interest using pl.Expr.filter.

(
    df
    .with_columns(
        foo=pl.struct("x", "y").filter(
            ~pl.any_horizontal(pl.col("x", "y").is_null())
        ).n_unique()
    )
)
shape: (4, 3)
┌─────┬──────┬─────┐
│ x   ┆ y    ┆ foo │
│ --- ┆ ---  ┆ --- │
│ str ┆ i64  ┆ u32 │
╞═════╪══════╪═════╡
│ a   ┆ 1    ┆ 2   │
│ a   ┆ 1    ┆ 2   │
│ b   ┆ 2    ┆ 2   │
│ b   ┆ null ┆ 2   │
└─────┴──────┴─────┘

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT