plotnine geom_histogram wrong bin placement

  Kiến thức lập trình

I’m trying to define very specifically the bins of my histogram so that their size is exactly 10.

Here is an example. I defined a list of numbers. The list contains 10 numbers with 1 digit, and then 50 numbers between 50 and 59, 60 numbers between 60 and 69, and so on.

rand_numbers = ([0]*5   + [9]*5) + 
               ([50]*20 + [59]*30) + 
               ([60]*30 + [69]*30) + 
               ([70]*35 + [79]*35) + 
               ([80]*40 + [89]*40) + 
               ([90]*45 + [99]*45)

Then I create a data frame where I “classified” the numbers so that numbers up to 69 are in a color, numbers in the 70s are in another color, and all numbers above 80 are another color:

df = pd.DataFrame({
    'c1': rand_numbers,
    'c2': ['foo'] * 120 + ['bar'] * 70 + ['baz']*170
})

To make the histogram, I’m doing:

import plotnine as p9

p = p9.ggplot(df, p9.aes(x='c1', fill = 'c2')) + 
    p9.scale_x_continuous(breaks=range(0, 120, 10)) +
    p9.geom_histogram(size=0.5, colour='black', breaks=range(0, 120, 10))

As you can see, the bins are “spilling” onto one another. Here is more or less what I expected:

That is, I expected a histogram with exactly 10 elements in the first bin, exactly 50 elements in the next bin (between 50 and 59), then exactly 60 elements in the next one. All of the aforementioned bins should be completely blue. Then, a red bin with exactly 70 elements, and then two green bins with exactly 80 and 90 elements.

As you can see, I’m using the solution suggested here and here on how to predefine the bins in geom_histogram(), but it didn’t work the way I expected.

In attempting to solve this problem, I found:

  • This question from 2016 reporting a bug — but it seems to be in the R implementation of ggplot… I don’t know whether this would have anything to do with plotnine

EDIT: I noticed that, if I do the following, it “works”. Still, I’m not sure if this is a trustworthy solution (?).

geom_histogram(size=0.5, colour='black',
     breaks=range(-1, 120, 10))   # <------ here, starting in -1

1

By default geom_histogram or stat_bin use closed="right" (see here), i.e. the bins are closed on the right aka the right edge is included in the bin and the left edge is excluded. Instead, to achieve your desired result you have to set closed="left":

import plotnine as p9
import pandas as pd

rand_numbers = ([0]*5   + [9]*5) + 
               ([50]*20 + [59]*30) + 
               ([60]*30 + [69]*30) + 
               ([70]*35 + [79]*35) + 
               ([80]*40 + [89]*40) + 
               ([90]*45 + [99]*45)

df = pd.DataFrame({
    'c1': rand_numbers,
    'c2': ['foo'] * 120 + ['bar'] * 70 + ['baz']*170
})

p9.ggplot(df, p9.aes(x='c1', fill = 'c2')) + 
    p9.scale_x_continuous(breaks=range(0, 120, 10)) +
    p9.geom_histogram(
        size=0.5, colour='black', 
        breaks=range(0, 120, 10), closed = "left"
    )

1

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT