Example: Location Estimates of Population and Murder Rates¶

In [1]:

Copied!





from scipy.stats import trim_mean
import pandas as pd
import numpy as np
import wquantiles
from scipy.stats import trim_mean
import pandas as pd
import numpy as np
import wquantiles

In [2]:

Copied!

state = pd.read_csv('../data/state.csv')
state.sort_values("Population", ascending=False).head(10)
state = pd.read_csv('../data/state.csv')
state.sort_values("Population", ascending=False).head(10)

Out[2]:

	State	Population	Murder.Rate	Abbreviation
4	California	37253956	4.4	CA
42	Texas	25145561	4.4	TX
31	New York	19378102	3.1	NY
8	Florida	18801310	5.8	FL
12	Illinois	12830632	5.3	IL
37	Pennsylvania	12702379	4.8	PA
34	Ohio	11536504	4.0	OH
21	Michigan	9883640	5.4	MI
9	Georgia	9687653	5.7	GA
32	North Carolina	9535483	5.1	NC

In [3]:

Copied!

trim_mean(state["Population"], 0.1)
trim_mean(state["Population"], 0.1)

Out[3]:

np.float64(4783697.125)

In [4]:

Copied!

state["Population"].median()
state["Population"].median()

Out[4]:

np.float64(4436369.5)

In [5]:

Copied!

np.average(state["Murder.Rate"], weights=state["Population"])
np.average(state["Murder.Rate"], weights=state["Population"])

Out[5]:

np.float64(4.445833981123393)

By using the weighted mean, we are actually answering the question, “If we had chosen a random residential area in the US, what would the murder rate be there?”
If we had only taken the mean of state["Murder.Rate"], for exapmle, we would have treated Alaska and California as regions with the same population.

In [6]:

Copied!

wquantiles.median(state["Murder.Rate"], weights=state["Population"])
wquantiles.median(state["Murder.Rate"], weights=state["Population"])

Out[6]:

np.float64(4.4)

In this context, median means that half of the Americans live in states with murder rates below this value. However, since we used the Population feature as a weight, result is more accurate and avoids the mistake mentioned above.