Data Science Essentials
Final Project for Data Science Essentials Course
TN Med Helper is a fictional company whose mission is to ensure access to healthcare for all Tennesseans. TN Med Helper has approached your data science consultancy for help identifying communities in Tennessee that need the most help in expanding access to healthcare.
In this project, we will use the Medicare Disparities data as a starting point for identifying such communities. Specifically, you will be provided with datasets containing the percent of Medicare beneficiaries who had an annual wellness visit (annual_wellness.csv), the number of all-cause hospitilizations per 1000 beneficiaries (hospitalizations.csv), and the number of emergency department visits per 1000 beneficiaries (emergency_department.csv). Over the next 8 weeks, you will work towards addressing the following three objectives.
First, TN Med Helper is concerned about communities either lacking access to healthcare or losing access to healthcare. They are looking to expand telehealth technologies into the vulnerable communities, and need your help to priortize areas most needing attention. your first objective is to identify which counties in Tennessee have the most severe lack of access to healthcare (either due to lack of hospitals, physicians, or both). Once you have identified these counties, see if you can find any common demographic or economic characteristics for these areas.
Second, TN Med Helper is interested in reducing the number of potentially preventable hospitalizations. Do areas that lack access to healthcare tend to have higher rates of emergency department visits or hospitalizations? Is there an association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits?
Finally, TN Med Helper is trying to identify specific subpopulations to focus more attention on. Using data from the Behavioral Risk Factor Surveillance System, build a model to predict whether an individual has not had a checkup in the last year. Apply this model to the counties you identified above to predict how likely it is that the average person from those counties has not had a checkup in the last year. Which groups within these counties might need to be focused on to maximize the impact of TN Med Helper’s efforts?
Over the course of this class, you will build up your data analysis skills and work towards answering these questions. At the end of the project, teams will present their findings to TN Med Helper.
Looking at the data
- Annual wellness visits is a percentage (of beneficiaries).
- Emergency department visits is a number per 1000 beneficiaries.
- All-cause hospitalizations is number per 1000 beneficiaries.
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
hospitalizations = pd.read_csv("../data/Medicare_Disparities_by_Population/hospitalizations.csv")
hospitalizations.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3222 entries, 0 to 3221
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 3222 non-null int64
1 geography 3222 non-null object
2 measure 3222 non-null object
3 adjustment 3222 non-null object
4 analysis 3222 non-null object
5 domain 3222 non-null object
6 condition 3222 non-null object
7 primary_sex 3222 non-null object
8 primary_age 3222 non-null object
9 primary_dual 3222 non-null object
10 fips 3222 non-null int64
11 county 3205 non-null object
12 state 3222 non-null object
13 urban 3222 non-null object
14 primary_race 3222 non-null object
15 primary_eligibility 3222 non-null object
16 primary_denominator 3222 non-null object
17 analysis_value 3222 non-null int64
dtypes: int64(3), object(15)
memory usage: 453.2+ KB
annual_wellness = pd.read_csv("../data/Medicare_Disparities_by_Population/annual_wellness.csv")
annual_wellness.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 3220 non-null int64
1 geography 3220 non-null object
2 measure 3220 non-null object
3 adjustment 3220 non-null object
4 analysis 3220 non-null object
5 domain 3220 non-null object
6 condition 3220 non-null object
7 primary_sex 3220 non-null object
8 primary_age 3220 non-null object
9 primary_dual 3220 non-null object
10 fips 3220 non-null int64
11 county 3205 non-null object
12 state 3220 non-null object
13 urban 3220 non-null object
14 primary_race 3220 non-null object
15 primary_eligibility 3220 non-null object
16 primary_denominator 3220 non-null object
17 analysis_value 3220 non-null int64
dtypes: int64(3), object(15)
memory usage: 452.9+ KB
emergency_department = pd.read_csv("../data/Medicare_Disparities_by_Population/emergency_department.csv")
emergency_department.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 3220 non-null int64
1 geography 3220 non-null object
2 measure 3220 non-null object
3 adjustment 3220 non-null object
4 analysis 3220 non-null object
5 domain 3220 non-null object
6 condition 3220 non-null object
7 primary_sex 3220 non-null object
8 primary_age 3220 non-null object
9 primary_dual 3220 non-null object
10 fips 3220 non-null int64
11 county 3205 non-null object
12 state 3220 non-null object
13 urban 3220 non-null object
14 primary_race 3220 non-null object
15 primary_eligibility 3220 non-null object
16 primary_denominator 3220 non-null object
17 analysis_value 3220 non-null int64
dtypes: int64(3), object(15)
memory usage: 452.9+ KB
All three data structures are the same, with the same column names (unfortunately). I will merge them, and rename the columns that I want to keep from each.
emergency_department.iloc[1] == hospitalizations.iloc[1]
year True
geography True
measure False
adjustment True
analysis True
domain True
condition False
primary_sex True
primary_age True
primary_dual True
fips True
county True
state True
urban True
primary_race True
primary_eligibility True
primary_denominator False
analysis_value False
Name: 1, dtype: bool
emer_dict = {}
for i in emergency_department:
emer_dict[i] = set(emergency_department[i])
emer_dict['fips'] = []
emer_dict['county'] = []
emer_dict['state'] = []
emer_dict['analysis_value'] = []
emer_dict
{'year': {2019},
'geography': {'County'},
'measure': {'Emergency department visit rate'},
'adjustment': {'Unsmoothed actual'},
'analysis': {'Base measure'},
'domain': {'Primary chronic conditions'},
'condition': {'All Emergency Department Visits'},
'primary_sex': {'All'},
'primary_age': {'All'},
'primary_dual': {'Dual & non-dual'},
'fips': [],
'county': [],
'state': [],
'urban': {'Rural', 'Urban'},
'primary_race': {'All'},
'primary_eligibility': {'All'},
'primary_denominator': {'undefined'},
'analysis_value': []}
The comparison above shows which columns are unique to each dataframe: measure, condition, primary_denominator, and analysis_value. All other columns except urban only have one value.
hospitalizations = hospitalizations[['fips','measure','primary_denominator', 'analysis_value']]
emergency_department = emergency_department[['fips','measure','primary_denominator', 'analysis_value']]
hospitalizations.columns = ['fips','measure_hosp','primary_denominator_hosp', 'analysis_value_hosp']
emergency_department.columns = ['fips','measure_emer','primary_denominator_emer', 'analysis_value_emer']
annual_wellness.columns = annual_wellness.columns.str.replace("analysis_value","analysis_value_well")
annual_wellness.columns = annual_wellness.columns.str.replace("measure","measure_well")
annual_wellness.columns = annual_wellness.columns.str.replace("primary_denominator","primary_denominator_well")
disparities = pd.merge(left = annual_wellness, left_on = 'fips',
right = hospitalizations, right_on = 'fips')
disparities = pd.merge(left = disparities, left_on = 'fips',
right = emergency_department, right_on = 'fips')
disparities.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3220 entries, 0 to 3219
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 3220 non-null int64
1 geography 3220 non-null object
2 measure_well 3220 non-null object
3 adjustment 3220 non-null object
4 analysis 3220 non-null object
5 domain 3220 non-null object
6 condition 3220 non-null object
7 primary_sex 3220 non-null object
8 primary_age 3220 non-null object
9 primary_dual 3220 non-null object
10 fips 3220 non-null int64
11 county 3205 non-null object
12 state 3220 non-null object
13 urban 3220 non-null object
14 primary_race 3220 non-null object
15 primary_eligibility 3220 non-null object
16 primary_denominator_well 3220 non-null object
17 analysis_value_well 3220 non-null int64
18 measure_hosp 3220 non-null object
19 primary_denominator_hosp 3220 non-null object
20 analysis_value_hosp 3220 non-null int64
21 measure_emer 3220 non-null object
22 primary_denominator_emer 3220 non-null object
23 analysis_value_emer 3220 non-null int64
dtypes: int64(5), object(19)
memory usage: 628.9+ KB
print(set(disparities.adjustment),set(disparities.analysis),set(disparities.domain),set(disparities.condition))
{'Unsmoothed actual'} {'Base measure'} {'Preventive Services'} {'Annual Wellness Visit'}
Each of the columns above only has one value, so I am going to remove them.
disparities = disparities.drop(['adjustment','analysis','domain','condition'], axis = 1)
disparities.head(2)
year | geography | measure_well | primary_sex | primary_age | primary_dual | fips | county | state | urban | primary_race | primary_eligibility | primary_denominator_well | analysis_value_well | measure_hosp | primary_denominator_hosp | analysis_value_hosp | measure_emer | primary_denominator_emer | analysis_value_emer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019 | County | Preventive Services | All | All | Dual & non-dual | 1001 | Autauga County | ALABAMA | Urban | All | All | 1,000-4,999 | 29 | Hospitalization | 5,000-9,999 | 299 | Emergency department visit rate | undefined | 723 |
1 | 2019 | County | Preventive Services | All | All | Dual & non-dual | 1003 | Baldwin County | ALABAMA | Rural | All | All | 10,000+ | 29 | Hospitalization | 10,000+ | 268 | Emergency department visit rate | undefined | 601 |
Lastly, I will narrow down the list to only counties in TN.
disparities = disparities.loc[disparities.state == 'TENNESSEE']
disparities.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 2428 to 2522
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 95 non-null int64
1 geography 95 non-null object
2 measure_well 95 non-null object
3 primary_sex 95 non-null object
4 primary_age 95 non-null object
5 primary_dual 95 non-null object
6 fips 95 non-null int64
7 county 95 non-null object
8 state 95 non-null object
9 urban 95 non-null object
10 primary_race 95 non-null object
11 primary_eligibility 95 non-null object
12 primary_denominator_well 95 non-null object
13 analysis_value_well 95 non-null int64
14 measure_hosp 95 non-null object
15 primary_denominator_hosp 95 non-null object
16 analysis_value_hosp 95 non-null int64
17 measure_emer 95 non-null object
18 primary_denominator_emer 95 non-null object
19 analysis_value_emer 95 non-null int64
dtypes: int64(5), object(15)
memory usage: 15.6+ KB
As before, we have all 95 counties in TN.
Question 1: Identify which counties in TN have the most severe lack of access to healthcare.
Which counties have low numbers of physicians?
physicians = pd.read_csv('../data/primary_care_physicians.csv')
physicians = physicians.loc[physicians.state == 'Tennessee']
physicians.shape
(95, 4)
population = pd.read_csv('../data/population_by_county.csv')
population.head()
FIPS | population | county | state | urban | |
---|---|---|---|---|---|
0 | 17051 | 21565 | Fayette County | ILLINOIS | Rural |
1 | 17107 | 29003 | Logan County | ILLINOIS | Rural |
2 | 17165 | 23994 | Saline County | ILLINOIS | Rural |
3 | 17097 | 701473 | Lake County | ILLINOIS | Urban |
4 | 17127 | 14219 | Massac County | ILLINOIS | Rural |
physicians = pd.merge(left = physicians, right = population[['FIPS','population','urban']], left_on='FIPS', right_on = 'FIPS')
physicians['urban'].value_counts()
Rural 57
Urban 38
Name: urban, dtype: int64
physicians.loc[physicians.population/physicians.primary_care_physicians < 1500,'supply'] = 'adequate'
physicians.loc[(physicians.population/physicians.primary_care_physicians > 1500) &
(physicians.population/physicians.primary_care_physicians < 3500),'supply'] = 'moderately adequate'
physicians.loc[physicians.population/physicians.primary_care_physicians > 3500,'supply'] = 'low inadequate'
physicians = physicians.rename({'supply':'shadac_category'}, axis = 'columns')
import matplotlib.pyplot as plt
pd.crosstab(physicians.urban, physicians.shadac_category, normalize='index').plot(kind = 'bar', stacked = True)
plt.xticks(rotation = 0)
plt.legend(bbox_to_anchor = (1, 0.8), loc = 'upper left'); # move the legend to the right side of the plot
plt.title("SHADAC Category by Type of County")
Text(0.5, 1.0, 'SHADAC Category by Type of County')

physicians
FIPS | state | county | primary_care_physicians | population | urban | shadac_category | |
---|---|---|---|---|---|---|---|
0 | 47001 | Tennessee | Anderson | 39.0 | 76061 | Urban | moderately adequate |
1 | 47003 | Tennessee | Bedford | 15.0 | 48292 | Rural | moderately adequate |
2 | 47005 | Tennessee | Benton | 3.0 | 16140 | Rural | low inadequate |
3 | 47007 | Tennessee | Bledsoe | 1.0 | 14836 | Rural | low inadequate |
4 | 47009 | Tennessee | Blount | 90.0 | 129927 | Urban | adequate |
... | ... | ... | ... | ... | ... | ... | ... |
90 | 47181 | Tennessee | Wayne | 5.0 | 16693 | Rural | moderately adequate |
91 | 47183 | Tennessee | Weakley | 18.0 | 33510 | Rural | moderately adequate |
92 | 47185 | Tennessee | White | 9.0 | 26800 | Rural | moderately adequate |
93 | 47187 | Tennessee | Williamson | 338.0 | 225389 | Urban | adequate |
94 | 47189 | Tennessee | Wilson | 43.0 | 136666 | Urban | moderately adequate |
95 rows × 7 columns
unemployment = pd.read_csv('../data/tn_unemployment.csv')
unemployment.head()
laus_code | State | County | Name | Period | LF | Employed | Unemployed | unemployment_rate | |
---|---|---|---|---|---|---|---|---|---|
0 | CN4700100000000 | 47 | 1 | Anderson County, TN | Mar-21 | 34704 | 33010 | 1694 | 4.9 |
1 | CN4700300000000 | 47 | 3 | Bedford County, TN | Mar-21 | 20623 | 19550 | 1073 | 5.2 |
2 | CN4700500000000 | 47 | 5 | Benton County, TN | Mar-21 | 6723 | 6305 | 418 | 6.2 |
3 | CN4700700000000 | 47 | 7 | Bledsoe County, TN | Mar-21 | 4252 | 3947 | 305 | 7.2 |
4 | CN4700900000000 | 47 | 9 | Blount County, TN | Mar-21 | 64098 | 61119 | 2979 | 4.6 |
unemployment.Name = unemployment.Name.str.split(' County, TN', expand = True)[0]
physicians = pd.merge(left=unemployment[['laus_code','Name','Employed','Unemployed','unemployment_rate']], left_on='Name',
right = physicians[['FIPS','county','primary_care_physicians','population','urban','shadac_category']], right_on='county')
physicians
laus_code | Name | Employed | Unemployed | unemployment_rate | FIPS | county | primary_care_physicians | population | urban | shadac_category | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | CN4700100000000 | Anderson | 33010 | 1694 | 4.9 | 47001 | Anderson | 39.0 | 76061 | Urban | moderately adequate |
1 | CN4700300000000 | Bedford | 19550 | 1073 | 5.2 | 47003 | Bedford | 15.0 | 48292 | Rural | moderately adequate |
2 | CN4700500000000 | Benton | 6305 | 418 | 6.2 | 47005 | Benton | 3.0 | 16140 | Rural | low inadequate |
3 | CN4700700000000 | Bledsoe | 3947 | 305 | 7.2 | 47007 | Bledsoe | 1.0 | 14836 | Rural | low inadequate |
4 | CN4700900000000 | Blount | 61119 | 2979 | 4.6 | 47009 | Blount | 90.0 | 129927 | Urban | adequate |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
90 | CN4718100000000 | Wayne | 6074 | 342 | 5.3 | 47181 | Wayne | 5.0 | 16693 | Rural | moderately adequate |
91 | CN4718300000000 | Weakley | 14783 | 711 | 4.6 | 47183 | Weakley | 18.0 | 33510 | Rural | moderately adequate |
92 | CN4718500000000 | White | 11484 | 601 | 5.0 | 47185 | White | 9.0 | 26800 | Rural | moderately adequate |
93 | CN4718700000000 | Williamson | 125213 | 4271 | 3.3 | 47187 | Williamson | 338.0 | 225389 | Urban | adequate |
94 | CN4718900000000 | Wilson | 74347 | 3079 | 4.0 | 47189 | Wilson | 43.0 | 136666 | Urban | moderately adequate |
95 rows × 11 columns
import seaborn as sns
sns.boxplot(physicians.urban,physicians.unemployment_rate)
plt.xticks(rotation = 0)
plt.title("Unemployment by Type of County")
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
Text(0.5, 1.0, 'Unemployment by Type of County')

physicians['pcp_per_100k'] = physicians.primary_care_physicians/physicians.population*100000
sns.lmplot(data = physicians, y = 'pcp_per_100k', x = 'unemployment_rate')
plt.ylabel = "PCP per 100k residents"
plt.xlabel = "Unemployment Rate"

physicians[['unemployment_rate', 'pcp_per_100k']].corr()
unemployment_rate | pcp_per_100k | |
---|---|---|
unemployment_rate | 1.000000 | -0.335333 |
pcp_per_100k | -0.335333 | 1.000000 |
disparities = pd.merge(left = disparities, left_on = 'fips',
right = physicians, right_on = 'FIPS')
Now get rid of repetitive columns:
disparities.head(2)
year | geography | measure_well | primary_sex | primary_age | primary_dual | fips | county_x | state | urban_x | ... | Employed | Unemployed | unemployment_rate | FIPS | county_y | primary_care_physicians | population | urban_y | shadac_category | pcp_per_100k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019 | County | Preventive Services | All | All | Dual & non-dual | 47001 | Anderson County | TENNESSEE | Urban | ... | 33010 | 1694 | 4.9 | 47001 | Anderson | 39.0 | 76061 | Urban | moderately adequate | 51.274635 |
1 | 2019 | County | Preventive Services | All | All | Dual & non-dual | 47003 | Bedford County | TENNESSEE | Rural | ... | 19550 | 1073 | 5.2 | 47003 | Bedford | 15.0 | 48292 | Rural | moderately adequate | 31.061045 |
2 rows × 32 columns
disparities.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 94
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 95 non-null int64
1 geography 95 non-null object
2 measure_well 95 non-null object
3 primary_sex 95 non-null object
4 primary_age 95 non-null object
5 primary_dual 95 non-null object
6 fips 95 non-null int64
7 county_x 95 non-null object
8 state 95 non-null object
9 urban_x 95 non-null object
10 primary_race 95 non-null object
11 primary_eligibility 95 non-null object
12 primary_denominator_well 95 non-null object
13 analysis_value_well 95 non-null int64
14 measure_hosp 95 non-null object
15 primary_denominator_hosp 95 non-null object
16 analysis_value_hosp 95 non-null int64
17 measure_emer 95 non-null object
18 primary_denominator_emer 95 non-null object
19 analysis_value_emer 95 non-null int64
20 laus_code 95 non-null object
21 Name 95 non-null object
22 Employed 95 non-null int64
23 Unemployed 95 non-null int64
24 unemployment_rate 95 non-null float64
25 FIPS 95 non-null int64
26 county_y 95 non-null object
27 primary_care_physicians 95 non-null float64
28 population 95 non-null int64
29 urban_y 95 non-null object
30 shadac_category 95 non-null object
31 pcp_per_100k 95 non-null float64
dtypes: float64(3), int64(9), object(20)
memory usage: 24.5+ KB
disparities = disparities.drop(['geography','county_x','urban_x', 'fips'], axis = 1)
disparities.index = disparities.FIPS
disparities.columns = disparities.columns.str.replace('county_y', 'county')
disparities.columns = disparities.columns.str.replace('urban_y', 'urban')
print(set(disparities.measure_well),set(disparities.measure_emer),set(disparities.measure_hosp))
{'Preventive Services'} {'Emergency department visit rate'} {'Hospitalization'}
disparities = disparities.drop(['measure_well','measure_emer','measure_hosp'], axis = 1)
disparities.iloc[0]
year 2019
primary_sex All
primary_age All
primary_dual Dual & non-dual
state TENNESSEE
primary_race All
primary_eligibility All
primary_denominator_well 5,000-9,999
analysis_value_well 35
primary_denominator_hosp 10,000+
analysis_value_hosp 267
primary_denominator_emer undefined
analysis_value_emer 617
laus_code CN4700100000000
Name Anderson
Employed 33010
Unemployed 1694
unemployment_rate 4.9
FIPS 47001
county Anderson
primary_care_physicians 39.0
population 76061
urban Urban
shadac_category moderately adequate
pcp_per_100k 51.274635
Name: 47001, dtype: object
disparities.to_csv("../data/disparities.csv")
Analyzing disparities in access: physicians
disparities = pd.read_csv('../data/disparities.csv', index_col = 0)
hospitals = pd.read_csv('../data/Hospitals.csv')
hospitals = hospitals.loc[hospitals.STATE =='TN']
hospitals.COUNTYFIPS = pd.to_numeric(hospitals['COUNTYFIPS'])
disparities.corr().style.background_gradient('viridis')
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/pandas/io/formats/style.py:2813: RuntimeWarning: All-NaN slice encountered
smin = np.nanmin(gmap) if vmin is None else vmin
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/pandas/io/formats/style.py:2814: RuntimeWarning: All-NaN slice encountered
smax = np.nanmax(gmap) if vmax is None else vmax
year | analysis_value_well | analysis_value_hosp | analysis_value_emer | Employed | Unemployed | unemployment_rate | FIPS.1 | primary_care_physicians | population | pcp_per_100k | |
---|---|---|---|---|---|---|---|---|---|---|---|
year | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
analysis_value_well | nan | 1.000000 | -0.480732 | -0.480607 | 0.288357 | 0.221078 | -0.325228 | 0.032489 | 0.319609 | 0.281995 | 0.488941 |
analysis_value_hosp | nan | -0.480732 | 1.000000 | 0.599412 | -0.093001 | -0.049513 | 0.275980 | 0.085773 | -0.121397 | -0.091323 | -0.308934 |
analysis_value_emer | nan | -0.480607 | 0.599412 | 1.000000 | -0.179994 | -0.121924 | 0.484027 | -0.044881 | -0.202044 | -0.172820 | -0.396181 |
Employed | nan | 0.288357 | -0.093001 | -0.179994 | 1.000000 | 0.965032 | -0.196023 | 0.089982 | 0.961056 | 0.992966 | 0.502287 |
Unemployed | nan | 0.221078 | -0.049513 | -0.121924 | 0.965032 | 1.000000 | -0.077655 | 0.091267 | 0.934610 | 0.985295 | 0.432351 |
unemployment_rate | nan | -0.325228 | 0.275980 | 0.484027 | -0.196023 | -0.077655 | 1.000000 | -0.104688 | -0.159979 | -0.156259 | -0.335333 |
FIPS.1 | nan | 0.032489 | 0.085773 | -0.044881 | 0.089982 | 0.091267 | -0.104688 | 1.000000 | 0.093030 | 0.091808 | 0.103152 |
primary_care_physicians | nan | 0.319609 | -0.121397 | -0.202044 | 0.961056 | 0.934610 | -0.159979 | 0.093030 | 1.000000 | 0.964170 | 0.624601 |
population | nan | 0.281995 | -0.091323 | -0.172820 | 0.992966 | 0.985295 | -0.156259 | 0.091808 | 0.964170 | 1.000000 | 0.501609 |
pcp_per_100k | nan | 0.488941 | -0.308934 | -0.396181 | 0.502287 | 0.432351 | -0.335333 | 0.103152 | 0.624601 | 0.501609 | 1.000000 |
disparities.columns
Index(['year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
'primary_race', 'primary_eligibility', 'primary_denominator_well',
'analysis_value_well', 'primary_denominator_hosp',
'analysis_value_hosp', 'primary_denominator_emer',
'analysis_value_emer', 'laus_code', 'Name', 'Employed', 'Unemployed',
'unemployment_rate', 'FIPS.1', 'county', 'primary_care_physicians',
'population', 'urban', 'shadac_category', 'pcp_per_100k'],
dtype='object')
import seaborn as sns
sns.pairplot(disparities[['analysis_value_well',
'analysis_value_hosp',
'analysis_value_emer',
'unemployment_rate',
'population',
'pcp_per_100k','urban'
]], hue = 'urban')
<seaborn.axisgrid.PairGrid at 0x19f0b01c0>

First, we will look at primary care physicians by county. We already did this, so we will summarize in plots below.
sns.boxplot(disparities.shadac_category, disparities.pcp_per_100k)
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
<AxesSubplot:xlabel='shadac_category', ylabel='pcp_per_100k'>

disparities.shadac_category.value_counts()
moderately adequate 50
low inadequate 31
adequate 14
Name: shadac_category, dtype: int64
There are 31 counties considered low inadequate (>3500 residents per physician).
low_shadac = disparities.loc[disparities.shadac_category == 'low inadequate']
low_shadac.sort_values('pcp_per_100k').head(10)
year | primary_sex | primary_age | primary_dual | state | primary_race | primary_eligibility | primary_denominator_well | analysis_value_well | primary_denominator_hosp | ... | Employed | Unemployed | unemployment_rate | FIPS.1 | county | primary_care_physicians | population | urban | shadac_category | pcp_per_100k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | |||||||||||||||||||||
47175 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 500-999 | 31 | 500-999 | ... | 1884 | 137 | 6.8 | 47175 | Van Buren | 0.0 | 5760 | Rural | low inadequate | 0.000000 |
47033 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 25 | 1,000-4,999 | ... | 6494 | 344 | 5.0 | 47033 | Crockett | 0.0 | 14399 | Rural | low inadequate | 0.000000 |
47061 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 34 | 1,000-4,999 | ... | 4655 | 322 | 6.5 | 47061 | Grundy | 0.0 | 13344 | Rural | low inadequate | 0.000000 |
47095 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 500-999 | 20 | 1,000-4,999 | ... | 1537 | 157 | 9.3 | 47095 | Lake | 0.0 | 7401 | Rural | low inadequate | 0.000000 |
47007 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 19 | 1,000-4,999 | ... | 3947 | 305 | 7.2 | 47007 | Bledsoe | 1.0 | 14836 | Rural | low inadequate | 6.740361 |
47161 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 39 | 1,000-4,999 | ... | 5146 | 305 | 5.6 | 47161 | Stewart | 1.0 | 13427 | Urban | low inadequate | 7.447680 |
47097 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 36 | 1,000-4,999 | ... | 9075 | 658 | 6.8 | 47097 | Lauderdale | 2.0 | 25989 | Rural | low inadequate | 7.695564 |
47129 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 23 | 1,000-4,999 | ... | 7327 | 455 | 5.8 | 47129 | Morgan | 3.0 | 21545 | Rural | low inadequate | 13.924344 |
47117 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 20 | 1,000-4,999 | ... | 14422 | 735 | 4.8 | 47117 | Marshall | 5.0 | 32965 | Rural | low inadequate | 15.167602 |
47067 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 500-999 | 27 | 500-999 | ... | 1997 | 144 | 6.7 | 47067 | Hancock | 1.0 | 6587 | Rural | low inadequate | 15.181418 |
10 rows × 25 columns
There are 31 counties with low access to physicians, by SHADAC category. In particular, 4 counties have 0 physicians: Van Buren, Crockett, Grundy, and Lake. All four are rural, though Crockett and Grundy have almost 15,000 residents.
Analyzing disparities in access: hospitals
We would like to add hospitals per 100K residents for each county to disparities. To do this, we need to get hospitals.COUNTY and disparities.county in the same format, and then we can add up the # of hospitals (plus associated variables) for each county by population.
hospitals.iloc[0]
X -9409026.0792
Y 4210878.1252
OBJECTID 517
ID 14737331
NAME STARR REGIONAL MEDICAL CENTER ETOWAH
ADDRESS 886 HIGHWAY 411 NORTH
CITY ETOWAH
STATE TN
ZIP 37331
ZIP4 NOT AVAILABLE
TELEPHONE (423) 263-3600
TYPE GENERAL ACUTE CARE
STATUS OPEN
POPULATION 72
COUNTY MCMINN
COUNTYFIPS 47107
COUNTRY USA
LATITUDE 35.345099
LONGITUDE -84.522719
NAICS_CODE 622110
NAICS_DESC GENERAL MEDICAL AND SURGICAL HOSPITALS
SOURCE https://apps.health.tn.gov/facilitylistings
SOURCEDATE 2019/08/10 00:00:00
VAL_METHOD IMAGERY
VAL_DATE 2014/02/10 00:00:00
WEBSITE http://www.starrregional.com/
STATE_ID NOT AVAILABLE
ALT_NAME WOODS MEMORIAL HOSPITAL
ST_FIPS 47
OWNER NON-PROFIT
TTL_STAFF -999
BEDS 72
TRAUMA LEVEL III
HELIPAD Y
Name: 515, dtype: object
hospitals.COUNTYFIPS.value_counts()
47157 23
47037 15
47065 13
47093 11
47179 7
..
47019 1
47177 1
47027 1
47159 1
47181 1
Name: COUNTYFIPS, Length: 80, dtype: int64
for i in range(len(hospitals.COUNTYFIPS.value_counts())):
ind = (hospitals.COUNTYFIPS.value_counts().index[i])
count = (hospitals.COUNTYFIPS.value_counts()[ind])
disparities.loc[ind, 'num_hospitals'] = count
However, some of these hospitals are closed.
set(hospitals.STATUS)
{'CLOSED', 'OPEN'}
hospitals.groupby('COUNTYFIPS').STATUS.value_counts()
COUNTYFIPS STATUS
47001 OPEN 1
47003 CLOSED 1
OPEN 1
47005 OPEN 1
47007 OPEN 1
..
47181 OPEN 1
47183 OPEN 2
47185 OPEN 1
47187 OPEN 2
47189 OPEN 2
Name: STATUS, Length: 89, dtype: int64
for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
count = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts()[ind])
disparities.loc[ind, 'num_hospitals_open'] = count
disparities
year | primary_sex | primary_age | primary_dual | state | primary_race | primary_eligibility | primary_denominator_well | analysis_value_well | primary_denominator_hosp | ... | unemployment_rate | FIPS.1 | county | primary_care_physicians | population | urban | shadac_category | pcp_per_100k | num_hospitals | num_hospitals_open | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | |||||||||||||||||||||
47001 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 5,000-9,999 | 35 | 10,000+ | ... | 4.9 | 47001 | Anderson | 39.0 | 76061 | Urban | moderately adequate | 51.274635 | 1.0 | 1.0 |
47003 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 46 | 5,000-9,999 | ... | 5.2 | 47003 | Bedford | 15.0 | 48292 | Rural | moderately adequate | 31.061045 | 2.0 | 1.0 |
47005 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 30 | 1,000-4,999 | ... | 6.2 | 47005 | Benton | 3.0 | 16140 | Rural | low inadequate | 18.587361 | 1.0 | 1.0 |
47007 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 19 | 1,000-4,999 | ... | 7.2 | 47007 | Bledsoe | 1.0 | 14836 | Rural | low inadequate | 6.740361 | 1.0 | 1.0 |
47009 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 10,000+ | 43 | 10,000+ | ... | 4.6 | 47009 | Blount | 90.0 | 129927 | Urban | adequate | 69.269667 | 2.0 | 2.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
47181 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 15 | 1,000-4,999 | ... | 5.3 | 47181 | Wayne | 5.0 | 16693 | Rural | moderately adequate | 29.952675 | 1.0 | 1.0 |
47183 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 30 | 5,000-9,999 | ... | 4.6 | 47183 | Weakley | 18.0 | 33510 | Rural | moderately adequate | 53.715309 | 2.0 | 2.0 |
47185 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 35 | 5,000-9,999 | ... | 5.0 | 47185 | White | 9.0 | 26800 | Rural | moderately adequate | 33.582090 | 1.0 | 1.0 |
47187 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 10,000+ | 48 | 10,000+ | ... | 3.3 | 47187 | Williamson | 338.0 | 225389 | Urban | adequate | 149.962953 | 2.0 | 2.0 |
47189 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 10,000+ | 45 | 10,000+ | ... | 4.0 | 47189 | Wilson | 43.0 | 136666 | Urban | moderately adequate | 31.463568 | 2.0 | 2.0 |
95 rows × 27 columns
sns.scatterplot(data = disparities, x = 'pcp_per_100k', y = 'num_hospitals_open', hue = 'urban')
<AxesSubplot:xlabel='pcp_per_100k', ylabel='num_hospitals_open'>

disparities = disparities.fillna(0)
disparities.loc[(disparities.pcp_per_100k < 15) &
(disparities.num_hospitals_open < 2 )]['county']
FIPS
47007 Bledsoe
47033 Crockett
47061 Grundy
47095 Lake
47097 Lauderdale
47129 Morgan
47161 Stewart
47175 Van Buren
Name: county, dtype: object
The counties above are the ones with fewer than 2 open hospitals and fewer than 15 pcp per 100k
Lastly, we can take a look at number of beds. We need to fillna (-999) with 0 so it doesn’t mess up the counts.
hospitals.BEDS = hospitals.BEDS.replace(-999,0)
for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
count = (hospitals.loc[hospitals.STATUS == 'OPEN'].groupby('COUNTYFIPS').BEDS.mean()[ind])
disparities.loc[ind, 'mean_beds'] = count
for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
count = (hospitals.loc[hospitals.STATUS == 'OPEN'].groupby('COUNTYFIPS').BEDS.sum()[ind])
disparities.loc[ind, 'num_beds'] = count
disparities.num_beds = disparities.num_beds.fillna(0)
sns.scatterplot(data = disparities.loc[disparities.urban == 'Rural'], x = 'num_hospitals_open', y = 'num_beds')#, hue = 'urban')
<AxesSubplot:xlabel='num_hospitals_open', ylabel='num_beds'>

sns.scatterplot(data = disparities, x = 'pcp_per_100k', y = 'num_beds', hue = 'urban')
<AxesSubplot:xlabel='pcp_per_100k', ylabel='num_beds'>

disparities.to_csv('../data/disparities.csv')
Plotting disparities
median_income = pd.read_excel('../data/est18all.xls',
sheet_name = 'est18ALL',
header = 3,
usecols = 'C,D,W')
median_income.head(2)
Postal Code | Name | Median Household Income | |
---|---|---|---|
0 | US | United States | 61937 |
1 | AL | Alabama | 49881 |
For this map, we only need the counties located in Tennessee.
median_income = median_income.loc[median_income['Postal Code'] == 'TN']
median_income.head(2)
Postal Code | Name | Median Household Income | |
---|---|---|---|
2471 | TN | Tennessee | 52366 |
2472 | TN | Anderson County | 50672 |
We can remove the first row.
median_income = median_income.iloc[1:]
median_income.head(2)
Postal Code | Name | Median Household Income | |
---|---|---|---|
2472 | TN | Anderson County | 50672 |
2473 | TN | Bedford County | 49860 |
Now, let’s read in our counties shapefiles. This one was obtained from http://www.tngis.org/administrative-boundaries.htm
This creates a geopandas DataFrame, which is like a pandas DataFrame, but has geometry associated with it.
counties = gpd.read_file('../data/county/tncounty.shp')
counties.head()
OBJECTID | NAME | KEY | SHAPE_AREA | SHAPE_LEN | geometry | |
---|---|---|---|---|---|---|
0 | 76 | Chester | 47023 | 8.049024e+09 | 520461.080124 | POLYGON ((1137985.762 344601.643, 1137965.070 ... |
1 | 77 | Wayne | 47181 | 2.050741e+10 | 666520.678598 | POLYGON ((1365052.057 391716.806, 1365746.554 ... |
2 | 78 | Tipton | 47167 | 1.319125e+10 | 865093.887634 | MULTIPOLYGON (((886814.330 400456.525, 886774.... |
3 | 79 | Hamilton | 47065 | 1.604776e+10 | 652926.001078 | POLYGON ((2274954.438 239788.911, 2274090.610 ... |
4 | 80 | Stewart | 47161 | 1.375003e+10 | 490090.336180 | POLYGON ((1382472.783 743972.302, 1382445.171 ... |
median_income['NAME'] = median_income['Name'].str[:-7]
counties = pd.merge(left = counties,
right = median_income[['NAME', 'Median Household Income']])
counties['Median Household Income'] = pd.to_numeric(counties['Median Household Income'])
from matplotlib.lines import Line2D
fig, ax = plt.subplots(figsize=(16,4))
counties.plot(column = 'Median Household Income',
edgecolor = 'black',
legend = True,
cmap = 'Blues',
scheme="NaturalBreaks",
ax = ax)
leg = ax.get_legend()
# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
label_text = lbl.get_text()
lower = float(label_text.split()[0][:-1])
upper = float(label_text.split()[1][:-1])
if i == 0:
new_text = "Below " + "\${:,.0f}".format(upper + 1)
elif i == n - 1:
new_text = "Above " + "\${:,.0f}".format(lower)
else:
new_text = "\${:,.0f}".format(lower + 1) + " - " + "\${:,.0f}".format(upper)
labels.append(new_text)
# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
marker = Line2D([0],[0], marker = 'o',
markersize = line.get_markersize(),
color = line.get_markerfacecolor(),
linestyle = 'None',
markeredgecolor = 'black',
markeredgewidth = 1)
markers.append(marker)
# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
plt.title('Median Household Income by County, 2018', fontsize = 18)
ax.axis('off');

counties = pd.merge(left = counties, left_on = 'NAME',
right = disparities, right_on = 'county')
counties
OBJECTID | NAME | KEY | SHAPE_AREA | SHAPE_LEN | geometry | Median Household Income | year | primary_sex | primary_age | ... | county | primary_care_physicians | population | urban | shadac_category | pcp_per_100k | num_hospitals | num_hospitals_open | mean_beds | num_beds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 76 | Chester | 47023 | 8.049024e+09 | 520461.080124 | POLYGON ((1137985.762 344601.643, 1137965.070 ... | 47508 | 2019 | All | All | ... | Chester | 4.0 | 17190 | Urban | low inadequate | 23.269343 | 0.0 | 0.0 | NaN | 0.0 |
1 | 77 | Wayne | 47181 | 2.050741e+10 | 666520.678598 | POLYGON ((1365052.057 391716.806, 1365746.554 ... | 38879 | 2019 | All | All | ... | Wayne | 5.0 | 16693 | Rural | moderately adequate | 29.952675 | 1.0 | 1.0 | 80.0 | 80.0 |
2 | 78 | Tipton | 47167 | 1.319125e+10 | 865093.887634 | MULTIPOLYGON (((886814.330 400456.525, 886774.... | 60874 | 2019 | All | All | ... | Tipton | 17.0 | 61447 | Urban | low inadequate | 27.666119 | 1.0 | 1.0 | 100.0 | 100.0 |
3 | 79 | Hamilton | 47065 | 1.604776e+10 | 652926.001078 | POLYGON ((2274954.438 239788.911, 2274090.610 ... | 57196 | 2019 | All | All | ... | Hamilton | 403.0 | 360919 | Urban | adequate | 111.659403 | 13.0 | 12.0 | 160.0 | 1920.0 |
4 | 80 | Stewart | 47161 | 1.375003e+10 | 490090.336180 | POLYGON ((1382472.783 743972.302, 1382445.171 ... | 46565 | 2019 | All | All | ... | Stewart | 1.0 | 13427 | Urban | low inadequate | 7.447680 | 0.0 | 0.0 | NaN | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
90 | 91 | McNairy | 47109 | 1.563586e+10 | 566369.132062 | POLYGON ((1137985.762 344601.643, 1139350.519 ... | 39859 | 2019 | All | All | ... | McNairy | 9.0 | 25844 | Rural | moderately adequate | 34.824331 | 0.0 | 0.0 | NaN | 0.0 |
91 | 92 | Franklin | 47051 | 1.605093e+10 | 621176.096919 | POLYGON ((1873015.265 239618.144, 1872957.848 ... | 50201 | 2019 | All | All | ... | Franklin | 27.0 | 41725 | Rural | moderately adequate | 64.709407 | 2.0 | 2.0 | 86.5 | 173.0 |
92 | 93 | Bradley | 47011 | 9.241234e+09 | 457372.233476 | POLYGON ((2274954.438 239788.911, 2275552.803 ... | 50427 | 2019 | All | All | ... | Bradley | 55.0 | 105749 | Urban | moderately adequate | 52.009948 | 2.0 | 2.0 | 174.0 | 348.0 |
93 | 94 | Marion | 47115 | 1.428734e+10 | 529431.591556 | POLYGON ((2126056.390 236919.771, 2122873.509 ... | 50819 | 2019 | All | All | ... | Marion | 12.0 | 28538 | Urban | moderately adequate | 42.049198 | 1.0 | 1.0 | 68.0 | 68.0 |
94 | 95 | Polk | 47139 | 1.233228e+10 | 479994.126988 | POLYGON ((2355580.184 332970.851, 2355673.384 ... | 41968 | 2019 | All | All | ... | Polk | 8.0 | 16814 | Urban | moderately adequate | 47.579398 | 1.0 | 0.0 | NaN | 0.0 |
95 rows × 36 columns
counties
OBJECTID | NAME | KEY | SHAPE_AREA | SHAPE_LEN | geometry | Median Household Income | year | primary_sex | primary_age | ... | primary_care_physicians | population | urban | shadac_category | pcp_per_100k | num_hospitals | num_hospitals_open | mean_beds | num_beds | num_hospitals_per_100k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 76 | Chester | 47023 | 8.049024e+09 | 520461.080124 | POLYGON ((1137985.762 344601.643, 1137965.070 ... | 47508 | 2019 | All | All | ... | 4.0 | 17190 | Urban | low inadequate | 23.269343 | 0.0 | 0.0 | NaN | 0.0 | 0.000000 |
1 | 77 | Wayne | 47181 | 2.050741e+10 | 666520.678598 | POLYGON ((1365052.057 391716.806, 1365746.554 ... | 38879 | 2019 | All | All | ... | 5.0 | 16693 | Rural | moderately adequate | 29.952675 | 1.0 | 1.0 | 80.0 | 80.0 | 5.990535 |
2 | 78 | Tipton | 47167 | 1.319125e+10 | 865093.887634 | MULTIPOLYGON (((886814.330 400456.525, 886774.... | 60874 | 2019 | All | All | ... | 17.0 | 61447 | Urban | low inadequate | 27.666119 | 1.0 | 1.0 | 100.0 | 100.0 | 1.627419 |
3 | 79 | Hamilton | 47065 | 1.604776e+10 | 652926.001078 | POLYGON ((2274954.438 239788.911, 2274090.610 ... | 57196 | 2019 | All | All | ... | 403.0 | 360919 | Urban | adequate | 111.659403 | 13.0 | 12.0 | 160.0 | 1920.0 | 3.324846 |
4 | 80 | Stewart | 47161 | 1.375003e+10 | 490090.336180 | POLYGON ((1382472.783 743972.302, 1382445.171 ... | 46565 | 2019 | All | All | ... | 1.0 | 13427 | Urban | low inadequate | 7.447680 | 0.0 | 0.0 | NaN | 0.0 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
90 | 91 | McNairy | 47109 | 1.563586e+10 | 566369.132062 | POLYGON ((1137985.762 344601.643, 1139350.519 ... | 39859 | 2019 | All | All | ... | 9.0 | 25844 | Rural | moderately adequate | 34.824331 | 0.0 | 0.0 | NaN | 0.0 | 0.000000 |
91 | 92 | Franklin | 47051 | 1.605093e+10 | 621176.096919 | POLYGON ((1873015.265 239618.144, 1872957.848 ... | 50201 | 2019 | All | All | ... | 27.0 | 41725 | Rural | moderately adequate | 64.709407 | 2.0 | 2.0 | 86.5 | 173.0 | 4.793289 |
92 | 93 | Bradley | 47011 | 9.241234e+09 | 457372.233476 | POLYGON ((2274954.438 239788.911, 2275552.803 ... | 50427 | 2019 | All | All | ... | 55.0 | 105749 | Urban | moderately adequate | 52.009948 | 2.0 | 2.0 | 174.0 | 348.0 | 1.891271 |
93 | 94 | Marion | 47115 | 1.428734e+10 | 529431.591556 | POLYGON ((2126056.390 236919.771, 2122873.509 ... | 50819 | 2019 | All | All | ... | 12.0 | 28538 | Urban | moderately adequate | 42.049198 | 1.0 | 1.0 | 68.0 | 68.0 | 3.504100 |
94 | 95 | Polk | 47139 | 1.233228e+10 | 479994.126988 | POLYGON ((2355580.184 332970.851, 2355673.384 ... | 41968 | 2019 | All | All | ... | 8.0 | 16814 | Urban | moderately adequate | 47.579398 | 1.0 | 0.0 | NaN | 0.0 | 0.000000 |
95 rows × 37 columns
from matplotlib.lines import Line2D
fig, ax = plt.subplots(figsize=(16,4))
counties.plot(column = 'num_hospitals_per_100k',
edgecolor = 'black',
legend = True,
cmap = 'Blues',
scheme="Quantiles",
ax = ax,
# missing_kwds={
# "color": "lightgrey",
# "edgecolor": "grey",
# "label": "Missing values",
# },
# classification_kwds ={k : 10}
)
leg = ax.get_legend()
# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
label_text = lbl.get_text()
if label_text == 'Missing values':
labels.append(label_text)
continue
lower = float(label_text.split()[0][:-1])
upper = float(label_text.split()[1][:-1])
if i == 0:
new_text = "Below " + "{:,.0f}".format(upper + 1)
elif i == n - 1:
new_text = "Above " + "{:,.0f}".format(lower)
else:
new_text = "{:,.0f}".format(lower + 1) + " - " + "{:,.0f}".format(upper)
labels.append(new_text)
# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
marker = Line2D([0],[0], marker = 'o',
markersize = line.get_markersize(),
color = line.get_markerfacecolor(),
linestyle = 'None',
markeredgecolor = 'black',
markeredgewidth = 1)
markers.append(marker)
# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
plt.title('Number of Hospitals per 100K Residents by County', fontsize = 18)
ax.axis('off');

from matplotlib.lines import Line2D
fig, ax = plt.subplots(figsize=(16,4))
counties.plot(column = 'pcp_per_100k',
edgecolor = 'black',
legend = True,
cmap = 'Blues',
scheme="Quantiles",
ax = ax,
# missing_kwds={
# "color": "lightgrey",
# "edgecolor": "grey",
# "label": "Missing values",
# },
# classification_kwds ={k : 10}
)
leg = ax.get_legend()
# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
label_text = lbl.get_text()
if label_text == 'Missing values':
labels.append(label_text)
continue
lower = float(label_text.split()[0][:-1])
upper = float(label_text.split()[1][:-1])
if i == 0:
new_text = "Below " + "{:,.0f}".format(upper + 1)
elif i == n - 1:
new_text = "Above " + "{:,.0f}".format(lower)
else:
new_text = "{:,.0f}".format(lower + 1) + " - " + "{:,.0f}".format(upper)
labels.append(new_text)
# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
marker = Line2D([0],[0], marker = 'o',
markersize = line.get_markersize(),
color = line.get_markerfacecolor(),
linestyle = 'None',
markeredgecolor = 'black',
markeredgewidth = 1)
markers.append(marker)
# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
plt.title('Number of PCP per 100K Residents by County', fontsize = 18)
ax.axis('off');

sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_beds', hue = 'shadac_category')
<AxesSubplot:xlabel='Median Household Income', ylabel='num_beds'>

counties['num_hospitals_per_100k']=counties.num_hospitals_open/counties.population*100000
sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_hospitals_open', hue = 'shadac_category')
<AxesSubplot:xlabel='Median Household Income', ylabel='num_hospitals_open'>

sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_hospitals_per_100k', hue = 'urban')
plt.ylabel("Number Open Hospitals per 100K Residents")
plt.savefig('income_vs_hosp_urban.pdf')

disparities.sort_values(['pcp_per_100k','population'], ascending = [True,False])[['county', 'num_beds','num_hospitals','pcp_per_100k','population']][0:25]
county | num_beds | num_hospitals | pcp_per_100k | population | |
---|---|---|---|---|---|
FIPS | |||||
47033 | Crockett | 0.0 | 0.0 | 0.000000 | 14399 |
47061 | Grundy | 0.0 | 0.0 | 0.000000 | 13344 |
47095 | Lake | 0.0 | 0.0 | 0.000000 | 7401 |
47175 | Van Buren | 0.0 | 0.0 | 0.000000 | 5760 |
47007 | Bledsoe | 25.0 | 1.0 | 6.740361 | 14836 |
47161 | Stewart | 0.0 | 0.0 | 7.447680 | 13427 |
47097 | Lauderdale | 25.0 | 1.0 | 7.695564 | 25989 |
47129 | Morgan | 0.0 | 0.0 | 13.924344 | 21545 |
47117 | Marshall | 25.0 | 1.0 | 15.167602 | 32965 |
47067 | Hancock | 10.0 | 1.0 | 15.181418 | 6587 |
47173 | Union | 0.0 | 0.0 | 15.394089 | 19488 |
47127 | Moore | 0.0 | 0.0 | 15.678896 | 6378 |
47069 | Hardeman | 25.0 | 1.0 | 15.721416 | 25443 |
47101 | Lewis | 0.0 | 0.0 | 16.629251 | 12027 |
47111 | Macon | 25.0 | 1.0 | 16.777116 | 23842 |
47087 | Jackson | 0.0 | 1.0 | 17.120356 | 11682 |
47005 | Benton | 25.0 | 1.0 | 18.587361 | 16140 |
47123 | Monroe | 59.0 | 1.0 | 19.538034 | 46064 |
47169 | Trousdale | 25.0 | 1.0 | 19.548431 | 10231 |
47081 | Hickman | 25.0 | 1.0 | 20.150727 | 24813 |
47015 | Cannon | 60.0 | 1.0 | 21.159543 | 14178 |
47057 | Grainger | 0.0 | 0.0 | 21.644085 | 23101 |
47023 | Chester | 0.0 | 0.0 | 23.269343 | 17190 |
47083 | Houston | 25.0 | 1.0 | 24.497795 | 8164 |
47077 | Henderson | 45.0 | 1.0 | 25.020553 | 27977 |
What happens if we add a few hospitals in TN?
sns.kdeplot(disparities.num_hospitals_open, hue = disparities.urban)
<AxesSubplot:xlabel='num_hospitals_open', ylabel='Density'>

predict = disparities.copy()
disparities.sort_values(['num_hospitals_open','pcp_per_100k'])[0:1]
year | primary_sex | primary_age | primary_dual | state | primary_race | primary_eligibility | primary_denominator_well | analysis_value_well | primary_denominator_hosp | ... | county | primary_care_physicians | population | urban | shadac_category | pcp_per_100k | num_hospitals | num_hospitals_open | mean_beds | num_beds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | |||||||||||||||||||||
47033 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 25 | 1,000-4,999 | ... | Crockett | 0.0 | 14399 | Rural | low inadequate | 0.0 | 0.0 | 0.0 | NaN | 0.0 |
1 rows × 29 columns
import numpy as np
def add_hospitals_weak_link(data, category, num_hospitals = 5, return_mean = False, print_bool = True):
df = data.copy()
sorted_df = df.sort_values([category,'population'], ascending = [True, False])
# print(sorted_df.head(3))
worst = sorted_df.index[0]
df.loc[worst,'num_hospitals_open'] = num_hospitals
# df['num_hospitals_open_per_100k'] = df.num_hospitals_open*100000/df.population
if print_bool ==True:
print("Baseline mean number of open hospitals per 100k before adding hospitals: "+str(np.round(data['num_hospitals_open'].mean(),4)))
print(f"Mean number of open hospitals per 100k after adding {num_hospitals} hospitals to {df.loc[worst,'county']} County: "
+str(np.round(df['num_hospitals_open'].mean(), 4)))
return df['num_hospitals_open'].mean() if return_mean ==True else None
add_hospitals_weak_link(disparities, category = 'pcp_per_100k')
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 5 hospitals to Crockett County: 1.6947
import numpy as np
def add_hospitals_equitably(data, category, num_hospitals = 5, num_counties = 5, copy = False, print_bool = True, return_mean = False):
df = data.copy()
sorted_df = df.sort_values([category,'population'], ascending = [True, False])
# print(sorted_df.head(3))
worst = sorted_df.index[0:num_counties]
for i in worst:
df.loc[i,'num_hospitals_open'] = num_hospitals/num_counties
# df['num_hospitals_open'] = df.num_hospitals_open*100000/df.population
if print_bool == True:
print("Baseline mean number of open hospitals per 100k before adding hospitals: "+str(np.round(data['num_hospitals_open'].mean(), 4)))
print(f"Mean number of open hospitals per 100k after adding {num_hospitals} hospitals to {df.loc[worst,'county'].values} Counties: "
+str(np.round(df['num_hospitals_open'].mean(), 4)))
return df['num_hospitals_open'].mean() if return_mean == True else None
add_hospitals_equitably(disparities, category = 'pcp_per_100k')
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 5 hospitals to ['Crockett' 'Grundy' 'Lake' 'Van Buren' 'Bledsoe'] Counties: 1.6842
add_hospitals_weak_link(disparities, category = 'pcp_per_100k', num_hospitals=10)
add_hospitals_equitably(disparities, category = 'pcp_per_100k', num_hospitals=10)
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 10 hospitals to Crockett County: 1.7474
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 10 hospitals to ['Crockett' 'Grundy' 'Lake' 'Van Buren' 'Bledsoe'] Counties: 1.7368
means_weak = []
means_equit = []
for i in range(20):
means_weak.append(add_hospitals_weak_link(disparities, category='pcp_per_100k', num_hospitals=i, print_bool=False, return_mean = True))
means_equit.append(add_hospitals_equitably(disparities, category='pcp_per_100k', num_hospitals=i,num_counties=4, print_bool=False, return_mean = True))
print(means_weak)
print(means_equit)
[1.6421052631578947, 1.6526315789473685, 1.6631578947368422, 1.6736842105263159, 1.6842105263157894, 1.694736842105263, 1.7052631578947368, 1.7157894736842105, 1.7263157894736842, 1.736842105263158, 1.7473684210526317, 1.7578947368421052, 1.768421052631579, 1.7789473684210526, 1.7894736842105263, 1.8, 1.8105263157894738, 1.8210526315789475, 1.831578947368421, 1.8421052631578947]
[1.6421052631578947, 1.6526315789473685, 1.6631578947368422, 1.6736842105263159, 1.6842105263157894, 1.694736842105263, 1.7052631578947368, 1.7157894736842105, 1.7263157894736842, 1.736842105263158, 1.7473684210526317, 1.7578947368421052, 1.768421052631579, 1.7789473684210526, 1.7894736842105263, 1.8, 1.8105263157894738, 1.8210526315789475, 1.831578947368421, 1.8421052631578947]
Question 2: What is the relationship between areas without access and ER visits?
Do areas that lack access to healthcare tend to have higher rates of emergency department visits or hospitalizations? Is there an association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits?
Reordering disparities
in order of least acess:
disparities = disparities.sort_values(['pcp_per_100k','population'], ascending = [True,False]).reset_index()
disparities.columns
Index(['FIPS', 'year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
'primary_race', 'primary_eligibility', 'primary_denominator_well',
'analysis_value_well', 'primary_denominator_hosp',
'analysis_value_hosp', 'primary_denominator_emer',
'analysis_value_emer', 'laus_code', 'Name', 'Employed', 'Unemployed',
'unemployment_rate', 'FIPS.1', 'county', 'primary_care_physicians',
'population', 'urban', 'shadac_category', 'pcp_per_100k',
'num_hospitals', 'num_hospitals_open', 'mean_beds', 'num_beds'],
dtype='object')
from imp import reload
reload(plt)
plt.figure()
sns.scatterplot(x = disparities., y = disparities.analysis_value_well, hue = disparities.shadac_category)
plt.ylabel("Emergency department visits per 100K")
plt.xlabel("PCP per 100k")
Text(0.5, 0, 'PCP per 100k')

plt.figure()
sns.scatterplot(x = counties['Median Household Income'], y = counties.analysis_value_emer, hue = disparities.shadac_category)
plt.ylabel("Emergency Visits per 1K")
plt.xlabel("Median Household Income")
Text(0.5, 0, 'Median Household Income')

sns.boxplot(x = counties.urban, y = counties.analysis_value_emer)
plt.ylabel("Hospitalizations per 1K")
Text(0, 0.5, 'Hospitalizations per 1K')

Association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits
plt.figure()
sns.lmplot(data =disparities, y = "analysis_value_hosp", x = "analysis_value_well")
plt.ylabel("Hospitalizations per 100K")
plt.xlabel("Percentage annual wellness visits")
Text(0.5, 21.70625000000002, 'Percentage annual wellness visits')
<Figure size 432x288 with 0 Axes>

emer_per_100k =disparities.analysis_value_emer
hosp_per_100k =disparities.analysis_value_hosp
well_rate = disparities.analysis_value_well
rates = pd.DataFrame([emer_per_100k, hosp_per_100k, well_rate, disparities.pcp_per_100k, disparities.shadac_category,counties['Median Household Income']], index=['Emergency Visits','Hospitalizations','Wellness Visits', 'pcp', 'shadac', 'Income']).T
sns.pairplot(rates, hue = 'shadac', corner = True)
<seaborn.axisgrid.PairGrid at 0x1a5eb4190>

rates[['emer','hosp','well', 'pcp']] = rates[['emer','hosp','well', 'pcp']].astype(float)
rates.drop('shadac', axis = 1)
Emergency Visits | Hospitalizations | Wellness Visits | pcp | Income | |
---|---|---|---|---|---|
0 | 622 | 289 | 25 | 0.0 | 47508 |
1 | 651 | 285 | 34 | 0.0 | 38879 |
2 | 1000 | 401 | 20 | 0.0 | 60874 |
3 | 759 | 304 | 31 | 0.0 | 57196 |
4 | 750 | 313 | 19 | 6.740361 | 46565 |
... | ... | ... | ... | ... | ... |
90 | 559 | 234 | 56 | 112.772823 | 39859 |
91 | 584 | 261 | 52 | 126.711238 | 50201 |
92 | 666 | 255 | 43 | 132.138284 | 50427 |
93 | 469 | 201 | 48 | 149.962953 | 50819 |
94 | 504 | 256 | 57 | 176.831892 | 41968 |
95 rows × 5 columns
rates.drop('shadac', axis = 1).astype(float).corr().style.background_gradient('coolwarm')
Emergency Visits | Hospitalizations | Wellness Visits | pcp | Income | |
---|---|---|---|---|---|
Emergency Visits | 1.000000 | 0.599412 | -0.480607 | -0.396181 | 0.153085 |
Hospitalizations | 0.599412 | 1.000000 | -0.480732 | -0.308934 | 0.126632 |
Wellness Visits | -0.480607 | -0.480732 | 1.000000 | 0.488941 | -0.073377 |
pcp | -0.396181 | -0.308934 | 0.488941 | 1.000000 | -0.080277 |
Income | 0.153085 | 0.126632 | -0.073377 | -0.080277 | 1.000000 |
counties.to_csv('../data/counties_full.csv')
disparities.to_csv('../data/disparities.csv')
Question 3: Which subpopulations need more attention?
Finally, TN Med Helper is trying to identify specific subpopulations to focus more attention on. Using data from the Behavioral Risk Factor Surveillance System, build a model to predict whether an individual has not had a checkup in the last year. Apply this model to the counties you identified above to predict how likely it is that the average person from those counties has not had a checkup in the last year. Which groups within these counties might need to be focused on to maximize the impact of TN Med Helper’s efforts?
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
disparities = pd.read_csv('../data/disparities.csv',header = 0, index_col = 0)
disparities.head()
FIPS | year | primary_sex | primary_age | primary_dual | state | primary_race | primary_eligibility | primary_denominator_well | analysis_value_well | ... | county | primary_care_physicians | population | urban | shadac_category | pcp_per_100k | num_hospitals | num_hospitals_open | mean_beds | num_beds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 47033 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 25 | ... | Crockett | 0.0 | 14399 | Rural | low inadequate | 0.000000 | 0.0 | 0.0 | NaN | 0.0 |
1 | 47061 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 34 | ... | Grundy | 0.0 | 13344 | Rural | low inadequate | 0.000000 | 0.0 | 0.0 | NaN | 0.0 |
2 | 47095 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 500-999 | 20 | ... | Lake | 0.0 | 7401 | Rural | low inadequate | 0.000000 | 0.0 | 0.0 | NaN | 0.0 |
3 | 47175 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 500-999 | 31 | ... | Van Buren | 0.0 | 5760 | Rural | low inadequate | 0.000000 | 0.0 | 0.0 | NaN | 0.0 |
4 | 47007 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 19 | ... | Bledsoe | 1.0 | 14836 | Rural | low inadequate | 6.740361 | 1.0 | 1.0 | 25.0 | 25.0 |
5 rows × 30 columns
The file brfss.csv contains a subset of the responses and variables from the 2019 Behavioral Risk Factor Surveillance System (BRFSS). This dataset can be downloaded using this link: https://drive.google.com/file/d/1acJKmT2aFf2nZl_VYLE897yx0LPNajoY/view?usp=sharing.
A detailed Codebook can be found here.
Our target variable is the CHECKUP1 column, which contains the responses to the question “About how long has it been since you last visited a doctor for a routine checkup? [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]” Specifically, we want to try and predict if someone gives an answer besides “Within past year (anytime less than 12 months ago)”.
brfss = pd.read_csv('../data/brfss.csv', header = 0)
brfss.head()
GENHLTH | HLTHPLN1 | PERSDOC2 | MEDCOST | CHECKUP1 | _RFHYPE5 | TOLDHI2 | CVDINFR4 | CVDCRHD4 | CVDSTRK3 | ... | EXERANY2 | _METSTAT | _URBSTAT | _IMPRACE | _RFBMI5 | _RFSMOK3 | _RFBING5 | _RFDRHV7 | _TOTINDA | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Good | Yes | Yes, only one | No | Within past year (anytime less than 12 months ... | Yes | Yes | No | No | No | ... | No | Metropolitan counties (_URBNRRL = 1,2,3,4) | Urban counties (_URBNRRL = 1,2,3,4,5) | Black, Non-Hispanic | Yes | No | No | No | No physical activity or exercise in last 30 days | True |
1 | Fair | Yes | Yes, only one | No | Within past year (anytime less than 12 months ... | No | No | No | No | No | ... | Yes | Metropolitan counties (_URBNRRL = 1,2,3,4) | Urban counties (_URBNRRL = 1,2,3,4,5) | White, Non-Hispanic | No | No | No | No | Had physical activity or exercise | True |
2 | Good | Yes | More than one | No | Within past year (anytime less than 12 months ... | Yes | No | No | No | No | ... | Yes | Metropolitan counties (_URBNRRL = 1,2,3,4) | Urban counties (_URBNRRL = 1,2,3,4,5) | Black, Non-Hispanic | Yes | No | No | No | Had physical activity or exercise | True |
3 | Very good | Yes | Yes, only one | No | Within past year (anytime less than 12 months ... | No | No | No | No | No | ... | Yes | Nonmetropolitan counties (_URBNRRL = 5,6) | Rural counties (_URBNRRL = 6) | White, Non-Hispanic | Yes | Yes | No | No | Had physical activity or exercise | True |
4 | Poor | Yes | Yes, only one | No | Within past year (anytime less than 12 months ... | No | Yes | No | No | No | ... | No | Metropolitan counties (_URBNRRL = 1,2,3,4) | Urban counties (_URBNRRL = 1,2,3,4,5) | White, Non-Hispanic | No | No | No | No | No physical activity or exercise in last 30 days | True |
5 rows × 40 columns
First, create a new coumn, “target” by converting this to a binary outcome. After you do this, drop the CHECKUP1 column from your dataframe so that you don’t accidentally make predictions based off of it.
set(brfss.CHECKUP1)
{'5 or more years ago',
'Never',
'Within past 2 years (1 year but less than 2 years ago)',
'Within past 5 years (2 years but less than 5 years ago)',
'Within past year (anytime less than 12 months ago)'}
target = []
for x in brfss.CHECKUP1:
if x == 'Within past year (anytime less than 12 months ago)':
target.append(True)
else: target.append(False)
brfss['target'] = target
brfss = brfss.drop('CHECKUP1', axis = 1)
brfss.target.value_counts()
True 215875
False 46174
Name: target, dtype: int64
Using data from the Behavioral Risk Factor Surveillance System, build a model to predict whether an individual has not had a checkup in the last year.
Then, experiment with making a logistic regression model to predict the target variable using one or more of the other columns. Note that you will need to convert the precictor columns into dummy variable prior to fitting a model. What do you find?
- GENHLTH: Would you say that in general your health is
- HLTHPLN1: Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?
- PERSDOC2: Do you have one person you think of as your personal doctor or health care provider?
- MEDCOST: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
- _RFHYPE5: Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional
- TOLDHI2: Have you ever been told by a doctor, nurse or other health professional that your blood cholesterol is high?
- CVDINFR4: (Ever told) you had a heart attack, also called a myocardial infarction?
- CVDCRHD4: (Ever told) you had angina or coronary heart disease?
- CVDSTRK3: (Ever told) you had a stroke.
- ASTHMA3: (Ever told) you had asthma?
- CHCSCNCR: (Ever told) you had skin cancer?
- CHCOCNCR: (Ever told) you had other cancer?
- CHCCOPD2: (Ever told) you had COPD?
- ADDEPEV3: (Ever told) (you had) a depressive disorder
- CHCKDNY2: were you ever told you had kidney disease?
- DIABETE4: (Ever told) diabetes?
- HAVARTH4: Arthritis
- MARITAL: Marital status
- EDUCA: What is the highest grade or year of school you completed?
- RENTHOM1: Own or rent home
- VETERAN3: Veteran?
- EMPLOY1: Employment
- INCOME2: Income categories
- DEAF
- BLIND
- DECIDE: Trouble concentrating/deciding
- DIFFWALK: Difficulty walking
- DIFFDRES: Difficulty dressing
- DIFFALON: Difficulty errands alone
- EXERANY2: Exercise in past month
- _METSTAT: Metropolitan Status
- _URBSTAT: Urban status
- _IMPRACE: Imputed race/ethnicity value
- _RFBMI5: Adults who have a body mass index greater than 25.00 (Overweight or Obese)
- _RFSMOK3: Adults who are current smokers
- _RFBING5: Binge drinkers
- _RFDRHV7: Heavy drinkers
- _TOTINDA: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
- target
We only really want categorical variables that we know about counties, so we need to look at County Health Rankings to figure out which of these variables we have percentages on. For example, we can use _IMPRACE because we have percentages of races by TN county.
In the County Health Rankings Data:
- GENHLTH == “Poor or fair health” (percentage)
- HLTHPLN1 == “% Uninsured”
- PERSDOC2 == ratio primary care physicians?
- DIABETES4 == “diabetes prevalence”
- _RFBING5 & _RFDRHV7 == Binge and heavy drinkers == “Excessive drinking”
- RENTHOM1 == “Homeownership”
- _IMPRACE = race %
- _URBSTAT = ‘rural’
- MARITAL ~ % single-parent households
- EDUCA == “High school graduation rate, % some college
- EMPLOY1 = % unemployed
- INCOME2 = household income
- _TOTINDA ~ % physically inactive
- _RFBMI5 BMI > 25, % adults with obesity == BMI > 30…
- _RFSMOK3 == ‘% smokers
- ADDEPEV3 ~ frequent mental distress
categorical_variables = ['GENHLTH', 'HLTHPLN1', 'PERSDOC2', 'DIABETE4','_RFBING5','_RFDRHV7','RENTHOM1','_IMPRACE','_URBSTAT','EDUCA','EMPLOY1','INCOME2','_TOTINDA','_RFSMOK3']
predict = pd.get_dummies(brfss, columns = categorical_variables)
predict.head()
MEDCOST | _RFHYPE5 | TOLDHI2 | CVDINFR4 | CVDCRHD4 | CVDSTRK3 | ASTHMA3 | CHCSCNCR | CHCOCNCR | CHCCOPD2 | ... | INCOME2_Less than $15,000 ($10,000 to less than $15,000) | INCOME2_Less than $20,000 ($15,000 to less than $20,000) | INCOME2_Less than $25,000 ($20,000 to less than $25,000) | INCOME2_Less than $35,000 ($25,000 to less than $35,000) | INCOME2_Less than $50,000 ($35,000 to less than $50,000) | INCOME2_Less than $75,000 ($50,000 to less than $75,000) | _TOTINDA_Had physical activity or exercise | _TOTINDA_No physical activity or exercise in last 30 days | _RFSMOK3_No | _RFSMOK3_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | No | Yes | Yes | No | No | No | No | No | No | No | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | No | No | No | No | No | No | No | No | No | No | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | No | Yes | No | No | No | No | No | No | No | No | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
3 | No | No | No | No | No | No | Yes | No | No | Yes | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
4 | No | No | Yes | No | No | No | Yes | No | No | Yes | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
5 rows × 80 columns
predict = predict.drop(predict.columns[0:24], axis = 1)
First, we need to get these boolean values to match up with our county info, so when we use the model later we can use the same variables.
- GENHLTH:
- Change GENHLTH columns to “GENHLTH_Poor_or_Fair”
- Matched county % poor or fair health in Ranked Measure data
- HLTHPLN1
- Matches inverse of county % uninsured in Additional Measure data
- DIABETE4
- Change DIABETE4 columns to combine Yes’s and No’s
- Matches % adults with diabetes in Additional Measure data
- _RFBING5 and _RFDRHV7
- Combine so binge and heavy drinking are one category
- Matches excessive drinking in Ranked Measure data
- RENTHOM1
- Own == 1, else == 0
- Matches % homeowners
- _IMPRACE
- Should be able to keep the same
- Matches demographic %s in Additional Measure Data
- _URBSTAT
- replace with Boolean for rural
- matches inverse of % rural
- EDUCA
- “Grade 12 or GED (High school graduate)”
- matches “High school graduation rate” in ranked measure data
- “College 1 year to 3 years (Some college or technical school)” or “College 4 years or more (College graduate)” == % some college
- EMPLOY1
- “Out of work for 1 year or more” or “Out of work for less than 1 year” or “Unable to work” == % unemployment rate in Ranked measure data
- INCOME2:
- group “Household Income (AIAN)” into same groups as above
- _TOTINDA
- only “No physical activity…”
- % physically inactive
- _RFSMOK3_Yes
- % smokers
X = predict.copy()
Rural
X['Rural'] = [{ "Urban counties (_URBNRRL = 1,2,3,4,5)":0,"Rural counties (_URBNRRL = 6)":1}[i] for i in brfss._URBSTAT]
X = X.drop([i for i in X.columns if "URBSTAT" in str(i)], axis = 1)
Diabetes
X.loc[(X['DIABETE4_Yes, but female told only during pregnancy'] == 1) |
(X["DIABETE4_Yes"]==1), 'Diabetes'] = 1
X.loc[(X['DIABETE4_Yes, but female told only during pregnancy'] == 0) &
(X["DIABETE4_Yes"]==0), 'Diabetes'] = 0
X = X.drop([i for i in X.columns if "DIABETE4" in str(i)], axis = 1)
GENHLTH
X.loc[(X['GENHLTH_Fair'] == 1) |
(X["GENHLTH_Poor"]==1), 'General_health'] = 1
X.loc[(X['GENHLTH_Fair'] == 0) &
(X["GENHLTH_Poor"]==0), 'General_health'] = 0
X = X.drop([i for i in X.columns if "GENHLTH" in str(i)], axis = 1)
HLTHPLN1
X['Health_plan'] = X.HLTHPLN1_Yes
X = X.drop([i for i in X.columns if "HLTHPLN1" in str(i)], axis = 1)
_RFBING5 and _RFDRHV7
X.loc[(X._RFDRHV7_Yes == 1) |
(X._RFBING5_Yes==1), 'Excessive_drinking'] = 1
X.loc[(X['_RFDRHV7_Yes'] == 0) &
(X["_RFBING5_Yes"]==0), 'Excessive_drinking'] = 0
X = X.drop([i for i in X.columns if "_RFDRHV" in str(i)], axis = 1)
X = X.drop([i for i in X.columns if "_RFBING" in str(i)], axis = 1)
RENTHOM1
X['Owns_home'] = X['RENTHOM1_Own']
X = X.drop([i for i in X.columns if "RENTHOM1" in str(i)], axis = 1)
EDUCA
X['High_school_grad'] = X['EDUCA_Grade 12 or GED (High school graduate)']
X.loc[(X['EDUCA_College 1 year to 3 years (Some college or technical school)'] == 1) |
(X['EDUCA_College 4 years or more (College graduate)']==1), 'College'] = 1
X.loc[(X['EDUCA_College 1 year to 3 years (Some college or technical school)'] == 0) &
(X["EDUCA_College 4 years or more (College graduate)"]==0), 'College'] = 0
X = X.drop([i for i in X.columns if "EDUCA" in str(i)], axis = 1)
EMPLOY1
X.loc[(X['EMPLOY1_Out of work for 1 year or more'] == 1) |
(X['EMPLOY1_Out of work for less than 1 year']==1) |
(X['EMPLOY1_Unable to work']==1), 'Unemployed'] = 1
X.loc[(X['EMPLOY1_Out of work for 1 year or more'] == 0) &
(X['EMPLOY1_Out of work for less than 1 year']==0) &
(X['EMPLOY1_Unable to work']==0), 'Unemployed'] = 0
X = X.drop([i for i in X.columns if "EMPLOY1" in str(i)], axis = 1)
_TOTINDA
X['Inactivity'] = X['_TOTINDA_No physical activity or exercise in last 30 days']
X = X.drop([i for i in X.columns if "_TOTINDA" in str(i)], axis = 1)
_RFSMOK3_Yes
X['Smoker'] = X._RFSMOK3_Yes
X = X.drop([i for i in X.columns if "_RFSMOK3" in str(i)], axis = 1)
X = X.drop([i for i in X.columns if "PERSDOC" in str(i)], axis = 1)
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262049 entries, 0 to 262048
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 target 262049 non-null bool
1 _IMPRACE_American Indian/Alaskan Native, Non-Hispanic 262049 non-null uint8
2 _IMPRACE_Asian, Non-Hispanic 262049 non-null uint8
3 _IMPRACE_Black, Non-Hispanic 262049 non-null uint8
4 _IMPRACE_Hispanic 262049 non-null uint8
5 _IMPRACE_Other race, Non-Hispanic 262049 non-null uint8
6 _IMPRACE_White, Non-Hispanic 262049 non-null uint8
7 INCOME2_$75,000 or more 262049 non-null uint8
8 INCOME2_Less than $10,000 262049 non-null uint8
9 INCOME2_Less than $15,000 ($10,000 to less than $15,000) 262049 non-null uint8
10 INCOME2_Less than $20,000 ($15,000 to less than $20,000) 262049 non-null uint8
11 INCOME2_Less than $25,000 ($20,000 to less than $25,000) 262049 non-null uint8
12 INCOME2_Less than $35,000 ($25,000 to less than $35,000) 262049 non-null uint8
13 INCOME2_Less than $50,000 ($35,000 to less than $50,000) 262049 non-null uint8
14 INCOME2_Less than $75,000 ($50,000 to less than $75,000) 262049 non-null uint8
15 Rural 262049 non-null int64
16 Diabetes 262049 non-null float64
17 General_health 262049 non-null float64
18 Health_plan 262049 non-null uint8
19 Excessive_drinking 262049 non-null float64
20 Owns_home 262049 non-null uint8
21 High_school_grad 262049 non-null uint8
22 College 262049 non-null float64
23 Unemployed 262049 non-null float64
24 Inactivity 262049 non-null uint8
25 Smoker 262049 non-null uint8
dtypes: bool(1), float64(5), int64(1), uint8(19)
memory usage: 17.0 MB
X.to_csv("../data/predictors.csv")
y = X.pop('target')
The main objective is to have a model which makes good predictions on unseen data. Therefore, in order to evaluate how good a model is, it is necessary to set aside some data as a test set for evaulation purposes. This can be accomplished using the train_test_split
function.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify = y, # Keep the same proportions of the target in the training and test data
test_size = 0.25,
random_state = 321)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression() # Create a logistic regression model
logreg.fit(X_train, y_train) # Fit it to the training data
LogisticRegression()
To understand the model, we can look at its coefficients.
logreg.intercept_
array([-0.06956167])
logreg.coef_
array([[-0.15000703, -0.12631886, 0.53903011, -0.1734734 , -0.12965132,
-0.02906713, -0.17897394, 0.02512163, 0.13639016, 0.09111363,
0.06724727, -0.03598943, -0.04823376, -0.12616319, -0.09815453,
1.19410076, 0.44202482, 1.4508977 , -0.35476091, 0.4756881 ,
0.01472026, -0.00793255, 0.15622678, -0.04213603, -0.36442622]])
y_pred_prob = logreg.predict_proba(X_test)[:,1]
i = 120
print('Patient Information:\n{}'.format(X_test.iloc[i]))
print('---------------------------------')
print('Predicted Probability of Checkup in Last Year: {}'.format(y_pred_prob[i]))
print('Actual: {}'.format(y_test.iloc[i]))
Patient Information:
_IMPRACE_American Indian/Alaskan Native, Non-Hispanic 0.0
_IMPRACE_Asian, Non-Hispanic 0.0
_IMPRACE_Black, Non-Hispanic 0.0
_IMPRACE_Hispanic 0.0
_IMPRACE_Other race, Non-Hispanic 0.0
_IMPRACE_White, Non-Hispanic 1.0
INCOME2_$75,000 or more 1.0
INCOME2_Less than $10,000 0.0
INCOME2_Less than $15,000 ($10,000 to less than $15,000) 0.0
INCOME2_Less than $20,000 ($15,000 to less than $20,000) 0.0
INCOME2_Less than $25,000 ($20,000 to less than $25,000) 0.0
INCOME2_Less than $35,000 ($25,000 to less than $35,000) 0.0
INCOME2_Less than $50,000 ($35,000 to less than $50,000) 0.0
INCOME2_Less than $75,000 ($50,000 to less than $75,000) 0.0
Rural 0.0
Diabetes 0.0
General_health 0.0
Health_plan 1.0
Excessive_drinking 0.0
Owns_home 1.0
High_school_grad 0.0
College 1.0
Unemployed 1.0
Inactivity 0.0
Smoker 0.0
Name: 232982, dtype: float64
---------------------------------
Predicted Probability of Checkup in Last Year: 0.857817177035579
Actual: True
One way to analyze your model is to look at the receiver operating characteristic (ROC) curve. This shows how the true positive rate and false positive rate change as the prediction threshold changes.
This value can be interpreted as how likely the model is to assign a higher probability to a positive observation compared to a negative one.
from sklearn.metrics import roc_auc_score, roc_curve
fp_rate, tp_rate, thresholds = roc_curve(y_test == 1, y_pred_prob)
plt.plot(fp_rate, tp_rate)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart Disease Prediction')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

AUC - percentage of the ROC plot that is under the curve
A perfect model would have AUC = 1.
roc_auc_score(y_test, y_pred_prob)
0.6789666785879891
Analyzing Predictions
To generate predictions, you can use the predict
method of your model.
from sklearn.metrics import confusion_matrix, classification_report
from cm import plot_confusion_matrix
y_pred = logreg.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[ 1052, 10492],
[ 772, 53197]])
print(classification_report(y_test, y_pred))
precision recall f1-score support
False 0.58 0.09 0.16 11544
True 0.84 0.99 0.90 53969
accuracy 0.83 65513
macro avg 0.71 0.54 0.53 65513
weighted avg 0.79 0.83 0.77 65513
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'])

There are a number of metrics you can use to evalute your model.
Accuracy: the total proportion of predictions which are correct.
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

Sensitivity/True Positive Rate: The proportion of true positives (in our case, people who survived) that are identified as such.
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'sensitivity')

Specificity/True Negative Rate: The proportion of true negatives (in our case, people who died) that are identified as such.
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'specificity')

Precision: The proportion of predicted positives that are actually positive (survived).
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'precision')

We have only used a single predictor in the above model. Let’s see if we can improve our model by using the rest of the predictors.
With so many predictors, it takes longer for the coefficients to converge. Let’s increase the number of iterations.
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(X_train, y_train)
LogisticRegression(max_iter=10000)
y_test
149338 True
201150 True
63901 True
17995 False
192704 True
...
218617 True
181851 True
54798 True
177920 False
22276 True
Name: target, Length: 65513, dtype: bool
Let’s take a look at the coefficients.
Caution: Our variables are on vastly different scales, so do not interpret the value of a coefficient as the importance.
coefficients = pd.DataFrame({
'variable': X.columns,
'coefficient': logreg.coef_[0]
})
importances = pd.DataFrame({'variable': coefficients.variable,
'importance': coefficients.coefficient}).sort_values('importance', ascending = False)
fig, ax = plt.subplots(figsize = (7,8))
sns.barplot(data = importances,
x = 'importance', y = 'variable', ax = ax, edgecolor = 'black')
plt.title('Logistic Regression Coefficients');

fig, ax = plt.subplots(figsize = (7,5))
sns.barplot(data = coefficients,
x = 'coefficient',
y = 'variable',
ax = ax,
edgecolor = 'black')
plt.title('Logistic Regression Coefficients')
ymin, ymax = plt.ylim()
plt.vlines(x = 0, ymin = ymin, ymax = ymax);

Try using regularization or a tree-based model to improve your model’s performance on the BRFSS dataset.
from sklearn.tree import DecisionTreeClassifier
A decision tree model is fit in much the same way as a logistic regression model.
tree = DecisionTreeClassifier(random_state = 321).fit(X_train, y_train)
First, let’s see how it does on the training data.
y_pred = tree.predict(X_test)
plot_confusion_matrix(y_train, tree.predict(X_train), labels = ['No', 'Yes'], metric = 'accuracy')

y_train.value_counts()
True 161906
False 34630
Name: target, dtype: int64
Since decision trees are so flexible, they will often perfectly predict the training data. Remember though, that what we care about is how well the model generalizes.
y_pred = tree.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

roc_auc_score(y_test, tree.predict_proba(X_test)[:,1])
0.6345258214120431
We do see a drop in model performance on the test data. This is a sign of overfitting.
To correct for this problem, we can take an ensemble approach, which means that we will build many decision trees on subsets of the features and data and then average the predictions of all of the trees. This will force our model to try and find more general patterns that will work on the test set.
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(random_state = 321)
forest.fit(X_train, y_train)
RandomForestClassifier(random_state=321)
y_pred = forest.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

roc_auc_score(y_test, forest.predict_proba(X_test)[:,1])
0.6501432873215749
We see some improvement over using a single tree, but we could do better. Random forests have a lot of hyperparameters that can be tuned to improve out model. Here are a few of these parameters:
- n_estimators: Number of decision trees to train. Default is 10. More trees = less variance, but slower to train and predict
- max_depth: Maximum depth (number of splits). By default, there is no max depth.
- min_samples_leaf: Minimum number of samples per leaf. Setting this higher keeps the decision trees from paying too much attention to any single data point.
These parameters can be tuned to try to improve the model that you get, and there are ways to automatically tune these parameters. See, for example, sklearn’s GridSearchCV or RandomSearchCV.
forest = RandomForestClassifier(n_estimators = 1000, max_depth = 5, min_samples_leaf = 5, random_state = 321)
forest.fit(X_train, y_train)
RandomForestClassifier(max_depth=5, min_samples_leaf=5, n_estimators=1000,
random_state=321)
y_pred = forest.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'sensitivity')

Specificity/True Negative Rate: The proportion of true negatives (in our case, people who died) that are identified as such.
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'specificity')

roc_auc_score(y_test, forest.predict_proba(X_test)[:,1])
0.678213761501158
fp_rate, tp_rate, thresholds = roc_curve(y_test == 1, forest.predict_proba(X_test)[:,1])
plt.plot(fp_rate, tp_rate)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart Disease Prediction')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

A nice perk of using random forest models is that we can see which features are the most important in making predictions.
Caution: Feature importances tell which features the model is using, but not how it is using those features.
importances = pd.DataFrame({'variable': X.columns,
'importance': forest.feature_importances_}).sort_values('importance', ascending = False)
fig, ax = plt.subplots(figsize = (7,8))
sns.barplot(data = importances,
x = 'importance', y = 'variable', ax = ax, edgecolor = 'black')
plt.title('Random Forest Feature Importance');

Apply this model to the counties you identified above to predict how likely it is that the average person from those counties has not had a checkup in the last year.
df1 = pd.read_csv('../data/County_Health_Rankings/Ranked_measure_data_TN.csv', header = 0, index_col = 0)
df1.head()
State | County | Deaths | Years of Potential Life Lost Rate | 95% CI - Low | 95% CI - High | Quartile | YPLL Rate (AIAN) | YPLL Rate (AIAN) 95% CI - Low | YPLL Rate (AIAN) 95% CI - High | ... | % Drive Alone (Hispanic) 95% CI - Low | % Drive Alone (Hispanic) 95% CI - High | % Drive Alone (White) | % Drive Alone (White) 95% CI - Low | % Drive Alone (White) 95% CI - High | # Workers who Drive Alone | % Long Commute - Drives Alone | 95% CI - Low.18 | 95% CI - High.18 | Quartile.34 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | |||||||||||||||||||||
47000 | Tennessee | NaN | 106691 | 9285 | 9202 | 9368 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 3003226 | 35 | 34 | 35 | NaN |
47001 | Tennessee | Anderson | 1369 | 10009 | 9172 | 10846 | 2.0 | NaN | NaN | NaN | ... | 63.0 | 96.0 | 90.0 | 89.0 | 92.0 | 32206 | 36 | 32 | 39 | 2.0 |
47003 | Tennessee | Bedford | 859 | 10294 | 9268 | 11319 | 2.0 | NaN | NaN | NaN | ... | 54.0 | 74.0 | 80.0 | 76.0 | 84.0 | 21250 | 35 | 31 | 38 | 2.0 |
47005 | Tennessee | Benton | 428 | 14479 | 12233 | 16726 | 4.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 5448 | 28 | 23 | 33 | 1.0 |
47007 | Tennessee | Bledsoe | 229 | 8126 | 6427 | 9824 | 1.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 5603 | 53 | 43 | 62 | 4.0 |
5 rows × 239 columns
df2 = pd.read_csv('../data/County_Health_Rankings/Addition_measure_data_TN.csv', header = 0, index_col = 0)
df2.head()
State | County | Life Expectancy | 95% CI - Low | 95% CI - High | Life Expectancy (AIAN) | Life Expectancy (AIAN) 95% CI - Low | Life Expectancy (AIAN) 95% CI - High | Life Expectancy (Asian) | Life Expectancy (Asian) 95% CI - Low | ... | % Hispanic | # Non-Hispanic White | % Non-Hispanic White | # Not Proficient in English | % Not Proficient in English | 95% CI - Low.19 | 95% CI - High.19 | % Female | # Rural | % Rural | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | |||||||||||||||||||||
47000 | Tennessee | NaN | 76.0 | 75.9 | 76.1 | NaN | NaN | NaN | NaN | NaN | ... | 5.6 | 4989758 | 73.7 | 94554 | 2 | 1.0 | 2.0 | 51.2 | 2132860 | 33.6 |
47001 | Tennessee | Anderson | 75.5 | 74.9 | 76.2 | NaN | NaN | NaN | NaN | NaN | ... | 3.1 | 68116 | 89.1 | 619 | 1 | 0.0 | 1.0 | 51.3 | 26041 | 34.7 |
47003 | Tennessee | Bedford | 74.3 | 73.5 | 75.1 | NaN | NaN | NaN | NaN | NaN | ... | 12.6 | 37556 | 76.6 | 1474 | 3 | 2.0 | 4.0 | 50.9 | 25053 | 55.6 |
47005 | Tennessee | Benton | 71.7 | 70.2 | 73.2 | NaN | NaN | NaN | NaN | NaN | ... | 2.5 | 14970 | 92.5 | 25 | 0 | 0.0 | 1.0 | 51.0 | 12937 | 78.5 |
47007 | Tennessee | Bledsoe | 78.0 | 76.4 | 79.6 | NaN | NaN | NaN | NaN | NaN | ... | 2.5 | 13038 | 88.4 | 245 | 2 | 0.0 | 3.0 | 41.1 | 12876 | 100.0 |
5 rows × 269 columns
First, we need to get these boolean values to match up with our county info, so when we use the model later we can use the same variables.
- GENHLTH:
- Matched county % poor or fair health in Ranked Measure data
- HLTHPLN1
- Matches inverse of county % uninsured in Additional Measure data
- DIABETE4
- Matches % adults with diabetes in Additional Measure data
- _RFBING5 and _RFDRHV7
- Matches excessive drinking in Ranked Measure data
- RENTHOM1
- Matches % homeowners in Additional
- _IMPRACE
- Matches demographic %s in Additional Measure Data
- _URBSTAT
- matches inverse of % rural in additional
- EDUCA
- “Grade 12 or GED (High school graduate)”
- matches “High school graduation rate” in ranked measure data
- “College 1 year to 3 years (Some college or technical school)” or “College 4 years or more (College graduate)” == % some college
- EMPLOY1
- “Out of work for 1 year or more” or “Out of work for less than 1 year” or “Unable to work” == % unemployment rate in Ranked measure data
- INCOME2:
- group “Household Income (AIAN)” into same groups as above in additional
- _TOTINDA
- % physically inactive in ranked data
- _RFSMOK3_Yes
- % smokers in ranked
# [print(i) for i in df2 if "%" in str(i)]
df1 = df1[['% Fair or Poor Health', "% Excessive Drinking","High School Graduation Rate","% Some College","% Unemployed","% Physically Inactive","% Smokers"]]
df2 = df2[["% Uninsured","% Adults with Diabetes","% Homeowners","% Rural","Median Household Income",
"% Non-Hispanic White", "% Black","% American Indian & Alaska Native","% Asian","% Native Hawaiian/Other Pacific Islander","% Hispanic"]]
We now need to convert these columns to the ones in X.
county_demo = df1.join(df2)
#everything is a percentage except Income, so divide by 100 and then fix income column
county_demo = county_demo/100
county_demo['Median Household Income'] *=100
county_demo
% Fair or Poor Health | % Excessive Drinking | High School Graduation Rate | % Some College | % Unemployed | % Physically Inactive | % Smokers | % Uninsured | % Adults with Diabetes | % Homeowners | % Rural | Median Household Income | % Non-Hispanic White | % Black | % American Indian & Alaska Native | % Asian | % Native Hawaiian/Other Pacific Islander | % Hispanic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | ||||||||||||||||||
47000 | 0.20 | 0.14 | 0.90 | 0.61 | 0.035 | 0.27 | 0.23 | 0.14 | 0.13 | 0.66 | 0.336 | 52366.0 | 0.737 | 0.167 | 0.005 | 0.019 | 0.001 | 0.056 |
47001 | 0.19 | 0.14 | 0.92 | 0.58 | 0.038 | 0.26 | 0.21 | 0.13 | 0.11 | 0.68 | 0.347 | 50672.0 | 0.891 | 0.039 | 0.005 | 0.015 | 0.001 | 0.031 |
47003 | 0.22 | 0.16 | 0.91 | 0.41 | 0.037 | 0.32 | 0.23 | 0.18 | 0.11 | 0.68 | 0.556 | 49860.0 | 0.766 | 0.076 | 0.011 | 0.011 | 0.002 | 0.126 |
47005 | 0.23 | 0.12 | 0.96 | 0.47 | 0.049 | 0.29 | 0.24 | 0.15 | 0.14 | 0.76 | 0.785 | 38940.0 | 0.925 | 0.024 | 0.006 | 0.006 | 0.000 | 0.025 |
47007 | 0.23 | 0.15 | 0.85 | 0.38 | 0.058 | 0.26 | 0.25 | 0.18 | 0.10 | 0.74 | 1.000 | 40195.0 | 0.884 | 0.071 | 0.006 | 0.003 | 0.000 | 0.025 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
47181 | 0.24 | 0.14 | 0.96 | 0.40 | 0.049 | 0.41 | 0.25 | 0.14 | 0.15 | 0.81 | 1.000 | 38879.0 | 0.899 | 0.063 | 0.004 | 0.003 | 0.000 | 0.021 |
47183 | 0.22 | 0.14 | 0.92 | 0.56 | 0.042 | 0.34 | 0.22 | 0.12 | 0.11 | 0.66 | 0.670 | 42351.0 | 0.868 | 0.075 | 0.004 | 0.012 | 0.000 | 0.027 |
47185 | 0.25 | 0.13 | 0.94 | 0.49 | 0.037 | 0.23 | 0.23 | 0.14 | 0.22 | 0.81 | 0.782 | 42603.0 | 0.930 | 0.018 | 0.005 | 0.004 | 0.001 | 0.028 |
47187 | 0.14 | 0.16 | 0.96 | 0.87 | 0.025 | 0.18 | 0.15 | 0.08 | 0.08 | 0.81 | 0.194 | 115930.0 | 0.843 | 0.043 | 0.003 | 0.047 | 0.001 | 0.049 |
47189 | 0.17 | 0.17 | 0.96 | 0.69 | 0.028 | 0.21 | 0.18 | 0.10 | 0.11 | 0.77 | 0.385 | 76507.0 | 0.847 | 0.071 | 0.005 | 0.017 | 0.001 | 0.045 |
96 rows × 18 columns
county_demo.columns
Index(['% Fair or Poor Health', '% Excessive Drinking',
'High School Graduation Rate', '% Some College', '% Unemployed',
'% Physically Inactive', '% Smokers', '% Uninsured',
'% Adults with Diabetes', '% Homeowners', '% Rural',
'Median Household Income', '% Non-Hispanic White', '% Black',
'% American Indian & Alaska Native', '% Asian',
'% Native Hawaiian/Other Pacific Islander', '% Hispanic'],
dtype='object')
X.columns
Index(['_IMPRACE_American Indian/Alaskan Native, Non-Hispanic',
'_IMPRACE_Asian, Non-Hispanic', '_IMPRACE_Black, Non-Hispanic',
'_IMPRACE_Hispanic', '_IMPRACE_Other race, Non-Hispanic',
'_IMPRACE_White, Non-Hispanic', 'INCOME2_$75,000 or more',
'INCOME2_Less than $10,000',
'INCOME2_Less than $15,000 ($10,000 to less than $15,000)',
'INCOME2_Less than $20,000 ($15,000 to less than $20,000)',
'INCOME2_Less than $25,000 ($20,000 to less than $25,000)',
'INCOME2_Less than $35,000 ($25,000 to less than $35,000)',
'INCOME2_Less than $50,000 ($35,000 to less than $50,000)',
'INCOME2_Less than $75,000 ($50,000 to less than $75,000)', 'Rural',
'Diabetes', 'General_health', 'Health_plan', 'Excessive_drinking',
'Owns_home', 'High_school_grad', 'College', 'Unemployed', 'Inactivity',
'Smoker'],
dtype='object')
column_dict = {'% Fair or Poor Health':"General_health", '% Excessive Drinking':"Excessive_drinking",
'High School Graduation Rate':"High_school_grad", '% Some College':"College", '% Unemployed':"Unemployed",
'% Physically Inactive':"Inactivity", '% Smokers':"Smoker", '% Uninsured':"Health_plan",
'% Adults with Diabetes':"Diabetes", '% Homeowners':"Owns_home", '% Rural':"Rural",
'% Non-Hispanic White':"_IMPRACE_White, Non-Hispanic",
'% Black':'_IMPRACE_Black, Non-Hispanic',
'% American Indian & Alaska Native':'_IMPRACE_American Indian/Alaskan Native, Non-Hispanic',
'% Asian':"_IMPRACE_Asian, Non-Hispanic",
'% Native Hawaiian/Other Pacific Islander':"_IMPRACE_Other race, Non-Hispanic",
'% Hispanic':"_IMPRACE_Hispanic"}
for i in column_dict:
county_demo.columns = county_demo.columns.str.replace(i,column_dict[i])
county_demo
General_health | Excessive_drinking | High_school_grad | College | Unemployed | Inactivity | Smoker | Health_plan | Diabetes | Owns_home | Rural | Median Household Income | _IMPRACE_White, Non-Hispanic | _IMPRACE_Black, Non-Hispanic | _IMPRACE_American Indian/Alaskan Native, Non-Hispanic | _IMPRACE_Asian, Non-Hispanic | _IMPRACE_Other race, Non-Hispanic | _IMPRACE_Hispanic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | ||||||||||||||||||
47000 | 0.20 | 0.14 | 0.90 | 0.61 | 0.035 | 0.27 | 0.23 | 0.14 | 0.13 | 0.66 | 0.336 | 52366.0 | 0.737 | 0.167 | 0.005 | 0.019 | 0.001 | 0.056 |
47001 | 0.19 | 0.14 | 0.92 | 0.58 | 0.038 | 0.26 | 0.21 | 0.13 | 0.11 | 0.68 | 0.347 | 50672.0 | 0.891 | 0.039 | 0.005 | 0.015 | 0.001 | 0.031 |
47003 | 0.22 | 0.16 | 0.91 | 0.41 | 0.037 | 0.32 | 0.23 | 0.18 | 0.11 | 0.68 | 0.556 | 49860.0 | 0.766 | 0.076 | 0.011 | 0.011 | 0.002 | 0.126 |
47005 | 0.23 | 0.12 | 0.96 | 0.47 | 0.049 | 0.29 | 0.24 | 0.15 | 0.14 | 0.76 | 0.785 | 38940.0 | 0.925 | 0.024 | 0.006 | 0.006 | 0.000 | 0.025 |
47007 | 0.23 | 0.15 | 0.85 | 0.38 | 0.058 | 0.26 | 0.25 | 0.18 | 0.10 | 0.74 | 1.000 | 40195.0 | 0.884 | 0.071 | 0.006 | 0.003 | 0.000 | 0.025 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
47181 | 0.24 | 0.14 | 0.96 | 0.40 | 0.049 | 0.41 | 0.25 | 0.14 | 0.15 | 0.81 | 1.000 | 38879.0 | 0.899 | 0.063 | 0.004 | 0.003 | 0.000 | 0.021 |
47183 | 0.22 | 0.14 | 0.92 | 0.56 | 0.042 | 0.34 | 0.22 | 0.12 | 0.11 | 0.66 | 0.670 | 42351.0 | 0.868 | 0.075 | 0.004 | 0.012 | 0.000 | 0.027 |
47185 | 0.25 | 0.13 | 0.94 | 0.49 | 0.037 | 0.23 | 0.23 | 0.14 | 0.22 | 0.81 | 0.782 | 42603.0 | 0.930 | 0.018 | 0.005 | 0.004 | 0.001 | 0.028 |
47187 | 0.14 | 0.16 | 0.96 | 0.87 | 0.025 | 0.18 | 0.15 | 0.08 | 0.08 | 0.81 | 0.194 | 115930.0 | 0.843 | 0.043 | 0.003 | 0.047 | 0.001 | 0.049 |
47189 | 0.17 | 0.17 | 0.96 | 0.69 | 0.028 | 0.21 | 0.18 | 0.10 | 0.11 | 0.77 | 0.385 | 76507.0 | 0.847 | 0.071 | 0.005 | 0.017 | 0.001 | 0.045 |
96 rows × 18 columns
Don’t forget, we need 1-Health Plan (% Uninsured –> Health plan). All others should already be accounted for if “negative,” such as physical inactivity, unemployed, etc. i.e. TRUE for Unemployed truly means unemployed.
county_demo['Health_plan'] = 1- county_demo['Health_plan']
Now we work on income categories.
income_categories = []
for i in county_demo['Median Household Income']:
if float(i) < 10000:
income_categories.append('INCOME2_Less than $10,000')
elif (i >= 10000) & (i < 15000):
income_categories.append('INCOME2_Less than $15,000 ($10,000 to less than $15,000)')
elif (i >= 15000) & (i < 20000):
income_categories.append('INCOME2_Less than $20,000 ($15,000 to less than $20,000)')
elif (i >= 20000) & (i < 25000):
income_categories.append('INCOME2_Less than $25,000 ($20,000 to less than $25,000)')
elif (i >= 25000) & (i < 35000):
income_categories.append('INCOME2_Less than $35,000 ($25,000 to less than $35,000)')
elif (i >= 35000) & (i < 50000):
income_categories.append('INCOME2_Less than $50,000 ($35,000 to less than $50,000)')
elif (i >= 50000) & (i < 75000):
income_categories.append('INCOME2_Less than $75,000 ($50,000 to less than $75,000)')
elif (i >= 75000):
income_categories.append('INCOME2_$75,000 or more')
county_demo['income_categories'] = income_categories
county_demo = pd.get_dummies(county_demo, 'income_categories')
county_demo = county_demo.drop('Median Household Income', axis = 1)
county_demo['INCOME2_Less than $25,000 ($20,000 to less than $25,000)'] = 0
county_demo['INCOME2_Less than $20,000 ($15,000 to less than $20,000)'] = 0
county_demo['INCOME2_Less than $15,000 ($10,000 to less than $15,000)'] = 0
county_demo['INCOME2_Less than $10,000'] = 0
county_demo.columns = county_demo.columns.str.replace("income_categories_", "")
county_demo
General_health | Excessive_drinking | High_school_grad | College | Unemployed | Inactivity | Smoker | Health_plan | Diabetes | Owns_home | ... | _IMPRACE_Other race, Non-Hispanic | _IMPRACE_Hispanic | INCOME2_$75,000 or more | INCOME2_Less than $35,000 ($25,000 to less than $35,000) | INCOME2_Less than $50,000 ($35,000 to less than $50,000) | INCOME2_Less than $75,000 ($50,000 to less than $75,000) | INCOME2_Less than $25,000 ($20,000 to less than $25,000) | INCOME2_Less than $20,000 ($15,000 to less than $20,000) | INCOME2_Less than $15,000 ($10,000 to less than $15,000) | INCOME2_Less than $10,000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | |||||||||||||||||||||
47000 | 0.20 | 0.14 | 0.90 | 0.61 | 0.035 | 0.27 | 0.23 | 0.86 | 0.13 | 0.66 | ... | 0.001 | 0.056 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
47001 | 0.19 | 0.14 | 0.92 | 0.58 | 0.038 | 0.26 | 0.21 | 0.87 | 0.11 | 0.68 | ... | 0.001 | 0.031 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
47003 | 0.22 | 0.16 | 0.91 | 0.41 | 0.037 | 0.32 | 0.23 | 0.82 | 0.11 | 0.68 | ... | 0.002 | 0.126 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
47005 | 0.23 | 0.12 | 0.96 | 0.47 | 0.049 | 0.29 | 0.24 | 0.85 | 0.14 | 0.76 | ... | 0.000 | 0.025 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
47007 | 0.23 | 0.15 | 0.85 | 0.38 | 0.058 | 0.26 | 0.25 | 0.82 | 0.10 | 0.74 | ... | 0.000 | 0.025 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
47181 | 0.24 | 0.14 | 0.96 | 0.40 | 0.049 | 0.41 | 0.25 | 0.86 | 0.15 | 0.81 | ... | 0.000 | 0.021 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
47183 | 0.22 | 0.14 | 0.92 | 0.56 | 0.042 | 0.34 | 0.22 | 0.88 | 0.11 | 0.66 | ... | 0.000 | 0.027 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
47185 | 0.25 | 0.13 | 0.94 | 0.49 | 0.037 | 0.23 | 0.23 | 0.86 | 0.22 | 0.81 | ... | 0.001 | 0.028 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
47187 | 0.14 | 0.16 | 0.96 | 0.87 | 0.025 | 0.18 | 0.15 | 0.92 | 0.08 | 0.81 | ... | 0.001 | 0.049 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
47189 | 0.17 | 0.17 | 0.96 | 0.69 | 0.028 | 0.21 | 0.18 | 0.90 | 0.11 | 0.77 | ... | 0.001 | 0.045 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
96 rows × 25 columns
county_pred = tree.predict(county_demo)
county_pred_prob = forest.predict_proba(county_demo)[:,1]
county_demo['predict_prob'] = county_pred_prob
county_demo
General_health | Excessive_drinking | High_school_grad | College | Unemployed | Inactivity | Smoker | Health_plan | Diabetes | Owns_home | ... | _IMPRACE_Hispanic | INCOME2_$75,000 or more | INCOME2_Less than $35,000 ($25,000 to less than $35,000) | INCOME2_Less than $50,000 ($35,000 to less than $50,000) | INCOME2_Less than $75,000 ($50,000 to less than $75,000) | INCOME2_Less than $25,000 ($20,000 to less than $25,000) | INCOME2_Less than $20,000 ($15,000 to less than $20,000) | INCOME2_Less than $15,000 ($10,000 to less than $15,000) | INCOME2_Less than $10,000 | predict_prob | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | |||||||||||||||||||||
47000 | 0.20 | 0.14 | 0.90 | 0.61 | 0.035 | 0.27 | 0.23 | 0.86 | 0.13 | 0.66 | ... | 0.056 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0.640059 |
47001 | 0.19 | 0.14 | 0.92 | 0.58 | 0.038 | 0.26 | 0.21 | 0.87 | 0.11 | 0.68 | ... | 0.031 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0.640059 |
47003 | 0.22 | 0.16 | 0.91 | 0.41 | 0.037 | 0.32 | 0.23 | 0.82 | 0.11 | 0.68 | ... | 0.126 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.687877 |
47005 | 0.23 | 0.12 | 0.96 | 0.47 | 0.049 | 0.29 | 0.24 | 0.85 | 0.14 | 0.76 | ... | 0.025 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.687877 |
47007 | 0.23 | 0.15 | 0.85 | 0.38 | 0.058 | 0.26 | 0.25 | 0.82 | 0.10 | 0.74 | ... | 0.025 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.687877 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
47181 | 0.24 | 0.14 | 0.96 | 0.40 | 0.049 | 0.41 | 0.25 | 0.86 | 0.15 | 0.81 | ... | 0.021 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.687877 |
47183 | 0.22 | 0.14 | 0.92 | 0.56 | 0.042 | 0.34 | 0.22 | 0.88 | 0.11 | 0.66 | ... | 0.027 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.667665 |
47185 | 0.25 | 0.13 | 0.94 | 0.49 | 0.037 | 0.23 | 0.23 | 0.86 | 0.22 | 0.81 | ... | 0.028 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.687877 |
47187 | 0.14 | 0.16 | 0.96 | 0.87 | 0.025 | 0.18 | 0.15 | 0.92 | 0.08 | 0.81 | ... | 0.049 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.812499 |
47189 | 0.17 | 0.17 | 0.96 | 0.69 | 0.028 | 0.21 | 0.18 | 0.90 | 0.11 | 0.77 | ... | 0.045 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.812499 |
96 rows × 26 columns
disparities = pd.read_csv('../data/disparities.csv', header = 0, index_col = 0)
disparities.index = disparities.FIPS
disparities
FIPS | year | primary_sex | primary_age | primary_dual | state | primary_race | primary_eligibility | primary_denominator_well | analysis_value_well | ... | county | primary_care_physicians | population | urban | shadac_category | pcp_per_100k | num_hospitals | num_hospitals_open | mean_beds | num_beds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FIPS | |||||||||||||||||||||
47033 | 47033 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 25 | ... | Crockett | 0.0 | 14399 | Rural | low inadequate | 0.000000 | 0.0 | 0.0 | NaN | 0.0 |
47061 | 47061 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 34 | ... | Grundy | 0.0 | 13344 | Rural | low inadequate | 0.000000 | 0.0 | 0.0 | NaN | 0.0 |
47095 | 47095 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 500-999 | 20 | ... | Lake | 0.0 | 7401 | Rural | low inadequate | 0.000000 | 0.0 | 0.0 | NaN | 0.0 |
47175 | 47175 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 500-999 | 31 | ... | Van Buren | 0.0 | 5760 | Rural | low inadequate | 0.000000 | 0.0 | 0.0 | NaN | 0.0 |
47007 | 47007 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 1,000-4,999 | 19 | ... | Bledsoe | 1.0 | 14836 | Rural | low inadequate | 6.740361 | 1.0 | 1.0 | 25.000000 | 25.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
47093 | 47093 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 10,000+ | 56 | ... | Knox | 520.0 | 461104 | Urban | adequate | 112.772823 | 11.0 | 10.0 | 306.600000 | 3066.0 |
47163 | 47163 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 10,000+ | 52 | ... | Sullivan | 199.0 | 157050 | Urban | adequate | 126.711238 | 5.0 | 5.0 | 235.000000 | 1175.0 |
47113 | 47113 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 10,000+ | 43 | ... | Madison | 129.0 | 97625 | Urban | adequate | 132.138284 | 3.0 | 3.0 | 330.666667 | 992.0 |
47187 | 47187 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 10,000+ | 48 | ... | Williamson | 338.0 | 225389 | Urban | adequate | 149.962953 | 2.0 | 2.0 | 112.500000 | 225.0 |
47179 | 47179 | 2019 | All | All | Dual & non-dual | TENNESSEE | All | All | 10,000+ | 57 | ... | Washington | 226.0 | 127805 | Urban | adequate | 176.831892 | 7.0 | 6.0 | 167.000000 | 1002.0 |
95 rows × 30 columns
full_df = disparities.join(county_demo, lsuffix = 'disp', rsuffix = 'demo')
full_df.sort_values('predict_prob')[['county','predict_prob']]
county | predict_prob | |
---|---|---|
FIPS | ||
47027 | Clay | 0.615584 |
47095 | Lake | 0.615584 |
47067 | Hancock | 0.615584 |
47149 | Rutherford | 0.640059 |
47125 | Montgomery | 0.640059 |
... | ... | ... |
47091 | Johnson | 0.687877 |
47055 | Giles | 0.687877 |
47075 | Haywood | 0.692379 |
47189 | Wilson | 0.812499 |
47187 | Williamson | 0.812499 |
95 rows × 2 columns
full_df.columns
Index(['FIPS', 'year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
'primary_race', 'primary_eligibility', 'primary_denominator_well',
'analysis_value_well', 'primary_denominator_hosp',
'analysis_value_hosp', 'primary_denominator_emer',
'analysis_value_emer', 'laus_code', 'Name', 'Employed',
'Unemployeddisp', 'unemployment_rate', 'FIPS.1', 'county',
'primary_care_physicians', 'population', 'urban', 'shadac_category',
'pcp_per_100k', 'num_hospitals', 'num_hospitals_open', 'mean_beds',
'num_beds', 'General_health', 'Excessive_drinking', 'High_school_grad',
'College', 'Unemployeddemo', 'Inactivity', 'Smoker', 'Health_plan',
'Diabetes', 'Owns_home', 'Rural', '_IMPRACE_White, Non-Hispanic',
'_IMPRACE_Black, Non-Hispanic',
'_IMPRACE_American Indian/Alaskan Native, Non-Hispanic',
'_IMPRACE_Asian, Non-Hispanic', '_IMPRACE_Other race, Non-Hispanic',
'_IMPRACE_Hispanic', 'INCOME2_$75,000 or more',
'INCOME2_Less than $35,000 ($25,000 to less than $35,000)',
'INCOME2_Less than $50,000 ($35,000 to less than $50,000)',
'INCOME2_Less than $75,000 ($50,000 to less than $75,000)',
'INCOME2_Less than $25,000 ($20,000 to less than $25,000)',
'INCOME2_Less than $20,000 ($15,000 to less than $20,000)',
'INCOME2_Less than $15,000 ($10,000 to less than $15,000)',
'INCOME2_Less than $10,000', 'predict_prob'],
dtype='object')
sns.scatterplot(full_df['Health_plan'] , full_df.predict_prob)
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
<AxesSubplot:xlabel='Health_plan', ylabel='predict_prob'>

(county_demo).plot(kind = 'box', figsize = (10,10))
plt.xticks(rotation = 90)
(array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26]),
[Text(1, 0, 'General_health'),
Text(2, 0, 'Excessive_drinking'),
Text(3, 0, 'High_school_grad'),
Text(4, 0, 'College'),
Text(5, 0, 'Unemployed'),
Text(6, 0, 'Inactivity'),
Text(7, 0, 'Smoker'),
Text(8, 0, 'Health_plan'),
Text(9, 0, 'Diabetes'),
Text(10, 0, 'Owns_home'),
Text(11, 0, 'Rural'),
Text(12, 0, '_IMPRACE_White, Non-Hispanic'),
Text(13, 0, '_IMPRACE_Black, Non-Hispanic'),
Text(14, 0, '_IMPRACE_American Indian/Alaskan Native, Non-Hispanic'),
Text(15, 0, '_IMPRACE_Asian, Non-Hispanic'),
Text(16, 0, '_IMPRACE_Other race, Non-Hispanic'),
Text(17, 0, '_IMPRACE_Hispanic'),
Text(18, 0, 'INCOME2_$75,000 or more'),
Text(19, 0, 'INCOME2_Less than $35,000 ($25,000 to less than $35,000)'),
Text(20, 0, 'INCOME2_Less than $50,000 ($35,000 to less than $50,000)'),
Text(21, 0, 'INCOME2_Less than $75,000 ($50,000 to less than $75,000)'),
Text(22, 0, 'INCOME2_Less than $25,000 ($20,000 to less than $25,000)'),
Text(23, 0, 'INCOME2_Less than $20,000 ($15,000 to less than $20,000)'),
Text(24, 0, 'INCOME2_Less than $15,000 ($10,000 to less than $15,000)'),
Text(25, 0, 'INCOME2_Less than $10,000'),
Text(26, 0, 'predict_prob')])

Which groups within these counties might need to be focused on to maximize the impact of TN Med Helper’s efforts?
counties = gpd.read_file('../data/county/tncounty.shp')
counties
OBJECTID | NAME | KEY | SHAPE_AREA | SHAPE_LEN | geometry | |
---|---|---|---|---|---|---|
0 | 76 | Chester | 47023 | 8.049024e+09 | 520461.080124 | POLYGON ((1137985.762 344601.643, 1137965.070 ... |
1 | 77 | Wayne | 47181 | 2.050741e+10 | 666520.678598 | POLYGON ((1365052.057 391716.806, 1365746.554 ... |
2 | 78 | Tipton | 47167 | 1.319125e+10 | 865093.887634 | MULTIPOLYGON (((886814.330 400456.525, 886774.... |
3 | 79 | Hamilton | 47065 | 1.604776e+10 | 652926.001078 | POLYGON ((2274954.438 239788.911, 2274090.610 ... |
4 | 80 | Stewart | 47161 | 1.375003e+10 | 490090.336180 | POLYGON ((1382472.783 743972.302, 1382445.171 ... |
... | ... | ... | ... | ... | ... | ... |
90 | 91 | McNairy | 47109 | 1.563586e+10 | 566369.132062 | POLYGON ((1137985.762 344601.643, 1139350.519 ... |
91 | 92 | Franklin | 47051 | 1.605093e+10 | 621176.096919 | POLYGON ((1873015.265 239618.144, 1872957.848 ... |
92 | 93 | Bradley | 47011 | 9.241234e+09 | 457372.233476 | POLYGON ((2274954.438 239788.911, 2275552.803 ... |
93 | 94 | Marion | 47115 | 1.428734e+10 | 529431.591556 | POLYGON ((2126056.390 236919.771, 2122873.509 ... |
94 | 95 | Polk | 47139 | 1.233228e+10 | 479994.126988 | POLYGON ((2355580.184 332970.851, 2355673.384 ... |
95 rows × 6 columns
full_df.county
FIPS
47033 Crockett
47061 Grundy
47095 Lake
47175 Van Buren
47007 Bledsoe
...
47093 Knox
47163 Sullivan
47113 Madison
47187 Williamson
47179 Washington
Name: county, Length: 95, dtype: object
counties = pd.merge(left = counties,left_on = 'NAME',
right = full_df[['county', 'predict_prob']], right_on = 'county')
counties['no_checkup'] = 1- counties.predict_prob
from matplotlib.lines import Line2D
fig, ax = plt.subplots(figsize=(16,4))
counties.plot(column = 'no_checkup',
edgecolor = 'black',
legend = True,
cmap = 'viridis',
# scheme="NaturalBreaks",
ax = ax)
# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
# markers = []
# for line in leg.get_lines():
# marker = Line2D([0],[0], marker = 'o',
# markersize = line.get_markersize(),
# color = line.get_markerfacecolor(),
# linestyle = 'None',
# markeredgecolor = 'black',
# markeredgewidth = 1)
# markers.append(marker)
# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
plt.title('Predicted No Checkup in the Last Year for Average Resident', fontsize = 18)
ax.axis('off');
