Data Science Essentials

Final Project for Data Science Essentials Course

TN Med Helper is a fictional company whose mission is to ensure access to healthcare for all Tennesseans. TN Med Helper has approached your data science consultancy for help identifying communities in Tennessee that need the most help in expanding access to healthcare.

In this project, we will use the Medicare Disparities data as a starting point for identifying such communities. Specifically, you will be provided with datasets containing the percent of Medicare beneficiaries who had an annual wellness visit (annual_wellness.csv), the number of all-cause hospitilizations per 1000 beneficiaries (hospitalizations.csv), and the number of emergency department visits per 1000 beneficiaries (emergency_department.csv). Over the next 8 weeks, you will work towards addressing the following three objectives.

First, TN Med Helper is concerned about communities either lacking access to healthcare or losing access to healthcare. They are looking to expand telehealth technologies into the vulnerable communities, and need your help to priortize areas most needing attention. your first objective is to identify which counties in Tennessee have the most severe lack of access to healthcare (either due to lack of hospitals, physicians, or both). Once you have identified these counties, see if you can find any common demographic or economic characteristics for these areas.

Second, TN Med Helper is interested in reducing the number of potentially preventable hospitalizations. Do areas that lack access to healthcare tend to have higher rates of emergency department visits or hospitalizations? Is there an association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits?

Finally, TN Med Helper is trying to identify specific subpopulations to focus more attention on. Using data from the Behavioral Risk Factor Surveillance System, build a model to predict whether an individual has not had a checkup in the last year. Apply this model to the counties you identified above to predict how likely it is that the average person from those counties has not had a checkup in the last year. Which groups within these counties might need to be focused on to maximize the impact of TN Med Helper’s efforts?

Over the course of this class, you will build up your data analysis skills and work towards answering these questions. At the end of the project, teams will present their findings to TN Med Helper.

Looking at the data

  • Annual wellness visits is a percentage (of beneficiaries).
  • Emergency department visits is a number per 1000 beneficiaries.
  • All-cause hospitalizations is number per 1000 beneficiaries.
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
hospitalizations = pd.read_csv("../data/Medicare_Disparities_by_Population/hospitalizations.csv")
hospitalizations.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3222 entries, 0 to 3221
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   year                 3222 non-null   int64 
 1   geography            3222 non-null   object
 2   measure              3222 non-null   object
 3   adjustment           3222 non-null   object
 4   analysis             3222 non-null   object
 5   domain               3222 non-null   object
 6   condition            3222 non-null   object
 7   primary_sex          3222 non-null   object
 8   primary_age          3222 non-null   object
 9   primary_dual         3222 non-null   object
 10  fips                 3222 non-null   int64 
 11  county               3205 non-null   object
 12  state                3222 non-null   object
 13  urban                3222 non-null   object
 14  primary_race         3222 non-null   object
 15  primary_eligibility  3222 non-null   object
 16  primary_denominator  3222 non-null   object
 17  analysis_value       3222 non-null   int64 
dtypes: int64(3), object(15)
memory usage: 453.2+ KB
annual_wellness = pd.read_csv("../data/Medicare_Disparities_by_Population/annual_wellness.csv")
annual_wellness.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   year                 3220 non-null   int64 
 1   geography            3220 non-null   object
 2   measure              3220 non-null   object
 3   adjustment           3220 non-null   object
 4   analysis             3220 non-null   object
 5   domain               3220 non-null   object
 6   condition            3220 non-null   object
 7   primary_sex          3220 non-null   object
 8   primary_age          3220 non-null   object
 9   primary_dual         3220 non-null   object
 10  fips                 3220 non-null   int64 
 11  county               3205 non-null   object
 12  state                3220 non-null   object
 13  urban                3220 non-null   object
 14  primary_race         3220 non-null   object
 15  primary_eligibility  3220 non-null   object
 16  primary_denominator  3220 non-null   object
 17  analysis_value       3220 non-null   int64 
dtypes: int64(3), object(15)
memory usage: 452.9+ KB
emergency_department = pd.read_csv("../data/Medicare_Disparities_by_Population/emergency_department.csv")
emergency_department.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   year                 3220 non-null   int64 
 1   geography            3220 non-null   object
 2   measure              3220 non-null   object
 3   adjustment           3220 non-null   object
 4   analysis             3220 non-null   object
 5   domain               3220 non-null   object
 6   condition            3220 non-null   object
 7   primary_sex          3220 non-null   object
 8   primary_age          3220 non-null   object
 9   primary_dual         3220 non-null   object
 10  fips                 3220 non-null   int64 
 11  county               3205 non-null   object
 12  state                3220 non-null   object
 13  urban                3220 non-null   object
 14  primary_race         3220 non-null   object
 15  primary_eligibility  3220 non-null   object
 16  primary_denominator  3220 non-null   object
 17  analysis_value       3220 non-null   int64 
dtypes: int64(3), object(15)
memory usage: 452.9+ KB

All three data structures are the same, with the same column names (unfortunately). I will merge them, and rename the columns that I want to keep from each.

emergency_department.iloc[1] == hospitalizations.iloc[1]
year                    True
geography               True
measure                False
adjustment              True
analysis                True
domain                  True
condition              False
primary_sex             True
primary_age             True
primary_dual            True
fips                    True
county                  True
state                   True
urban                   True
primary_race            True
primary_eligibility     True
primary_denominator    False
analysis_value         False
Name: 1, dtype: bool
emer_dict = {}
for i in emergency_department:
    emer_dict[i] = set(emergency_department[i])
emer_dict['fips'] = []
emer_dict['county'] = []
emer_dict['state'] = []
emer_dict['analysis_value'] = []
emer_dict
{'year': {2019},
 'geography': {'County'},
 'measure': {'Emergency department visit rate'},
 'adjustment': {'Unsmoothed actual'},
 'analysis': {'Base measure'},
 'domain': {'Primary chronic conditions'},
 'condition': {'All Emergency Department Visits'},
 'primary_sex': {'All'},
 'primary_age': {'All'},
 'primary_dual': {'Dual & non-dual'},
 'fips': [],
 'county': [],
 'state': [],
 'urban': {'Rural', 'Urban'},
 'primary_race': {'All'},
 'primary_eligibility': {'All'},
 'primary_denominator': {'undefined'},
 'analysis_value': []}

The comparison above shows which columns are unique to each dataframe: measure, condition, primary_denominator, and analysis_value. All other columns except urban only have one value.

hospitalizations = hospitalizations[['fips','measure','primary_denominator', 'analysis_value']]
emergency_department = emergency_department[['fips','measure','primary_denominator', 'analysis_value']]
hospitalizations.columns = ['fips','measure_hosp','primary_denominator_hosp', 'analysis_value_hosp']
emergency_department.columns = ['fips','measure_emer','primary_denominator_emer', 'analysis_value_emer']

annual_wellness.columns = annual_wellness.columns.str.replace("analysis_value","analysis_value_well")
annual_wellness.columns = annual_wellness.columns.str.replace("measure","measure_well")
annual_wellness.columns = annual_wellness.columns.str.replace("primary_denominator","primary_denominator_well")
disparities = pd.merge(left = annual_wellness, left_on = 'fips',
         right = hospitalizations, right_on = 'fips')
disparities = pd.merge(left = disparities, left_on = 'fips',
         right = emergency_department, right_on = 'fips')
disparities.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3220 entries, 0 to 3219
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   year                      3220 non-null   int64 
 1   geography                 3220 non-null   object
 2   measure_well              3220 non-null   object
 3   adjustment                3220 non-null   object
 4   analysis                  3220 non-null   object
 5   domain                    3220 non-null   object
 6   condition                 3220 non-null   object
 7   primary_sex               3220 non-null   object
 8   primary_age               3220 non-null   object
 9   primary_dual              3220 non-null   object
 10  fips                      3220 non-null   int64 
 11  county                    3205 non-null   object
 12  state                     3220 non-null   object
 13  urban                     3220 non-null   object
 14  primary_race              3220 non-null   object
 15  primary_eligibility       3220 non-null   object
 16  primary_denominator_well  3220 non-null   object
 17  analysis_value_well       3220 non-null   int64 
 18  measure_hosp              3220 non-null   object
 19  primary_denominator_hosp  3220 non-null   object
 20  analysis_value_hosp       3220 non-null   int64 
 21  measure_emer              3220 non-null   object
 22  primary_denominator_emer  3220 non-null   object
 23  analysis_value_emer       3220 non-null   int64 
dtypes: int64(5), object(19)
memory usage: 628.9+ KB
print(set(disparities.adjustment),set(disparities.analysis),set(disparities.domain),set(disparities.condition))
{'Unsmoothed actual'} {'Base measure'} {'Preventive Services'} {'Annual Wellness Visit'}

Each of the columns above only has one value, so I am going to remove them.

disparities = disparities.drop(['adjustment','analysis','domain','condition'], axis = 1)
disparities.head(2)
year geography measure_well primary_sex primary_age primary_dual fips county state urban primary_race primary_eligibility primary_denominator_well analysis_value_well measure_hosp primary_denominator_hosp analysis_value_hosp measure_emer primary_denominator_emer analysis_value_emer
0 2019 County Preventive Services All All Dual & non-dual 1001 Autauga County ALABAMA Urban All All 1,000-4,999 29 Hospitalization 5,000-9,999 299 Emergency department visit rate undefined 723
1 2019 County Preventive Services All All Dual & non-dual 1003 Baldwin County ALABAMA Rural All All 10,000+ 29 Hospitalization 10,000+ 268 Emergency department visit rate undefined 601

Lastly, I will narrow down the list to only counties in TN.

disparities = disparities.loc[disparities.state == 'TENNESSEE']
disparities.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 2428 to 2522
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   year                      95 non-null     int64 
 1   geography                 95 non-null     object
 2   measure_well              95 non-null     object
 3   primary_sex               95 non-null     object
 4   primary_age               95 non-null     object
 5   primary_dual              95 non-null     object
 6   fips                      95 non-null     int64 
 7   county                    95 non-null     object
 8   state                     95 non-null     object
 9   urban                     95 non-null     object
 10  primary_race              95 non-null     object
 11  primary_eligibility       95 non-null     object
 12  primary_denominator_well  95 non-null     object
 13  analysis_value_well       95 non-null     int64 
 14  measure_hosp              95 non-null     object
 15  primary_denominator_hosp  95 non-null     object
 16  analysis_value_hosp       95 non-null     int64 
 17  measure_emer              95 non-null     object
 18  primary_denominator_emer  95 non-null     object
 19  analysis_value_emer       95 non-null     int64 
dtypes: int64(5), object(15)
memory usage: 15.6+ KB

As before, we have all 95 counties in TN.

Question 1: Identify which counties in TN have the most severe lack of access to healthcare.

Which counties have low numbers of physicians?

physicians = pd.read_csv('../data/primary_care_physicians.csv')
physicians = physicians.loc[physicians.state == 'Tennessee']
physicians.shape
(95, 4)
population = pd.read_csv('../data/population_by_county.csv')
population.head()
FIPS population county state urban
0 17051 21565 Fayette County ILLINOIS Rural
1 17107 29003 Logan County ILLINOIS Rural
2 17165 23994 Saline County ILLINOIS Rural
3 17097 701473 Lake County ILLINOIS Urban
4 17127 14219 Massac County ILLINOIS Rural
physicians = pd.merge(left = physicians, right = population[['FIPS','population','urban']], left_on='FIPS', right_on = 'FIPS')
physicians['urban'].value_counts()
Rural    57
Urban    38
Name: urban, dtype: int64
physicians.loc[physicians.population/physicians.primary_care_physicians < 1500,'supply'] = 'adequate'
physicians.loc[(physicians.population/physicians.primary_care_physicians > 1500) &
               (physicians.population/physicians.primary_care_physicians < 3500),'supply'] = 'moderately adequate'
physicians.loc[physicians.population/physicians.primary_care_physicians > 3500,'supply'] = 'low inadequate'
physicians = physicians.rename({'supply':'shadac_category'}, axis = 'columns')
import matplotlib.pyplot as plt
pd.crosstab(physicians.urban, physicians.shadac_category, normalize='index').plot(kind = 'bar', stacked = True)
plt.xticks(rotation = 0)
plt.legend(bbox_to_anchor = (1, 0.8), loc = 'upper left');   # move the legend to the right side of the plot
plt.title("SHADAC Category by Type of County")
Text(0.5, 1.0, 'SHADAC Category by Type of County')
physicians
FIPS state county primary_care_physicians population urban shadac_category
0 47001 Tennessee Anderson 39.0 76061 Urban moderately adequate
1 47003 Tennessee Bedford 15.0 48292 Rural moderately adequate
2 47005 Tennessee Benton 3.0 16140 Rural low inadequate
3 47007 Tennessee Bledsoe 1.0 14836 Rural low inadequate
4 47009 Tennessee Blount 90.0 129927 Urban adequate
... ... ... ... ... ... ... ...
90 47181 Tennessee Wayne 5.0 16693 Rural moderately adequate
91 47183 Tennessee Weakley 18.0 33510 Rural moderately adequate
92 47185 Tennessee White 9.0 26800 Rural moderately adequate
93 47187 Tennessee Williamson 338.0 225389 Urban adequate
94 47189 Tennessee Wilson 43.0 136666 Urban moderately adequate

95 rows × 7 columns

unemployment = pd.read_csv('../data/tn_unemployment.csv')
unemployment.head()
laus_code State County Name Period LF Employed Unemployed unemployment_rate
0 CN4700100000000 47 1 Anderson County, TN Mar-21 34704 33010 1694 4.9
1 CN4700300000000 47 3 Bedford County, TN Mar-21 20623 19550 1073 5.2
2 CN4700500000000 47 5 Benton County, TN Mar-21 6723 6305 418 6.2
3 CN4700700000000 47 7 Bledsoe County, TN Mar-21 4252 3947 305 7.2
4 CN4700900000000 47 9 Blount County, TN Mar-21 64098 61119 2979 4.6
unemployment.Name = unemployment.Name.str.split(' County, TN', expand = True)[0]
physicians = pd.merge(left=unemployment[['laus_code','Name','Employed','Unemployed','unemployment_rate']], left_on='Name',
        right = physicians[['FIPS','county','primary_care_physicians','population','urban','shadac_category']], right_on='county')
physicians
laus_code Name Employed Unemployed unemployment_rate FIPS county primary_care_physicians population urban shadac_category
0 CN4700100000000 Anderson 33010 1694 4.9 47001 Anderson 39.0 76061 Urban moderately adequate
1 CN4700300000000 Bedford 19550 1073 5.2 47003 Bedford 15.0 48292 Rural moderately adequate
2 CN4700500000000 Benton 6305 418 6.2 47005 Benton 3.0 16140 Rural low inadequate
3 CN4700700000000 Bledsoe 3947 305 7.2 47007 Bledsoe 1.0 14836 Rural low inadequate
4 CN4700900000000 Blount 61119 2979 4.6 47009 Blount 90.0 129927 Urban adequate
... ... ... ... ... ... ... ... ... ... ... ...
90 CN4718100000000 Wayne 6074 342 5.3 47181 Wayne 5.0 16693 Rural moderately adequate
91 CN4718300000000 Weakley 14783 711 4.6 47183 Weakley 18.0 33510 Rural moderately adequate
92 CN4718500000000 White 11484 601 5.0 47185 White 9.0 26800 Rural moderately adequate
93 CN4718700000000 Williamson 125213 4271 3.3 47187 Williamson 338.0 225389 Urban adequate
94 CN4718900000000 Wilson 74347 3079 4.0 47189 Wilson 43.0 136666 Urban moderately adequate

95 rows × 11 columns

import seaborn as sns
sns.boxplot(physicians.urban,physicians.unemployment_rate)
plt.xticks(rotation = 0)
plt.title("Unemployment by Type of County")
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(





Text(0.5, 1.0, 'Unemployment by Type of County')
physicians['pcp_per_100k'] = physicians.primary_care_physicians/physicians.population*100000
sns.lmplot(data = physicians, y = 'pcp_per_100k', x = 'unemployment_rate')
plt.ylabel = "PCP per 100k residents"
plt.xlabel = "Unemployment Rate"
physicians[['unemployment_rate', 'pcp_per_100k']].corr()
unemployment_rate pcp_per_100k
unemployment_rate 1.000000 -0.335333
pcp_per_100k -0.335333 1.000000
disparities = pd.merge(left = disparities, left_on = 'fips',
         right = physicians, right_on = 'FIPS')

Now get rid of repetitive columns:

disparities.head(2)
year geography measure_well primary_sex primary_age primary_dual fips county_x state urban_x ... Employed Unemployed unemployment_rate FIPS county_y primary_care_physicians population urban_y shadac_category pcp_per_100k
0 2019 County Preventive Services All All Dual & non-dual 47001 Anderson County TENNESSEE Urban ... 33010 1694 4.9 47001 Anderson 39.0 76061 Urban moderately adequate 51.274635
1 2019 County Preventive Services All All Dual & non-dual 47003 Bedford County TENNESSEE Rural ... 19550 1073 5.2 47003 Bedford 15.0 48292 Rural moderately adequate 31.061045

2 rows × 32 columns

disparities.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 94
Data columns (total 32 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   year                      95 non-null     int64  
 1   geography                 95 non-null     object 
 2   measure_well              95 non-null     object 
 3   primary_sex               95 non-null     object 
 4   primary_age               95 non-null     object 
 5   primary_dual              95 non-null     object 
 6   fips                      95 non-null     int64  
 7   county_x                  95 non-null     object 
 8   state                     95 non-null     object 
 9   urban_x                   95 non-null     object 
 10  primary_race              95 non-null     object 
 11  primary_eligibility       95 non-null     object 
 12  primary_denominator_well  95 non-null     object 
 13  analysis_value_well       95 non-null     int64  
 14  measure_hosp              95 non-null     object 
 15  primary_denominator_hosp  95 non-null     object 
 16  analysis_value_hosp       95 non-null     int64  
 17  measure_emer              95 non-null     object 
 18  primary_denominator_emer  95 non-null     object 
 19  analysis_value_emer       95 non-null     int64  
 20  laus_code                 95 non-null     object 
 21  Name                      95 non-null     object 
 22  Employed                  95 non-null     int64  
 23  Unemployed                95 non-null     int64  
 24  unemployment_rate         95 non-null     float64
 25  FIPS                      95 non-null     int64  
 26  county_y                  95 non-null     object 
 27  primary_care_physicians   95 non-null     float64
 28  population                95 non-null     int64  
 29  urban_y                   95 non-null     object 
 30  shadac_category           95 non-null     object 
 31  pcp_per_100k              95 non-null     float64
dtypes: float64(3), int64(9), object(20)
memory usage: 24.5+ KB
disparities = disparities.drop(['geography','county_x','urban_x', 'fips'], axis = 1)
disparities.index = disparities.FIPS
disparities.columns = disparities.columns.str.replace('county_y', 'county')
disparities.columns = disparities.columns.str.replace('urban_y', 'urban')
print(set(disparities.measure_well),set(disparities.measure_emer),set(disparities.measure_hosp))
{'Preventive Services'} {'Emergency department visit rate'} {'Hospitalization'}
disparities = disparities.drop(['measure_well','measure_emer','measure_hosp'], axis = 1)

disparities.iloc[0]
year                                       2019
primary_sex                                 All
primary_age                                 All
primary_dual                    Dual & non-dual
state                                 TENNESSEE
primary_race                                All
primary_eligibility                         All
primary_denominator_well            5,000-9,999
analysis_value_well                          35
primary_denominator_hosp                10,000+
analysis_value_hosp                         267
primary_denominator_emer              undefined
analysis_value_emer                         617
laus_code                       CN4700100000000
Name                                   Anderson
Employed                                  33010
Unemployed                                 1694
unemployment_rate                           4.9
FIPS                                      47001
county                                 Anderson
primary_care_physicians                    39.0
population                                76061
urban                                     Urban
shadac_category             moderately adequate
pcp_per_100k                          51.274635
Name: 47001, dtype: object
disparities.to_csv("../data/disparities.csv")

Analyzing disparities in access: physicians

disparities = pd.read_csv('../data/disparities.csv', index_col = 0)
hospitals = pd.read_csv('../data/Hospitals.csv')
hospitals = hospitals.loc[hospitals.STATE =='TN']
hospitals.COUNTYFIPS = pd.to_numeric(hospitals['COUNTYFIPS'])
disparities.corr().style.background_gradient('viridis')
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/pandas/io/formats/style.py:2813: RuntimeWarning: All-NaN slice encountered
  smin = np.nanmin(gmap) if vmin is None else vmin
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/pandas/io/formats/style.py:2814: RuntimeWarning: All-NaN slice encountered
  smax = np.nanmax(gmap) if vmax is None else vmax
  year analysis_value_well analysis_value_hosp analysis_value_emer Employed Unemployed unemployment_rate FIPS.1 primary_care_physicians population pcp_per_100k
year nan nan nan nan nan nan nan nan nan nan nan
analysis_value_well nan 1.000000 -0.480732 -0.480607 0.288357 0.221078 -0.325228 0.032489 0.319609 0.281995 0.488941
analysis_value_hosp nan -0.480732 1.000000 0.599412 -0.093001 -0.049513 0.275980 0.085773 -0.121397 -0.091323 -0.308934
analysis_value_emer nan -0.480607 0.599412 1.000000 -0.179994 -0.121924 0.484027 -0.044881 -0.202044 -0.172820 -0.396181
Employed nan 0.288357 -0.093001 -0.179994 1.000000 0.965032 -0.196023 0.089982 0.961056 0.992966 0.502287
Unemployed nan 0.221078 -0.049513 -0.121924 0.965032 1.000000 -0.077655 0.091267 0.934610 0.985295 0.432351
unemployment_rate nan -0.325228 0.275980 0.484027 -0.196023 -0.077655 1.000000 -0.104688 -0.159979 -0.156259 -0.335333
FIPS.1 nan 0.032489 0.085773 -0.044881 0.089982 0.091267 -0.104688 1.000000 0.093030 0.091808 0.103152
primary_care_physicians nan 0.319609 -0.121397 -0.202044 0.961056 0.934610 -0.159979 0.093030 1.000000 0.964170 0.624601
population nan 0.281995 -0.091323 -0.172820 0.992966 0.985295 -0.156259 0.091808 0.964170 1.000000 0.501609
pcp_per_100k nan 0.488941 -0.308934 -0.396181 0.502287 0.432351 -0.335333 0.103152 0.624601 0.501609 1.000000
disparities.columns
Index(['year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
       'primary_race', 'primary_eligibility', 'primary_denominator_well',
       'analysis_value_well', 'primary_denominator_hosp',
       'analysis_value_hosp', 'primary_denominator_emer',
       'analysis_value_emer', 'laus_code', 'Name', 'Employed', 'Unemployed',
       'unemployment_rate', 'FIPS.1', 'county', 'primary_care_physicians',
       'population', 'urban', 'shadac_category', 'pcp_per_100k'],
      dtype='object')
import seaborn as sns
sns.pairplot(disparities[['analysis_value_well', 
                          'analysis_value_hosp', 
                          'analysis_value_emer',
                          'unemployment_rate',
                          'population',
                          'pcp_per_100k','urban'
                         ]], hue = 'urban')
<seaborn.axisgrid.PairGrid at 0x19f0b01c0>

First, we will look at primary care physicians by county. We already did this, so we will summarize in plots below.

sns.boxplot(disparities.shadac_category, disparities.pcp_per_100k)
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(





<AxesSubplot:xlabel='shadac_category', ylabel='pcp_per_100k'>
disparities.shadac_category.value_counts()
moderately adequate    50
low inadequate         31
adequate               14
Name: shadac_category, dtype: int64

There are 31 counties considered low inadequate (>3500 residents per physician).

low_shadac = disparities.loc[disparities.shadac_category == 'low inadequate']
low_shadac.sort_values('pcp_per_100k').head(10)
year primary_sex primary_age primary_dual state primary_race primary_eligibility primary_denominator_well analysis_value_well primary_denominator_hosp ... Employed Unemployed unemployment_rate FIPS.1 county primary_care_physicians population urban shadac_category pcp_per_100k
FIPS
47175 2019 All All Dual & non-dual TENNESSEE All All 500-999 31 500-999 ... 1884 137 6.8 47175 Van Buren 0.0 5760 Rural low inadequate 0.000000
47033 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 25 1,000-4,999 ... 6494 344 5.0 47033 Crockett 0.0 14399 Rural low inadequate 0.000000
47061 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 34 1,000-4,999 ... 4655 322 6.5 47061 Grundy 0.0 13344 Rural low inadequate 0.000000
47095 2019 All All Dual & non-dual TENNESSEE All All 500-999 20 1,000-4,999 ... 1537 157 9.3 47095 Lake 0.0 7401 Rural low inadequate 0.000000
47007 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 19 1,000-4,999 ... 3947 305 7.2 47007 Bledsoe 1.0 14836 Rural low inadequate 6.740361
47161 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 39 1,000-4,999 ... 5146 305 5.6 47161 Stewart 1.0 13427 Urban low inadequate 7.447680
47097 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 36 1,000-4,999 ... 9075 658 6.8 47097 Lauderdale 2.0 25989 Rural low inadequate 7.695564
47129 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 23 1,000-4,999 ... 7327 455 5.8 47129 Morgan 3.0 21545 Rural low inadequate 13.924344
47117 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 20 1,000-4,999 ... 14422 735 4.8 47117 Marshall 5.0 32965 Rural low inadequate 15.167602
47067 2019 All All Dual & non-dual TENNESSEE All All 500-999 27 500-999 ... 1997 144 6.7 47067 Hancock 1.0 6587 Rural low inadequate 15.181418

10 rows × 25 columns

There are 31 counties with low access to physicians, by SHADAC category. In particular, 4 counties have 0 physicians: Van Buren, Crockett, Grundy, and Lake. All four are rural, though Crockett and Grundy have almost 15,000 residents.


Analyzing disparities in access: hospitals

We would like to add hospitals per 100K residents for each county to disparities. To do this, we need to get hospitals.COUNTY and disparities.county in the same format, and then we can add up the # of hospitals (plus associated variables) for each county by population.

hospitals.iloc[0]
X                                           -9409026.0792
Y                                            4210878.1252
OBJECTID                                              517
ID                                               14737331
NAME                 STARR REGIONAL MEDICAL CENTER ETOWAH
ADDRESS                             886 HIGHWAY 411 NORTH
CITY                                               ETOWAH
STATE                                                  TN
ZIP                                                 37331
ZIP4                                        NOT AVAILABLE
TELEPHONE                                  (423) 263-3600
TYPE                                   GENERAL ACUTE CARE
STATUS                                               OPEN
POPULATION                                             72
COUNTY                                             MCMINN
COUNTYFIPS                                          47107
COUNTRY                                               USA
LATITUDE                                        35.345099
LONGITUDE                                      -84.522719
NAICS_CODE                                         622110
NAICS_DESC         GENERAL MEDICAL AND SURGICAL HOSPITALS
SOURCE        https://apps.health.tn.gov/facilitylistings
SOURCEDATE                            2019/08/10 00:00:00
VAL_METHOD                                        IMAGERY
VAL_DATE                              2014/02/10 00:00:00
WEBSITE                     http://www.starrregional.com/
STATE_ID                                    NOT AVAILABLE
ALT_NAME                          WOODS MEMORIAL HOSPITAL
ST_FIPS                                                47
OWNER                                          NON-PROFIT
TTL_STAFF                                            -999
BEDS                                                   72
TRAUMA                                          LEVEL III
HELIPAD                                                 Y
Name: 515, dtype: object
hospitals.COUNTYFIPS.value_counts()
47157    23
47037    15
47065    13
47093    11
47179     7
         ..
47019     1
47177     1
47027     1
47159     1
47181     1
Name: COUNTYFIPS, Length: 80, dtype: int64
for i in range(len(hospitals.COUNTYFIPS.value_counts())):
    ind = (hospitals.COUNTYFIPS.value_counts().index[i])
    count = (hospitals.COUNTYFIPS.value_counts()[ind])
    disparities.loc[ind, 'num_hospitals'] = count

However, some of these hospitals are closed.

set(hospitals.STATUS)
{'CLOSED', 'OPEN'}
hospitals.groupby('COUNTYFIPS').STATUS.value_counts()
COUNTYFIPS  STATUS
47001       OPEN      1
47003       CLOSED    1
            OPEN      1
47005       OPEN      1
47007       OPEN      1
                     ..
47181       OPEN      1
47183       OPEN      2
47185       OPEN      1
47187       OPEN      2
47189       OPEN      2
Name: STATUS, Length: 89, dtype: int64
for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
    ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
    count = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts()[ind])
    disparities.loc[ind, 'num_hospitals_open'] = count
disparities
year primary_sex primary_age primary_dual state primary_race primary_eligibility primary_denominator_well analysis_value_well primary_denominator_hosp ... unemployment_rate FIPS.1 county primary_care_physicians population urban shadac_category pcp_per_100k num_hospitals num_hospitals_open
FIPS
47001 2019 All All Dual & non-dual TENNESSEE All All 5,000-9,999 35 10,000+ ... 4.9 47001 Anderson 39.0 76061 Urban moderately adequate 51.274635 1.0 1.0
47003 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 46 5,000-9,999 ... 5.2 47003 Bedford 15.0 48292 Rural moderately adequate 31.061045 2.0 1.0
47005 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 30 1,000-4,999 ... 6.2 47005 Benton 3.0 16140 Rural low inadequate 18.587361 1.0 1.0
47007 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 19 1,000-4,999 ... 7.2 47007 Bledsoe 1.0 14836 Rural low inadequate 6.740361 1.0 1.0
47009 2019 All All Dual & non-dual TENNESSEE All All 10,000+ 43 10,000+ ... 4.6 47009 Blount 90.0 129927 Urban adequate 69.269667 2.0 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47181 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 15 1,000-4,999 ... 5.3 47181 Wayne 5.0 16693 Rural moderately adequate 29.952675 1.0 1.0
47183 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 30 5,000-9,999 ... 4.6 47183 Weakley 18.0 33510 Rural moderately adequate 53.715309 2.0 2.0
47185 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 35 5,000-9,999 ... 5.0 47185 White 9.0 26800 Rural moderately adequate 33.582090 1.0 1.0
47187 2019 All All Dual & non-dual TENNESSEE All All 10,000+ 48 10,000+ ... 3.3 47187 Williamson 338.0 225389 Urban adequate 149.962953 2.0 2.0
47189 2019 All All Dual & non-dual TENNESSEE All All 10,000+ 45 10,000+ ... 4.0 47189 Wilson 43.0 136666 Urban moderately adequate 31.463568 2.0 2.0

95 rows × 27 columns

sns.scatterplot(data = disparities, x = 'pcp_per_100k', y = 'num_hospitals_open', hue = 'urban')
<AxesSubplot:xlabel='pcp_per_100k', ylabel='num_hospitals_open'>
disparities = disparities.fillna(0)
disparities.loc[(disparities.pcp_per_100k < 15) &
                (disparities.num_hospitals_open < 2 )]['county']
FIPS
47007       Bledsoe
47033      Crockett
47061        Grundy
47095          Lake
47097    Lauderdale
47129        Morgan
47161       Stewart
47175     Van Buren
Name: county, dtype: object

The counties above are the ones with fewer than 2 open hospitals and fewer than 15 pcp per 100k

Lastly, we can take a look at number of beds. We need to fillna (-999) with 0 so it doesn’t mess up the counts.

hospitals.BEDS = hospitals.BEDS.replace(-999,0)
for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
    ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
    count = (hospitals.loc[hospitals.STATUS == 'OPEN'].groupby('COUNTYFIPS').BEDS.mean()[ind])
    disparities.loc[ind, 'mean_beds'] = count
for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
    ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
    count = (hospitals.loc[hospitals.STATUS == 'OPEN'].groupby('COUNTYFIPS').BEDS.sum()[ind])
    disparities.loc[ind, 'num_beds'] = count

disparities.num_beds = disparities.num_beds.fillna(0)
sns.scatterplot(data = disparities.loc[disparities.urban == 'Rural'], x = 'num_hospitals_open', y = 'num_beds')#, hue = 'urban')
<AxesSubplot:xlabel='num_hospitals_open', ylabel='num_beds'>
sns.scatterplot(data = disparities, x = 'pcp_per_100k', y = 'num_beds', hue = 'urban')
<AxesSubplot:xlabel='pcp_per_100k', ylabel='num_beds'>
disparities.to_csv('../data/disparities.csv')

Plotting disparities

median_income = pd.read_excel('../data/est18all.xls',
              sheet_name = 'est18ALL',
             header = 3,
             usecols = 'C,D,W')
median_income.head(2)
Postal Code Name Median Household Income
0 US United States 61937
1 AL Alabama 49881

For this map, we only need the counties located in Tennessee.

median_income = median_income.loc[median_income['Postal Code'] == 'TN']
median_income.head(2)
Postal Code Name Median Household Income
2471 TN Tennessee 52366
2472 TN Anderson County 50672

We can remove the first row.

median_income = median_income.iloc[1:]
median_income.head(2)
Postal Code Name Median Household Income
2472 TN Anderson County 50672
2473 TN Bedford County 49860

Now, let’s read in our counties shapefiles. This one was obtained from http://www.tngis.org/administrative-boundaries.htm

This creates a geopandas DataFrame, which is like a pandas DataFrame, but has geometry associated with it.

counties = gpd.read_file('../data/county/tncounty.shp')
counties.head()
OBJECTID NAME KEY SHAPE_AREA SHAPE_LEN geometry
0 76 Chester 47023 8.049024e+09 520461.080124 POLYGON ((1137985.762 344601.643, 1137965.070 ...
1 77 Wayne 47181 2.050741e+10 666520.678598 POLYGON ((1365052.057 391716.806, 1365746.554 ...
2 78 Tipton 47167 1.319125e+10 865093.887634 MULTIPOLYGON (((886814.330 400456.525, 886774....
3 79 Hamilton 47065 1.604776e+10 652926.001078 POLYGON ((2274954.438 239788.911, 2274090.610 ...
4 80 Stewart 47161 1.375003e+10 490090.336180 POLYGON ((1382472.783 743972.302, 1382445.171 ...
median_income['NAME'] = median_income['Name'].str[:-7]
counties = pd.merge(left = counties,
                    right = median_income[['NAME', 'Median Household Income']])
counties['Median Household Income'] = pd.to_numeric(counties['Median Household Income'])
from matplotlib.lines import Line2D

fig, ax = plt.subplots(figsize=(16,4))

counties.plot(column = 'Median Household Income', 
              edgecolor = 'black',
              legend = True,
              cmap = 'Blues',
              scheme="NaturalBreaks",
              ax = ax)

leg = ax.get_legend()

# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
    label_text = lbl.get_text()
    lower = float(label_text.split()[0][:-1])
    upper = float(label_text.split()[1][:-1])
    if i == 0:
        new_text = "Below " + "\${:,.0f}".format(upper + 1)
    elif i == n - 1:
        new_text = "Above " + "\${:,.0f}".format(lower)
    else:
        new_text = "\${:,.0f}".format(lower + 1) + " - " + "\${:,.0f}".format(upper)
        
    labels.append(new_text)

# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
    marker = Line2D([0],[0], marker = 'o', 
                    markersize = line.get_markersize(), 
                    color = line.get_markerfacecolor(),
                    linestyle = 'None',
                    markeredgecolor = 'black',
                    markeredgewidth = 1)
    markers.append(marker)

# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
    
plt.title('Median Household Income by County, 2018', fontsize = 18)

ax.axis('off');
counties = pd.merge(left = counties, left_on = 'NAME',
         right = disparities, right_on = 'county')
counties
OBJECTID NAME KEY SHAPE_AREA SHAPE_LEN geometry Median Household Income year primary_sex primary_age ... county primary_care_physicians population urban shadac_category pcp_per_100k num_hospitals num_hospitals_open mean_beds num_beds
0 76 Chester 47023 8.049024e+09 520461.080124 POLYGON ((1137985.762 344601.643, 1137965.070 ... 47508 2019 All All ... Chester 4.0 17190 Urban low inadequate 23.269343 0.0 0.0 NaN 0.0
1 77 Wayne 47181 2.050741e+10 666520.678598 POLYGON ((1365052.057 391716.806, 1365746.554 ... 38879 2019 All All ... Wayne 5.0 16693 Rural moderately adequate 29.952675 1.0 1.0 80.0 80.0
2 78 Tipton 47167 1.319125e+10 865093.887634 MULTIPOLYGON (((886814.330 400456.525, 886774.... 60874 2019 All All ... Tipton 17.0 61447 Urban low inadequate 27.666119 1.0 1.0 100.0 100.0
3 79 Hamilton 47065 1.604776e+10 652926.001078 POLYGON ((2274954.438 239788.911, 2274090.610 ... 57196 2019 All All ... Hamilton 403.0 360919 Urban adequate 111.659403 13.0 12.0 160.0 1920.0
4 80 Stewart 47161 1.375003e+10 490090.336180 POLYGON ((1382472.783 743972.302, 1382445.171 ... 46565 2019 All All ... Stewart 1.0 13427 Urban low inadequate 7.447680 0.0 0.0 NaN 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
90 91 McNairy 47109 1.563586e+10 566369.132062 POLYGON ((1137985.762 344601.643, 1139350.519 ... 39859 2019 All All ... McNairy 9.0 25844 Rural moderately adequate 34.824331 0.0 0.0 NaN 0.0
91 92 Franklin 47051 1.605093e+10 621176.096919 POLYGON ((1873015.265 239618.144, 1872957.848 ... 50201 2019 All All ... Franklin 27.0 41725 Rural moderately adequate 64.709407 2.0 2.0 86.5 173.0
92 93 Bradley 47011 9.241234e+09 457372.233476 POLYGON ((2274954.438 239788.911, 2275552.803 ... 50427 2019 All All ... Bradley 55.0 105749 Urban moderately adequate 52.009948 2.0 2.0 174.0 348.0
93 94 Marion 47115 1.428734e+10 529431.591556 POLYGON ((2126056.390 236919.771, 2122873.509 ... 50819 2019 All All ... Marion 12.0 28538 Urban moderately adequate 42.049198 1.0 1.0 68.0 68.0
94 95 Polk 47139 1.233228e+10 479994.126988 POLYGON ((2355580.184 332970.851, 2355673.384 ... 41968 2019 All All ... Polk 8.0 16814 Urban moderately adequate 47.579398 1.0 0.0 NaN 0.0

95 rows × 36 columns

counties
OBJECTID NAME KEY SHAPE_AREA SHAPE_LEN geometry Median Household Income year primary_sex primary_age ... primary_care_physicians population urban shadac_category pcp_per_100k num_hospitals num_hospitals_open mean_beds num_beds num_hospitals_per_100k
0 76 Chester 47023 8.049024e+09 520461.080124 POLYGON ((1137985.762 344601.643, 1137965.070 ... 47508 2019 All All ... 4.0 17190 Urban low inadequate 23.269343 0.0 0.0 NaN 0.0 0.000000
1 77 Wayne 47181 2.050741e+10 666520.678598 POLYGON ((1365052.057 391716.806, 1365746.554 ... 38879 2019 All All ... 5.0 16693 Rural moderately adequate 29.952675 1.0 1.0 80.0 80.0 5.990535
2 78 Tipton 47167 1.319125e+10 865093.887634 MULTIPOLYGON (((886814.330 400456.525, 886774.... 60874 2019 All All ... 17.0 61447 Urban low inadequate 27.666119 1.0 1.0 100.0 100.0 1.627419
3 79 Hamilton 47065 1.604776e+10 652926.001078 POLYGON ((2274954.438 239788.911, 2274090.610 ... 57196 2019 All All ... 403.0 360919 Urban adequate 111.659403 13.0 12.0 160.0 1920.0 3.324846
4 80 Stewart 47161 1.375003e+10 490090.336180 POLYGON ((1382472.783 743972.302, 1382445.171 ... 46565 2019 All All ... 1.0 13427 Urban low inadequate 7.447680 0.0 0.0 NaN 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
90 91 McNairy 47109 1.563586e+10 566369.132062 POLYGON ((1137985.762 344601.643, 1139350.519 ... 39859 2019 All All ... 9.0 25844 Rural moderately adequate 34.824331 0.0 0.0 NaN 0.0 0.000000
91 92 Franklin 47051 1.605093e+10 621176.096919 POLYGON ((1873015.265 239618.144, 1872957.848 ... 50201 2019 All All ... 27.0 41725 Rural moderately adequate 64.709407 2.0 2.0 86.5 173.0 4.793289
92 93 Bradley 47011 9.241234e+09 457372.233476 POLYGON ((2274954.438 239788.911, 2275552.803 ... 50427 2019 All All ... 55.0 105749 Urban moderately adequate 52.009948 2.0 2.0 174.0 348.0 1.891271
93 94 Marion 47115 1.428734e+10 529431.591556 POLYGON ((2126056.390 236919.771, 2122873.509 ... 50819 2019 All All ... 12.0 28538 Urban moderately adequate 42.049198 1.0 1.0 68.0 68.0 3.504100
94 95 Polk 47139 1.233228e+10 479994.126988 POLYGON ((2355580.184 332970.851, 2355673.384 ... 41968 2019 All All ... 8.0 16814 Urban moderately adequate 47.579398 1.0 0.0 NaN 0.0 0.000000

95 rows × 37 columns

from matplotlib.lines import Line2D

fig, ax = plt.subplots(figsize=(16,4))

counties.plot(column = 'num_hospitals_per_100k', 
              edgecolor = 'black',
              legend = True,
              cmap = 'Blues',
              scheme="Quantiles",
              ax = ax,
              # missing_kwds={
              # "color": "lightgrey",
              # "edgecolor": "grey",
              # "label": "Missing values",
              # },
              # classification_kwds ={k : 10}
)

leg = ax.get_legend()

# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
    label_text = lbl.get_text()
    if label_text == 'Missing values': 
        labels.append(label_text)
        continue
    lower = float(label_text.split()[0][:-1])
    upper = float(label_text.split()[1][:-1])
    if i == 0:
        new_text = "Below " + "{:,.0f}".format(upper + 1)
    elif i == n - 1:
        new_text = "Above " + "{:,.0f}".format(lower)
    else:
        new_text = "{:,.0f}".format(lower + 1) + " - " + "{:,.0f}".format(upper)
        
    labels.append(new_text)

# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
    marker = Line2D([0],[0], marker = 'o', 
                    markersize = line.get_markersize(), 
                    color = line.get_markerfacecolor(),
                    linestyle = 'None',
                    markeredgecolor = 'black',
                    markeredgewidth = 1)
    markers.append(marker)

# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
    
plt.title('Number of Hospitals per 100K Residents by County', fontsize = 18)

ax.axis('off');
from matplotlib.lines import Line2D

fig, ax = plt.subplots(figsize=(16,4))

counties.plot(column = 'pcp_per_100k', 
              edgecolor = 'black',
              legend = True,
              cmap = 'Blues',
              scheme="Quantiles",
              ax = ax,
              # missing_kwds={
              # "color": "lightgrey",
              # "edgecolor": "grey",
              # "label": "Missing values",
              # },
              # classification_kwds ={k : 10}
)

leg = ax.get_legend()

# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
    label_text = lbl.get_text()
    if label_text == 'Missing values': 
        labels.append(label_text)
        continue
    lower = float(label_text.split()[0][:-1])
    upper = float(label_text.split()[1][:-1])
    if i == 0:
        new_text = "Below " + "{:,.0f}".format(upper + 1)
    elif i == n - 1:
        new_text = "Above " + "{:,.0f}".format(lower)
    else:
        new_text = "{:,.0f}".format(lower + 1) + " - " + "{:,.0f}".format(upper)
        
    labels.append(new_text)

# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
    marker = Line2D([0],[0], marker = 'o', 
                    markersize = line.get_markersize(), 
                    color = line.get_markerfacecolor(),
                    linestyle = 'None',
                    markeredgecolor = 'black',
                    markeredgewidth = 1)
    markers.append(marker)

# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
    
plt.title('Number of PCP per 100K Residents by County', fontsize = 18)

ax.axis('off');
sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_beds', hue = 'shadac_category')
<AxesSubplot:xlabel='Median Household Income', ylabel='num_beds'>
counties['num_hospitals_per_100k']=counties.num_hospitals_open/counties.population*100000
sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_hospitals_open', hue = 'shadac_category')
<AxesSubplot:xlabel='Median Household Income', ylabel='num_hospitals_open'>
sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_hospitals_per_100k', hue = 'urban')
plt.ylabel("Number Open Hospitals per 100K Residents")
plt.savefig('income_vs_hosp_urban.pdf')
disparities.sort_values(['pcp_per_100k','population'], ascending = [True,False])[['county', 'num_beds','num_hospitals','pcp_per_100k','population']][0:25]
county num_beds num_hospitals pcp_per_100k population
FIPS
47033 Crockett 0.0 0.0 0.000000 14399
47061 Grundy 0.0 0.0 0.000000 13344
47095 Lake 0.0 0.0 0.000000 7401
47175 Van Buren 0.0 0.0 0.000000 5760
47007 Bledsoe 25.0 1.0 6.740361 14836
47161 Stewart 0.0 0.0 7.447680 13427
47097 Lauderdale 25.0 1.0 7.695564 25989
47129 Morgan 0.0 0.0 13.924344 21545
47117 Marshall 25.0 1.0 15.167602 32965
47067 Hancock 10.0 1.0 15.181418 6587
47173 Union 0.0 0.0 15.394089 19488
47127 Moore 0.0 0.0 15.678896 6378
47069 Hardeman 25.0 1.0 15.721416 25443
47101 Lewis 0.0 0.0 16.629251 12027
47111 Macon 25.0 1.0 16.777116 23842
47087 Jackson 0.0 1.0 17.120356 11682
47005 Benton 25.0 1.0 18.587361 16140
47123 Monroe 59.0 1.0 19.538034 46064
47169 Trousdale 25.0 1.0 19.548431 10231
47081 Hickman 25.0 1.0 20.150727 24813
47015 Cannon 60.0 1.0 21.159543 14178
47057 Grainger 0.0 0.0 21.644085 23101
47023 Chester 0.0 0.0 23.269343 17190
47083 Houston 25.0 1.0 24.497795 8164
47077 Henderson 45.0 1.0 25.020553 27977

What happens if we add a few hospitals in TN?

sns.kdeplot(disparities.num_hospitals_open, hue = disparities.urban)
<AxesSubplot:xlabel='num_hospitals_open', ylabel='Density'>
predict = disparities.copy()
disparities.sort_values(['num_hospitals_open','pcp_per_100k'])[0:1]
year primary_sex primary_age primary_dual state primary_race primary_eligibility primary_denominator_well analysis_value_well primary_denominator_hosp ... county primary_care_physicians population urban shadac_category pcp_per_100k num_hospitals num_hospitals_open mean_beds num_beds
FIPS
47033 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 25 1,000-4,999 ... Crockett 0.0 14399 Rural low inadequate 0.0 0.0 0.0 NaN 0.0

1 rows × 29 columns

import numpy as np
def add_hospitals_weak_link(data, category, num_hospitals = 5, return_mean = False, print_bool = True):
    df = data.copy()
    sorted_df = df.sort_values([category,'population'], ascending = [True, False])
    # print(sorted_df.head(3))
    worst = sorted_df.index[0]
    df.loc[worst,'num_hospitals_open'] = num_hospitals
    # df['num_hospitals_open_per_100k'] = df.num_hospitals_open*100000/df.population
    if print_bool ==True:
        print("Baseline mean number of open hospitals per 100k before adding hospitals: "+str(np.round(data['num_hospitals_open'].mean(),4)))

        print(f"Mean number of open hospitals per 100k after adding {num_hospitals} hospitals to {df.loc[worst,'county']} County: "
          +str(np.round(df['num_hospitals_open'].mean(), 4)))
    return df['num_hospitals_open'].mean() if return_mean ==True else None

add_hospitals_weak_link(disparities, category = 'pcp_per_100k')
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 5 hospitals to Crockett County: 1.6947
import numpy as np
def add_hospitals_equitably(data, category, num_hospitals = 5, num_counties = 5, copy = False, print_bool = True, return_mean = False):
    df = data.copy()
    sorted_df = df.sort_values([category,'population'], ascending = [True, False])
    # print(sorted_df.head(3))
    worst = sorted_df.index[0:num_counties]
    for i in worst:
        df.loc[i,'num_hospitals_open'] = num_hospitals/num_counties
    # df['num_hospitals_open'] = df.num_hospitals_open*100000/df.population
    if print_bool == True:
        print("Baseline mean number of open hospitals per 100k before adding hospitals: "+str(np.round(data['num_hospitals_open'].mean(), 4)))

        print(f"Mean number of open hospitals per 100k after adding {num_hospitals} hospitals to {df.loc[worst,'county'].values} Counties: "
          +str(np.round(df['num_hospitals_open'].mean(), 4)))
    return df['num_hospitals_open'].mean() if return_mean == True else None

add_hospitals_equitably(disparities, category = 'pcp_per_100k')
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 5 hospitals to ['Crockett' 'Grundy' 'Lake' 'Van Buren' 'Bledsoe'] Counties: 1.6842
add_hospitals_weak_link(disparities, category = 'pcp_per_100k', num_hospitals=10)
add_hospitals_equitably(disparities, category = 'pcp_per_100k', num_hospitals=10)
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 10 hospitals to Crockett County: 1.7474
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 10 hospitals to ['Crockett' 'Grundy' 'Lake' 'Van Buren' 'Bledsoe'] Counties: 1.7368
means_weak = []
means_equit = []
for i in range(20):
    means_weak.append(add_hospitals_weak_link(disparities, category='pcp_per_100k', num_hospitals=i, print_bool=False, return_mean = True))
    means_equit.append(add_hospitals_equitably(disparities, category='pcp_per_100k', num_hospitals=i,num_counties=4, print_bool=False, return_mean = True))

print(means_weak)
print(means_equit)

[1.6421052631578947, 1.6526315789473685, 1.6631578947368422, 1.6736842105263159, 1.6842105263157894, 1.694736842105263, 1.7052631578947368, 1.7157894736842105, 1.7263157894736842, 1.736842105263158, 1.7473684210526317, 1.7578947368421052, 1.768421052631579, 1.7789473684210526, 1.7894736842105263, 1.8, 1.8105263157894738, 1.8210526315789475, 1.831578947368421, 1.8421052631578947]
[1.6421052631578947, 1.6526315789473685, 1.6631578947368422, 1.6736842105263159, 1.6842105263157894, 1.694736842105263, 1.7052631578947368, 1.7157894736842105, 1.7263157894736842, 1.736842105263158, 1.7473684210526317, 1.7578947368421052, 1.768421052631579, 1.7789473684210526, 1.7894736842105263, 1.8, 1.8105263157894738, 1.8210526315789475, 1.831578947368421, 1.8421052631578947]

Question 2: What is the relationship between areas without access and ER visits?

Do areas that lack access to healthcare tend to have higher rates of emergency department visits or hospitalizations? Is there an association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits?

Reordering disparities in order of least acess:

disparities = disparities.sort_values(['pcp_per_100k','population'], ascending = [True,False]).reset_index()
disparities.columns
Index(['FIPS', 'year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
       'primary_race', 'primary_eligibility', 'primary_denominator_well',
       'analysis_value_well', 'primary_denominator_hosp',
       'analysis_value_hosp', 'primary_denominator_emer',
       'analysis_value_emer', 'laus_code', 'Name', 'Employed', 'Unemployed',
       'unemployment_rate', 'FIPS.1', 'county', 'primary_care_physicians',
       'population', 'urban', 'shadac_category', 'pcp_per_100k',
       'num_hospitals', 'num_hospitals_open', 'mean_beds', 'num_beds'],
      dtype='object')
from imp import reload
reload(plt)
plt.figure()
sns.scatterplot(x = disparities., y = disparities.analysis_value_well, hue = disparities.shadac_category)
plt.ylabel("Emergency department visits per 100K")
plt.xlabel("PCP per 100k")

Text(0.5, 0, 'PCP per 100k')
plt.figure()
sns.scatterplot(x = counties['Median Household Income'], y = counties.analysis_value_emer, hue = disparities.shadac_category)
plt.ylabel("Emergency Visits per 1K")
plt.xlabel("Median Household Income")

Text(0.5, 0, 'Median Household Income')
sns.boxplot(x = counties.urban, y = counties.analysis_value_emer)
plt.ylabel("Hospitalizations per 1K")

Text(0, 0.5, 'Hospitalizations per 1K')

Association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits

plt.figure()
sns.lmplot(data =disparities, y = "analysis_value_hosp", x = "analysis_value_well")
plt.ylabel("Hospitalizations per 100K")
plt.xlabel("Percentage annual wellness visits")

Text(0.5, 21.70625000000002, 'Percentage annual wellness visits')




<Figure size 432x288 with 0 Axes>
emer_per_100k =disparities.analysis_value_emer
hosp_per_100k =disparities.analysis_value_hosp
well_rate = disparities.analysis_value_well
rates = pd.DataFrame([emer_per_100k, hosp_per_100k, well_rate, disparities.pcp_per_100k, disparities.shadac_category,counties['Median Household Income']], index=['Emergency Visits','Hospitalizations','Wellness Visits', 'pcp', 'shadac', 'Income']).T
sns.pairplot(rates, hue = 'shadac', corner = True)
<seaborn.axisgrid.PairGrid at 0x1a5eb4190>
rates[['emer','hosp','well', 'pcp']] = rates[['emer','hosp','well', 'pcp']].astype(float)
rates.drop('shadac', axis = 1)
Emergency Visits Hospitalizations Wellness Visits pcp Income
0 622 289 25 0.0 47508
1 651 285 34 0.0 38879
2 1000 401 20 0.0 60874
3 759 304 31 0.0 57196
4 750 313 19 6.740361 46565
... ... ... ... ... ...
90 559 234 56 112.772823 39859
91 584 261 52 126.711238 50201
92 666 255 43 132.138284 50427
93 469 201 48 149.962953 50819
94 504 256 57 176.831892 41968

95 rows × 5 columns

rates.drop('shadac', axis = 1).astype(float).corr().style.background_gradient('coolwarm')


  Emergency Visits Hospitalizations Wellness Visits pcp Income
Emergency Visits 1.000000 0.599412 -0.480607 -0.396181 0.153085
Hospitalizations 0.599412 1.000000 -0.480732 -0.308934 0.126632
Wellness Visits -0.480607 -0.480732 1.000000 0.488941 -0.073377
pcp -0.396181 -0.308934 0.488941 1.000000 -0.080277
Income 0.153085 0.126632 -0.073377 -0.080277 1.000000
counties.to_csv('../data/counties_full.csv')
disparities.to_csv('../data/disparities.csv')

Question 3: Which subpopulations need more attention?

Finally, TN Med Helper is trying to identify specific subpopulations to focus more attention on. Using data from the Behavioral Risk Factor Surveillance System, build a model to predict whether an individual has not had a checkup in the last year. Apply this model to the counties you identified above to predict how likely it is that the average person from those counties has not had a checkup in the last year. Which groups within these counties might need to be focused on to maximize the impact of TN Med Helper’s efforts?

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
disparities = pd.read_csv('../data/disparities.csv',header = 0, index_col = 0)
disparities.head()
FIPS year primary_sex primary_age primary_dual state primary_race primary_eligibility primary_denominator_well analysis_value_well ... county primary_care_physicians population urban shadac_category pcp_per_100k num_hospitals num_hospitals_open mean_beds num_beds
0 47033 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 25 ... Crockett 0.0 14399 Rural low inadequate 0.000000 0.0 0.0 NaN 0.0
1 47061 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 34 ... Grundy 0.0 13344 Rural low inadequate 0.000000 0.0 0.0 NaN 0.0
2 47095 2019 All All Dual & non-dual TENNESSEE All All 500-999 20 ... Lake 0.0 7401 Rural low inadequate 0.000000 0.0 0.0 NaN 0.0
3 47175 2019 All All Dual & non-dual TENNESSEE All All 500-999 31 ... Van Buren 0.0 5760 Rural low inadequate 0.000000 0.0 0.0 NaN 0.0
4 47007 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 19 ... Bledsoe 1.0 14836 Rural low inadequate 6.740361 1.0 1.0 25.0 25.0

5 rows × 30 columns

The file brfss.csv contains a subset of the responses and variables from the 2019 Behavioral Risk Factor Surveillance System (BRFSS). This dataset can be downloaded using this link: https://drive.google.com/file/d/1acJKmT2aFf2nZl_VYLE897yx0LPNajoY/view?usp=sharing.

A detailed Codebook can be found here.

Our target variable is the CHECKUP1 column, which contains the responses to the question “About how long has it been since you last visited a doctor for a routine checkup? [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]” Specifically, we want to try and predict if someone gives an answer besides “Within past year (anytime less than 12 months ago)”.

brfss = pd.read_csv('../data/brfss.csv', header = 0)
brfss.head()
GENHLTH HLTHPLN1 PERSDOC2 MEDCOST CHECKUP1 _RFHYPE5 TOLDHI2 CVDINFR4 CVDCRHD4 CVDSTRK3 ... EXERANY2 _METSTAT _URBSTAT _IMPRACE _RFBMI5 _RFSMOK3 _RFBING5 _RFDRHV7 _TOTINDA target
0 Good Yes Yes, only one No Within past year (anytime less than 12 months ... Yes Yes No No No ... No Metropolitan counties (_URBNRRL = 1,2,3,4) Urban counties (_URBNRRL = 1,2,3,4,5) Black, Non-Hispanic Yes No No No No physical activity or exercise in last 30 days True
1 Fair Yes Yes, only one No Within past year (anytime less than 12 months ... No No No No No ... Yes Metropolitan counties (_URBNRRL = 1,2,3,4) Urban counties (_URBNRRL = 1,2,3,4,5) White, Non-Hispanic No No No No Had physical activity or exercise True
2 Good Yes More than one No Within past year (anytime less than 12 months ... Yes No No No No ... Yes Metropolitan counties (_URBNRRL = 1,2,3,4) Urban counties (_URBNRRL = 1,2,3,4,5) Black, Non-Hispanic Yes No No No Had physical activity or exercise True
3 Very good Yes Yes, only one No Within past year (anytime less than 12 months ... No No No No No ... Yes Nonmetropolitan counties (_URBNRRL = 5,6) Rural counties (_URBNRRL = 6) White, Non-Hispanic Yes Yes No No Had physical activity or exercise True
4 Poor Yes Yes, only one No Within past year (anytime less than 12 months ... No Yes No No No ... No Metropolitan counties (_URBNRRL = 1,2,3,4) Urban counties (_URBNRRL = 1,2,3,4,5) White, Non-Hispanic No No No No No physical activity or exercise in last 30 days True

5 rows × 40 columns

First, create a new coumn, “target” by converting this to a binary outcome. After you do this, drop the CHECKUP1 column from your dataframe so that you don’t accidentally make predictions based off of it.

set(brfss.CHECKUP1)
{'5 or more years ago',
 'Never',
 'Within past 2 years (1 year but less than 2 years ago)',
 'Within past 5 years (2 years but less than 5 years ago)',
 'Within past year (anytime less than 12 months ago)'}
target = []
for x in brfss.CHECKUP1:
    if x == 'Within past year (anytime less than 12 months ago)':
        target.append(True)
    else: target.append(False)
brfss['target'] = target
brfss = brfss.drop('CHECKUP1', axis = 1)
brfss.target.value_counts()
True     215875
False     46174
Name: target, dtype: int64

Using data from the Behavioral Risk Factor Surveillance System, build a model to predict whether an individual has not had a checkup in the last year.

Then, experiment with making a logistic regression model to predict the target variable using one or more of the other columns. Note that you will need to convert the precictor columns into dummy variable prior to fitting a model. What do you find?

  • GENHLTH: Would you say that in general your health is
  • HLTHPLN1: Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?
  • PERSDOC2: Do you have one person you think of as your personal doctor or health care provider?
  • MEDCOST: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
  • _RFHYPE5: Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional
  • TOLDHI2: Have you ever been told by a doctor, nurse or other health professional that your blood cholesterol is high?
  • CVDINFR4: (Ever told) you had a heart attack, also called a myocardial infarction?
  • CVDCRHD4: (Ever told) you had angina or coronary heart disease?
  • CVDSTRK3: (Ever told) you had a stroke.
  • ASTHMA3: (Ever told) you had asthma?
  • CHCSCNCR: (Ever told) you had skin cancer?
  • CHCOCNCR: (Ever told) you had other cancer?
  • CHCCOPD2: (Ever told) you had COPD?
  • ADDEPEV3: (Ever told) (you had) a depressive disorder
  • CHCKDNY2: were you ever told you had kidney disease?
  • DIABETE4: (Ever told) diabetes?
  • HAVARTH4: Arthritis
  • MARITAL: Marital status
  • EDUCA: What is the highest grade or year of school you completed?
  • RENTHOM1: Own or rent home
  • VETERAN3: Veteran?
  • EMPLOY1: Employment
  • INCOME2: Income categories
  • DEAF
  • BLIND
  • DECIDE: Trouble concentrating/deciding
  • DIFFWALK: Difficulty walking
  • DIFFDRES: Difficulty dressing
  • DIFFALON: Difficulty errands alone
  • EXERANY2: Exercise in past month
  • _METSTAT: Metropolitan Status
  • _URBSTAT: Urban status
  • _IMPRACE: Imputed race/ethnicity value
  • _RFBMI5: Adults who have a body mass index greater than 25.00 (Overweight or Obese)
  • _RFSMOK3: Adults who are current smokers
  • _RFBING5: Binge drinkers
  • _RFDRHV7: Heavy drinkers
  • _TOTINDA: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
  • target

We only really want categorical variables that we know about counties, so we need to look at County Health Rankings to figure out which of these variables we have percentages on. For example, we can use _IMPRACE because we have percentages of races by TN county.

In the County Health Rankings Data:

  1. GENHLTH == “Poor or fair health” (percentage)
  2. HLTHPLN1 == “% Uninsured”
  3. PERSDOC2 == ratio primary care physicians?
  4. DIABETES4 == “diabetes prevalence”
  5. _RFBING5 & _RFDRHV7 == Binge and heavy drinkers == “Excessive drinking”
  6. RENTHOM1 == “Homeownership”
  7. _IMPRACE = race %
  8. _URBSTAT = ‘rural’
  9. MARITAL ~ % single-parent households
  10. EDUCA == “High school graduation rate, % some college
  11. EMPLOY1 = % unemployed
  12. INCOME2 = household income
  13. _TOTINDA ~ % physically inactive
  14. _RFBMI5 BMI > 25, % adults with obesity == BMI > 30…
  15. _RFSMOK3 == ‘% smokers
  16. ADDEPEV3 ~ frequent mental distress
categorical_variables = ['GENHLTH', 'HLTHPLN1', 'PERSDOC2', 'DIABETE4','_RFBING5','_RFDRHV7','RENTHOM1','_IMPRACE','_URBSTAT','EDUCA','EMPLOY1','INCOME2','_TOTINDA','_RFSMOK3']
predict = pd.get_dummies(brfss, columns = categorical_variables)
predict.head()
MEDCOST _RFHYPE5 TOLDHI2 CVDINFR4 CVDCRHD4 CVDSTRK3 ASTHMA3 CHCSCNCR CHCOCNCR CHCCOPD2 ... INCOME2_Less than $15,000 ($10,000 to less than $15,000) INCOME2_Less than $20,000 ($15,000 to less than $20,000) INCOME2_Less than $25,000 ($20,000 to less than $25,000) INCOME2_Less than $35,000 ($25,000 to less than $35,000) INCOME2_Less than $50,000 ($35,000 to less than $50,000) INCOME2_Less than $75,000 ($50,000 to less than $75,000) _TOTINDA_Had physical activity or exercise _TOTINDA_No physical activity or exercise in last 30 days _RFSMOK3_No _RFSMOK3_Yes
0 No Yes Yes No No No No No No No ... 0 1 0 0 0 0 0 1 1 0
1 No No No No No No No No No No ... 0 0 0 1 0 0 1 0 1 0
2 No Yes No No No No No No No No ... 0 0 0 0 0 1 1 0 1 0
3 No No No No No No Yes No No Yes ... 0 0 0 0 0 1 1 0 0 1
4 No No Yes No No No Yes No No Yes ... 0 1 0 0 0 0 0 1 1 0

5 rows × 80 columns

predict = predict.drop(predict.columns[0:24], axis = 1)

First, we need to get these boolean values to match up with our county info, so when we use the model later we can use the same variables.

  • GENHLTH:
    • Change GENHLTH columns to “GENHLTH_Poor_or_Fair”
    • Matched county % poor or fair health in Ranked Measure data
  • HLTHPLN1
    • Matches inverse of county % uninsured in Additional Measure data
  • DIABETE4
    • Change DIABETE4 columns to combine Yes’s and No’s
    • Matches % adults with diabetes in Additional Measure data
  • _RFBING5 and _RFDRHV7
    • Combine so binge and heavy drinking are one category
    • Matches excessive drinking in Ranked Measure data
  • RENTHOM1
    • Own == 1, else == 0
    • Matches % homeowners
  • _IMPRACE
    • Should be able to keep the same
    • Matches demographic %s in Additional Measure Data
  • _URBSTAT
    • replace with Boolean for rural
    • matches inverse of % rural
  • EDUCA
    • “Grade 12 or GED (High school graduate)”
    • matches “High school graduation rate” in ranked measure data
    • “College 1 year to 3 years (Some college or technical school)” or “College 4 years or more (College graduate)” == % some college
  • EMPLOY1
    • “Out of work for 1 year or more” or “Out of work for less than 1 year” or “Unable to work” == % unemployment rate in Ranked measure data
  • INCOME2:
    • group “Household Income (AIAN)” into same groups as above
  • _TOTINDA
    • only “No physical activity…”
    • % physically inactive
  • _RFSMOK3_Yes
    • % smokers
X = predict.copy()

Rural

X['Rural'] = [{ "Urban counties (_URBNRRL = 1,2,3,4,5)":0,"Rural counties (_URBNRRL = 6)":1}[i] for i in brfss._URBSTAT]
X = X.drop([i for i in X.columns if "URBSTAT" in str(i)], axis = 1)

Diabetes

X.loc[(X['DIABETE4_Yes, but female told only during pregnancy'] == 1) |
      (X["DIABETE4_Yes"]==1), 'Diabetes'] = 1
X.loc[(X['DIABETE4_Yes, but female told only during pregnancy'] == 0) &
      (X["DIABETE4_Yes"]==0), 'Diabetes'] = 0
X = X.drop([i for i in X.columns if "DIABETE4" in str(i)], axis = 1)

GENHLTH

X.loc[(X['GENHLTH_Fair'] == 1) |
      (X["GENHLTH_Poor"]==1), 'General_health'] = 1
X.loc[(X['GENHLTH_Fair'] == 0) &
      (X["GENHLTH_Poor"]==0), 'General_health'] = 0

X = X.drop([i for i in X.columns if "GENHLTH" in str(i)], axis = 1)

HLTHPLN1

X['Health_plan'] = X.HLTHPLN1_Yes
X = X.drop([i for i in X.columns if "HLTHPLN1" in str(i)], axis = 1)

_RFBING5 and _RFDRHV7

X.loc[(X._RFDRHV7_Yes == 1) |
      (X._RFBING5_Yes==1), 'Excessive_drinking'] = 1
X.loc[(X['_RFDRHV7_Yes'] == 0) &
      (X["_RFBING5_Yes"]==0), 'Excessive_drinking'] = 0
X = X.drop([i for i in X.columns if "_RFDRHV" in str(i)], axis = 1)
X = X.drop([i for i in X.columns if "_RFBING" in str(i)], axis = 1)

RENTHOM1

X['Owns_home'] = X['RENTHOM1_Own']
X = X.drop([i for i in X.columns if "RENTHOM1" in str(i)], axis = 1)

EDUCA

X['High_school_grad'] = X['EDUCA_Grade 12 or GED (High school graduate)']
X.loc[(X['EDUCA_College 1 year to 3 years (Some college or technical school)'] == 1) |
      (X['EDUCA_College 4 years or more (College graduate)']==1), 'College'] = 1
X.loc[(X['EDUCA_College 1 year to 3 years (Some college or technical school)'] == 0) &
      (X["EDUCA_College 4 years or more (College graduate)"]==0), 'College'] = 0
X = X.drop([i for i in X.columns if "EDUCA" in str(i)], axis = 1)

EMPLOY1

X.loc[(X['EMPLOY1_Out of work for 1 year or more'] == 1) |
      (X['EMPLOY1_Out of work for less than 1 year']==1) |
      (X['EMPLOY1_Unable to work']==1), 'Unemployed'] = 1
X.loc[(X['EMPLOY1_Out of work for 1 year or more'] == 0) &
      (X['EMPLOY1_Out of work for less than 1 year']==0) &
      (X['EMPLOY1_Unable to work']==0), 'Unemployed'] = 0
X = X.drop([i for i in X.columns if "EMPLOY1" in str(i)], axis = 1)

_TOTINDA

X['Inactivity'] = X['_TOTINDA_No physical activity or exercise in last 30 days']
X = X.drop([i for i in X.columns if "_TOTINDA" in str(i)], axis = 1)

_RFSMOK3_Yes

X['Smoker'] = X._RFSMOK3_Yes
X = X.drop([i for i in X.columns if "_RFSMOK3" in str(i)], axis = 1)
X = X.drop([i for i in X.columns if "PERSDOC" in str(i)], axis = 1)
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262049 entries, 0 to 262048
Data columns (total 26 columns):
 #   Column                                                    Non-Null Count   Dtype  
---  ------                                                    --------------   -----  
 0   target                                                    262049 non-null  bool   
 1   _IMPRACE_American Indian/Alaskan Native, Non-Hispanic     262049 non-null  uint8  
 2   _IMPRACE_Asian, Non-Hispanic                              262049 non-null  uint8  
 3   _IMPRACE_Black, Non-Hispanic                              262049 non-null  uint8  
 4   _IMPRACE_Hispanic                                         262049 non-null  uint8  
 5   _IMPRACE_Other race, Non-Hispanic                         262049 non-null  uint8  
 6   _IMPRACE_White, Non-Hispanic                              262049 non-null  uint8  
 7   INCOME2_$75,000 or more                                   262049 non-null  uint8  
 8   INCOME2_Less than $10,000                                 262049 non-null  uint8  
 9   INCOME2_Less than $15,000 ($10,000 to less than $15,000)  262049 non-null  uint8  
 10  INCOME2_Less than $20,000 ($15,000 to less than $20,000)  262049 non-null  uint8  
 11  INCOME2_Less than $25,000 ($20,000 to less than $25,000)  262049 non-null  uint8  
 12  INCOME2_Less than $35,000 ($25,000 to less than $35,000)  262049 non-null  uint8  
 13  INCOME2_Less than $50,000 ($35,000 to less than $50,000)  262049 non-null  uint8  
 14  INCOME2_Less than $75,000 ($50,000 to less than $75,000)  262049 non-null  uint8  
 15  Rural                                                     262049 non-null  int64  
 16  Diabetes                                                  262049 non-null  float64
 17  General_health                                            262049 non-null  float64
 18  Health_plan                                               262049 non-null  uint8  
 19  Excessive_drinking                                        262049 non-null  float64
 20  Owns_home                                                 262049 non-null  uint8  
 21  High_school_grad                                          262049 non-null  uint8  
 22  College                                                   262049 non-null  float64
 23  Unemployed                                                262049 non-null  float64
 24  Inactivity                                                262049 non-null  uint8  
 25  Smoker                                                    262049 non-null  uint8  
dtypes: bool(1), float64(5), int64(1), uint8(19)
memory usage: 17.0 MB
X.to_csv("../data/predictors.csv")
y = X.pop('target')

The main objective is to have a model which makes good predictions on unseen data. Therefore, in order to evaluate how good a model is, it is necessary to set aside some data as a test set for evaulation purposes. This can be accomplished using the train_test_split function.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify = y,     # Keep the same proportions of the target in the training and test data
                                                    test_size = 0.25,
                                                    random_state = 321)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()         # Create a logistic regression model
logreg.fit(X_train, y_train)          # Fit it to the training data
LogisticRegression()

To understand the model, we can look at its coefficients.

logreg.intercept_
array([-0.06956167])
logreg.coef_
array([[-0.15000703, -0.12631886,  0.53903011, -0.1734734 , -0.12965132,
        -0.02906713, -0.17897394,  0.02512163,  0.13639016,  0.09111363,
         0.06724727, -0.03598943, -0.04823376, -0.12616319, -0.09815453,
         1.19410076,  0.44202482,  1.4508977 , -0.35476091,  0.4756881 ,
         0.01472026, -0.00793255,  0.15622678, -0.04213603, -0.36442622]])
y_pred_prob = logreg.predict_proba(X_test)[:,1]
i = 120

print('Patient Information:\n{}'.format(X_test.iloc[i]))
print('---------------------------------')
print('Predicted Probability of Checkup in Last Year: {}'.format(y_pred_prob[i]))
print('Actual: {}'.format(y_test.iloc[i]))
Patient Information:
_IMPRACE_American Indian/Alaskan Native, Non-Hispanic       0.0
_IMPRACE_Asian, Non-Hispanic                                0.0
_IMPRACE_Black, Non-Hispanic                                0.0
_IMPRACE_Hispanic                                           0.0
_IMPRACE_Other race, Non-Hispanic                           0.0
_IMPRACE_White, Non-Hispanic                                1.0
INCOME2_$75,000 or more                                     1.0
INCOME2_Less than $10,000                                   0.0
INCOME2_Less than $15,000 ($10,000 to less than $15,000)    0.0
INCOME2_Less than $20,000 ($15,000 to less than $20,000)    0.0
INCOME2_Less than $25,000 ($20,000 to less than $25,000)    0.0
INCOME2_Less than $35,000 ($25,000 to less than $35,000)    0.0
INCOME2_Less than $50,000 ($35,000 to less than $50,000)    0.0
INCOME2_Less than $75,000 ($50,000 to less than $75,000)    0.0
Rural                                                       0.0
Diabetes                                                    0.0
General_health                                              0.0
Health_plan                                                 1.0
Excessive_drinking                                          0.0
Owns_home                                                   1.0
High_school_grad                                            0.0
College                                                     1.0
Unemployed                                                  1.0
Inactivity                                                  0.0
Smoker                                                      0.0
Name: 232982, dtype: float64
---------------------------------
Predicted Probability of Checkup in Last Year: 0.857817177035579
Actual: True

One way to analyze your model is to look at the receiver operating characteristic (ROC) curve. This shows how the true positive rate and false positive rate change as the prediction threshold changes.

This value can be interpreted as how likely the model is to assign a higher probability to a positive observation compared to a negative one.

from sklearn.metrics import roc_auc_score, roc_curve
fp_rate, tp_rate, thresholds = roc_curve(y_test == 1, y_pred_prob)

plt.plot(fp_rate, tp_rate)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart Disease Prediction')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

AUC - percentage of the ROC plot that is under the curve

A perfect model would have AUC = 1.

roc_auc_score(y_test, y_pred_prob)
0.6789666785879891

Analyzing Predictions

To generate predictions, you can use the predict method of your model.

from sklearn.metrics import confusion_matrix, classification_report
from cm import plot_confusion_matrix
y_pred = logreg.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[ 1052, 10492],
       [  772, 53197]])
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       False       0.58      0.09      0.16     11544
        True       0.84      0.99      0.90     53969

    accuracy                           0.83     65513
   macro avg       0.71      0.54      0.53     65513
weighted avg       0.79      0.83      0.77     65513
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'])

There are a number of metrics you can use to evalute your model.

Accuracy: the total proportion of predictions which are correct.

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

Sensitivity/True Positive Rate: The proportion of true positives (in our case, people who survived) that are identified as such.

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'sensitivity')

Specificity/True Negative Rate: The proportion of true negatives (in our case, people who died) that are identified as such.

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'specificity')

Precision: The proportion of predicted positives that are actually positive (survived).

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'precision')

We have only used a single predictor in the above model. Let’s see if we can improve our model by using the rest of the predictors.

With so many predictors, it takes longer for the coefficients to converge. Let’s increase the number of iterations.

logreg = LogisticRegression(max_iter = 10000)
logreg.fit(X_train, y_train)
LogisticRegression(max_iter=10000)
y_test
149338     True
201150     True
63901      True
17995     False
192704     True
          ...  
218617     True
181851     True
54798      True
177920    False
22276      True
Name: target, Length: 65513, dtype: bool

Let’s take a look at the coefficients.

Caution: Our variables are on vastly different scales, so do not interpret the value of a coefficient as the importance.

coefficients = pd.DataFrame({
    'variable': X.columns,
    'coefficient': logreg.coef_[0]
})
importances = pd.DataFrame({'variable': coefficients.variable,
                           'importance': coefficients.coefficient}).sort_values('importance', ascending = False)


fig, ax = plt.subplots(figsize = (7,8))
sns.barplot(data = importances,
            x = 'importance', y = 'variable', ax = ax, edgecolor = 'black')
plt.title('Logistic Regression Coefficients');
fig, ax = plt.subplots(figsize = (7,5))
sns.barplot(data = coefficients,
            x = 'coefficient', 
            y = 'variable', 
            ax = ax, 
            edgecolor = 'black')
plt.title('Logistic Regression Coefficients')

ymin, ymax = plt.ylim()
plt.vlines(x = 0, ymin = ymin, ymax = ymax);

Try using regularization or a tree-based model to improve your model’s performance on the BRFSS dataset.


from sklearn.tree import DecisionTreeClassifier

A decision tree model is fit in much the same way as a logistic regression model.

tree = DecisionTreeClassifier(random_state = 321).fit(X_train, y_train)

First, let’s see how it does on the training data.

y_pred = tree.predict(X_test)
plot_confusion_matrix(y_train, tree.predict(X_train), labels = ['No', 'Yes'], metric = 'accuracy')
y_train.value_counts()
True     161906
False     34630
Name: target, dtype: int64

Since decision trees are so flexible, they will often perfectly predict the training data. Remember though, that what we care about is how well the model generalizes.

y_pred = tree.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')
roc_auc_score(y_test, tree.predict_proba(X_test)[:,1])
0.6345258214120431

We do see a drop in model performance on the test data. This is a sign of overfitting.

To correct for this problem, we can take an ensemble approach, which means that we will build many decision trees on subsets of the features and data and then average the predictions of all of the trees. This will force our model to try and find more general patterns that will work on the test set.

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(random_state = 321)
forest.fit(X_train, y_train)
RandomForestClassifier(random_state=321)
y_pred = forest.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')
roc_auc_score(y_test, forest.predict_proba(X_test)[:,1])
0.6501432873215749

We see some improvement over using a single tree, but we could do better. Random forests have a lot of hyperparameters that can be tuned to improve out model. Here are a few of these parameters:

  • n_estimators: Number of decision trees to train. Default is 10. More trees = less variance, but slower to train and predict
  • max_depth: Maximum depth (number of splits). By default, there is no max depth.
  • min_samples_leaf: Minimum number of samples per leaf. Setting this higher keeps the decision trees from paying too much attention to any single data point.

These parameters can be tuned to try to improve the model that you get, and there are ways to automatically tune these parameters. See, for example, sklearn’s GridSearchCV or RandomSearchCV.

forest = RandomForestClassifier(n_estimators = 1000, max_depth = 5, min_samples_leaf = 5, random_state = 321)
forest.fit(X_train, y_train)
RandomForestClassifier(max_depth=5, min_samples_leaf=5, n_estimators=1000,
                       random_state=321)
y_pred = forest.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'sensitivity')

Specificity/True Negative Rate: The proportion of true negatives (in our case, people who died) that are identified as such.

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'specificity')
roc_auc_score(y_test, forest.predict_proba(X_test)[:,1])
0.678213761501158
fp_rate, tp_rate, thresholds = roc_curve(y_test == 1, forest.predict_proba(X_test)[:,1])

plt.plot(fp_rate, tp_rate)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart Disease Prediction')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

A nice perk of using random forest models is that we can see which features are the most important in making predictions.

Caution: Feature importances tell which features the model is using, but not how it is using those features.

importances = pd.DataFrame({'variable': X.columns,
                           'importance': forest.feature_importances_}).sort_values('importance', ascending = False)


fig, ax = plt.subplots(figsize = (7,8))
sns.barplot(data = importances,
            x = 'importance', y = 'variable', ax = ax, edgecolor = 'black')
plt.title('Random Forest Feature Importance');

Apply this model to the counties you identified above to predict how likely it is that the average person from those counties has not had a checkup in the last year.

df1 = pd.read_csv('../data/County_Health_Rankings/Ranked_measure_data_TN.csv', header = 0, index_col = 0)
df1.head()
State County Deaths Years of Potential Life Lost Rate 95% CI - Low 95% CI - High Quartile YPLL Rate (AIAN) YPLL Rate (AIAN) 95% CI - Low YPLL Rate (AIAN) 95% CI - High ... % Drive Alone (Hispanic) 95% CI - Low % Drive Alone (Hispanic) 95% CI - High % Drive Alone (White) % Drive Alone (White) 95% CI - Low % Drive Alone (White) 95% CI - High # Workers who Drive Alone % Long Commute - Drives Alone 95% CI - Low.18 95% CI - High.18 Quartile.34
FIPS
47000 Tennessee NaN 106691 9285 9202 9368 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 3003226 35 34 35 NaN
47001 Tennessee Anderson 1369 10009 9172 10846 2.0 NaN NaN NaN ... 63.0 96.0 90.0 89.0 92.0 32206 36 32 39 2.0
47003 Tennessee Bedford 859 10294 9268 11319 2.0 NaN NaN NaN ... 54.0 74.0 80.0 76.0 84.0 21250 35 31 38 2.0
47005 Tennessee Benton 428 14479 12233 16726 4.0 NaN NaN NaN ... NaN NaN NaN NaN NaN 5448 28 23 33 1.0
47007 Tennessee Bledsoe 229 8126 6427 9824 1.0 NaN NaN NaN ... NaN NaN NaN NaN NaN 5603 53 43 62 4.0

5 rows × 239 columns

df2 = pd.read_csv('../data/County_Health_Rankings/Addition_measure_data_TN.csv', header = 0, index_col = 0)
df2.head()
State County Life Expectancy 95% CI - Low 95% CI - High Life Expectancy (AIAN) Life Expectancy (AIAN) 95% CI - Low Life Expectancy (AIAN) 95% CI - High Life Expectancy (Asian) Life Expectancy (Asian) 95% CI - Low ... % Hispanic # Non-Hispanic White % Non-Hispanic White # Not Proficient in English % Not Proficient in English 95% CI - Low.19 95% CI - High.19 % Female # Rural % Rural
FIPS
47000 Tennessee NaN 76.0 75.9 76.1 NaN NaN NaN NaN NaN ... 5.6 4989758 73.7 94554 2 1.0 2.0 51.2 2132860 33.6
47001 Tennessee Anderson 75.5 74.9 76.2 NaN NaN NaN NaN NaN ... 3.1 68116 89.1 619 1 0.0 1.0 51.3 26041 34.7
47003 Tennessee Bedford 74.3 73.5 75.1 NaN NaN NaN NaN NaN ... 12.6 37556 76.6 1474 3 2.0 4.0 50.9 25053 55.6
47005 Tennessee Benton 71.7 70.2 73.2 NaN NaN NaN NaN NaN ... 2.5 14970 92.5 25 0 0.0 1.0 51.0 12937 78.5
47007 Tennessee Bledsoe 78.0 76.4 79.6 NaN NaN NaN NaN NaN ... 2.5 13038 88.4 245 2 0.0 3.0 41.1 12876 100.0

5 rows × 269 columns

First, we need to get these boolean values to match up with our county info, so when we use the model later we can use the same variables.

  • GENHLTH:
    • Matched county % poor or fair health in Ranked Measure data
  • HLTHPLN1
    • Matches inverse of county % uninsured in Additional Measure data
  • DIABETE4
    • Matches % adults with diabetes in Additional Measure data
  • _RFBING5 and _RFDRHV7
    • Matches excessive drinking in Ranked Measure data
  • RENTHOM1
    • Matches % homeowners in Additional
  • _IMPRACE
    • Matches demographic %s in Additional Measure Data
  • _URBSTAT
    • matches inverse of % rural in additional
  • EDUCA
    • “Grade 12 or GED (High school graduate)”
    • matches “High school graduation rate” in ranked measure data
    • “College 1 year to 3 years (Some college or technical school)” or “College 4 years or more (College graduate)” == % some college
  • EMPLOY1
    • “Out of work for 1 year or more” or “Out of work for less than 1 year” or “Unable to work” == % unemployment rate in Ranked measure data
  • INCOME2:
    • group “Household Income (AIAN)” into same groups as above in additional
  • _TOTINDA
    • % physically inactive in ranked data
  • _RFSMOK3_Yes
    • % smokers in ranked
# [print(i) for i in df2 if "%" in str(i)]
df1 = df1[['% Fair or Poor Health', "% Excessive Drinking","High School Graduation Rate","% Some College","% Unemployed","% Physically Inactive","% Smokers"]]
df2 = df2[["% Uninsured","% Adults with Diabetes","% Homeowners","% Rural","Median Household Income",
     "% Non-Hispanic White", "% Black","% American Indian & Alaska Native","% Asian","% Native Hawaiian/Other Pacific Islander","% Hispanic"]]

We now need to convert these columns to the ones in X.

county_demo = df1.join(df2)
#everything is a percentage except Income, so divide by 100 and then fix income column
county_demo = county_demo/100
county_demo['Median Household Income'] *=100 
county_demo
% Fair or Poor Health % Excessive Drinking High School Graduation Rate % Some College % Unemployed % Physically Inactive % Smokers % Uninsured % Adults with Diabetes % Homeowners % Rural Median Household Income % Non-Hispanic White % Black % American Indian & Alaska Native % Asian % Native Hawaiian/Other Pacific Islander % Hispanic
FIPS
47000 0.20 0.14 0.90 0.61 0.035 0.27 0.23 0.14 0.13 0.66 0.336 52366.0 0.737 0.167 0.005 0.019 0.001 0.056
47001 0.19 0.14 0.92 0.58 0.038 0.26 0.21 0.13 0.11 0.68 0.347 50672.0 0.891 0.039 0.005 0.015 0.001 0.031
47003 0.22 0.16 0.91 0.41 0.037 0.32 0.23 0.18 0.11 0.68 0.556 49860.0 0.766 0.076 0.011 0.011 0.002 0.126
47005 0.23 0.12 0.96 0.47 0.049 0.29 0.24 0.15 0.14 0.76 0.785 38940.0 0.925 0.024 0.006 0.006 0.000 0.025
47007 0.23 0.15 0.85 0.38 0.058 0.26 0.25 0.18 0.10 0.74 1.000 40195.0 0.884 0.071 0.006 0.003 0.000 0.025
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47181 0.24 0.14 0.96 0.40 0.049 0.41 0.25 0.14 0.15 0.81 1.000 38879.0 0.899 0.063 0.004 0.003 0.000 0.021
47183 0.22 0.14 0.92 0.56 0.042 0.34 0.22 0.12 0.11 0.66 0.670 42351.0 0.868 0.075 0.004 0.012 0.000 0.027
47185 0.25 0.13 0.94 0.49 0.037 0.23 0.23 0.14 0.22 0.81 0.782 42603.0 0.930 0.018 0.005 0.004 0.001 0.028
47187 0.14 0.16 0.96 0.87 0.025 0.18 0.15 0.08 0.08 0.81 0.194 115930.0 0.843 0.043 0.003 0.047 0.001 0.049
47189 0.17 0.17 0.96 0.69 0.028 0.21 0.18 0.10 0.11 0.77 0.385 76507.0 0.847 0.071 0.005 0.017 0.001 0.045

96 rows × 18 columns

county_demo.columns
Index(['% Fair or Poor Health', '% Excessive Drinking',
       'High School Graduation Rate', '% Some College', '% Unemployed',
       '% Physically Inactive', '% Smokers', '% Uninsured',
       '% Adults with Diabetes', '% Homeowners', '% Rural',
       'Median Household Income', '% Non-Hispanic White', '% Black',
       '% American Indian & Alaska Native', '% Asian',
       '% Native Hawaiian/Other Pacific Islander', '% Hispanic'],
      dtype='object')
X.columns
Index(['_IMPRACE_American Indian/Alaskan Native, Non-Hispanic',
       '_IMPRACE_Asian, Non-Hispanic', '_IMPRACE_Black, Non-Hispanic',
       '_IMPRACE_Hispanic', '_IMPRACE_Other race, Non-Hispanic',
       '_IMPRACE_White, Non-Hispanic', 'INCOME2_$75,000 or more',
       'INCOME2_Less than $10,000',
       'INCOME2_Less than $15,000 ($10,000 to less than $15,000)',
       'INCOME2_Less than $20,000 ($15,000 to less than $20,000)',
       'INCOME2_Less than $25,000 ($20,000 to less than $25,000)',
       'INCOME2_Less than $35,000 ($25,000 to less than $35,000)',
       'INCOME2_Less than $50,000 ($35,000 to less than $50,000)',
       'INCOME2_Less than $75,000 ($50,000 to less than $75,000)', 'Rural',
       'Diabetes', 'General_health', 'Health_plan', 'Excessive_drinking',
       'Owns_home', 'High_school_grad', 'College', 'Unemployed', 'Inactivity',
       'Smoker'],
      dtype='object')
column_dict = {'% Fair or Poor Health':"General_health", '% Excessive Drinking':"Excessive_drinking",
'High School Graduation Rate':"High_school_grad", '% Some College':"College", '% Unemployed':"Unemployed",
'% Physically Inactive':"Inactivity", '% Smokers':"Smoker", '% Uninsured':"Health_plan",
'% Adults with Diabetes':"Diabetes", '% Homeowners':"Owns_home", '% Rural':"Rural",
'% Non-Hispanic White':"_IMPRACE_White, Non-Hispanic", 
'% Black':'_IMPRACE_Black, Non-Hispanic',
'% American Indian & Alaska Native':'_IMPRACE_American Indian/Alaskan Native, Non-Hispanic', 
'% Asian':"_IMPRACE_Asian, Non-Hispanic",
'% Native Hawaiian/Other Pacific Islander':"_IMPRACE_Other race, Non-Hispanic", 
'% Hispanic':"_IMPRACE_Hispanic"}
for i in column_dict:
    county_demo.columns = county_demo.columns.str.replace(i,column_dict[i])
county_demo
General_health Excessive_drinking High_school_grad College Unemployed Inactivity Smoker Health_plan Diabetes Owns_home Rural Median Household Income _IMPRACE_White, Non-Hispanic _IMPRACE_Black, Non-Hispanic _IMPRACE_American Indian/Alaskan Native, Non-Hispanic _IMPRACE_Asian, Non-Hispanic _IMPRACE_Other race, Non-Hispanic _IMPRACE_Hispanic
FIPS
47000 0.20 0.14 0.90 0.61 0.035 0.27 0.23 0.14 0.13 0.66 0.336 52366.0 0.737 0.167 0.005 0.019 0.001 0.056
47001 0.19 0.14 0.92 0.58 0.038 0.26 0.21 0.13 0.11 0.68 0.347 50672.0 0.891 0.039 0.005 0.015 0.001 0.031
47003 0.22 0.16 0.91 0.41 0.037 0.32 0.23 0.18 0.11 0.68 0.556 49860.0 0.766 0.076 0.011 0.011 0.002 0.126
47005 0.23 0.12 0.96 0.47 0.049 0.29 0.24 0.15 0.14 0.76 0.785 38940.0 0.925 0.024 0.006 0.006 0.000 0.025
47007 0.23 0.15 0.85 0.38 0.058 0.26 0.25 0.18 0.10 0.74 1.000 40195.0 0.884 0.071 0.006 0.003 0.000 0.025
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47181 0.24 0.14 0.96 0.40 0.049 0.41 0.25 0.14 0.15 0.81 1.000 38879.0 0.899 0.063 0.004 0.003 0.000 0.021
47183 0.22 0.14 0.92 0.56 0.042 0.34 0.22 0.12 0.11 0.66 0.670 42351.0 0.868 0.075 0.004 0.012 0.000 0.027
47185 0.25 0.13 0.94 0.49 0.037 0.23 0.23 0.14 0.22 0.81 0.782 42603.0 0.930 0.018 0.005 0.004 0.001 0.028
47187 0.14 0.16 0.96 0.87 0.025 0.18 0.15 0.08 0.08 0.81 0.194 115930.0 0.843 0.043 0.003 0.047 0.001 0.049
47189 0.17 0.17 0.96 0.69 0.028 0.21 0.18 0.10 0.11 0.77 0.385 76507.0 0.847 0.071 0.005 0.017 0.001 0.045

96 rows × 18 columns

Don’t forget, we need 1-Health Plan (% Uninsured –> Health plan). All others should already be accounted for if “negative,” such as physical inactivity, unemployed, etc. i.e. TRUE for Unemployed truly means unemployed.

county_demo['Health_plan'] = 1- county_demo['Health_plan']

Now we work on income categories.

income_categories = []
for i in county_demo['Median Household Income']:
    if float(i) < 10000:
        income_categories.append('INCOME2_Less than $10,000')
    elif (i >= 10000) & (i < 15000):
        income_categories.append('INCOME2_Less than $15,000 ($10,000 to less than $15,000)')
    elif (i >= 15000) & (i < 20000):
        income_categories.append('INCOME2_Less than $20,000 ($15,000 to less than $20,000)')
    elif (i >= 20000) & (i < 25000):
        income_categories.append('INCOME2_Less than $25,000 ($20,000 to less than $25,000)')
    elif (i >= 25000) & (i < 35000):
        income_categories.append('INCOME2_Less than $35,000 ($25,000 to less than $35,000)')
    elif (i >= 35000) & (i < 50000):
        income_categories.append('INCOME2_Less than $50,000 ($35,000 to less than $50,000)')
    elif (i >= 50000) & (i < 75000):
        income_categories.append('INCOME2_Less than $75,000 ($50,000 to less than $75,000)')
    elif (i >= 75000):
        income_categories.append('INCOME2_$75,000 or more')
county_demo['income_categories'] = income_categories
county_demo = pd.get_dummies(county_demo, 'income_categories')
county_demo = county_demo.drop('Median Household Income', axis = 1)
county_demo['INCOME2_Less than $25,000 ($20,000 to less than $25,000)'] = 0
county_demo['INCOME2_Less than $20,000 ($15,000 to less than $20,000)'] = 0
county_demo['INCOME2_Less than $15,000 ($10,000 to less than $15,000)'] = 0
county_demo['INCOME2_Less than $10,000'] = 0
county_demo.columns = county_demo.columns.str.replace("income_categories_", "")
county_demo
General_health Excessive_drinking High_school_grad College Unemployed Inactivity Smoker Health_plan Diabetes Owns_home ... _IMPRACE_Other race, Non-Hispanic _IMPRACE_Hispanic INCOME2_$75,000 or more INCOME2_Less than $35,000 ($25,000 to less than $35,000) INCOME2_Less than $50,000 ($35,000 to less than $50,000) INCOME2_Less than $75,000 ($50,000 to less than $75,000) INCOME2_Less than $25,000 ($20,000 to less than $25,000) INCOME2_Less than $20,000 ($15,000 to less than $20,000) INCOME2_Less than $15,000 ($10,000 to less than $15,000) INCOME2_Less than $10,000
FIPS
47000 0.20 0.14 0.90 0.61 0.035 0.27 0.23 0.86 0.13 0.66 ... 0.001 0.056 0 0 0 1 0 0 0 0
47001 0.19 0.14 0.92 0.58 0.038 0.26 0.21 0.87 0.11 0.68 ... 0.001 0.031 0 0 0 1 0 0 0 0
47003 0.22 0.16 0.91 0.41 0.037 0.32 0.23 0.82 0.11 0.68 ... 0.002 0.126 0 0 1 0 0 0 0 0
47005 0.23 0.12 0.96 0.47 0.049 0.29 0.24 0.85 0.14 0.76 ... 0.000 0.025 0 0 1 0 0 0 0 0
47007 0.23 0.15 0.85 0.38 0.058 0.26 0.25 0.82 0.10 0.74 ... 0.000 0.025 0 0 1 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47181 0.24 0.14 0.96 0.40 0.049 0.41 0.25 0.86 0.15 0.81 ... 0.000 0.021 0 0 1 0 0 0 0 0
47183 0.22 0.14 0.92 0.56 0.042 0.34 0.22 0.88 0.11 0.66 ... 0.000 0.027 0 0 1 0 0 0 0 0
47185 0.25 0.13 0.94 0.49 0.037 0.23 0.23 0.86 0.22 0.81 ... 0.001 0.028 0 0 1 0 0 0 0 0
47187 0.14 0.16 0.96 0.87 0.025 0.18 0.15 0.92 0.08 0.81 ... 0.001 0.049 1 0 0 0 0 0 0 0
47189 0.17 0.17 0.96 0.69 0.028 0.21 0.18 0.90 0.11 0.77 ... 0.001 0.045 1 0 0 0 0 0 0 0

96 rows × 25 columns

county_pred = tree.predict(county_demo)

county_pred_prob = forest.predict_proba(county_demo)[:,1]
county_demo['predict_prob'] = county_pred_prob
county_demo
General_health Excessive_drinking High_school_grad College Unemployed Inactivity Smoker Health_plan Diabetes Owns_home ... _IMPRACE_Hispanic INCOME2_$75,000 or more INCOME2_Less than $35,000 ($25,000 to less than $35,000) INCOME2_Less than $50,000 ($35,000 to less than $50,000) INCOME2_Less than $75,000 ($50,000 to less than $75,000) INCOME2_Less than $25,000 ($20,000 to less than $25,000) INCOME2_Less than $20,000 ($15,000 to less than $20,000) INCOME2_Less than $15,000 ($10,000 to less than $15,000) INCOME2_Less than $10,000 predict_prob
FIPS
47000 0.20 0.14 0.90 0.61 0.035 0.27 0.23 0.86 0.13 0.66 ... 0.056 0 0 0 1 0 0 0 0 0.640059
47001 0.19 0.14 0.92 0.58 0.038 0.26 0.21 0.87 0.11 0.68 ... 0.031 0 0 0 1 0 0 0 0 0.640059
47003 0.22 0.16 0.91 0.41 0.037 0.32 0.23 0.82 0.11 0.68 ... 0.126 0 0 1 0 0 0 0 0 0.687877
47005 0.23 0.12 0.96 0.47 0.049 0.29 0.24 0.85 0.14 0.76 ... 0.025 0 0 1 0 0 0 0 0 0.687877
47007 0.23 0.15 0.85 0.38 0.058 0.26 0.25 0.82 0.10 0.74 ... 0.025 0 0 1 0 0 0 0 0 0.687877
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47181 0.24 0.14 0.96 0.40 0.049 0.41 0.25 0.86 0.15 0.81 ... 0.021 0 0 1 0 0 0 0 0 0.687877
47183 0.22 0.14 0.92 0.56 0.042 0.34 0.22 0.88 0.11 0.66 ... 0.027 0 0 1 0 0 0 0 0 0.667665
47185 0.25 0.13 0.94 0.49 0.037 0.23 0.23 0.86 0.22 0.81 ... 0.028 0 0 1 0 0 0 0 0 0.687877
47187 0.14 0.16 0.96 0.87 0.025 0.18 0.15 0.92 0.08 0.81 ... 0.049 1 0 0 0 0 0 0 0 0.812499
47189 0.17 0.17 0.96 0.69 0.028 0.21 0.18 0.90 0.11 0.77 ... 0.045 1 0 0 0 0 0 0 0 0.812499

96 rows × 26 columns

disparities = pd.read_csv('../data/disparities.csv', header = 0, index_col = 0)
disparities.index = disparities.FIPS
disparities
FIPS year primary_sex primary_age primary_dual state primary_race primary_eligibility primary_denominator_well analysis_value_well ... county primary_care_physicians population urban shadac_category pcp_per_100k num_hospitals num_hospitals_open mean_beds num_beds
FIPS
47033 47033 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 25 ... Crockett 0.0 14399 Rural low inadequate 0.000000 0.0 0.0 NaN 0.0
47061 47061 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 34 ... Grundy 0.0 13344 Rural low inadequate 0.000000 0.0 0.0 NaN 0.0
47095 47095 2019 All All Dual & non-dual TENNESSEE All All 500-999 20 ... Lake 0.0 7401 Rural low inadequate 0.000000 0.0 0.0 NaN 0.0
47175 47175 2019 All All Dual & non-dual TENNESSEE All All 500-999 31 ... Van Buren 0.0 5760 Rural low inadequate 0.000000 0.0 0.0 NaN 0.0
47007 47007 2019 All All Dual & non-dual TENNESSEE All All 1,000-4,999 19 ... Bledsoe 1.0 14836 Rural low inadequate 6.740361 1.0 1.0 25.000000 25.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47093 47093 2019 All All Dual & non-dual TENNESSEE All All 10,000+ 56 ... Knox 520.0 461104 Urban adequate 112.772823 11.0 10.0 306.600000 3066.0
47163 47163 2019 All All Dual & non-dual TENNESSEE All All 10,000+ 52 ... Sullivan 199.0 157050 Urban adequate 126.711238 5.0 5.0 235.000000 1175.0
47113 47113 2019 All All Dual & non-dual TENNESSEE All All 10,000+ 43 ... Madison 129.0 97625 Urban adequate 132.138284 3.0 3.0 330.666667 992.0
47187 47187 2019 All All Dual & non-dual TENNESSEE All All 10,000+ 48 ... Williamson 338.0 225389 Urban adequate 149.962953 2.0 2.0 112.500000 225.0
47179 47179 2019 All All Dual & non-dual TENNESSEE All All 10,000+ 57 ... Washington 226.0 127805 Urban adequate 176.831892 7.0 6.0 167.000000 1002.0

95 rows × 30 columns

full_df = disparities.join(county_demo, lsuffix = 'disp', rsuffix = 'demo')
full_df.sort_values('predict_prob')[['county','predict_prob']]
county predict_prob
FIPS
47027 Clay 0.615584
47095 Lake 0.615584
47067 Hancock 0.615584
47149 Rutherford 0.640059
47125 Montgomery 0.640059
... ... ...
47091 Johnson 0.687877
47055 Giles 0.687877
47075 Haywood 0.692379
47189 Wilson 0.812499
47187 Williamson 0.812499

95 rows × 2 columns

full_df.columns
Index(['FIPS', 'year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
       'primary_race', 'primary_eligibility', 'primary_denominator_well',
       'analysis_value_well', 'primary_denominator_hosp',
       'analysis_value_hosp', 'primary_denominator_emer',
       'analysis_value_emer', 'laus_code', 'Name', 'Employed',
       'Unemployeddisp', 'unemployment_rate', 'FIPS.1', 'county',
       'primary_care_physicians', 'population', 'urban', 'shadac_category',
       'pcp_per_100k', 'num_hospitals', 'num_hospitals_open', 'mean_beds',
       'num_beds', 'General_health', 'Excessive_drinking', 'High_school_grad',
       'College', 'Unemployeddemo', 'Inactivity', 'Smoker', 'Health_plan',
       'Diabetes', 'Owns_home', 'Rural', '_IMPRACE_White, Non-Hispanic',
       '_IMPRACE_Black, Non-Hispanic',
       '_IMPRACE_American Indian/Alaskan Native, Non-Hispanic',
       '_IMPRACE_Asian, Non-Hispanic', '_IMPRACE_Other race, Non-Hispanic',
       '_IMPRACE_Hispanic', 'INCOME2_$75,000 or more',
       'INCOME2_Less than $35,000 ($25,000 to less than $35,000)',
       'INCOME2_Less than $50,000 ($35,000 to less than $50,000)',
       'INCOME2_Less than $75,000 ($50,000 to less than $75,000)',
       'INCOME2_Less than $25,000 ($20,000 to less than $25,000)',
       'INCOME2_Less than $20,000 ($15,000 to less than $20,000)',
       'INCOME2_Less than $15,000 ($10,000 to less than $15,000)',
       'INCOME2_Less than $10,000', 'predict_prob'],
      dtype='object')
sns.scatterplot(full_df['Health_plan'] , full_df.predict_prob)
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(





<AxesSubplot:xlabel='Health_plan', ylabel='predict_prob'>
(county_demo).plot(kind = 'box', figsize = (10,10))
plt.xticks(rotation = 90)
(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26]),
 [Text(1, 0, 'General_health'),
  Text(2, 0, 'Excessive_drinking'),
  Text(3, 0, 'High_school_grad'),
  Text(4, 0, 'College'),
  Text(5, 0, 'Unemployed'),
  Text(6, 0, 'Inactivity'),
  Text(7, 0, 'Smoker'),
  Text(8, 0, 'Health_plan'),
  Text(9, 0, 'Diabetes'),
  Text(10, 0, 'Owns_home'),
  Text(11, 0, 'Rural'),
  Text(12, 0, '_IMPRACE_White, Non-Hispanic'),
  Text(13, 0, '_IMPRACE_Black, Non-Hispanic'),
  Text(14, 0, '_IMPRACE_American Indian/Alaskan Native, Non-Hispanic'),
  Text(15, 0, '_IMPRACE_Asian, Non-Hispanic'),
  Text(16, 0, '_IMPRACE_Other race, Non-Hispanic'),
  Text(17, 0, '_IMPRACE_Hispanic'),
  Text(18, 0, 'INCOME2_$75,000 or more'),
  Text(19, 0, 'INCOME2_Less than $35,000 ($25,000 to less than $35,000)'),
  Text(20, 0, 'INCOME2_Less than $50,000 ($35,000 to less than $50,000)'),
  Text(21, 0, 'INCOME2_Less than $75,000 ($50,000 to less than $75,000)'),
  Text(22, 0, 'INCOME2_Less than $25,000 ($20,000 to less than $25,000)'),
  Text(23, 0, 'INCOME2_Less than $20,000 ($15,000 to less than $20,000)'),
  Text(24, 0, 'INCOME2_Less than $15,000 ($10,000 to less than $15,000)'),
  Text(25, 0, 'INCOME2_Less than $10,000'),
  Text(26, 0, 'predict_prob')])

Which groups within these counties might need to be focused on to maximize the impact of TN Med Helper’s efforts?

counties = gpd.read_file('../data/county/tncounty.shp')
counties
OBJECTID NAME KEY SHAPE_AREA SHAPE_LEN geometry
0 76 Chester 47023 8.049024e+09 520461.080124 POLYGON ((1137985.762 344601.643, 1137965.070 ...
1 77 Wayne 47181 2.050741e+10 666520.678598 POLYGON ((1365052.057 391716.806, 1365746.554 ...
2 78 Tipton 47167 1.319125e+10 865093.887634 MULTIPOLYGON (((886814.330 400456.525, 886774....
3 79 Hamilton 47065 1.604776e+10 652926.001078 POLYGON ((2274954.438 239788.911, 2274090.610 ...
4 80 Stewart 47161 1.375003e+10 490090.336180 POLYGON ((1382472.783 743972.302, 1382445.171 ...
... ... ... ... ... ... ...
90 91 McNairy 47109 1.563586e+10 566369.132062 POLYGON ((1137985.762 344601.643, 1139350.519 ...
91 92 Franklin 47051 1.605093e+10 621176.096919 POLYGON ((1873015.265 239618.144, 1872957.848 ...
92 93 Bradley 47011 9.241234e+09 457372.233476 POLYGON ((2274954.438 239788.911, 2275552.803 ...
93 94 Marion 47115 1.428734e+10 529431.591556 POLYGON ((2126056.390 236919.771, 2122873.509 ...
94 95 Polk 47139 1.233228e+10 479994.126988 POLYGON ((2355580.184 332970.851, 2355673.384 ...

95 rows × 6 columns

full_df.county
FIPS
47033      Crockett
47061        Grundy
47095          Lake
47175     Van Buren
47007       Bledsoe
            ...    
47093          Knox
47163      Sullivan
47113       Madison
47187    Williamson
47179    Washington
Name: county, Length: 95, dtype: object
counties = pd.merge(left = counties,left_on = 'NAME',
                    right = full_df[['county', 'predict_prob']], right_on = 'county')
counties['no_checkup'] = 1- counties.predict_prob

from matplotlib.lines import Line2D

fig, ax = plt.subplots(figsize=(16,4))

counties.plot(column = 'no_checkup', 
              edgecolor = 'black',
              legend = True,
              cmap = 'viridis',
              # scheme="NaturalBreaks",
              ax = ax)



# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
# markers = []
# for line in leg.get_lines():
#     marker = Line2D([0],[0], marker = 'o', 
#                     markersize = line.get_markersize(), 
#                     color = line.get_markerfacecolor(),
#                     linestyle = 'None',
#                     markeredgecolor = 'black',
#                     markeredgewidth = 1)
#     markers.append(marker)

# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
    
plt.title('Predicted No Checkup in the Last Year for Average Resident', fontsize = 18)

ax.axis('off');