TN Med Helper is a fictional company whose mission is to ensure access to healthcare for all Tennesseans. TN Med Helper has approached your data science consultancy for help identifying communities in Tennessee that need the most help in expanding access to healthcare.

In this project, we will use the Medicare Disparities data as a starting point for identifying such communities. Specifically, you will be provided with datasets containing the percent of Medicare beneficiaries who had an annual wellness visit (annual_wellness.csv), the number of all-cause hospitilizations per 1000 beneficiaries (hospitalizations.csv), and the number of emergency department visits per 1000 beneficiaries (emergency_department.csv). Over the next 8 weeks, you will work towards addressing the following three objectives.

First, TN Med Helper is concerned about communities either lacking access to healthcare or losing access to healthcare. They are looking to expand telehealth technologies into the vulnerable communities, and need your help to priortize areas most needing attention. your first objective is to identify which counties in Tennessee have the most severe lack of access to healthcare (either due to lack of hospitals, physicians, or both). Once you have identified these counties, see if you can find any common demographic or economic characteristics for these areas.

Second, TN Med Helper is interested in reducing the number of potentially preventable hospitalizations. Do areas that lack access to healthcare tend to have higher rates of emergency department visits or hospitalizations? Is there an association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits?

Finally, TN Med Helper is trying to identify specific subpopulations to focus more attention on. Using data from the Behavioral Risk Factor Surveillance System, build a model to predict whether an individual has not had a checkup in the last year. Apply this model to the counties you identified above to predict how likely it is that the average person from those counties has not had a checkup in the last year. Which groups within these counties might need to be focused on to maximize the impact of TN Med Helper’s efforts?

Over the course of this class, you will build up your data analysis skills and work towards answering these questions. At the end of the project, teams will present their findings to TN Med Helper.

Looking at the data

Annual wellness visits is a percentage (of beneficiaries).
Emergency department visits is a number per 1000 beneficiaries.
All-cause hospitalizations is number per 1000 beneficiaries.

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

hospitalizations = pd.read_csv("../data/Medicare_Disparities_by_Population/hospitalizations.csv")
hospitalizations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3222 entries, 0 to 3221
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 year                 3222 non-null   int64 
 geography            3222 non-null   object
 measure              3222 non-null   object
 adjustment           3222 non-null   object
 analysis             3222 non-null   object
 domain               3222 non-null   object
 condition            3222 non-null   object
 primary_sex          3222 non-null   object
 primary_age          3222 non-null   object
 primary_dual         3222 non-null   object
fips                 3222 non-null   int64 
county               3205 non-null   object
state                3222 non-null   object
urban                3222 non-null   object
primary_race         3222 non-null   object
primary_eligibility  3222 non-null   object
primary_denominator  3222 non-null   object
analysis_value       3222 non-null   int64 
dtypes: int64(3), object(15)
memory usage: 453.2+ KB

annual_wellness = pd.read_csv("../data/Medicare_Disparities_by_Population/annual_wellness.csv")
annual_wellness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 year                 3220 non-null   int64 
 geography            3220 non-null   object
 measure              3220 non-null   object
 adjustment           3220 non-null   object
 analysis             3220 non-null   object
 domain               3220 non-null   object
 condition            3220 non-null   object
 primary_sex          3220 non-null   object
 primary_age          3220 non-null   object
 primary_dual         3220 non-null   object
fips                 3220 non-null   int64 
county               3205 non-null   object
state                3220 non-null   object
urban                3220 non-null   object
primary_race         3220 non-null   object
primary_eligibility  3220 non-null   object
primary_denominator  3220 non-null   object
analysis_value       3220 non-null   int64 
dtypes: int64(3), object(15)
memory usage: 452.9+ KB

emergency_department = pd.read_csv("../data/Medicare_Disparities_by_Population/emergency_department.csv")
emergency_department.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 year                 3220 non-null   int64 
 geography            3220 non-null   object
 measure              3220 non-null   object
 adjustment           3220 non-null   object
 analysis             3220 non-null   object
 domain               3220 non-null   object
 condition            3220 non-null   object
 primary_sex          3220 non-null   object
 primary_age          3220 non-null   object
 primary_dual         3220 non-null   object
fips                 3220 non-null   int64 
county               3205 non-null   object
state                3220 non-null   object
urban                3220 non-null   object
primary_race         3220 non-null   object
primary_eligibility  3220 non-null   object
primary_denominator  3220 non-null   object
analysis_value       3220 non-null   int64 
dtypes: int64(3), object(15)
memory usage: 452.9+ KB

All three data structures are the same, with the same column names (unfortunately). I will merge them, and rename the columns that I want to keep from each.

emergency_department.iloc[1] == hospitalizations.iloc[1]

year                    True
geography               True
measure                False
adjustment              True
analysis                True
domain                  True
condition              False
primary_sex             True
primary_age             True
primary_dual            True
fips                    True
county                  True
state                   True
urban                   True
primary_race            True
primary_eligibility     True
primary_denominator    False
analysis_value         False
Name: 1, dtype: bool

emer_dict = {}
for i in emergency_department:
    emer_dict[i] = set(emergency_department[i])

emer_dict['fips'] = []
emer_dict['county'] = []
emer_dict['state'] = []
emer_dict['analysis_value'] = []
emer_dict

{'year': {2019},
 'geography': {'County'},
 'measure': {'Emergency department visit rate'},
 'adjustment': {'Unsmoothed actual'},
 'analysis': {'Base measure'},
 'domain': {'Primary chronic conditions'},
 'condition': {'All Emergency Department Visits'},
 'primary_sex': {'All'},
 'primary_age': {'All'},
 'primary_dual': {'Dual & non-dual'},
 'fips': [],
 'county': [],
 'state': [],
 'urban': {'Rural', 'Urban'},
 'primary_race': {'All'},
 'primary_eligibility': {'All'},
 'primary_denominator': {'undefined'},
 'analysis_value': []}

The comparison above shows which columns are unique to each dataframe: measure, condition, primary_denominator, and analysis_value. All other columns except urban only have one value.

hospitalizations = hospitalizations[['fips','measure','primary_denominator', 'analysis_value']]
emergency_department = emergency_department[['fips','measure','primary_denominator', 'analysis_value']]

hospitalizations.columns = ['fips','measure_hosp','primary_denominator_hosp', 'analysis_value_hosp']
emergency_department.columns = ['fips','measure_emer','primary_denominator_emer', 'analysis_value_emer']

annual_wellness.columns = annual_wellness.columns.str.replace("analysis_value","analysis_value_well")
annual_wellness.columns = annual_wellness.columns.str.replace("measure","measure_well")
annual_wellness.columns = annual_wellness.columns.str.replace("primary_denominator","primary_denominator_well")

disparities = pd.merge(left = annual_wellness, left_on = 'fips',
         right = hospitalizations, right_on = 'fips')

disparities = pd.merge(left = disparities, left_on = 'fips',
         right = emergency_department, right_on = 'fips')

disparities.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3220 entries, 0 to 3219
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 year                      3220 non-null   int64 
 geography                 3220 non-null   object
 measure_well              3220 non-null   object
 adjustment                3220 non-null   object
 analysis                  3220 non-null   object
 domain                    3220 non-null   object
 condition                 3220 non-null   object
 primary_sex               3220 non-null   object
 primary_age               3220 non-null   object
 primary_dual              3220 non-null   object
fips                      3220 non-null   int64 
county                    3205 non-null   object
state                     3220 non-null   object
urban                     3220 non-null   object
primary_race              3220 non-null   object
primary_eligibility       3220 non-null   object
primary_denominator_well  3220 non-null   object
analysis_value_well       3220 non-null   int64 
measure_hosp              3220 non-null   object
primary_denominator_hosp  3220 non-null   object
analysis_value_hosp       3220 non-null   int64 
measure_emer              3220 non-null   object
primary_denominator_emer  3220 non-null   object
analysis_value_emer       3220 non-null   int64 
dtypes: int64(5), object(19)
memory usage: 628.9+ KB

print(set(disparities.adjustment),set(disparities.analysis),set(disparities.domain),set(disparities.condition))

{'Unsmoothed actual'} {'Base measure'} {'Preventive Services'} {'Annual Wellness Visit'}

Each of the columns above only has one value, so I am going to remove them.

disparities = disparities.drop(['adjustment','analysis','domain','condition'], axis = 1)
disparities.head(2)

	year	geography	measure_well	primary_sex	primary_age	primary_dual	fips	county	state	urban	primary_race	primary_eligibility	primary_denominator_well	analysis_value_well	measure_hosp	primary_denominator_hosp	analysis_value_hosp	measure_emer	primary_denominator_emer	analysis_value_emer
0	2019	County	Preventive Services	All	All	Dual & non-dual	1001	Autauga County	ALABAMA	Urban	All	All	1,000-4,999	29	Hospitalization	5,000-9,999	299	Emergency department visit rate	undefined	723
1	2019	County	Preventive Services	All	All	Dual & non-dual	1003	Baldwin County	ALABAMA	Rural	All	All	10,000+	29	Hospitalization	10,000+	268	Emergency department visit rate	undefined	601

Lastly, I will narrow down the list to only counties in TN.

disparities = disparities.loc[disparities.state == 'TENNESSEE']

disparities.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 2428 to 2522
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 year                      95 non-null     int64 
 geography                 95 non-null     object
 measure_well              95 non-null     object
 primary_sex               95 non-null     object
 primary_age               95 non-null     object
 primary_dual              95 non-null     object
 fips                      95 non-null     int64 
 county                    95 non-null     object
 state                     95 non-null     object
 urban                     95 non-null     object
primary_race              95 non-null     object
primary_eligibility       95 non-null     object
primary_denominator_well  95 non-null     object
analysis_value_well       95 non-null     int64 
measure_hosp              95 non-null     object
primary_denominator_hosp  95 non-null     object
analysis_value_hosp       95 non-null     int64 
measure_emer              95 non-null     object
primary_denominator_emer  95 non-null     object
analysis_value_emer       95 non-null     int64 
dtypes: int64(5), object(15)
memory usage: 15.6+ KB

As before, we have all 95 counties in TN.

Question 1: Identify which counties in TN have the most severe lack of access to healthcare.

Which counties have low numbers of physicians?

physicians = pd.read_csv('../data/primary_care_physicians.csv')

physicians = physicians.loc[physicians.state == 'Tennessee']
physicians.shape

(95, 4)

population = pd.read_csv('../data/population_by_county.csv')
population.head()

	FIPS	population	county	state	urban
0	17051	21565	Fayette County	ILLINOIS	Rural
1	17107	29003	Logan County	ILLINOIS	Rural
2	17165	23994	Saline County	ILLINOIS	Rural
3	17097	701473	Lake County	ILLINOIS	Urban
4	17127	14219	Massac County	ILLINOIS	Rural

physicians = pd.merge(left = physicians, right = population[['FIPS','population','urban']], left_on='FIPS', right_on = 'FIPS')

physicians['urban'].value_counts()

Rural    57
Urban    38
Name: urban, dtype: int64

physicians.loc[physicians.population/physicians.primary_care_physicians < 1500,'supply'] = 'adequate'
physicians.loc[(physicians.population/physicians.primary_care_physicians > 1500) &
               (physicians.population/physicians.primary_care_physicians < 3500),'supply'] = 'moderately adequate'
physicians.loc[physicians.population/physicians.primary_care_physicians > 3500,'supply'] = 'low inadequate'

physicians = physicians.rename({'supply':'shadac_category'}, axis = 'columns')

import matplotlib.pyplot as plt
pd.crosstab(physicians.urban, physicians.shadac_category, normalize='index').plot(kind = 'bar', stacked = True)
plt.xticks(rotation = 0)
plt.legend(bbox_to_anchor = (1, 0.8), loc = 'upper left');   # move the legend to the right side of the plot
plt.title("SHADAC Category by Type of County")

Text(0.5, 1.0, 'SHADAC Category by Type of County')

physicians

	FIPS	state	county	primary_care_physicians	population	urban	shadac_category
0	47001	Tennessee	Anderson	39.0	76061	Urban	moderately adequate
1	47003	Tennessee	Bedford	15.0	48292	Rural	moderately adequate
2	47005	Tennessee	Benton	3.0	16140	Rural	low inadequate
3	47007	Tennessee	Bledsoe	1.0	14836	Rural	low inadequate
4	47009	Tennessee	Blount	90.0	129927	Urban	adequate
...	...	...	...	...	...	...	...
90	47181	Tennessee	Wayne	5.0	16693	Rural	moderately adequate
91	47183	Tennessee	Weakley	18.0	33510	Rural	moderately adequate
92	47185	Tennessee	White	9.0	26800	Rural	moderately adequate
93	47187	Tennessee	Williamson	338.0	225389	Urban	adequate
94	47189	Tennessee	Wilson	43.0	136666	Urban	moderately adequate

95 rows × 7 columns

unemployment = pd.read_csv('../data/tn_unemployment.csv')
unemployment.head()

	laus_code	State	County	Name	Period	LF	Employed	Unemployed	unemployment_rate
0	CN4700100000000	47	1	Anderson County, TN	Mar-21	34704	33010	1694	4.9
1	CN4700300000000	47	3	Bedford County, TN	Mar-21	20623	19550	1073	5.2
2	CN4700500000000	47	5	Benton County, TN	Mar-21	6723	6305	418	6.2
3	CN4700700000000	47	7	Bledsoe County, TN	Mar-21	4252	3947	305	7.2
4	CN4700900000000	47	9	Blount County, TN	Mar-21	64098	61119	2979	4.6

unemployment.Name = unemployment.Name.str.split(' County, TN', expand = True)[0]

physicians = pd.merge(left=unemployment[['laus_code','Name','Employed','Unemployed','unemployment_rate']], left_on='Name',
        right = physicians[['FIPS','county','primary_care_physicians','population','urban','shadac_category']], right_on='county')

physicians

	laus_code	Name	Employed	Unemployed	unemployment_rate	FIPS	county	primary_care_physicians	population	urban	shadac_category
0	CN4700100000000	Anderson	33010	1694	4.9	47001	Anderson	39.0	76061	Urban	moderately adequate
1	CN4700300000000	Bedford	19550	1073	5.2	47003	Bedford	15.0	48292	Rural	moderately adequate
2	CN4700500000000	Benton	6305	418	6.2	47005	Benton	3.0	16140	Rural	low inadequate
3	CN4700700000000	Bledsoe	3947	305	7.2	47007	Bledsoe	1.0	14836	Rural	low inadequate
4	CN4700900000000	Blount	61119	2979	4.6	47009	Blount	90.0	129927	Urban	adequate
...	...	...	...	...	...	...	...	...	...	...	...
90	CN4718100000000	Wayne	6074	342	5.3	47181	Wayne	5.0	16693	Rural	moderately adequate
91	CN4718300000000	Weakley	14783	711	4.6	47183	Weakley	18.0	33510	Rural	moderately adequate
92	CN4718500000000	White	11484	601	5.0	47185	White	9.0	26800	Rural	moderately adequate
93	CN4718700000000	Williamson	125213	4271	3.3	47187	Williamson	338.0	225389	Urban	adequate
94	CN4718900000000	Wilson	74347	3079	4.0	47189	Wilson	43.0	136666	Urban	moderately adequate

95 rows × 11 columns

import seaborn as sns
sns.boxplot(physicians.urban,physicians.unemployment_rate)
plt.xticks(rotation = 0)
plt.title("Unemployment by Type of County")

/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Text(0.5, 1.0, 'Unemployment by Type of County')

physicians['pcp_per_100k'] = physicians.primary_care_physicians/physicians.population*100000

sns.lmplot(data = physicians, y = 'pcp_per_100k', x = 'unemployment_rate')
plt.ylabel = "PCP per 100k residents"
plt.xlabel = "Unemployment Rate"

physicians[['unemployment_rate', 'pcp_per_100k']].corr()

	unemployment_rate	pcp_per_100k
unemployment_rate	1.000000	-0.335333
pcp_per_100k	-0.335333	1.000000

disparities = pd.merge(left = disparities, left_on = 'fips',
         right = physicians, right_on = 'FIPS')

Now get rid of repetitive columns:

disparities.head(2)

	year	geography	measure_well	primary_sex	primary_age	primary_dual	fips	county_x	state	urban_x	...	Employed	Unemployed	unemployment_rate	FIPS	county_y	primary_care_physicians	population	urban_y	shadac_category	pcp_per_100k
0	2019	County	Preventive Services	All	All	Dual & non-dual	47001	Anderson County	TENNESSEE	Urban	...	33010	1694	4.9	47001	Anderson	39.0	76061	Urban	moderately adequate	51.274635
1	2019	County	Preventive Services	All	All	Dual & non-dual	47003	Bedford County	TENNESSEE	Rural	...	19550	1073	5.2	47003	Bedford	15.0	48292	Rural	moderately adequate	31.061045

2 rows × 32 columns

disparities.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 0 to 94
Data columns (total 32 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 year                      95 non-null     int64  
 geography                 95 non-null     object 
 measure_well              95 non-null     object 
 primary_sex               95 non-null     object 
 primary_age               95 non-null     object 
 primary_dual              95 non-null     object 
 fips                      95 non-null     int64  
 county_x                  95 non-null     object 
 state                     95 non-null     object 
 urban_x                   95 non-null     object 
primary_race              95 non-null     object 
primary_eligibility       95 non-null     object 
primary_denominator_well  95 non-null     object 
analysis_value_well       95 non-null     int64  
measure_hosp              95 non-null     object 
primary_denominator_hosp  95 non-null     object 
analysis_value_hosp       95 non-null     int64  
measure_emer              95 non-null     object 
primary_denominator_emer  95 non-null     object 
analysis_value_emer       95 non-null     int64  
laus_code                 95 non-null     object 
Name                      95 non-null     object 
Employed                  95 non-null     int64  
Unemployed                95 non-null     int64  
unemployment_rate         95 non-null     float64
FIPS                      95 non-null     int64  
county_y                  95 non-null     object 
primary_care_physicians   95 non-null     float64
population                95 non-null     int64  
urban_y                   95 non-null     object 
shadac_category           95 non-null     object 
pcp_per_100k              95 non-null     float64
dtypes: float64(3), int64(9), object(20)
memory usage: 24.5+ KB

disparities = disparities.drop(['geography','county_x','urban_x', 'fips'], axis = 1)
disparities.index = disparities.FIPS
disparities.columns = disparities.columns.str.replace('county_y', 'county')

disparities.columns = disparities.columns.str.replace('urban_y', 'urban')

print(set(disparities.measure_well),set(disparities.measure_emer),set(disparities.measure_hosp))

{'Preventive Services'} {'Emergency department visit rate'} {'Hospitalization'}

disparities = disparities.drop(['measure_well','measure_emer','measure_hosp'], axis = 1)

disparities.iloc[0]

year                                       2019
primary_sex                                 All
primary_age                                 All
primary_dual                    Dual & non-dual
state                                 TENNESSEE
primary_race                                All
primary_eligibility                         All
primary_denominator_well            5,000-9,999
analysis_value_well                          35
primary_denominator_hosp                10,000+
analysis_value_hosp                         267
primary_denominator_emer              undefined
analysis_value_emer                         617
laus_code                       CN4700100000000
Name                                   Anderson
Employed                                  33010
Unemployed                                 1694
unemployment_rate                           4.9
FIPS                                      47001
county                                 Anderson
primary_care_physicians                    39.0
population                                76061
urban                                     Urban
shadac_category             moderately adequate
pcp_per_100k                          51.274635
Name: 47001, dtype: object

disparities.to_csv("../data/disparities.csv")

Analyzing disparities in access: physicians

disparities = pd.read_csv('../data/disparities.csv', index_col = 0)

hospitals = pd.read_csv('../data/Hospitals.csv')
hospitals = hospitals.loc[hospitals.STATE =='TN']
hospitals.COUNTYFIPS = pd.to_numeric(hospitals['COUNTYFIPS'])

disparities.corr().style.background_gradient('viridis')

/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/pandas/io/formats/style.py:2813: RuntimeWarning: All-NaN slice encountered
  smin = np.nanmin(gmap) if vmin is None else vmin
/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/pandas/io/formats/style.py:2814: RuntimeWarning: All-NaN slice encountered
  smax = np.nanmax(gmap) if vmax is None else vmax

	year	analysis_value_well	analysis_value_hosp	analysis_value_emer	Employed	Unemployed	unemployment_rate	FIPS.1	primary_care_physicians	population	pcp_per_100k
year	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
analysis_value_well	nan	1.000000	-0.480732	-0.480607	0.288357	0.221078	-0.325228	0.032489	0.319609	0.281995	0.488941
analysis_value_hosp	nan	-0.480732	1.000000	0.599412	-0.093001	-0.049513	0.275980	0.085773	-0.121397	-0.091323	-0.308934
analysis_value_emer	nan	-0.480607	0.599412	1.000000	-0.179994	-0.121924	0.484027	-0.044881	-0.202044	-0.172820	-0.396181
Employed	nan	0.288357	-0.093001	-0.179994	1.000000	0.965032	-0.196023	0.089982	0.961056	0.992966	0.502287
Unemployed	nan	0.221078	-0.049513	-0.121924	0.965032	1.000000	-0.077655	0.091267	0.934610	0.985295	0.432351
unemployment_rate	nan	-0.325228	0.275980	0.484027	-0.196023	-0.077655	1.000000	-0.104688	-0.159979	-0.156259	-0.335333
FIPS.1	nan	0.032489	0.085773	-0.044881	0.089982	0.091267	-0.104688	1.000000	0.093030	0.091808	0.103152
primary_care_physicians	nan	0.319609	-0.121397	-0.202044	0.961056	0.934610	-0.159979	0.093030	1.000000	0.964170	0.624601
population	nan	0.281995	-0.091323	-0.172820	0.992966	0.985295	-0.156259	0.091808	0.964170	1.000000	0.501609
pcp_per_100k	nan	0.488941	-0.308934	-0.396181	0.502287	0.432351	-0.335333	0.103152	0.624601	0.501609	1.000000

disparities.columns

Index(['year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
       'primary_race', 'primary_eligibility', 'primary_denominator_well',
       'analysis_value_well', 'primary_denominator_hosp',
       'analysis_value_hosp', 'primary_denominator_emer',
       'analysis_value_emer', 'laus_code', 'Name', 'Employed', 'Unemployed',
       'unemployment_rate', 'FIPS.1', 'county', 'primary_care_physicians',
       'population', 'urban', 'shadac_category', 'pcp_per_100k'],
      dtype='object')

import seaborn as sns
sns.pairplot(disparities[['analysis_value_well', 
                          'analysis_value_hosp', 
                          'analysis_value_emer',
                          'unemployment_rate',
                          'population',
                          'pcp_per_100k','urban'
                         ]], hue = 'urban')

<seaborn.axisgrid.PairGrid at 0x19f0b01c0>

First, we will look at primary care physicians by county. We already did this, so we will summarize in plots below.

sns.boxplot(disparities.shadac_category, disparities.pcp_per_100k)

/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='shadac_category', ylabel='pcp_per_100k'>

disparities.shadac_category.value_counts()

moderately adequate    50
low inadequate         31
adequate               14
Name: shadac_category, dtype: int64

There are 31 counties considered low inadequate (>3500 residents per physician).

low_shadac = disparities.loc[disparities.shadac_category == 'low inadequate']

low_shadac.sort_values('pcp_per_100k').head(10)

	year	primary_sex	primary_age	primary_dual	state	primary_race	primary_eligibility	primary_denominator_well	analysis_value_well	primary_denominator_hosp	...	Employed	Unemployed	unemployment_rate	FIPS.1	county	primary_care_physicians	population	urban	shadac_category	pcp_per_100k
FIPS
47175	2019	All	All	Dual & non-dual	TENNESSEE	All	All	500-999	31	500-999	...	1884	137	6.8	47175	Van Buren	0.0	5760	Rural	low inadequate	0.000000
47033	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	25	1,000-4,999	...	6494	344	5.0	47033	Crockett	0.0	14399	Rural	low inadequate	0.000000
47061	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	34	1,000-4,999	...	4655	322	6.5	47061	Grundy	0.0	13344	Rural	low inadequate	0.000000
47095	2019	All	All	Dual & non-dual	TENNESSEE	All	All	500-999	20	1,000-4,999	...	1537	157	9.3	47095	Lake	0.0	7401	Rural	low inadequate	0.000000
47007	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	19	1,000-4,999	...	3947	305	7.2	47007	Bledsoe	1.0	14836	Rural	low inadequate	6.740361
47161	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	39	1,000-4,999	...	5146	305	5.6	47161	Stewart	1.0	13427	Urban	low inadequate	7.447680
47097	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	36	1,000-4,999	...	9075	658	6.8	47097	Lauderdale	2.0	25989	Rural	low inadequate	7.695564
47129	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	23	1,000-4,999	...	7327	455	5.8	47129	Morgan	3.0	21545	Rural	low inadequate	13.924344
47117	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	20	1,000-4,999	...	14422	735	4.8	47117	Marshall	5.0	32965	Rural	low inadequate	15.167602
47067	2019	All	All	Dual & non-dual	TENNESSEE	All	All	500-999	27	500-999	...	1997	144	6.7	47067	Hancock	1.0	6587	Rural	low inadequate	15.181418

10 rows × 25 columns

There are 31 counties with low access to physicians, by SHADAC category. In particular, 4 counties have 0 physicians: Van Buren, Crockett, Grundy, and Lake. All four are rural, though Crockett and Grundy have almost 15,000 residents.

Analyzing disparities in access: hospitals

We would like to add hospitals per 100K residents for each county to disparities. To do this, we need to get hospitals.COUNTY and disparities.county in the same format, and then we can add up the # of hospitals (plus associated variables) for each county by population.

hospitals.iloc[0]

X                                           -9409026.0792
Y                                            4210878.1252
OBJECTID                                              517
ID                                               14737331
NAME                 STARR REGIONAL MEDICAL CENTER ETOWAH
ADDRESS                             886 HIGHWAY 411 NORTH
CITY                                               ETOWAH
STATE                                                  TN
ZIP                                                 37331
ZIP4                                        NOT AVAILABLE
TELEPHONE                                  (423) 263-3600
TYPE                                   GENERAL ACUTE CARE
STATUS                                               OPEN
POPULATION                                             72
COUNTY                                             MCMINN
COUNTYFIPS                                          47107
COUNTRY                                               USA
LATITUDE                                        35.345099
LONGITUDE                                      -84.522719
NAICS_CODE                                         622110
NAICS_DESC         GENERAL MEDICAL AND SURGICAL HOSPITALS
SOURCE        https://apps.health.tn.gov/facilitylistings
SOURCEDATE                            2019/08/10 00:00:00
VAL_METHOD                                        IMAGERY
VAL_DATE                              2014/02/10 00:00:00
WEBSITE                     http://www.starrregional.com/
STATE_ID                                    NOT AVAILABLE
ALT_NAME                          WOODS MEMORIAL HOSPITAL
ST_FIPS                                                47
OWNER                                          NON-PROFIT
TTL_STAFF                                            -999
BEDS                                                   72
TRAUMA                                          LEVEL III
HELIPAD                                                 Y
Name: 515, dtype: object

hospitals.COUNTYFIPS.value_counts()

  23
  15
  13
  11
   7
         ..
   1
   1
   1
   1
   1
Name: COUNTYFIPS, Length: 80, dtype: int64

for i in range(len(hospitals.COUNTYFIPS.value_counts())):
    ind = (hospitals.COUNTYFIPS.value_counts().index[i])
    count = (hospitals.COUNTYFIPS.value_counts()[ind])
    disparities.loc[ind, 'num_hospitals'] = count

However, some of these hospitals are closed.

set(hospitals.STATUS)

{'CLOSED', 'OPEN'}

hospitals.groupby('COUNTYFIPS').STATUS.value_counts()

COUNTYFIPS  STATUS
47001       OPEN      1
47003       CLOSED    1
            OPEN      1
47005       OPEN      1
47007       OPEN      1
                     ..
47181       OPEN      1
47183       OPEN      2
47185       OPEN      1
47187       OPEN      2
47189       OPEN      2
Name: STATUS, Length: 89, dtype: int64

for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
    ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
    count = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts()[ind])
    disparities.loc[ind, 'num_hospitals_open'] = count

disparities

	year	primary_sex	primary_age	primary_dual	state	primary_race	primary_eligibility	primary_denominator_well	analysis_value_well	primary_denominator_hosp	...	unemployment_rate	FIPS.1	county	primary_care_physicians	population	urban	shadac_category	pcp_per_100k	num_hospitals	num_hospitals_open
FIPS
47001	2019	All	All	Dual & non-dual	TENNESSEE	All	All	5,000-9,999	35	10,000+	...	4.9	47001	Anderson	39.0	76061	Urban	moderately adequate	51.274635	1.0	1.0
47003	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	46	5,000-9,999	...	5.2	47003	Bedford	15.0	48292	Rural	moderately adequate	31.061045	2.0	1.0
47005	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	30	1,000-4,999	...	6.2	47005	Benton	3.0	16140	Rural	low inadequate	18.587361	1.0	1.0
47007	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	19	1,000-4,999	...	7.2	47007	Bledsoe	1.0	14836	Rural	low inadequate	6.740361	1.0	1.0
47009	2019	All	All	Dual & non-dual	TENNESSEE	All	All	10,000+	43	10,000+	...	4.6	47009	Blount	90.0	129927	Urban	adequate	69.269667	2.0	2.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
47181	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	15	1,000-4,999	...	5.3	47181	Wayne	5.0	16693	Rural	moderately adequate	29.952675	1.0	1.0
47183	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	30	5,000-9,999	...	4.6	47183	Weakley	18.0	33510	Rural	moderately adequate	53.715309	2.0	2.0
47185	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	35	5,000-9,999	...	5.0	47185	White	9.0	26800	Rural	moderately adequate	33.582090	1.0	1.0
47187	2019	All	All	Dual & non-dual	TENNESSEE	All	All	10,000+	48	10,000+	...	3.3	47187	Williamson	338.0	225389	Urban	adequate	149.962953	2.0	2.0
47189	2019	All	All	Dual & non-dual	TENNESSEE	All	All	10,000+	45	10,000+	...	4.0	47189	Wilson	43.0	136666	Urban	moderately adequate	31.463568	2.0	2.0

95 rows × 27 columns

sns.scatterplot(data = disparities, x = 'pcp_per_100k', y = 'num_hospitals_open', hue = 'urban')

<AxesSubplot:xlabel='pcp_per_100k', ylabel='num_hospitals_open'>

disparities = disparities.fillna(0)

disparities.loc[(disparities.pcp_per_100k < 15) &
                (disparities.num_hospitals_open < 2 )]['county']

FIPS
     Bledsoe
    Crockett
      Grundy
        Lake
  Lauderdale
      Morgan
     Stewart
   Van Buren
Name: county, dtype: object

The counties above are the ones with fewer than 2 open hospitals and fewer than 15 pcp per 100k

Lastly, we can take a look at number of beds. We need to fillna (-999) with 0 so it doesn’t mess up the counts.

hospitals.BEDS = hospitals.BEDS.replace(-999,0)

for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
    ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
    count = (hospitals.loc[hospitals.STATUS == 'OPEN'].groupby('COUNTYFIPS').BEDS.mean()[ind])
    disparities.loc[ind, 'mean_beds'] = count

for i in range(len(hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts())):
    ind = (hospitals.loc[hospitals.STATUS == 'OPEN'].COUNTYFIPS.value_counts().index[i])
    count = (hospitals.loc[hospitals.STATUS == 'OPEN'].groupby('COUNTYFIPS').BEDS.sum()[ind])
    disparities.loc[ind, 'num_beds'] = count

disparities.num_beds = disparities.num_beds.fillna(0)

sns.scatterplot(data = disparities.loc[disparities.urban == 'Rural'], x = 'num_hospitals_open', y = 'num_beds')#, hue = 'urban')

<AxesSubplot:xlabel='num_hospitals_open', ylabel='num_beds'>

sns.scatterplot(data = disparities, x = 'pcp_per_100k', y = 'num_beds', hue = 'urban')

<AxesSubplot:xlabel='pcp_per_100k', ylabel='num_beds'>

disparities.to_csv('../data/disparities.csv')

Plotting disparities

median_income = pd.read_excel('../data/est18all.xls',
              sheet_name = 'est18ALL',
             header = 3,
             usecols = 'C,D,W')

median_income.head(2)

	Postal Code	Name	Median Household Income
0	US	United States	61937
1	AL	Alabama	49881

For this map, we only need the counties located in Tennessee.

median_income = median_income.loc[median_income['Postal Code'] == 'TN']
median_income.head(2)

	Postal Code	Name	Median Household Income
2471	TN	Tennessee	52366
2472	TN	Anderson County	50672

We can remove the first row.

median_income = median_income.iloc[1:]
median_income.head(2)

	Postal Code	Name	Median Household Income
2472	TN	Anderson County	50672
2473	TN	Bedford County	49860

Now, let’s read in our counties shapefiles. This one was obtained from http://www.tngis.org/administrative-boundaries.htm

This creates a geopandas DataFrame, which is like a pandas DataFrame, but has geometry associated with it.

counties = gpd.read_file('../data/county/tncounty.shp')

counties.head()

	OBJECTID	NAME	KEY	SHAPE_AREA	SHAPE_LEN	geometry
0	76	Chester	47023	8.049024e+09	520461.080124	POLYGON ((1137985.762 344601.643, 1137965.070 ...
1	77	Wayne	47181	2.050741e+10	666520.678598	POLYGON ((1365052.057 391716.806, 1365746.554 ...
2	78	Tipton	47167	1.319125e+10	865093.887634	MULTIPOLYGON (((886814.330 400456.525, 886774....
3	79	Hamilton	47065	1.604776e+10	652926.001078	POLYGON ((2274954.438 239788.911, 2274090.610 ...
4	80	Stewart	47161	1.375003e+10	490090.336180	POLYGON ((1382472.783 743972.302, 1382445.171 ...

median_income['NAME'] = median_income['Name'].str[:-7]
counties = pd.merge(left = counties,
                    right = median_income[['NAME', 'Median Household Income']])
counties['Median Household Income'] = pd.to_numeric(counties['Median Household Income'])

from matplotlib.lines import Line2D

fig, ax = plt.subplots(figsize=(16,4))

counties.plot(column = 'Median Household Income', 
              edgecolor = 'black',
              legend = True,
              cmap = 'Blues',
              scheme="NaturalBreaks",
              ax = ax)

leg = ax.get_legend()

# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
    label_text = lbl.get_text()
    lower = float(label_text.split()[0][:-1])
    upper = float(label_text.split()[1][:-1])
    if i == 0:
        new_text = "Below " + "\${:,.0f}".format(upper + 1)
    elif i == n - 1:
        new_text = "Above " + "\${:,.0f}".format(lower)
    else:
        new_text = "\${:,.0f}".format(lower + 1) + " - " + "\${:,.0f}".format(upper)
        
    labels.append(new_text)

# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
    marker = Line2D([0],[0], marker = 'o', 
                    markersize = line.get_markersize(), 
                    color = line.get_markerfacecolor(),
                    linestyle = 'None',
                    markeredgecolor = 'black',
                    markeredgewidth = 1)
    markers.append(marker)

# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
    
plt.title('Median Household Income by County, 2018', fontsize = 18)

ax.axis('off');

counties = pd.merge(left = counties, left_on = 'NAME',
         right = disparities, right_on = 'county')

counties

	OBJECTID	NAME	KEY	SHAPE_AREA	SHAPE_LEN	geometry	Median Household Income	year	primary_sex	primary_age	...	county	primary_care_physicians	population	urban	shadac_category	pcp_per_100k	num_hospitals	num_hospitals_open	mean_beds	num_beds
0	76	Chester	47023	8.049024e+09	520461.080124	POLYGON ((1137985.762 344601.643, 1137965.070 ...	47508	2019	All	All	...	Chester	4.0	17190	Urban	low inadequate	23.269343	0.0	0.0	NaN	0.0
1	77	Wayne	47181	2.050741e+10	666520.678598	POLYGON ((1365052.057 391716.806, 1365746.554 ...	38879	2019	All	All	...	Wayne	5.0	16693	Rural	moderately adequate	29.952675	1.0	1.0	80.0	80.0
2	78	Tipton	47167	1.319125e+10	865093.887634	MULTIPOLYGON (((886814.330 400456.525, 886774....	60874	2019	All	All	...	Tipton	17.0	61447	Urban	low inadequate	27.666119	1.0	1.0	100.0	100.0
3	79	Hamilton	47065	1.604776e+10	652926.001078	POLYGON ((2274954.438 239788.911, 2274090.610 ...	57196	2019	All	All	...	Hamilton	403.0	360919	Urban	adequate	111.659403	13.0	12.0	160.0	1920.0
4	80	Stewart	47161	1.375003e+10	490090.336180	POLYGON ((1382472.783 743972.302, 1382445.171 ...	46565	2019	All	All	...	Stewart	1.0	13427	Urban	low inadequate	7.447680	0.0	0.0	NaN	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
90	91	McNairy	47109	1.563586e+10	566369.132062	POLYGON ((1137985.762 344601.643, 1139350.519 ...	39859	2019	All	All	...	McNairy	9.0	25844	Rural	moderately adequate	34.824331	0.0	0.0	NaN	0.0
91	92	Franklin	47051	1.605093e+10	621176.096919	POLYGON ((1873015.265 239618.144, 1872957.848 ...	50201	2019	All	All	...	Franklin	27.0	41725	Rural	moderately adequate	64.709407	2.0	2.0	86.5	173.0
92	93	Bradley	47011	9.241234e+09	457372.233476	POLYGON ((2274954.438 239788.911, 2275552.803 ...	50427	2019	All	All	...	Bradley	55.0	105749	Urban	moderately adequate	52.009948	2.0	2.0	174.0	348.0
93	94	Marion	47115	1.428734e+10	529431.591556	POLYGON ((2126056.390 236919.771, 2122873.509 ...	50819	2019	All	All	...	Marion	12.0	28538	Urban	moderately adequate	42.049198	1.0	1.0	68.0	68.0
94	95	Polk	47139	1.233228e+10	479994.126988	POLYGON ((2355580.184 332970.851, 2355673.384 ...	41968	2019	All	All	...	Polk	8.0	16814	Urban	moderately adequate	47.579398	1.0	0.0	NaN	0.0

95 rows × 36 columns

counties

	OBJECTID	NAME	KEY	SHAPE_AREA	SHAPE_LEN	geometry	Median Household Income	year	primary_sex	primary_age	...	primary_care_physicians	population	urban	shadac_category	pcp_per_100k	num_hospitals	num_hospitals_open	mean_beds	num_beds	num_hospitals_per_100k
0	76	Chester	47023	8.049024e+09	520461.080124	POLYGON ((1137985.762 344601.643, 1137965.070 ...	47508	2019	All	All	...	4.0	17190	Urban	low inadequate	23.269343	0.0	0.0	NaN	0.0	0.000000
1	77	Wayne	47181	2.050741e+10	666520.678598	POLYGON ((1365052.057 391716.806, 1365746.554 ...	38879	2019	All	All	...	5.0	16693	Rural	moderately adequate	29.952675	1.0	1.0	80.0	80.0	5.990535
2	78	Tipton	47167	1.319125e+10	865093.887634	MULTIPOLYGON (((886814.330 400456.525, 886774....	60874	2019	All	All	...	17.0	61447	Urban	low inadequate	27.666119	1.0	1.0	100.0	100.0	1.627419
3	79	Hamilton	47065	1.604776e+10	652926.001078	POLYGON ((2274954.438 239788.911, 2274090.610 ...	57196	2019	All	All	...	403.0	360919	Urban	adequate	111.659403	13.0	12.0	160.0	1920.0	3.324846
4	80	Stewart	47161	1.375003e+10	490090.336180	POLYGON ((1382472.783 743972.302, 1382445.171 ...	46565	2019	All	All	...	1.0	13427	Urban	low inadequate	7.447680	0.0	0.0	NaN	0.0	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
90	91	McNairy	47109	1.563586e+10	566369.132062	POLYGON ((1137985.762 344601.643, 1139350.519 ...	39859	2019	All	All	...	9.0	25844	Rural	moderately adequate	34.824331	0.0	0.0	NaN	0.0	0.000000
91	92	Franklin	47051	1.605093e+10	621176.096919	POLYGON ((1873015.265 239618.144, 1872957.848 ...	50201	2019	All	All	...	27.0	41725	Rural	moderately adequate	64.709407	2.0	2.0	86.5	173.0	4.793289
92	93	Bradley	47011	9.241234e+09	457372.233476	POLYGON ((2274954.438 239788.911, 2275552.803 ...	50427	2019	All	All	...	55.0	105749	Urban	moderately adequate	52.009948	2.0	2.0	174.0	348.0	1.891271
93	94	Marion	47115	1.428734e+10	529431.591556	POLYGON ((2126056.390 236919.771, 2122873.509 ...	50819	2019	All	All	...	12.0	28538	Urban	moderately adequate	42.049198	1.0	1.0	68.0	68.0	3.504100
94	95	Polk	47139	1.233228e+10	479994.126988	POLYGON ((2355580.184 332970.851, 2355673.384 ...	41968	2019	All	All	...	8.0	16814	Urban	moderately adequate	47.579398	1.0	0.0	NaN	0.0	0.000000

95 rows × 37 columns

from matplotlib.lines import Line2D

fig, ax = plt.subplots(figsize=(16,4))

counties.plot(column = 'num_hospitals_per_100k', 
              edgecolor = 'black',
              legend = True,
              cmap = 'Blues',
              scheme="Quantiles",
              ax = ax,
              # missing_kwds={
              # "color": "lightgrey",
              # "edgecolor": "grey",
              # "label": "Missing values",
              # },
              # classification_kwds ={k : 10}
)

leg = ax.get_legend()

# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
    label_text = lbl.get_text()
    if label_text == 'Missing values': 
        labels.append(label_text)
        continue
    lower = float(label_text.split()[0][:-1])
    upper = float(label_text.split()[1][:-1])
    if i == 0:
        new_text = "Below " + "{:,.0f}".format(upper + 1)
    elif i == n - 1:
        new_text = "Above " + "{:,.0f}".format(lower)
    else:
        new_text = "{:,.0f}".format(lower + 1) + " - " + "{:,.0f}".format(upper)
        
    labels.append(new_text)

# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
    marker = Line2D([0],[0], marker = 'o', 
                    markersize = line.get_markersize(), 
                    color = line.get_markerfacecolor(),
                    linestyle = 'None',
                    markeredgecolor = 'black',
                    markeredgewidth = 1)
    markers.append(marker)

# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
    
plt.title('Number of Hospitals per 100K Residents by County', fontsize = 18)

ax.axis('off');

from matplotlib.lines import Line2D

fig, ax = plt.subplots(figsize=(16,4))

counties.plot(column = 'pcp_per_100k', 
              edgecolor = 'black',
              legend = True,
              cmap = 'Blues',
              scheme="Quantiles",
              ax = ax,
              # missing_kwds={
              # "color": "lightgrey",
              # "edgecolor": "grey",
              # "label": "Missing values",
              # },
              # classification_kwds ={k : 10}
)

leg = ax.get_legend()

# Adjust the formatting of the legend
labels = []
n = len(leg.get_texts())
for i, lbl in enumerate(leg.get_texts()):
    label_text = lbl.get_text()
    if label_text == 'Missing values': 
        labels.append(label_text)
        continue
    lower = float(label_text.split()[0][:-1])
    upper = float(label_text.split()[1][:-1])
    if i == 0:
        new_text = "Below " + "{:,.0f}".format(upper + 1)
    elif i == n - 1:
        new_text = "Above " + "{:,.0f}".format(lower)
    else:
        new_text = "{:,.0f}".format(lower + 1) + " - " + "{:,.0f}".format(upper)
        
    labels.append(new_text)

# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
markers = []
for line in leg.get_lines():
    marker = Line2D([0],[0], marker = 'o', 
                    markersize = line.get_markersize(), 
                    color = line.get_markerfacecolor(),
                    linestyle = 'None',
                    markeredgecolor = 'black',
                    markeredgewidth = 1)
    markers.append(marker)

# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
    
plt.title('Number of PCP per 100K Residents by County', fontsize = 18)

ax.axis('off');

sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_beds', hue = 'shadac_category')

<AxesSubplot:xlabel='Median Household Income', ylabel='num_beds'>

counties['num_hospitals_per_100k']=counties.num_hospitals_open/counties.population*100000

sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_hospitals_open', hue = 'shadac_category')

<AxesSubplot:xlabel='Median Household Income', ylabel='num_hospitals_open'>

sns.scatterplot(data = counties, x = 'Median Household Income', y = 'num_hospitals_per_100k', hue = 'urban')
plt.ylabel("Number Open Hospitals per 100K Residents")
plt.savefig('income_vs_hosp_urban.pdf')

disparities.sort_values(['pcp_per_100k','population'], ascending = [True,False])[['county', 'num_beds','num_hospitals','pcp_per_100k','population']][0:25]

	county	num_beds	num_hospitals	pcp_per_100k	population
FIPS
47033	Crockett	0.0	0.0	0.000000	14399
47061	Grundy	0.0	0.0	0.000000	13344
47095	Lake	0.0	0.0	0.000000	7401
47175	Van Buren	0.0	0.0	0.000000	5760
47007	Bledsoe	25.0	1.0	6.740361	14836
47161	Stewart	0.0	0.0	7.447680	13427
47097	Lauderdale	25.0	1.0	7.695564	25989
47129	Morgan	0.0	0.0	13.924344	21545
47117	Marshall	25.0	1.0	15.167602	32965
47067	Hancock	10.0	1.0	15.181418	6587
47173	Union	0.0	0.0	15.394089	19488
47127	Moore	0.0	0.0	15.678896	6378
47069	Hardeman	25.0	1.0	15.721416	25443
47101	Lewis	0.0	0.0	16.629251	12027
47111	Macon	25.0	1.0	16.777116	23842
47087	Jackson	0.0	1.0	17.120356	11682
47005	Benton	25.0	1.0	18.587361	16140
47123	Monroe	59.0	1.0	19.538034	46064
47169	Trousdale	25.0	1.0	19.548431	10231
47081	Hickman	25.0	1.0	20.150727	24813
47015	Cannon	60.0	1.0	21.159543	14178
47057	Grainger	0.0	0.0	21.644085	23101
47023	Chester	0.0	0.0	23.269343	17190
47083	Houston	25.0	1.0	24.497795	8164
47077	Henderson	45.0	1.0	25.020553	27977

What happens if we add a few hospitals in TN?

sns.kdeplot(disparities.num_hospitals_open, hue = disparities.urban)

<AxesSubplot:xlabel='num_hospitals_open', ylabel='Density'>

predict = disparities.copy()

disparities.sort_values(['num_hospitals_open','pcp_per_100k'])[0:1]

	year	primary_sex	primary_age	primary_dual	state	primary_race	primary_eligibility	primary_denominator_well	analysis_value_well	primary_denominator_hosp	...	county	primary_care_physicians	population	urban	shadac_category	pcp_per_100k	num_hospitals	num_hospitals_open	mean_beds	num_beds
FIPS
47033	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	25	1,000-4,999	...	Crockett	0.0	14399	Rural	low inadequate	0.0	0.0	0.0	NaN	0.0

1 rows × 29 columns

import numpy as np
def add_hospitals_weak_link(data, category, num_hospitals = 5, return_mean = False, print_bool = True):
    df = data.copy()
    sorted_df = df.sort_values([category,'population'], ascending = [True, False])
    # print(sorted_df.head(3))
    worst = sorted_df.index[0]
    df.loc[worst,'num_hospitals_open'] = num_hospitals
    # df['num_hospitals_open_per_100k'] = df.num_hospitals_open*100000/df.population
    if print_bool ==True:
        print("Baseline mean number of open hospitals per 100k before adding hospitals: "+str(np.round(data['num_hospitals_open'].mean(),4)))

        print(f"Mean number of open hospitals per 100k after adding {num_hospitals} hospitals to {df.loc[worst,'county']} County: "
          +str(np.round(df['num_hospitals_open'].mean(), 4)))
    return df['num_hospitals_open'].mean() if return_mean ==True else None

add_hospitals_weak_link(disparities, category = 'pcp_per_100k')

Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 5 hospitals to Crockett County: 1.6947

import numpy as np
def add_hospitals_equitably(data, category, num_hospitals = 5, num_counties = 5, copy = False, print_bool = True, return_mean = False):
    df = data.copy()
    sorted_df = df.sort_values([category,'population'], ascending = [True, False])
    # print(sorted_df.head(3))
    worst = sorted_df.index[0:num_counties]
    for i in worst:
        df.loc[i,'num_hospitals_open'] = num_hospitals/num_counties
    # df['num_hospitals_open'] = df.num_hospitals_open*100000/df.population
    if print_bool == True:
        print("Baseline mean number of open hospitals per 100k before adding hospitals: "+str(np.round(data['num_hospitals_open'].mean(), 4)))

        print(f"Mean number of open hospitals per 100k after adding {num_hospitals} hospitals to {df.loc[worst,'county'].values} Counties: "
          +str(np.round(df['num_hospitals_open'].mean(), 4)))
    return df['num_hospitals_open'].mean() if return_mean == True else None

add_hospitals_equitably(disparities, category = 'pcp_per_100k')

Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 5 hospitals to ['Crockett' 'Grundy' 'Lake' 'Van Buren' 'Bledsoe'] Counties: 1.6842

add_hospitals_weak_link(disparities, category = 'pcp_per_100k', num_hospitals=10)
add_hospitals_equitably(disparities, category = 'pcp_per_100k', num_hospitals=10)

Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 10 hospitals to Crockett County: 1.7474
Baseline mean number of open hospitals per 100k before adding hospitals: 1.6421
Mean number of open hospitals per 100k after adding 10 hospitals to ['Crockett' 'Grundy' 'Lake' 'Van Buren' 'Bledsoe'] Counties: 1.7368

means_weak = []
means_equit = []
for i in range(20):
    means_weak.append(add_hospitals_weak_link(disparities, category='pcp_per_100k', num_hospitals=i, print_bool=False, return_mean = True))
    means_equit.append(add_hospitals_equitably(disparities, category='pcp_per_100k', num_hospitals=i,num_counties=4, print_bool=False, return_mean = True))

print(means_weak)
print(means_equit)

[1.6421052631578947, 1.6526315789473685, 1.6631578947368422, 1.6736842105263159, 1.6842105263157894, 1.694736842105263, 1.7052631578947368, 1.7157894736842105, 1.7263157894736842, 1.736842105263158, 1.7473684210526317, 1.7578947368421052, 1.768421052631579, 1.7789473684210526, 1.7894736842105263, 1.8, 1.8105263157894738, 1.8210526315789475, 1.831578947368421, 1.8421052631578947]
[1.6421052631578947, 1.6526315789473685, 1.6631578947368422, 1.6736842105263159, 1.6842105263157894, 1.694736842105263, 1.7052631578947368, 1.7157894736842105, 1.7263157894736842, 1.736842105263158, 1.7473684210526317, 1.7578947368421052, 1.768421052631579, 1.7789473684210526, 1.7894736842105263, 1.8, 1.8105263157894738, 1.8210526315789475, 1.831578947368421, 1.8421052631578947]

Question 2: What is the relationship between areas without access and ER visits?

Do areas that lack access to healthcare tend to have higher rates of emergency department visits or hospitalizations? Is there an association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits?

Reordering disparities in order of least acess:

disparities = disparities.sort_values(['pcp_per_100k','population'], ascending = [True,False]).reset_index()

disparities.columns

Index(['FIPS', 'year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
       'primary_race', 'primary_eligibility', 'primary_denominator_well',
       'analysis_value_well', 'primary_denominator_hosp',
       'analysis_value_hosp', 'primary_denominator_emer',
       'analysis_value_emer', 'laus_code', 'Name', 'Employed', 'Unemployed',
       'unemployment_rate', 'FIPS.1', 'county', 'primary_care_physicians',
       'population', 'urban', 'shadac_category', 'pcp_per_100k',
       'num_hospitals', 'num_hospitals_open', 'mean_beds', 'num_beds'],
      dtype='object')

from imp import reload
reload(plt)
plt.figure()
sns.scatterplot(x = disparities., y = disparities.analysis_value_well, hue = disparities.shadac_category)
plt.ylabel("Emergency department visits per 100K")
plt.xlabel("PCP per 100k")

Text(0.5, 0, 'PCP per 100k')

plt.figure()
sns.scatterplot(x = counties['Median Household Income'], y = counties.analysis_value_emer, hue = disparities.shadac_category)
plt.ylabel("Emergency Visits per 1K")
plt.xlabel("Median Household Income")

Text(0.5, 0, 'Median Household Income')

sns.boxplot(x = counties.urban, y = counties.analysis_value_emer)
plt.ylabel("Hospitalizations per 1K")

Text(0, 0.5, 'Hospitalizations per 1K')

Association between the percentage of beneficiaries who had an annual wellness visit and rate of hospitalizations or emergency department visits

plt.figure()
sns.lmplot(data =disparities, y = "analysis_value_hosp", x = "analysis_value_well")
plt.ylabel("Hospitalizations per 100K")
plt.xlabel("Percentage annual wellness visits")

Text(0.5, 21.70625000000002, 'Percentage annual wellness visits')

<Figure size 432x288 with 0 Axes>

emer_per_100k =disparities.analysis_value_emer
hosp_per_100k =disparities.analysis_value_hosp
well_rate = disparities.analysis_value_well

rates = pd.DataFrame([emer_per_100k, hosp_per_100k, well_rate, disparities.pcp_per_100k, disparities.shadac_category,counties['Median Household Income']], index=['Emergency Visits','Hospitalizations','Wellness Visits', 'pcp', 'shadac', 'Income']).T

sns.pairplot(rates, hue = 'shadac', corner = True)

<seaborn.axisgrid.PairGrid at 0x1a5eb4190>

rates[['emer','hosp','well', 'pcp']] = rates[['emer','hosp','well', 'pcp']].astype(float)

rates.drop('shadac', axis = 1)

	Emergency Visits	Hospitalizations	Wellness Visits	pcp	Income
0	622	289	25	0.0	47508
1	651	285	34	0.0	38879
2	1000	401	20	0.0	60874
3	759	304	31	0.0	57196
4	750	313	19	6.740361	46565
...	...	...	...	...	...
90	559	234	56	112.772823	39859
91	584	261	52	126.711238	50201
92	666	255	43	132.138284	50427
93	469	201	48	149.962953	50819
94	504	256	57	176.831892	41968

95 rows × 5 columns

rates.drop('shadac', axis = 1).astype(float).corr().style.background_gradient('coolwarm')

	Emergency Visits	Hospitalizations	Wellness Visits	pcp	Income
Emergency Visits	1.000000	0.599412	-0.480607	-0.396181	0.153085
Hospitalizations	0.599412	1.000000	-0.480732	-0.308934	0.126632
Wellness Visits	-0.480607	-0.480732	1.000000	0.488941	-0.073377
pcp	-0.396181	-0.308934	0.488941	1.000000	-0.080277
Income	0.153085	0.126632	-0.073377	-0.080277	1.000000

counties.to_csv('../data/counties_full.csv')

disparities.to_csv('../data/disparities.csv')

Question 3: Which subpopulations need more attention?

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

disparities = pd.read_csv('../data/disparities.csv',header = 0, index_col = 0)
disparities.head()

	FIPS	year	primary_sex	primary_age	primary_dual	state	primary_race	primary_eligibility	primary_denominator_well	analysis_value_well	...	county	primary_care_physicians	population	urban	shadac_category	pcp_per_100k	num_hospitals	num_hospitals_open	mean_beds	num_beds
0	47033	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	25	...	Crockett	0.0	14399	Rural	low inadequate	0.000000	0.0	0.0	NaN	0.0
1	47061	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	34	...	Grundy	0.0	13344	Rural	low inadequate	0.000000	0.0	0.0	NaN	0.0
2	47095	2019	All	All	Dual & non-dual	TENNESSEE	All	All	500-999	20	...	Lake	0.0	7401	Rural	low inadequate	0.000000	0.0	0.0	NaN	0.0
3	47175	2019	All	All	Dual & non-dual	TENNESSEE	All	All	500-999	31	...	Van Buren	0.0	5760	Rural	low inadequate	0.000000	0.0	0.0	NaN	0.0
4	47007	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	19	...	Bledsoe	1.0	14836	Rural	low inadequate	6.740361	1.0	1.0	25.0	25.0

5 rows × 30 columns

The file brfss.csv contains a subset of the responses and variables from the 2019 Behavioral Risk Factor Surveillance System (BRFSS). This dataset can be downloaded using this link: https://drive.google.com/file/d/1acJKmT2aFf2nZl_VYLE897yx0LPNajoY/view?usp=sharing.

A detailed Codebook can be found here.

Our target variable is the CHECKUP1 column, which contains the responses to the question “About how long has it been since you last visited a doctor for a routine checkup? [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]” Specifically, we want to try and predict if someone gives an answer besides “Within past year (anytime less than 12 months ago)”.

brfss = pd.read_csv('../data/brfss.csv', header = 0)
brfss.head()

	GENHLTH	HLTHPLN1	PERSDOC2	MEDCOST	CHECKUP1	_RFHYPE5	TOLDHI2	CVDINFR4	CVDCRHD4	CVDSTRK3	...	EXERANY2	_METSTAT	_URBSTAT	_IMPRACE	_RFBMI5	_RFSMOK3	_RFBING5	_RFDRHV7	_TOTINDA	target
0	Good	Yes	Yes, only one	No	Within past year (anytime less than 12 months ...	Yes	Yes	No	No	No	...	No	Metropolitan counties (_URBNRRL = 1,2,3,4)	Urban counties (_URBNRRL = 1,2,3,4,5)	Black, Non-Hispanic	Yes	No	No	No	No physical activity or exercise in last 30 days	True
1	Fair	Yes	Yes, only one	No	Within past year (anytime less than 12 months ...	No	No	No	No	No	...	Yes	Metropolitan counties (_URBNRRL = 1,2,3,4)	Urban counties (_URBNRRL = 1,2,3,4,5)	White, Non-Hispanic	No	No	No	No	Had physical activity or exercise	True
2	Good	Yes	More than one	No	Within past year (anytime less than 12 months ...	Yes	No	No	No	No	...	Yes	Metropolitan counties (_URBNRRL = 1,2,3,4)	Urban counties (_URBNRRL = 1,2,3,4,5)	Black, Non-Hispanic	Yes	No	No	No	Had physical activity or exercise	True
3	Very good	Yes	Yes, only one	No	Within past year (anytime less than 12 months ...	No	No	No	No	No	...	Yes	Nonmetropolitan counties (_URBNRRL = 5,6)	Rural counties (_URBNRRL = 6)	White, Non-Hispanic	Yes	Yes	No	No	Had physical activity or exercise	True
4	Poor	Yes	Yes, only one	No	Within past year (anytime less than 12 months ...	No	Yes	No	No	No	...	No	Metropolitan counties (_URBNRRL = 1,2,3,4)	Urban counties (_URBNRRL = 1,2,3,4,5)	White, Non-Hispanic	No	No	No	No	No physical activity or exercise in last 30 days	True

5 rows × 40 columns

First, create a new coumn, “target” by converting this to a binary outcome. After you do this, drop the CHECKUP1 column from your dataframe so that you don’t accidentally make predictions based off of it.

set(brfss.CHECKUP1)

{'5 or more years ago',
 'Never',
 'Within past 2 years (1 year but less than 2 years ago)',
 'Within past 5 years (2 years but less than 5 years ago)',
 'Within past year (anytime less than 12 months ago)'}

target = []
for x in brfss.CHECKUP1:
    if x == 'Within past year (anytime less than 12 months ago)':
        target.append(True)
    else: target.append(False)

brfss['target'] = target

brfss = brfss.drop('CHECKUP1', axis = 1)

brfss.target.value_counts()

True     215875
False     46174
Name: target, dtype: int64

Using data from the Behavioral Risk Factor Surveillance System, build a model to predict whether an individual has not had a checkup in the last year.

Then, experiment with making a logistic regression model to predict the target variable using one or more of the other columns. Note that you will need to convert the precictor columns into dummy variable prior to fitting a model. What do you find?

GENHLTH: Would you say that in general your health is
HLTHPLN1: Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?
PERSDOC2: Do you have one person you think of as your personal doctor or health care provider?
MEDCOST: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
_RFHYPE5: Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional
TOLDHI2: Have you ever been told by a doctor, nurse or other health professional that your blood cholesterol is high?
CVDINFR4: (Ever told) you had a heart attack, also called a myocardial infarction?
CVDCRHD4: (Ever told) you had angina or coronary heart disease?
CVDSTRK3: (Ever told) you had a stroke.
ASTHMA3: (Ever told) you had asthma?
CHCSCNCR: (Ever told) you had skin cancer?
CHCOCNCR: (Ever told) you had other cancer?
CHCCOPD2: (Ever told) you had COPD?
ADDEPEV3: (Ever told) (you had) a depressive disorder
CHCKDNY2: were you ever told you had kidney disease?
DIABETE4: (Ever told) diabetes?
HAVARTH4: Arthritis
MARITAL: Marital status
EDUCA: What is the highest grade or year of school you completed?
RENTHOM1: Own or rent home
VETERAN3: Veteran?
EMPLOY1: Employment
INCOME2: Income categories
DEAF
BLIND
DECIDE: Trouble concentrating/deciding
DIFFWALK: Difficulty walking
DIFFDRES: Difficulty dressing
DIFFALON: Difficulty errands alone
EXERANY2: Exercise in past month
_METSTAT: Metropolitan Status
_URBSTAT: Urban status
_IMPRACE: Imputed race/ethnicity value
_RFBMI5: Adults who have a body mass index greater than 25.00 (Overweight or Obese)
_RFSMOK3: Adults who are current smokers
_RFBING5: Binge drinkers
_RFDRHV7: Heavy drinkers
_TOTINDA: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
target

We only really want categorical variables that we know about counties, so we need to look at County Health Rankings to figure out which of these variables we have percentages on. For example, we can use _IMPRACE because we have percentages of races by TN county.

In the County Health Rankings Data:

GENHLTH == “Poor or fair health” (percentage)
HLTHPLN1 == “% Uninsured”
PERSDOC2 == ratio primary care physicians?
DIABETES4 == “diabetes prevalence”
_RFBING5 & _RFDRHV7 == Binge and heavy drinkers == “Excessive drinking”
RENTHOM1 == “Homeownership”
_IMPRACE = race %
_URBSTAT = ‘rural’
MARITAL ~ % single-parent households
EDUCA == “High school graduation rate, % some college
EMPLOY1 = % unemployed
INCOME2 = household income
_TOTINDA ~ % physically inactive
_RFBMI5 BMI > 25, % adults with obesity == BMI > 30…
_RFSMOK3 == ‘% smokers
ADDEPEV3 ~ frequent mental distress

categorical_variables = ['GENHLTH', 'HLTHPLN1', 'PERSDOC2', 'DIABETE4','_RFBING5','_RFDRHV7','RENTHOM1','_IMPRACE','_URBSTAT','EDUCA','EMPLOY1','INCOME2','_TOTINDA','_RFSMOK3']
predict = pd.get_dummies(brfss, columns = categorical_variables)

predict.head()

	MEDCOST	_RFHYPE5	TOLDHI2	CVDINFR4	CVDCRHD4	CVDSTRK3	ASTHMA3	CHCSCNCR	CHCOCNCR	CHCCOPD2	...	INCOME2_Less than $20,000 ($15,000 to less than $20,000)	INCOME2_Less than $35,000 ($25,000 to less than $35,000)	INCOME2_Less than $75,000 ($50,000 to less than $75,000)	_TOTINDA_Had physical activity or exercise	_TOTINDA_No physical activity or exercise in last 30 days	_RFSMOK3_No	_RFSMOK3_Yes
0	No	Yes	Yes	No	No	No	No	No	No	No	...	1	0	0	0	1	1	0
1	No	No	No	No	No	No	No	No	No	No	...	0	1	0	1	0	1	0
2	No	Yes	No	No	No	No	No	No	No	No	...	0	0	1	1	0	1	0
3	No	No	No	No	No	No	Yes	No	No	Yes	...	0	0	1	1	0	0	1
4	No	No	Yes	No	No	No	Yes	No	No	Yes	...	1	0	0	0	1	1	0

5 rows × 80 columns

predict = predict.drop(predict.columns[0:24], axis = 1)

First, we need to get these boolean values to match up with our county info, so when we use the model later we can use the same variables.

GENHLTH:
- Change GENHLTH columns to “GENHLTH_Poor_or_Fair”
- Matched county % poor or fair health in Ranked Measure data
HLTHPLN1
- Matches inverse of county % uninsured in Additional Measure data
DIABETE4
- Change DIABETE4 columns to combine Yes’s and No’s
- Matches % adults with diabetes in Additional Measure data
_RFBING5 and _RFDRHV7
- Combine so binge and heavy drinking are one category
- Matches excessive drinking in Ranked Measure data
RENTHOM1
- Own == 1, else == 0
- Matches % homeowners
_IMPRACE
- Should be able to keep the same
- Matches demographic %s in Additional Measure Data
_URBSTAT
- replace with Boolean for rural
- matches inverse of % rural
EDUCA
- “Grade 12 or GED (High school graduate)”
- matches “High school graduation rate” in ranked measure data
- “College 1 year to 3 years (Some college or technical school)” or “College 4 years or more (College graduate)” == % some college
EMPLOY1
- “Out of work for 1 year or more” or “Out of work for less than 1 year” or “Unable to work” == % unemployment rate in Ranked measure data
INCOME2:
- group “Household Income (AIAN)” into same groups as above
_TOTINDA
- only “No physical activity…”
- % physically inactive
_RFSMOK3_Yes
- % smokers

X = predict.copy()

Rural

X['Rural'] = [{ "Urban counties (_URBNRRL = 1,2,3,4,5)":0,"Rural counties (_URBNRRL = 6)":1}[i] for i in brfss._URBSTAT]

X = X.drop([i for i in X.columns if "URBSTAT" in str(i)], axis = 1)

Diabetes

X.loc[(X['DIABETE4_Yes, but female told only during pregnancy'] == 1) |
      (X["DIABETE4_Yes"]==1), 'Diabetes'] = 1
X.loc[(X['DIABETE4_Yes, but female told only during pregnancy'] == 0) &
      (X["DIABETE4_Yes"]==0), 'Diabetes'] = 0

X = X.drop([i for i in X.columns if "DIABETE4" in str(i)], axis = 1)

GENHLTH

X.loc[(X['GENHLTH_Fair'] == 1) |
      (X["GENHLTH_Poor"]==1), 'General_health'] = 1
X.loc[(X['GENHLTH_Fair'] == 0) &
      (X["GENHLTH_Poor"]==0), 'General_health'] = 0

X = X.drop([i for i in X.columns if "GENHLTH" in str(i)], axis = 1)

HLTHPLN1

X['Health_plan'] = X.HLTHPLN1_Yes

X = X.drop([i for i in X.columns if "HLTHPLN1" in str(i)], axis = 1)

_RFBING5 and _RFDRHV7

X.loc[(X._RFDRHV7_Yes == 1) |
      (X._RFBING5_Yes==1), 'Excessive_drinking'] = 1
X.loc[(X['_RFDRHV7_Yes'] == 0) &
      (X["_RFBING5_Yes"]==0), 'Excessive_drinking'] = 0

X = X.drop([i for i in X.columns if "_RFDRHV" in str(i)], axis = 1)
X = X.drop([i for i in X.columns if "_RFBING" in str(i)], axis = 1)

RENTHOM1

X['Owns_home'] = X['RENTHOM1_Own']
X = X.drop([i for i in X.columns if "RENTHOM1" in str(i)], axis = 1)

EDUCA

X['High_school_grad'] = X['EDUCA_Grade 12 or GED (High school graduate)']
X.loc[(X['EDUCA_College 1 year to 3 years (Some college or technical school)'] == 1) |
      (X['EDUCA_College 4 years or more (College graduate)']==1), 'College'] = 1
X.loc[(X['EDUCA_College 1 year to 3 years (Some college or technical school)'] == 0) &
      (X["EDUCA_College 4 years or more (College graduate)"]==0), 'College'] = 0

X = X.drop([i for i in X.columns if "EDUCA" in str(i)], axis = 1)

EMPLOY1

X.loc[(X['EMPLOY1_Out of work for 1 year or more'] == 1) |
      (X['EMPLOY1_Out of work for less than 1 year']==1) |
      (X['EMPLOY1_Unable to work']==1), 'Unemployed'] = 1
X.loc[(X['EMPLOY1_Out of work for 1 year or more'] == 0) &
      (X['EMPLOY1_Out of work for less than 1 year']==0) &
      (X['EMPLOY1_Unable to work']==0), 'Unemployed'] = 0

X = X.drop([i for i in X.columns if "EMPLOY1" in str(i)], axis = 1)

_TOTINDA

X['Inactivity'] = X['_TOTINDA_No physical activity or exercise in last 30 days']
X = X.drop([i for i in X.columns if "_TOTINDA" in str(i)], axis = 1)

_RFSMOK3_Yes

X['Smoker'] = X._RFSMOK3_Yes
X = X.drop([i for i in X.columns if "_RFSMOK3" in str(i)], axis = 1)

X = X.drop([i for i in X.columns if "PERSDOC" in str(i)], axis = 1)

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262049 entries, 0 to 262048
Data columns (total 26 columns):
 #   Column                                                    Non-Null Count   Dtype  
---  ------                                                    --------------   -----  
 target                                                    262049 non-null  bool   
 _IMPRACE_American Indian/Alaskan Native, Non-Hispanic     262049 non-null  uint8  
 _IMPRACE_Asian, Non-Hispanic                              262049 non-null  uint8  
 _IMPRACE_Black, Non-Hispanic                              262049 non-null  uint8  
 _IMPRACE_Hispanic                                         262049 non-null  uint8  
 _IMPRACE_Other race, Non-Hispanic                         262049 non-null  uint8  
 _IMPRACE_White, Non-Hispanic                              262049 non-null  uint8  
 INCOME2_$75,000 or more                                   262049 non-null  uint8  
 INCOME2_Less than $10,000                                 262049 non-null  uint8  
 INCOME2_Less than $15,000 ($10,000 to less than $15,000)  262049 non-null  uint8  
INCOME2_Less than $20,000 ($15,000 to less than $20,000)  262049 non-null  uint8  
INCOME2_Less than $25,000 ($20,000 to less than $25,000)  262049 non-null  uint8  
INCOME2_Less than $35,000 ($25,000 to less than $35,000)  262049 non-null  uint8  
INCOME2_Less than $50,000 ($35,000 to less than $50,000)  262049 non-null  uint8  
INCOME2_Less than $75,000 ($50,000 to less than $75,000)  262049 non-null  uint8  
Rural                                                     262049 non-null  int64  
Diabetes                                                  262049 non-null  float64
General_health                                            262049 non-null  float64
Health_plan                                               262049 non-null  uint8  
Excessive_drinking                                        262049 non-null  float64
Owns_home                                                 262049 non-null  uint8  
High_school_grad                                          262049 non-null  uint8  
College                                                   262049 non-null  float64
Unemployed                                                262049 non-null  float64
Inactivity                                                262049 non-null  uint8  
Smoker                                                    262049 non-null  uint8  
dtypes: bool(1), float64(5), int64(1), uint8(19)
memory usage: 17.0 MB

X.to_csv("../data/predictors.csv")

y = X.pop('target')

The main objective is to have a model which makes good predictions on unseen data. Therefore, in order to evaluate how good a model is, it is necessary to set aside some data as a test set for evaulation purposes. This can be accomplished using the train_test_split function.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify = y,     # Keep the same proportions of the target in the training and test data
                                                    test_size = 0.25,
                                                    random_state = 321)

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()         # Create a logistic regression model
logreg.fit(X_train, y_train)          # Fit it to the training data

LogisticRegression()

To understand the model, we can look at its coefficients.

logreg.intercept_

array([-0.06956167])

logreg.coef_

array([[-0.15000703, -0.12631886,  0.53903011, -0.1734734 , -0.12965132,
        -0.02906713, -0.17897394,  0.02512163,  0.13639016,  0.09111363,
         0.06724727, -0.03598943, -0.04823376, -0.12616319, -0.09815453,
         1.19410076,  0.44202482,  1.4508977 , -0.35476091,  0.4756881 ,
         0.01472026, -0.00793255,  0.15622678, -0.04213603, -0.36442622]])

y_pred_prob = logreg.predict_proba(X_test)[:,1]

i = 120

print('Patient Information:\n{}'.format(X_test.iloc[i]))
print('---------------------------------')
print('Predicted Probability of Checkup in Last Year: {}'.format(y_pred_prob[i]))
print('Actual: {}'.format(y_test.iloc[i]))

Patient Information:
_IMPRACE_American Indian/Alaskan Native, Non-Hispanic       0.0
_IMPRACE_Asian, Non-Hispanic                                0.0
_IMPRACE_Black, Non-Hispanic                                0.0
_IMPRACE_Hispanic                                           0.0
_IMPRACE_Other race, Non-Hispanic                           0.0
_IMPRACE_White, Non-Hispanic                                1.0
INCOME2_$75,000 or more                                     1.0
INCOME2_Less than $10,000                                   0.0
INCOME2_Less than $15,000 ($10,000 to less than $15,000)    0.0
INCOME2_Less than $20,000 ($15,000 to less than $20,000)    0.0
INCOME2_Less than $25,000 ($20,000 to less than $25,000)    0.0
INCOME2_Less than $35,000 ($25,000 to less than $35,000)    0.0
INCOME2_Less than $50,000 ($35,000 to less than $50,000)    0.0
INCOME2_Less than $75,000 ($50,000 to less than $75,000)    0.0
Rural                                                       0.0
Diabetes                                                    0.0
General_health                                              0.0
Health_plan                                                 1.0
Excessive_drinking                                          0.0
Owns_home                                                   1.0
High_school_grad                                            0.0
College                                                     1.0
Unemployed                                                  1.0
Inactivity                                                  0.0
Smoker                                                      0.0
Name: 232982, dtype: float64
---------------------------------
Predicted Probability of Checkup in Last Year: 0.857817177035579
Actual: True

One way to analyze your model is to look at the receiver operating characteristic (ROC) curve. This shows how the true positive rate and false positive rate change as the prediction threshold changes.

This value can be interpreted as how likely the model is to assign a higher probability to a positive observation compared to a negative one.

from sklearn.metrics import roc_auc_score, roc_curve

fp_rate, tp_rate, thresholds = roc_curve(y_test == 1, y_pred_prob)

plt.plot(fp_rate, tp_rate)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart Disease Prediction')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

AUC - percentage of the ROC plot that is under the curve

A perfect model would have AUC = 1.

roc_auc_score(y_test, y_pred_prob)

0.6789666785879891

Analyzing Predictions

To generate predictions, you can use the predict method of your model.

from sklearn.metrics import confusion_matrix, classification_report
from cm import plot_confusion_matrix

y_pred = logreg.predict(X_test)

confusion_matrix(y_test, y_pred)

array([[ 1052, 10492],
       [  772, 53197]])

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.58      0.09      0.16     11544
        True       0.84      0.99      0.90     53969

    accuracy                           0.83     65513
   macro avg       0.71      0.54      0.53     65513
weighted avg       0.79      0.83      0.77     65513

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'])

There are a number of metrics you can use to evalute your model.

Accuracy: the total proportion of predictions which are correct.

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

Sensitivity/True Positive Rate: The proportion of true positives (in our case, people who survived) that are identified as such.

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'sensitivity')

Specificity/True Negative Rate: The proportion of true negatives (in our case, people who died) that are identified as such.

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'specificity')

Precision: The proportion of predicted positives that are actually positive (survived).

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'precision')

We have only used a single predictor in the above model. Let’s see if we can improve our model by using the rest of the predictors.

With so many predictors, it takes longer for the coefficients to converge. Let’s increase the number of iterations.

logreg = LogisticRegression(max_iter = 10000)
logreg.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

y_test

149338     True
201150     True
63901      True
17995     False
192704     True
          ...  
218617     True
181851     True
54798      True
177920    False
22276      True
Name: target, Length: 65513, dtype: bool

Let’s take a look at the coefficients.

Caution: Our variables are on vastly different scales, so do not interpret the value of a coefficient as the importance.

coefficients = pd.DataFrame({
    'variable': X.columns,
    'coefficient': logreg.coef_[0]
})

importances = pd.DataFrame({'variable': coefficients.variable,
                           'importance': coefficients.coefficient}).sort_values('importance', ascending = False)


fig, ax = plt.subplots(figsize = (7,8))
sns.barplot(data = importances,
            x = 'importance', y = 'variable', ax = ax, edgecolor = 'black')
plt.title('Logistic Regression Coefficients');

fig, ax = plt.subplots(figsize = (7,5))
sns.barplot(data = coefficients,
            x = 'coefficient', 
            y = 'variable', 
            ax = ax, 
            edgecolor = 'black')
plt.title('Logistic Regression Coefficients')

ymin, ymax = plt.ylim()
plt.vlines(x = 0, ymin = ymin, ymax = ymax);

Try using regularization or a tree-based model to improve your model’s performance on the BRFSS dataset.

from sklearn.tree import DecisionTreeClassifier

A decision tree model is fit in much the same way as a logistic regression model.

tree = DecisionTreeClassifier(random_state = 321).fit(X_train, y_train)

First, let’s see how it does on the training data.

y_pred = tree.predict(X_test)
plot_confusion_matrix(y_train, tree.predict(X_train), labels = ['No', 'Yes'], metric = 'accuracy')

y_train.value_counts()

True     161906
False     34630
Name: target, dtype: int64

Since decision trees are so flexible, they will often perfectly predict the training data. Remember though, that what we care about is how well the model generalizes.

y_pred = tree.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

roc_auc_score(y_test, tree.predict_proba(X_test)[:,1])

0.6345258214120431

We do see a drop in model performance on the test data. This is a sign of overfitting.

To correct for this problem, we can take an ensemble approach, which means that we will build many decision trees on subsets of the features and data and then average the predictions of all of the trees. This will force our model to try and find more general patterns that will work on the test set.

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(random_state = 321)
forest.fit(X_train, y_train)

RandomForestClassifier(random_state=321)

y_pred = forest.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

roc_auc_score(y_test, forest.predict_proba(X_test)[:,1])

0.6501432873215749

We see some improvement over using a single tree, but we could do better. Random forests have a lot of hyperparameters that can be tuned to improve out model. Here are a few of these parameters:

n_estimators: Number of decision trees to train. Default is 10. More trees = less variance, but slower to train and predict
max_depth: Maximum depth (number of splits). By default, there is no max depth.
min_samples_leaf: Minimum number of samples per leaf. Setting this higher keeps the decision trees from paying too much attention to any single data point.

These parameters can be tuned to try to improve the model that you get, and there are ways to automatically tune these parameters. See, for example, sklearn’s GridSearchCV or RandomSearchCV.

forest = RandomForestClassifier(n_estimators = 1000, max_depth = 5, min_samples_leaf = 5, random_state = 321)
forest.fit(X_train, y_train)

RandomForestClassifier(max_depth=5, min_samples_leaf=5, n_estimators=1000,
                       random_state=321)

y_pred = forest.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'accuracy')

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'sensitivity')

Specificity/True Negative Rate: The proportion of true negatives (in our case, people who died) that are identified as such.

plot_confusion_matrix(y_test, y_pred, labels = ['No', 'Yes'], metric = 'specificity')

roc_auc_score(y_test, forest.predict_proba(X_test)[:,1])

0.678213761501158

fp_rate, tp_rate, thresholds = roc_curve(y_test == 1, forest.predict_proba(X_test)[:,1])

plt.plot(fp_rate, tp_rate)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart Disease Prediction')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

A nice perk of using random forest models is that we can see which features are the most important in making predictions.

Caution: Feature importances tell which features the model is using, but not how it is using those features.

importances = pd.DataFrame({'variable': X.columns,
                           'importance': forest.feature_importances_}).sort_values('importance', ascending = False)


fig, ax = plt.subplots(figsize = (7,8))
sns.barplot(data = importances,
            x = 'importance', y = 'variable', ax = ax, edgecolor = 'black')
plt.title('Random Forest Feature Importance');

Apply this model to the counties you identified above to predict how likely it is that the average person from those counties has not had a checkup in the last year.

df1 = pd.read_csv('../data/County_Health_Rankings/Ranked_measure_data_TN.csv', header = 0, index_col = 0)
df1.head()

	State	County	Deaths	Years of Potential Life Lost Rate	95% CI - Low	95% CI - High	Quartile	YPLL Rate (AIAN)	YPLL Rate (AIAN) 95% CI - Low	YPLL Rate (AIAN) 95% CI - High	...	% Drive Alone (Hispanic) 95% CI - Low	% Drive Alone (Hispanic) 95% CI - High	% Drive Alone (White)	% Drive Alone (White) 95% CI - Low	% Drive Alone (White) 95% CI - High	# Workers who Drive Alone	% Long Commute - Drives Alone	95% CI - Low.18	95% CI - High.18	Quartile.34
FIPS
47000	Tennessee	NaN	106691	9285	9202	9368	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	3003226	35	34	35	NaN
47001	Tennessee	Anderson	1369	10009	9172	10846	2.0	NaN	NaN	NaN	...	63.0	96.0	90.0	89.0	92.0	32206	36	32	39	2.0
47003	Tennessee	Bedford	859	10294	9268	11319	2.0	NaN	NaN	NaN	...	54.0	74.0	80.0	76.0	84.0	21250	35	31	38	2.0
47005	Tennessee	Benton	428	14479	12233	16726	4.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	5448	28	23	33	1.0
47007	Tennessee	Bledsoe	229	8126	6427	9824	1.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	5603	53	43	62	4.0

5 rows × 239 columns

df2 = pd.read_csv('../data/County_Health_Rankings/Addition_measure_data_TN.csv', header = 0, index_col = 0)
df2.head()

	State	County	Life Expectancy	95% CI - Low	95% CI - High	Life Expectancy (AIAN)	Life Expectancy (AIAN) 95% CI - Low	Life Expectancy (AIAN) 95% CI - High	Life Expectancy (Asian)	Life Expectancy (Asian) 95% CI - Low	...	% Hispanic	# Non-Hispanic White	% Non-Hispanic White	# Not Proficient in English	% Not Proficient in English	95% CI - Low.19	95% CI - High.19	% Female	# Rural	% Rural
FIPS
47000	Tennessee	NaN	76.0	75.9	76.1	NaN	NaN	NaN	NaN	NaN	...	5.6	4989758	73.7	94554	2	1.0	2.0	51.2	2132860	33.6
47001	Tennessee	Anderson	75.5	74.9	76.2	NaN	NaN	NaN	NaN	NaN	...	3.1	68116	89.1	619	1	0.0	1.0	51.3	26041	34.7
47003	Tennessee	Bedford	74.3	73.5	75.1	NaN	NaN	NaN	NaN	NaN	...	12.6	37556	76.6	1474	3	2.0	4.0	50.9	25053	55.6
47005	Tennessee	Benton	71.7	70.2	73.2	NaN	NaN	NaN	NaN	NaN	...	2.5	14970	92.5	25	0	0.0	1.0	51.0	12937	78.5
47007	Tennessee	Bledsoe	78.0	76.4	79.6	NaN	NaN	NaN	NaN	NaN	...	2.5	13038	88.4	245	2	0.0	3.0	41.1	12876	100.0

5 rows × 269 columns

First, we need to get these boolean values to match up with our county info, so when we use the model later we can use the same variables.

GENHLTH:
- Matched county % poor or fair health in Ranked Measure data
HLTHPLN1
- Matches inverse of county % uninsured in Additional Measure data
DIABETE4
- Matches % adults with diabetes in Additional Measure data
_RFBING5 and _RFDRHV7
- Matches excessive drinking in Ranked Measure data
RENTHOM1
- Matches % homeowners in Additional
_IMPRACE
- Matches demographic %s in Additional Measure Data
_URBSTAT
- matches inverse of % rural in additional
EDUCA
- “Grade 12 or GED (High school graduate)”
- matches “High school graduation rate” in ranked measure data
- “College 1 year to 3 years (Some college or technical school)” or “College 4 years or more (College graduate)” == % some college
EMPLOY1
- “Out of work for 1 year or more” or “Out of work for less than 1 year” or “Unable to work” == % unemployment rate in Ranked measure data
INCOME2:
- group “Household Income (AIAN)” into same groups as above in additional
_TOTINDA
- % physically inactive in ranked data
_RFSMOK3_Yes
- % smokers in ranked

# [print(i) for i in df2 if "%" in str(i)]

df1 = df1[['% Fair or Poor Health', "% Excessive Drinking","High School Graduation Rate","% Some College","% Unemployed","% Physically Inactive","% Smokers"]]

df2 = df2[["% Uninsured","% Adults with Diabetes","% Homeowners","% Rural","Median Household Income",
     "% Non-Hispanic White", "% Black","% American Indian & Alaska Native","% Asian","% Native Hawaiian/Other Pacific Islander","% Hispanic"]]

We now need to convert these columns to the ones in X.

county_demo = df1.join(df2)

#everything is a percentage except Income, so divide by 100 and then fix income column
county_demo = county_demo/100
county_demo['Median Household Income'] *=100 

county_demo

	% Fair or Poor Health	% Excessive Drinking	High School Graduation Rate	% Some College	% Unemployed	% Physically Inactive	% Smokers	% Uninsured	% Adults with Diabetes	% Homeowners	% Rural	Median Household Income	% Non-Hispanic White	% Black	% American Indian & Alaska Native	% Asian	% Native Hawaiian/Other Pacific Islander	% Hispanic
FIPS
47000	0.20	0.14	0.90	0.61	0.035	0.27	0.23	0.14	0.13	0.66	0.336	52366.0	0.737	0.167	0.005	0.019	0.001	0.056
47001	0.19	0.14	0.92	0.58	0.038	0.26	0.21	0.13	0.11	0.68	0.347	50672.0	0.891	0.039	0.005	0.015	0.001	0.031
47003	0.22	0.16	0.91	0.41	0.037	0.32	0.23	0.18	0.11	0.68	0.556	49860.0	0.766	0.076	0.011	0.011	0.002	0.126
47005	0.23	0.12	0.96	0.47	0.049	0.29	0.24	0.15	0.14	0.76	0.785	38940.0	0.925	0.024	0.006	0.006	0.000	0.025
47007	0.23	0.15	0.85	0.38	0.058	0.26	0.25	0.18	0.10	0.74	1.000	40195.0	0.884	0.071	0.006	0.003	0.000	0.025
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
47181	0.24	0.14	0.96	0.40	0.049	0.41	0.25	0.14	0.15	0.81	1.000	38879.0	0.899	0.063	0.004	0.003	0.000	0.021
47183	0.22	0.14	0.92	0.56	0.042	0.34	0.22	0.12	0.11	0.66	0.670	42351.0	0.868	0.075	0.004	0.012	0.000	0.027
47185	0.25	0.13	0.94	0.49	0.037	0.23	0.23	0.14	0.22	0.81	0.782	42603.0	0.930	0.018	0.005	0.004	0.001	0.028
47187	0.14	0.16	0.96	0.87	0.025	0.18	0.15	0.08	0.08	0.81	0.194	115930.0	0.843	0.043	0.003	0.047	0.001	0.049
47189	0.17	0.17	0.96	0.69	0.028	0.21	0.18	0.10	0.11	0.77	0.385	76507.0	0.847	0.071	0.005	0.017	0.001	0.045

96 rows × 18 columns

county_demo.columns

Index(['% Fair or Poor Health', '% Excessive Drinking',
       'High School Graduation Rate', '% Some College', '% Unemployed',
       '% Physically Inactive', '% Smokers', '% Uninsured',
       '% Adults with Diabetes', '% Homeowners', '% Rural',
       'Median Household Income', '% Non-Hispanic White', '% Black',
       '% American Indian & Alaska Native', '% Asian',
       '% Native Hawaiian/Other Pacific Islander', '% Hispanic'],
      dtype='object')

X.columns

Index(['_IMPRACE_American Indian/Alaskan Native, Non-Hispanic',
       '_IMPRACE_Asian, Non-Hispanic', '_IMPRACE_Black, Non-Hispanic',
       '_IMPRACE_Hispanic', '_IMPRACE_Other race, Non-Hispanic',
       '_IMPRACE_White, Non-Hispanic', 'INCOME2_$75,000 or more',
       'INCOME2_Less than $10,000',
       'INCOME2_Less than $15,000 ($10,000 to less than $15,000)',
       'INCOME2_Less than $20,000 ($15,000 to less than $20,000)',
       'INCOME2_Less than $25,000 ($20,000 to less than $25,000)',
       'INCOME2_Less than $35,000 ($25,000 to less than $35,000)',
       'INCOME2_Less than $50,000 ($35,000 to less than $50,000)',
       'INCOME2_Less than $75,000 ($50,000 to less than $75,000)', 'Rural',
       'Diabetes', 'General_health', 'Health_plan', 'Excessive_drinking',
       'Owns_home', 'High_school_grad', 'College', 'Unemployed', 'Inactivity',
       'Smoker'],
      dtype='object')

column_dict = {'% Fair or Poor Health':"General_health", '% Excessive Drinking':"Excessive_drinking",
'High School Graduation Rate':"High_school_grad", '% Some College':"College", '% Unemployed':"Unemployed",
'% Physically Inactive':"Inactivity", '% Smokers':"Smoker", '% Uninsured':"Health_plan",
'% Adults with Diabetes':"Diabetes", '% Homeowners':"Owns_home", '% Rural':"Rural",
'% Non-Hispanic White':"_IMPRACE_White, Non-Hispanic", 
'% Black':'_IMPRACE_Black, Non-Hispanic',
'% American Indian & Alaska Native':'_IMPRACE_American Indian/Alaskan Native, Non-Hispanic', 
'% Asian':"_IMPRACE_Asian, Non-Hispanic",
'% Native Hawaiian/Other Pacific Islander':"_IMPRACE_Other race, Non-Hispanic", 
'% Hispanic':"_IMPRACE_Hispanic"}

for i in column_dict:
    county_demo.columns = county_demo.columns.str.replace(i,column_dict[i])

county_demo

	General_health	Excessive_drinking	High_school_grad	College	Unemployed	Inactivity	Smoker	Health_plan	Diabetes	Owns_home	Rural	Median Household Income	_IMPRACE_White, Non-Hispanic	_IMPRACE_Black, Non-Hispanic	_IMPRACE_American Indian/Alaskan Native, Non-Hispanic	_IMPRACE_Asian, Non-Hispanic	_IMPRACE_Other race, Non-Hispanic	_IMPRACE_Hispanic
FIPS
47000	0.20	0.14	0.90	0.61	0.035	0.27	0.23	0.14	0.13	0.66	0.336	52366.0	0.737	0.167	0.005	0.019	0.001	0.056
47001	0.19	0.14	0.92	0.58	0.038	0.26	0.21	0.13	0.11	0.68	0.347	50672.0	0.891	0.039	0.005	0.015	0.001	0.031
47003	0.22	0.16	0.91	0.41	0.037	0.32	0.23	0.18	0.11	0.68	0.556	49860.0	0.766	0.076	0.011	0.011	0.002	0.126
47005	0.23	0.12	0.96	0.47	0.049	0.29	0.24	0.15	0.14	0.76	0.785	38940.0	0.925	0.024	0.006	0.006	0.000	0.025
47007	0.23	0.15	0.85	0.38	0.058	0.26	0.25	0.18	0.10	0.74	1.000	40195.0	0.884	0.071	0.006	0.003	0.000	0.025
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
47181	0.24	0.14	0.96	0.40	0.049	0.41	0.25	0.14	0.15	0.81	1.000	38879.0	0.899	0.063	0.004	0.003	0.000	0.021
47183	0.22	0.14	0.92	0.56	0.042	0.34	0.22	0.12	0.11	0.66	0.670	42351.0	0.868	0.075	0.004	0.012	0.000	0.027
47185	0.25	0.13	0.94	0.49	0.037	0.23	0.23	0.14	0.22	0.81	0.782	42603.0	0.930	0.018	0.005	0.004	0.001	0.028
47187	0.14	0.16	0.96	0.87	0.025	0.18	0.15	0.08	0.08	0.81	0.194	115930.0	0.843	0.043	0.003	0.047	0.001	0.049
47189	0.17	0.17	0.96	0.69	0.028	0.21	0.18	0.10	0.11	0.77	0.385	76507.0	0.847	0.071	0.005	0.017	0.001	0.045

96 rows × 18 columns

Don’t forget, we need 1-Health Plan (% Uninsured –> Health plan). All others should already be accounted for if “negative,” such as physical inactivity, unemployed, etc. i.e. TRUE for Unemployed truly means unemployed.

county_demo['Health_plan'] = 1- county_demo['Health_plan']

Now we work on income categories.

income_categories = []
for i in county_demo['Median Household Income']:
    if float(i) < 10000:
        income_categories.append('INCOME2_Less than $10,000')
    elif (i >= 10000) & (i < 15000):
        income_categories.append('INCOME2_Less than $15,000 ($10,000 to less than $15,000)')
    elif (i >= 15000) & (i < 20000):
        income_categories.append('INCOME2_Less than $20,000 ($15,000 to less than $20,000)')
    elif (i >= 20000) & (i < 25000):
        income_categories.append('INCOME2_Less than $25,000 ($20,000 to less than $25,000)')
    elif (i >= 25000) & (i < 35000):
        income_categories.append('INCOME2_Less than $35,000 ($25,000 to less than $35,000)')
    elif (i >= 35000) & (i < 50000):
        income_categories.append('INCOME2_Less than $50,000 ($35,000 to less than $50,000)')
    elif (i >= 50000) & (i < 75000):
        income_categories.append('INCOME2_Less than $75,000 ($50,000 to less than $75,000)')
    elif (i >= 75000):
        income_categories.append('INCOME2_$75,000 or more')

county_demo['income_categories'] = income_categories

county_demo = pd.get_dummies(county_demo, 'income_categories')
county_demo = county_demo.drop('Median Household Income', axis = 1)

county_demo['INCOME2_Less than $25,000 ($20,000 to less than $25,000)'] = 0
county_demo['INCOME2_Less than $20,000 ($15,000 to less than $20,000)'] = 0
county_demo['INCOME2_Less than $15,000 ($10,000 to less than $15,000)'] = 0
county_demo['INCOME2_Less than $10,000'] = 0

county_demo.columns = county_demo.columns.str.replace("income_categories_", "")

county_demo

	General_health	Excessive_drinking	High_school_grad	College	Unemployed	Inactivity	Smoker	Health_plan	Diabetes	Owns_home	...	_IMPRACE_Other race, Non-Hispanic	_IMPRACE_Hispanic	INCOME2_$75,000 or more	INCOME2_Less than $35,000 ($25,000 to less than $35,000)	INCOME2_Less than $50,000 ($35,000 to less than $50,000)	INCOME2_Less than $75,000 ($50,000 to less than $75,000)	INCOME2_Less than $25,000 ($20,000 to less than $25,000)	INCOME2_Less than $20,000 ($15,000 to less than $20,000)	INCOME2_Less than $15,000 ($10,000 to less than $15,000)	INCOME2_Less than $10,000
FIPS
47000	0.20	0.14	0.90	0.61	0.035	0.27	0.23	0.86	0.13	0.66	...	0.001	0.056	0	0	0	1	0	0	0	0
47001	0.19	0.14	0.92	0.58	0.038	0.26	0.21	0.87	0.11	0.68	...	0.001	0.031	0	0	0	1	0	0	0	0
47003	0.22	0.16	0.91	0.41	0.037	0.32	0.23	0.82	0.11	0.68	...	0.002	0.126	0	0	1	0	0	0	0	0
47005	0.23	0.12	0.96	0.47	0.049	0.29	0.24	0.85	0.14	0.76	...	0.000	0.025	0	0	1	0	0	0	0	0
47007	0.23	0.15	0.85	0.38	0.058	0.26	0.25	0.82	0.10	0.74	...	0.000	0.025	0	0	1	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
47181	0.24	0.14	0.96	0.40	0.049	0.41	0.25	0.86	0.15	0.81	...	0.000	0.021	0	0	1	0	0	0	0	0
47183	0.22	0.14	0.92	0.56	0.042	0.34	0.22	0.88	0.11	0.66	...	0.000	0.027	0	0	1	0	0	0	0	0
47185	0.25	0.13	0.94	0.49	0.037	0.23	0.23	0.86	0.22	0.81	...	0.001	0.028	0	0	1	0	0	0	0	0
47187	0.14	0.16	0.96	0.87	0.025	0.18	0.15	0.92	0.08	0.81	...	0.001	0.049	1	0	0	0	0	0	0	0
47189	0.17	0.17	0.96	0.69	0.028	0.21	0.18	0.90	0.11	0.77	...	0.001	0.045	1	0	0	0	0	0	0	0

96 rows × 25 columns

county_pred = tree.predict(county_demo)

county_pred_prob = forest.predict_proba(county_demo)[:,1]

county_demo['predict_prob'] = county_pred_prob

county_demo

	General_health	Excessive_drinking	High_school_grad	College	Unemployed	Inactivity	Smoker	Health_plan	Diabetes	Owns_home	...	_IMPRACE_Hispanic	INCOME2_$75,000 or more	INCOME2_Less than $35,000 ($25,000 to less than $35,000)	INCOME2_Less than $50,000 ($35,000 to less than $50,000)	INCOME2_Less than $75,000 ($50,000 to less than $75,000)	INCOME2_Less than $25,000 ($20,000 to less than $25,000)	INCOME2_Less than $20,000 ($15,000 to less than $20,000)	INCOME2_Less than $15,000 ($10,000 to less than $15,000)	INCOME2_Less than $10,000	predict_prob
FIPS
47000	0.20	0.14	0.90	0.61	0.035	0.27	0.23	0.86	0.13	0.66	...	0.056	0	0	0	1	0	0	0	0	0.640059
47001	0.19	0.14	0.92	0.58	0.038	0.26	0.21	0.87	0.11	0.68	...	0.031	0	0	0	1	0	0	0	0	0.640059
47003	0.22	0.16	0.91	0.41	0.037	0.32	0.23	0.82	0.11	0.68	...	0.126	0	0	1	0	0	0	0	0	0.687877
47005	0.23	0.12	0.96	0.47	0.049	0.29	0.24	0.85	0.14	0.76	...	0.025	0	0	1	0	0	0	0	0	0.687877
47007	0.23	0.15	0.85	0.38	0.058	0.26	0.25	0.82	0.10	0.74	...	0.025	0	0	1	0	0	0	0	0	0.687877
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
47181	0.24	0.14	0.96	0.40	0.049	0.41	0.25	0.86	0.15	0.81	...	0.021	0	0	1	0	0	0	0	0	0.687877
47183	0.22	0.14	0.92	0.56	0.042	0.34	0.22	0.88	0.11	0.66	...	0.027	0	0	1	0	0	0	0	0	0.667665
47185	0.25	0.13	0.94	0.49	0.037	0.23	0.23	0.86	0.22	0.81	...	0.028	0	0	1	0	0	0	0	0	0.687877
47187	0.14	0.16	0.96	0.87	0.025	0.18	0.15	0.92	0.08	0.81	...	0.049	1	0	0	0	0	0	0	0	0.812499
47189	0.17	0.17	0.96	0.69	0.028	0.21	0.18	0.90	0.11	0.77	...	0.045	1	0	0	0	0	0	0	0	0.812499

96 rows × 26 columns

disparities = pd.read_csv('../data/disparities.csv', header = 0, index_col = 0)
disparities.index = disparities.FIPS
disparities

	FIPS	year	primary_sex	primary_age	primary_dual	state	primary_race	primary_eligibility	primary_denominator_well	analysis_value_well	...	county	primary_care_physicians	population	urban	shadac_category	pcp_per_100k	num_hospitals	num_hospitals_open	mean_beds	num_beds
FIPS
47033	47033	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	25	...	Crockett	0.0	14399	Rural	low inadequate	0.000000	0.0	0.0	NaN	0.0
47061	47061	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	34	...	Grundy	0.0	13344	Rural	low inadequate	0.000000	0.0	0.0	NaN	0.0
47095	47095	2019	All	All	Dual & non-dual	TENNESSEE	All	All	500-999	20	...	Lake	0.0	7401	Rural	low inadequate	0.000000	0.0	0.0	NaN	0.0
47175	47175	2019	All	All	Dual & non-dual	TENNESSEE	All	All	500-999	31	...	Van Buren	0.0	5760	Rural	low inadequate	0.000000	0.0	0.0	NaN	0.0
47007	47007	2019	All	All	Dual & non-dual	TENNESSEE	All	All	1,000-4,999	19	...	Bledsoe	1.0	14836	Rural	low inadequate	6.740361	1.0	1.0	25.000000	25.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
47093	47093	2019	All	All	Dual & non-dual	TENNESSEE	All	All	10,000+	56	...	Knox	520.0	461104	Urban	adequate	112.772823	11.0	10.0	306.600000	3066.0
47163	47163	2019	All	All	Dual & non-dual	TENNESSEE	All	All	10,000+	52	...	Sullivan	199.0	157050	Urban	adequate	126.711238	5.0	5.0	235.000000	1175.0
47113	47113	2019	All	All	Dual & non-dual	TENNESSEE	All	All	10,000+	43	...	Madison	129.0	97625	Urban	adequate	132.138284	3.0	3.0	330.666667	992.0
47187	47187	2019	All	All	Dual & non-dual	TENNESSEE	All	All	10,000+	48	...	Williamson	338.0	225389	Urban	adequate	149.962953	2.0	2.0	112.500000	225.0
47179	47179	2019	All	All	Dual & non-dual	TENNESSEE	All	All	10,000+	57	...	Washington	226.0	127805	Urban	adequate	176.831892	7.0	6.0	167.000000	1002.0

95 rows × 30 columns

full_df = disparities.join(county_demo, lsuffix = 'disp', rsuffix = 'demo')

full_df.sort_values('predict_prob')[['county','predict_prob']]

	county	predict_prob
FIPS
47027	Clay	0.615584
47095	Lake	0.615584
47067	Hancock	0.615584
47149	Rutherford	0.640059
47125	Montgomery	0.640059
...	...	...
47091	Johnson	0.687877
47055	Giles	0.687877
47075	Haywood	0.692379
47189	Wilson	0.812499
47187	Williamson	0.812499

95 rows × 2 columns

full_df.columns

Index(['FIPS', 'year', 'primary_sex', 'primary_age', 'primary_dual', 'state',
       'primary_race', 'primary_eligibility', 'primary_denominator_well',
       'analysis_value_well', 'primary_denominator_hosp',
       'analysis_value_hosp', 'primary_denominator_emer',
       'analysis_value_emer', 'laus_code', 'Name', 'Employed',
       'Unemployeddisp', 'unemployment_rate', 'FIPS.1', 'county',
       'primary_care_physicians', 'population', 'urban', 'shadac_category',
       'pcp_per_100k', 'num_hospitals', 'num_hospitals_open', 'mean_beds',
       'num_beds', 'General_health', 'Excessive_drinking', 'High_school_grad',
       'College', 'Unemployeddemo', 'Inactivity', 'Smoker', 'Health_plan',
       'Diabetes', 'Owns_home', 'Rural', '_IMPRACE_White, Non-Hispanic',
       '_IMPRACE_Black, Non-Hispanic',
       '_IMPRACE_American Indian/Alaskan Native, Non-Hispanic',
       '_IMPRACE_Asian, Non-Hispanic', '_IMPRACE_Other race, Non-Hispanic',
       '_IMPRACE_Hispanic', 'INCOME2_$75,000 or more',
       'INCOME2_Less than $35,000 ($25,000 to less than $35,000)',
       'INCOME2_Less than $50,000 ($35,000 to less than $50,000)',
       'INCOME2_Less than $75,000 ($50,000 to less than $75,000)',
       'INCOME2_Less than $25,000 ($20,000 to less than $25,000)',
       'INCOME2_Less than $20,000 ($15,000 to less than $20,000)',
       'INCOME2_Less than $15,000 ($10,000 to less than $15,000)',
       'INCOME2_Less than $10,000', 'predict_prob'],
      dtype='object')

sns.scatterplot(full_df['Health_plan'] , full_df.predict_prob)

/Users/smgroves/Documents/anaconda3/envs/geo_dse/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='Health_plan', ylabel='predict_prob'>

(county_demo).plot(kind = 'box', figsize = (10,10))
plt.xticks(rotation = 90)

(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26]),
 [Text(1, 0, 'General_health'),
  Text(2, 0, 'Excessive_drinking'),
  Text(3, 0, 'High_school_grad'),
  Text(4, 0, 'College'),
  Text(5, 0, 'Unemployed'),
  Text(6, 0, 'Inactivity'),
  Text(7, 0, 'Smoker'),
  Text(8, 0, 'Health_plan'),
  Text(9, 0, 'Diabetes'),
  Text(10, 0, 'Owns_home'),
  Text(11, 0, 'Rural'),
  Text(12, 0, '_IMPRACE_White, Non-Hispanic'),
  Text(13, 0, '_IMPRACE_Black, Non-Hispanic'),
  Text(14, 0, '_IMPRACE_American Indian/Alaskan Native, Non-Hispanic'),
  Text(15, 0, '_IMPRACE_Asian, Non-Hispanic'),
  Text(16, 0, '_IMPRACE_Other race, Non-Hispanic'),
  Text(17, 0, '_IMPRACE_Hispanic'),
  Text(18, 0, 'INCOME2_$75,000 or more'),
  Text(19, 0, 'INCOME2_Less than $35,000 ($25,000 to less than $35,000)'),
  Text(20, 0, 'INCOME2_Less than $50,000 ($35,000 to less than $50,000)'),
  Text(21, 0, 'INCOME2_Less than $75,000 ($50,000 to less than $75,000)'),
  Text(22, 0, 'INCOME2_Less than $25,000 ($20,000 to less than $25,000)'),
  Text(23, 0, 'INCOME2_Less than $20,000 ($15,000 to less than $20,000)'),
  Text(24, 0, 'INCOME2_Less than $15,000 ($10,000 to less than $15,000)'),
  Text(25, 0, 'INCOME2_Less than $10,000'),
  Text(26, 0, 'predict_prob')])

Which groups within these counties might need to be focused on to maximize the impact of TN Med Helper’s efforts?

counties = gpd.read_file('../data/county/tncounty.shp')

counties

	OBJECTID	NAME	KEY	SHAPE_AREA	SHAPE_LEN	geometry
0	76	Chester	47023	8.049024e+09	520461.080124	POLYGON ((1137985.762 344601.643, 1137965.070 ...
1	77	Wayne	47181	2.050741e+10	666520.678598	POLYGON ((1365052.057 391716.806, 1365746.554 ...
2	78	Tipton	47167	1.319125e+10	865093.887634	MULTIPOLYGON (((886814.330 400456.525, 886774....
3	79	Hamilton	47065	1.604776e+10	652926.001078	POLYGON ((2274954.438 239788.911, 2274090.610 ...
4	80	Stewart	47161	1.375003e+10	490090.336180	POLYGON ((1382472.783 743972.302, 1382445.171 ...
...	...	...	...	...	...	...
90	91	McNairy	47109	1.563586e+10	566369.132062	POLYGON ((1137985.762 344601.643, 1139350.519 ...
91	92	Franklin	47051	1.605093e+10	621176.096919	POLYGON ((1873015.265 239618.144, 1872957.848 ...
92	93	Bradley	47011	9.241234e+09	457372.233476	POLYGON ((2274954.438 239788.911, 2275552.803 ...
93	94	Marion	47115	1.428734e+10	529431.591556	POLYGON ((2126056.390 236919.771, 2122873.509 ...
94	95	Polk	47139	1.233228e+10	479994.126988	POLYGON ((2355580.184 332970.851, 2355673.384 ...

95 rows × 6 columns

full_df.county

FIPS
    Crockett
      Grundy
        Lake
   Van Buren
     Bledsoe
            ...    
        Knox
    Sullivan
     Madison
  Williamson
  Washington
Name: county, Length: 95, dtype: object

counties = pd.merge(left = counties,left_on = 'NAME',
                    right = full_df[['county', 'predict_prob']], right_on = 'county')

counties['no_checkup'] = 1- counties.predict_prob

from matplotlib.lines import Line2D

fig, ax = plt.subplots(figsize=(16,4))

counties.plot(column = 'no_checkup', 
              edgecolor = 'black',
              legend = True,
              cmap = 'viridis',
              # scheme="NaturalBreaks",
              ax = ax)



# Adjust the marker appearance
# Extract the old markers and then modify by setting the edgecolor and edgewidth
# markers = []
# for line in leg.get_lines():
#     marker = Line2D([0],[0], marker = 'o', 
#                     markersize = line.get_markersize(), 
#                     color = line.get_markerfacecolor(),
#                     linestyle = 'None',
#                     markeredgecolor = 'black',
#                     markeredgewidth = 1)
#     markers.append(marker)

# Redraw the legend with the new labels and markers
plt.legend(markers, labels, fontsize = 12)
leg = ax.get_legend()
leg.set_bbox_to_anchor((1, 0.5))
    
plt.title('Predicted No Checkup in the Last Year for Average Resident', fontsize = 18)

ax.axis('off');