Merge various datasets with Pandas
Here, I’m presenting my code I used to manipulate and merge multiple datasets. This is part of my dancer’s business project. Hope it is useful!
Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
DataFrame to save
# Dataframe to merge everything
df = None
Scale/adjustment factor data
Before we begin, let’s import a few useful economic index data and census data. Even though these data won’t be used for my modeling, I will use them when we explore statistics to get a trend of business.
Consumer Price Index (CPI)
CPI can be used to account for inflation. I downloaded numbers of interesting areas from the website below.
- Source: U.S. Bureau of Labor Statistics (link is one of the example page)
- All area: Mean of half1 and half2 of “All items in U.S. city average, all urban consumers, not seasonally adjusted” (CUUR0000SA0)
cpi = pd.read_csv('data/CPI.csv')
display(cpi)
year | NY | Chicago | Seattle | LA | SanFran | All | |
---|---|---|---|---|---|---|---|
0 | 1990 | 138.500 | 131.700 | 126.800 | 135.900 | 132.100 | 130.6500 |
1 | 1991 | 144.800 | 137.000 | 134.100 | 141.400 | 137.900 | 136.2000 |
2 | 1992 | 150.000 | 141.100 | 139.000 | 146.500 | 142.500 | 140.3000 |
3 | 1993 | 154.500 | 145.400 | 142.900 | 150.300 | 146.300 | 144.5000 |
4 | 1994 | 158.200 | 148.600 | 147.800 | 152.300 | 148.700 | 148.2500 |
5 | 1995 | 162.200 | 153.300 | 152.300 | 154.600 | 151.600 | 152.3500 |
6 | 1996 | 166.900 | 157.400 | 157.500 | 157.500 | 155.100 | 156.8500 |
7 | 1997 | 170.800 | 161.700 | 163.000 | 160.000 | 160.400 | 160.5500 |
8 | 1998 | 173.600 | 165.000 | 167.700 | 162.300 | 165.500 | 163.0000 |
9 | 1999 | 177.000 | 168.400 | 172.800 | 166.100 | 172.500 | 166.6000 |
10 | 2000 | 182.500 | 173.800 | 179.200 | 171.600 | 180.200 | 172.2000 |
11 | 2001 | 187.100 | 178.300 | 185.700 | 177.300 | 189.900 | 177.0500 |
12 | 2002 | 191.900 | 181.200 | 189.300 | 182.200 | 193.000 | 179.9000 |
13 | 2003 | 197.800 | 184.500 | 192.300 | 187.000 | 196.400 | 183.9500 |
14 | 2004 | 204.800 | 188.600 | 194.700 | 193.200 | 198.800 | 188.9000 |
15 | 2005 | 212.700 | 194.300 | 200.200 | 201.800 | 202.700 | 195.3000 |
16 | 2006 | 220.700 | 198.300 | 207.600 | 210.400 | 209.200 | 201.6000 |
17 | 2007 | 226.940 | 204.818 | 215.656 | 217.338 | 216.048 | 207.3425 |
18 | 2008 | 235.782 | 212.536 | 224.719 | 225.008 | 222.767 | 215.3030 |
19 | 2009 | 236.825 | 209.995 | 226.028 | 223.219 | 224.395 | 214.5370 |
20 | 2010 | 240.864 | 212.870 | 226.693 | 225.894 | 227.469 | 218.0555 |
21 | 2011 | 247.718 | 218.684 | 232.765 | 231.928 | 233.390 | 224.9390 |
22 | 2012 | 252.588 | 222.005 | 238.663 | 236.648 | 239.650 | 229.5940 |
23 | 2013 | 256.833 | 224.545 | 241.563 | 239.207 | 245.023 | 232.9570 |
24 | 2014 | 260.230 | 228.468 | 246.018 | 242.434 | 251.985 | 236.7360 |
25 | 2015 | 260.558 | 227.792 | 249.364 | 244.632 | 258.572 | 237.0170 |
26 | 2016 | 263.365 | 229.302 | 254.886 | 249.246 | 266.344 | 240.0075 |
27 | 2017 | 268.520 | 233.611 | 262.668 | 256.210 | 274.924 | 245.1195 |
28 | 2018 | 273.641 | 237.706 | 271.089 | 265.962 | 285.550 | 251.1070 |
29 | 2019 | 278.164 | 241.181 | 277.984 | 274.114 | 295.004 | 255.6575 |
30 | 2020 | 282.920 | 243.873 | 282.693 | 278.567 | 300.084 | 258.8110 |
31 | 2021 | 292.303 | 254.159 | 295.560 | 289.244 | 309.721 | 270.9695 |
# Merge to df
df = pd.melt(cpi, id_vars=['year'], value_vars=cpi.columns[1:],
var_name='area', value_name='cpi')
print(len(df)) # should be 32 years x 6 area = 192 rows
display(df.sample(5))
192
year | area | cpi | |
---|---|---|---|
138 | 2000 | SanFran | 180.200 |
165 | 1995 | All | 152.350 |
89 | 2015 | Seattle | 249.364 |
85 | 2011 | Seattle | 232.765 |
35 | 1993 | Chicago | 145.400 |
Cost of Living Index (COLI)
CPI may not account for actual living cost. The COLI is closer to actual spend of living. When we judge income level, we always consider the COLI to scale.
Unlike the CPI, finding COLI was difficult. I found the COLI data of year 2010 from the Census.gov. I will assume this number is staying same over years up to relative between cities.
Also, it is not clear if this is calculated for metropolitan statistical area (broader) or only for city (smaller area). I'll assume the former, which is as same as the area division of wage statistics.
coli = pd.read_csv('data/COLI.csv')
coli = coli[coli.area!='Percent']
display(coli)
area | coli | Grocery | Housing | Utilities | Transportation | HealthCare | Etc | |
---|---|---|---|---|---|---|---|---|
0 | NY | 216.7 | 154.3 | 386.7 | 169.6 | 120.3 | 130.2 | 145.7 |
1 | Chicago | 116.9 | 111.2 | 134.8 | 117.3 | 116.5 | 108.5 | 104.4 |
2 | Seattle | 121.4 | 115.1 | 140.3 | 85.7 | 118.8 | 119.9 | 119.1 |
3 | LA | 136.4 | 106.0 | 207.1 | 101.7 | 113.6 | 109.1 | 107.0 |
4 | SanFran | 164.0 | 111.9 | 281.0 | 94.5 | 113.0 | 117.0 | 124.3 |
Housing price is the dominant driving factor.
# Merge to df
df = df.merge(coli[['area','coli']], how = 'outer', on = ['area'])
print(len(df)) # should be 32 years x 6 area = 192 rows
display(df.sample(5))
192
year | area | cpi | coli | |
---|---|---|---|---|
147 | 2009 | SanFran | 224.395 | 164.0 |
72 | 1998 | Seattle | 167.700 | 121.4 |
61 | 2019 | Chicago | 241.181 | 116.9 |
37 | 1995 | Chicago | 153.300 | 116.9 |
119 | 2013 | LA | 239.207 | 136.4 |
# COLI adjusted CPI (2010)
t = df[df.year==2010]
t['coli_cpi'] = t.coli/t.cpi
t = t[['area','coli_cpi']]
df = df.merge(t,how='left',on=['area'])
/var/folders/31/7v9nfdf14sz0sxn2xwnq90y00000gn/T/ipykernel_12606/1860590002.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
t['coli_cpi'] = t.coli/t.cpi
Census - U.S. and Metropolitan area population
- Source
- National population
- Metropolitan statistical area population
Metropolitan statistical area population data
# Check metropolitan area census dataset
census = pd.read_csv('data/census.csv')
census.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2797 entries, 0 to 2796
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CBSA 2797 non-null int64
1 MDIV 141 non-null float64
2 STCOU 1840 non-null float64
3 NAME 2797 non-null object
4 LSAD 2797 non-null object
5 CENSUS2010POP 2797 non-null int64
6 ESTIMATESBASE2010 2797 non-null int64
7 POPESTIMATE2010 2797 non-null int64
8 POPESTIMATE2011 2797 non-null int64
9 POPESTIMATE2012 2797 non-null int64
10 POPESTIMATE2013 2797 non-null int64
11 POPESTIMATE2014 2797 non-null int64
12 POPESTIMATE2015 2797 non-null int64
13 POPESTIMATE2016 2797 non-null int64
14 POPESTIMATE2017 2797 non-null int64
15 POPESTIMATE2018 2797 non-null int64
16 POPESTIMATE2019 2797 non-null int64
17 POPESTIMATE2020 5 non-null float64
18 POPESTIMATE2021 5 non-null float64
dtypes: float64(4), int64(13), object(2)
memory usage: 415.3+ KB
# Select only interesting area
# Zipcode of Metropolitan Statistical Area
# LA has two zip codes because it has changed over years
zipcode_area = {31100:'LA',31080:'LA',41860:'SanFran',16980:'Chicago',35620:'NY',42660:'Seattle'}
lst=[]
for i in zipcode_area.keys():
if i==31100:
continue
lst.append(census.loc[(census.CBSA==i)&(census.LSAD=='Metropolitan Statistical Area')])
census = pd.concat(lst)
display(census)
CBSA | MDIV | STCOU | NAME | LSAD | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | POPESTIMATE2016 | POPESTIMATE2017 | POPESTIMATE2018 | POPESTIMATE2019 | POPESTIMATE2020 | POPESTIMATE2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
850 | 31080 | NaN | NaN | Los Angeles-Long Beach-Anaheim, CA | Metropolitan Statistical Area | 12828837 | 12828957 | 12838417 | 12925753 | 13013443 | 13097434 | 13166609 | 13234696 | 13270694 | 13278000 | 13249879 | 13214799 | 13173266.0 | 12997353.0 |
1307 | 41860 | NaN | NaN | San Francisco-Oakland-Berkeley, CA | Metropolitan Statistical Area | 4335391 | 4335593 | 4343634 | 4395725 | 4455473 | 4519636 | 4584981 | 4647924 | 4688198 | 4712421 | 4726314 | 4731803 | 4739649.0 | 4623264.0 |
291 | 16980 | NaN | NaN | Chicago-Naperville-Elgin, IL-IN-WI | Metropolitan Statistical Area | 9461105 | 9461537 | 9470634 | 9500870 | 9528090 | 9550194 | 9560430 | 9552554 | 9533662 | 9514113 | 9484158 | 9458539 | 9601605.0 | 9509934.0 |
1017 | 35620 | NaN | NaN | New York-Newark-Jersey City, NY-NJ-PA | Metropolitan Statistical Area | 18897109 | 18896277 | 18923407 | 19052774 | 19149689 | 19226449 | 19280929 | 19320968 | 19334778 | 19322607 | 19276644 | 19216182 | 20096413.0 | 19768458.0 |
1337 | 42660 | NaN | NaN | Seattle-Tacoma-Bellevue, WA | Metropolitan Statistical Area | 3439809 | 3439808 | 3449241 | 3503891 | 3558829 | 3612347 | 3675160 | 3739654 | 3816355 | 3885579 | 3935179 | 3979845 | 4024730.0 | 4011553.0 |
# Select only interesting fields
census.drop(['MDIV','STCOU','LSAD','CENSUS2010POP','ESTIMATESBASE2010'],axis=1,inplace=True)
# Change name of dields
census.columns = ['CBSA','NAME',2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021]
# Transpose
census = census.T
# Put area names as field names
census.columns = ['LA','SanFran','Chicago','NY','Seattle']
# Reset index
census.reset_index()
# Leave only yearly population rows
census.drop(['CBSA','NAME'],inplace=True)
display(census)
LA | SanFran | Chicago | NY | Seattle | |
---|---|---|---|---|---|
2010 | 12838417 | 4343634 | 9470634 | 18923407 | 3449241 |
2011 | 12925753 | 4395725 | 9500870 | 19052774 | 3503891 |
2012 | 13013443 | 4455473 | 9528090 | 19149689 | 3558829 |
2013 | 13097434 | 4519636 | 9550194 | 19226449 | 3612347 |
2014 | 13166609 | 4584981 | 9560430 | 19280929 | 3675160 |
2015 | 13234696 | 4647924 | 9552554 | 19320968 | 3739654 |
2016 | 13270694 | 4688198 | 9533662 | 19334778 | 3816355 |
2017 | 13278000 | 4712421 | 9514113 | 19322607 | 3885579 |
2018 | 13249879 | 4726314 | 9484158 | 19276644 | 3935179 |
2019 | 13214799 | 4731803 | 9458539 | 19216182 | 3979845 |
2020 | 13173266.0 | 4739649.0 | 9601605.0 | 20096413.0 | 4024730.0 |
2021 | 12997353.0 | 4623264.0 | 9509934.0 | 19768458.0 | 4011553.0 |
National population data
# Check one example file of national population
demo = pd.read_csv('data/C2015.csv')
demo.info()
display(demo.head(5))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Label (Grouping) 89 non-null object
1 United States!!Estimate 84 non-null object
2 United States!!Margin of Error 84 non-null object
3 United States!!Percent 84 non-null object
4 United States!!Percent Margin of Error 84 non-null object
dtypes: object(5)
memory usage: 3.6+ KB
Label (Grouping) | United States!!Estimate | United States!!Margin of Error | United States!!Percent | United States!!Percent Margin of Error | |
---|---|---|---|---|---|
0 | SEX AND AGE | NaN | NaN | NaN | NaN |
1 | Total population | 321,418,821 | ***** | 321,418,821 | (X) |
2 | Male | 158,167,834 | ±31,499 | 49.2% | ±0.1 |
3 | Female | 163,250,987 | ±31,500 | 50.8% | ±0.1 |
4 | Under 5 years | 19,793,807 | ±16,520 | 6.2% | ±0.1 |
%%script false --no-raise-error
# If you already have data/usDemo.csv, this block can be skipped.
# Combine multiple year files of national population
df_save = []
for year in range(2010,2020):
demo = pd.read_csv('data/C{0}.csv'.format(year))
demo['year'] = year
df_save.append(demo)
# Mave a csv file
df_save = pd.concat(df_save)
df_save.columns = ['label','estimate','estimate_err','pct','pct_err','year','estimate_err2']
df_save.to_csv('data/usDemo.csv', index=False)
census_national = pd.read_csv('data/usDemo.csv')
display(census_national)
label | estimate | estimate_err | pct | pct_err | year | estimate_err2 | |
---|---|---|---|---|---|---|---|
0 | SEX AND AGE | NaN | NaN | NaN | NaN | 2010 | NaN |
1 | Total population | 309,349,689 | ***** | 309,349,689 | (X) | 2010 | NaN |
2 | Male | 152,089,450 | ±27,325 | 49.2% | ±0.1 | 2010 | NaN |
3 | Female | 157,260,239 | ±27,325 | 50.8% | ±0.1 | 2010 | NaN |
4 | Under 5 years | 20,133,943 | ±20,568 | 6.5% | ±0.1 | 2010 | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
880 | Total housing units | 139,686,209 | NaN | (X) | (X) | 2019 | ±6,973 |
881 | CITIZEN, VOTING AGE POPULATION | NaN | NaN | NaN | NaN | 2019 | NaN |
882 | Citizen, 18 and over population | 235,418,734 | NaN | 235,418,734 | (X) | 2019 | ±159,764 |
883 | Male | 114,206,194 | NaN | 48.5% | ±0.1 | 2019 | ±98,225 |
884 | Female | 121,212,540 | NaN | 51.5% | ±0.1 | 2019 | ±79,689 |
885 rows × 7 columns
# Leave data to use only
# Change format to merge with metropolitan census dataframe
population=[]
years=[]
# make year:population dictionary
for year in range(2010,2020):
pop = int(''.join(census_national[(census_national.label.str.contains('Total population'))\
&(census_national.year==year)].iloc[0].estimate.split(',')))
population.append(pop)
years.append(year)
census_national = pd.DataFrame({'year':years,'All':population})
# Add extra years
census_national_other = pd.DataFrame([
[2020, 331501080],
[2021, 331893745],
[1998, 275854000],
[1999, 279040000],
[2000, 282172000],
[2001, 285082000],
[2002, 287804000],
[2003, 290326000],
[2004, 293046000],
[2005, 295753000],
[2006, 298593000],
[2007, 301580000],
[2008, 304375000],
[2009, 307007000]],
columns=['year','All'])
census_national = census_national.append(census_national_other, ignore_index=True)
census_national = census_national.sort_values(by=['year'])
display(census_national)
year | All | |
---|---|---|
12 | 1998 | 275854000 |
13 | 1999 | 279040000 |
14 | 2000 | 282172000 |
15 | 2001 | 285082000 |
16 | 2002 | 287804000 |
17 | 2003 | 290326000 |
18 | 2004 | 293046000 |
19 | 2005 | 295753000 |
20 | 2006 | 298593000 |
21 | 2007 | 301580000 |
22 | 2008 | 304375000 |
23 | 2009 | 307007000 |
0 | 2010 | 309349689 |
1 | 2011 | 311591919 |
2 | 2012 | 313914040 |
3 | 2013 | 316128839 |
4 | 2014 | 318857056 |
5 | 2015 | 321418821 |
6 | 2016 | 323127515 |
7 | 2017 | 325719178 |
8 | 2018 | 327167439 |
9 | 2019 | 328239523 |
10 | 2020 | 331501080 |
11 | 2021 | 331893745 |
Merge national population to metropolitan population
census = pd.merge(census_national, census, how='outer', right_index=True, left_on='year')
# Change data type to numeric
census = census.apply(pd.to_numeric)
census.info()
display(census)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 12 to 11
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 24 non-null int64
1 All 24 non-null int64
2 LA 12 non-null float64
3 SanFran 12 non-null float64
4 Chicago 12 non-null float64
5 NY 12 non-null float64
6 Seattle 12 non-null float64
dtypes: float64(5), int64(2)
memory usage: 1.5 KB
year | All | LA | SanFran | Chicago | NY | Seattle | |
---|---|---|---|---|---|---|---|
12 | 1998 | 275854000 | NaN | NaN | NaN | NaN | NaN |
13 | 1999 | 279040000 | NaN | NaN | NaN | NaN | NaN |
14 | 2000 | 282172000 | NaN | NaN | NaN | NaN | NaN |
15 | 2001 | 285082000 | NaN | NaN | NaN | NaN | NaN |
16 | 2002 | 287804000 | NaN | NaN | NaN | NaN | NaN |
17 | 2003 | 290326000 | NaN | NaN | NaN | NaN | NaN |
18 | 2004 | 293046000 | NaN | NaN | NaN | NaN | NaN |
19 | 2005 | 295753000 | NaN | NaN | NaN | NaN | NaN |
20 | 2006 | 298593000 | NaN | NaN | NaN | NaN | NaN |
21 | 2007 | 301580000 | NaN | NaN | NaN | NaN | NaN |
22 | 2008 | 304375000 | NaN | NaN | NaN | NaN | NaN |
23 | 2009 | 307007000 | NaN | NaN | NaN | NaN | NaN |
0 | 2010 | 309349689 | 12838417.0 | 4343634.0 | 9470634.0 | 18923407.0 | 3449241.0 |
1 | 2011 | 311591919 | 12925753.0 | 4395725.0 | 9500870.0 | 19052774.0 | 3503891.0 |
2 | 2012 | 313914040 | 13013443.0 | 4455473.0 | 9528090.0 | 19149689.0 | 3558829.0 |
3 | 2013 | 316128839 | 13097434.0 | 4519636.0 | 9550194.0 | 19226449.0 | 3612347.0 |
4 | 2014 | 318857056 | 13166609.0 | 4584981.0 | 9560430.0 | 19280929.0 | 3675160.0 |
5 | 2015 | 321418821 | 13234696.0 | 4647924.0 | 9552554.0 | 19320968.0 | 3739654.0 |
6 | 2016 | 323127515 | 13270694.0 | 4688198.0 | 9533662.0 | 19334778.0 | 3816355.0 |
7 | 2017 | 325719178 | 13278000.0 | 4712421.0 | 9514113.0 | 19322607.0 | 3885579.0 |
8 | 2018 | 327167439 | 13249879.0 | 4726314.0 | 9484158.0 | 19276644.0 | 3935179.0 |
9 | 2019 | 328239523 | 13214799.0 | 4731803.0 | 9458539.0 | 19216182.0 | 3979845.0 |
10 | 2020 | 331501080 | 13173266.0 | 4739649.0 | 9601605.0 | 20096413.0 | 4024730.0 |
11 | 2021 | 331893745 | 12997353.0 | 4623264.0 | 9509934.0 | 19768458.0 | 4011553.0 |
# Merge to df
census = pd.melt(census, id_vars=['year'], value_vars=cpi.columns[1:],
var_name='area', value_name='population')
df = df.merge(census, how='outer', on= ['area','year'])
print(len(df)) # should be 32 years x 6 area = 192 rows
display(df.sample(5))
192
year | area | cpi | coli | coli_cpi | population | |
---|---|---|---|---|---|---|
53 | 2011 | Chicago | 218.6840 | 116.9 | 0.549161 | 9500870.0 |
61 | 2019 | Chicago | 241.1810 | 116.9 | 0.549161 | 9458539.0 |
42 | 2000 | Chicago | 173.8000 | 116.9 | 0.549161 | NaN |
180 | 2010 | All | 218.0555 | NaN | NaN | 309349689.0 |
91 | 2017 | Seattle | 262.6680 | 121.4 | 0.535526 | 3885579.0 |
U.S. market data
I crated “data/rev.csv” file by combining U.S. markets statistics from multiple sources. Data sources are
- U.S. dance studio
- U.S. fitness and recreational sports centers
# Read data file
rev = pd.read_csv('data/rev.csv')
# Set every money scale to billion dollars
rev.fitness = rev.fitness/1000
rev.studio_wage = rev.studio_wage/1000
display(rev)
# fitness: U.S. fitness and recreational sports center revenue
# dance_studio: U.S. dance studio revenue
# studio_num: number of buinesses of U.S. dance studio
# studio_emp: number of employees of U.S. dance studio
# studio_wage: total wage of U.S. dance studio
year | fitness | dance_studio | studio_num | studio_emp | studio_wage | |
---|---|---|---|---|---|---|
0 | 1998 | 10.797 | NaN | NaN | NaN | NaN |
1 | 1999 | 11.777 | NaN | NaN | NaN | NaN |
2 | 2000 | 12.543 | NaN | NaN | NaN | NaN |
3 | 2001 | 13.542 | NaN | NaN | NaN | NaN |
4 | 2002 | 14.987 | NaN | NaN | NaN | NaN |
5 | 2003 | 16.287 | NaN | NaN | NaN | NaN |
6 | 2004 | 17.174 | NaN | NaN | NaN | NaN |
7 | 2005 | 18.286 | NaN | NaN | NaN | NaN |
8 | 2006 | 19.447 | NaN | NaN | NaN | NaN |
9 | 2007 | 21.416 | NaN | NaN | NaN | NaN |
10 | 2008 | 22.339 | NaN | NaN | NaN | NaN |
11 | 2009 | 21.842 | NaN | NaN | NaN | NaN |
12 | 2010 | 22.311 | NaN | NaN | NaN | NaN |
13 | 2011 | 23.191 | 3.04 | NaN | NaN | NaN |
14 | 2012 | 24.051 | 3.22 | 47269.0 | 90668.0 | 0.9026 |
15 | 2013 | 25.803 | 3.28 | 48399.0 | 93420.0 | 0.9029 |
16 | 2014 | 27.001 | 3.42 | 52942.0 | 99696.0 | 0.9504 |
17 | 2015 | 28.838 | 3.59 | 55523.0 | 104321.0 | 1.0148 |
18 | 2016 | 31.223 | 3.70 | 56412.0 | 107832.0 | 1.0798 |
19 | 2017 | 33.042 | 3.87 | 58515.0 | 114075.0 | 1.1448 |
20 | 2018 | 33.971 | 4.10 | 63363.0 | 120456.0 | 1.1787 |
21 | 2019 | 35.889 | 4.20 | 65723.0 | 126288.0 | 1.2295 |
22 | 2020 | 24.361 | 3.43 | 62808.0 | 112485.0 | 1.0768 |
23 | 2021 | NaN | 3.72 | 66266.0 | 120081.0 | 1.1531 |
24 | 2022 | NaN | 3.83 | 68393.0 | 123680.0 | 1.1876 |
# Merge to df
df= df.merge(rev, how = 'outer', on= ['year'])
# for convenience
df.loc[df.area.isna(),'area']='All'
print(len(df)) # should be 32 years x 6 area + 1 year row = 193 rows
display(df.sample(3))
193
year | area | cpi | coli | coli_cpi | population | fitness | dance_studio | studio_num | studio_emp | studio_wage | |
---|---|---|---|---|---|---|---|---|---|---|---|
138 | 2013 | NY | 256.833 | 216.7 | 0.899678 | 19226449.0 | 25.803 | 3.28 | 48399.0 | 93420.0 | 0.9029 |
115 | 2009 | Chicago | 209.995 | 116.9 | 0.549161 | NaN | 21.842 | NaN | NaN | NaN | NaN |
102 | 2007 | NY | 226.940 | 216.7 | 0.899678 | NaN | 21.416 | NaN | NaN | NaN | NaN |
Employee statistics data
This website provides a table of employment statistics (wage, number of employee, etc) of different area of each year. I downloaded each year’s file, and they will be cleaned and concatenated.
Explanation of fields
Here are definitions of each field. Not explained field is not used in this analysis.
Area identifier
- area: area code
- area_name(title): Area name
Job identifier
- occ_code: The 6-digit Standard Occupational Classification (SOC) code or OEWS-specific code for the occupation
- occ_title: SOC title or OEWS-specific title for the occupation
Number of employee
- tot_emp: Estimated total employment rounded to the nearest 10 (excludes self-employed).
- emp_prse: Percent relative standard error (PRSE) for the employment estimate. PRSE is a measure of sampling error, expressed as a percentage of the corresponding estimate. Sampling error occurs when values for a population are estimated from a sample survey of the population, rather than calculated from data for all members of the population. Estimates with lower PRSEs are typically more precise in the presence of sampling error.
Wage
- h_mean: Mean hourly wage
-
a_mean: Mean annual wage
- mean_prse: Percent relative standard error (PRSE) for the mean wage estimate.
- h_pct10: Hourly 10th percentile wage
- h_pct25: Hourly 25th percentile wage
- h_median: Hourly median wage (or the 50th percentile)
- h_pct75: Hourly 75th percentile wage
-
h_pct90: Hourly 90th percentile wage
- a_pct10: Annual 10th percentile wage
- a_pct25: Annual 25th percentile wage
- a_median: Annual median wage (or the 50th percentile)
- a_pct75: Annual 75th percentile wage
- a_pct90: Annual 90th percentile wage
# Let's check how each file looks like
sample = pd.read_excel('data/2010.xls')
sample.info()
display(sample.sample(2))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7940 entries, 0 to 7939
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PRIM_STATE 7940 non-null object
1 AREA 7940 non-null int64
2 AREA_NAME 7940 non-null object
3 OCC_CODE 7940 non-null object
4 OCC_TITLE 7940 non-null object
5 GROUP 253 non-null object
6 TOT_EMP 7940 non-null object
7 EMP_PRSE 7940 non-null object
8 JOBS_1000 7940 non-null object
9 LOC QUOTIENT 7940 non-null object
10 H_MEAN 7940 non-null object
11 A_MEAN 7940 non-null object
12 MEAN_PRSE 7940 non-null object
13 H_PCT10 7940 non-null object
14 H_PCT25 7940 non-null object
15 H_MEDIAN 7940 non-null object
16 H_PCT75 7940 non-null object
17 H_PCT90 7940 non-null object
18 A_PCT10 7940 non-null object
19 A_PCT25 7940 non-null object
20 A_MEDIAN 7940 non-null object
21 A_PCT75 7940 non-null object
22 A_PCT90 7940 non-null object
23 ANNUAL 562 non-null object
24 HOURLY 38 non-null object
dtypes: int64(1), object(24)
memory usage: 1.5+ MB
PRIM_STATE | AREA | AREA_NAME | OCC_CODE | OCC_TITLE | GROUP | TOT_EMP | EMP_PRSE | JOBS_1000 | LOC QUOTIENT | ... | H_MEDIAN | H_PCT75 | H_PCT90 | A_PCT10 | A_PCT25 | A_MEDIAN | A_PCT75 | A_PCT90 | ANNUAL | HOURLY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3384 | IL | 16980 | Chicago-Naperville-Joliet, IL-IN-WI | 47-2041 | Carpet Installers | NaN | 1470 | 17.6 | 0.352 | 1.749 | ... | 22.34 | 32.03 | 39.24 | 23180 | 35810 | 46470 | 66630 | 81620 | NaN | NaN |
2762 | FL | 33100 | Miami-Fort Lauderdale-Pompano Beach, FL | 51-4023 | Rolling Machine Setters, Operators, and Tender... | NaN | 190 | 16.4 | 0.09 | 0.359 | ... | 15.46 | 18.64 | 20.53 | 18690 | 25990 | 32160 | 38780 | 42700 | NaN | NaN |
2 rows × 25 columns
# jobs in interest
# dancer: dancer
# choreo: choreographer,
# fit_trainer: fitness trainer/instructer
# rec_worker: recreational worker,
# all_jobs: all jobs sum/mean
code_job = {'27-2031':'dancer','27-2032':'choreo',
'39-9031':'fit_trainer','39-9032':'rec_worker',
'00-0000':'all_jobs'}
%%script false --no-raise-error
# This block combines multiple wage data files, then generate a single csv file.
# If you already have data/wage.csv, this block can be skipped. It takes time to run.
df_save = []
for year in range(2003,2022):
print(year)
metro = None # metropolitan area statistics data
national = None # national statistics data
try:
if year>2004:
metro = pd.read_excel('data/'+str(year)+'.xls')
national = pd.read_excel('data/'+str(year)+'nat.xls')
except:
if year>2004:
metro = pd.read_excel('data/'+str(year)+'.xlsx')
national = pd.read_excel('data/'+str(year)+'nat.xlsx')
if year>2004:
metro.columns = metro.columns.str.strip().str.lower()
national.columns = national.columns.str.strip().str.lower()
# unify feature names in all years
if year>2004:
metro.rename(columns={'area_title':'area_name'},inplace=True)
# LA area code changed
area_la = 31100
if year>2014:
area_la=31080
if year>2004:
# Select metropolitan area in interest
metro = metro.loc[(metro.area==area_la) | (metro.area==41860) | (metro.area==16980) |
(metro.area==35620) | (metro.area==42660)]
# Select occupation in interest
metro = metro.loc[(metro.occ_code=='27-2031') | (metro.occ_code=='27-2032') |
(metro.occ_code=='39-9031') | (metro.occ_code=='39-9032') |
(metro.occ_code=='00-0000')]
# Change zip code to the unique area names
metro['area']=metro.apply(lambda x: zipcode_area[x['area']], axis=1)
# Select occupation in interest
national = national.loc[(national.occ_code=='27-2031') | (national.occ_code=='27-2032') |
(national.occ_code=='39-9031') | (national.occ_code=='39-9032') |
(national.occ_code=='00-0000')]
# To match columns with metropolitan dataframe
national['area'] = 'All'
national['area_name'] = 'U.S. all'
# Keep only columns to use
if year>2004:
metro = metro[['area', 'area_name', 'occ_code', 'occ_title',
'tot_emp', 'emp_prse', 'h_mean', 'a_mean', 'mean_prse', 'h_pct10',
'h_pct25', 'h_median', 'h_pct75', 'h_pct90', 'a_pct10', 'a_pct25',
'a_median', 'a_pct75', 'a_pct90']]
national = national[['area', 'area_name', 'occ_code', 'occ_title',
'tot_emp', 'emp_prse', 'h_mean', 'a_mean', 'mean_prse', 'h_pct10',
'h_pct25', 'h_median', 'h_pct75', 'h_pct90', 'a_pct10', 'a_pct25',
'a_median', 'a_pct75', 'a_pct90']]
# comebine national data to metropolitan data
emp=None
if year>2004:
emp = pd.concat([national,metro], ignore_index=True)
else:
emp = national
# add year
emp['year']=year
# add the unique occupation name
emp['occ']=emp.apply(lambda x: code_job[x['occ_code']], axis=1)
# Cleaning
emp.replace('**',np.nan,inplace=True)
emp.replace('*',np.nan,inplace=True)
# Append to a list to save
df_save.append(emp)
# Mave a csv file
df_save = pd.concat(df_save)
df_save.to_csv('data/emp.csv', index=False)
# Check data is prepared as intended
emp = pd.read_csv('data/emp.csv')
emp.info()
# Confirm if city and occupation labels are correct
print("Check area names are correctly marked ------ ")
for x in zipcode_area.values():
print(x,emp[emp.area==x].area_name.unique())
print("\n Check occupation names are correctly marked ------ ")
for x in code_job.values():
print(x,emp[emp.occ==x].occ_title.unique())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 497 non-null object
1 area_name 497 non-null object
2 occ_code 497 non-null object
3 occ_title 497 non-null object
4 tot_emp 462 non-null float64
5 emp_prse 462 non-null float64
6 h_mean 491 non-null float64
7 a_mean 402 non-null float64
8 mean_prse 491 non-null float64
9 h_pct10 491 non-null float64
10 h_pct25 491 non-null float64
11 h_median 491 non-null float64
12 h_pct75 491 non-null float64
13 h_pct90 491 non-null float64
14 a_pct10 402 non-null float64
15 a_pct25 402 non-null float64
16 a_median 402 non-null float64
17 a_pct75 402 non-null float64
18 a_pct90 402 non-null float64
19 year 497 non-null int64
20 occ 497 non-null object
dtypes: float64(15), int64(1), object(5)
memory usage: 81.7+ KB
Check area names are correctly marked ------
LA ['Los Angeles-Long Beach-Santa Ana, CA'
'Los Angeles-Long Beach-Anaheim, CA']
LA ['Los Angeles-Long Beach-Santa Ana, CA'
'Los Angeles-Long Beach-Anaheim, CA']
SanFran ['San Francisco-Oakland-Fremont, CA' 'San Francisco-Oakland-Hayward, CA']
Chicago ['Chicago-Naperville-Joilet, IL-IN-WI'
'Chicago-Naperville-Joliet, IL-IN-WI'
'Chicago-Joliet-Naperville, IL-IN-WI'
'Chicago-Naperville-Elgin, IL-IN-WI']
NY ['New York-Northern New Jersey-Long Island, NY-NJ-PA'
'New York-Newark-Jersey City, NY-NJ-PA']
Seattle ['Seattle-Tacoma-Bellevue, WA']
Check occupation names are correctly marked ------
dancer ['Dancers']
choreo ['Choreographers']
fit_trainer ['Fitness trainers and aerobics instructors'
'Fitness Trainers and Aerobics Instructors'
'Exercise Trainers and Group Fitness Instructors']
rec_worker ['Recreation workers' 'Recreation Workers']
all_jobs ['All Occupations']
# Since area and occupations are correctly marked, let's remove them
emp.drop(['area_name','occ_code','occ_title'],axis=1,inplace=True)
print(len(emp)) # 17 years x 6 area x 5 jobs + 2 years x 1 area x 5 jobs = 520 rows, if no missing record
display(emp.sample(5))
497
area | tot_emp | emp_prse | h_mean | a_mean | mean_prse | h_pct10 | h_pct25 | h_median | h_pct75 | h_pct90 | a_pct10 | a_pct25 | a_median | a_pct75 | a_pct90 | year | occ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
385 | Seattle | NaN | NaN | 37.82 | 78670.0 | 10.3 | 20.78 | 34.29 | 40.35 | 46.00 | 49.36 | 43220.0 | 71320.0 | 83920.0 | 95690.0 | 102660.0 | 2017 | choreo |
410 | NY | 70.0 | 33.8 | 40.86 | 84990.0 | 12.8 | 17.68 | 21.40 | 37.44 | 59.55 | 73.45 | 36780.0 | 44500.0 | 77870.0 | 123870.0 | 152780.0 | 2018 | choreo |
384 | Seattle | 110.0 | 10.4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2017 | dancer |
166 | LA | 13510.0 | 3.8 | 12.09 | 25150.0 | 1.1 | 8.76 | 9.65 | 11.32 | 13.75 | 16.49 | 18210.0 | 20060.0 | 23550.0 | 28590.0 | 34290.0 | 2010 | rec_worker |
291 | NY | 8615710.0 | 0.3 | 28.39 | 59060.0 | 0.5 | 9.19 | 12.47 | 20.99 | 35.83 | 55.72 | 19120.0 | 25950.0 | 43660.0 | 74530.0 | 115900.0 | 2014 | all_jobs |
# Merge to df
df= df.merge(emp, how = 'outer', on= ['area','year'])
print(len(df)) # 520 rows, if no missing record +
# 89 rows, 15 years x 5 area + 14 years x 1 area = 609
display(df.sample(5))
586
year | area | cpi | coli | coli_cpi | population | fitness | dance_studio | studio_num | studio_emp | ... | h_pct25 | h_median | h_pct75 | h_pct90 | a_pct10 | a_pct25 | a_median | a_pct75 | a_pct90 | occ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64 | 2000 | SanFran | 180.200 | 164.0 | 0.720977 | NaN | 12.543 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
520 | 2019 | LA | 274.114 | 136.4 | 0.603823 | 13214799.0 | 35.889 | 4.20 | 65723.0 | 126288.0 | ... | 20.69 | 34.81 | 54.68 | 59.31 | 26400.0 | 43020.0 | 72400.0 | 113740.0 | 123350.0 | choreo |
203 | 2008 | LA | 225.008 | 136.4 | 0.603823 | NaN | 22.339 | NaN | NaN | NaN | ... | 14.11 | 21.63 | 28.35 | 33.29 | 25440.0 | 29340.0 | 44980.0 | 58970.0 | 69240.0 | choreo |
42 | 1997 | NY | 170.800 | 216.7 | 0.899678 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
357 | 2013 | All | 232.957 | NaN | NaN | 316128839.0 | 25.803 | 3.28 | 48399.0 | 93420.0 | ... | 10.15 | 15.88 | 23.36 | 32.19 | 17840.0 | 21110.0 | 33020.0 | 48590.0 | 66950.0 | fit_trainer |
5 rows × 27 columns
Save organized dataset
Now, we have all data prepared. Let’s save it for next steps.
# for convenience
df.loc[df.occ.isna(),'occ']='all_jobs'
#%%script false --no-raise-error
# If you already have data/dance.csv, this block can be skipped.
df.to_csv('data/dance1.csv',index=False)
Leave a comment