Here, I’m presenting my code I used to manipulate and merge multiple datasets. This is part of my dancer’s business project. Hope it is useful!



Load libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

DataFrame to save

# Dataframe to merge everything
df = None

Scale/adjustment factor data

Before we begin, let’s import a few useful economic index data and census data. Even though these data won’t be used for my modeling, I will use them when we explore statistics to get a trend of business.

Consumer Price Index (CPI)

CPI can be used to account for inflation. I downloaded numbers of interesting areas from the website below.

cpi = pd.read_csv('data/CPI.csv')
display(cpi)
year NY Chicago Seattle LA SanFran All
0 1990 138.500 131.700 126.800 135.900 132.100 130.6500
1 1991 144.800 137.000 134.100 141.400 137.900 136.2000
2 1992 150.000 141.100 139.000 146.500 142.500 140.3000
3 1993 154.500 145.400 142.900 150.300 146.300 144.5000
4 1994 158.200 148.600 147.800 152.300 148.700 148.2500
5 1995 162.200 153.300 152.300 154.600 151.600 152.3500
6 1996 166.900 157.400 157.500 157.500 155.100 156.8500
7 1997 170.800 161.700 163.000 160.000 160.400 160.5500
8 1998 173.600 165.000 167.700 162.300 165.500 163.0000
9 1999 177.000 168.400 172.800 166.100 172.500 166.6000
10 2000 182.500 173.800 179.200 171.600 180.200 172.2000
11 2001 187.100 178.300 185.700 177.300 189.900 177.0500
12 2002 191.900 181.200 189.300 182.200 193.000 179.9000
13 2003 197.800 184.500 192.300 187.000 196.400 183.9500
14 2004 204.800 188.600 194.700 193.200 198.800 188.9000
15 2005 212.700 194.300 200.200 201.800 202.700 195.3000
16 2006 220.700 198.300 207.600 210.400 209.200 201.6000
17 2007 226.940 204.818 215.656 217.338 216.048 207.3425
18 2008 235.782 212.536 224.719 225.008 222.767 215.3030
19 2009 236.825 209.995 226.028 223.219 224.395 214.5370
20 2010 240.864 212.870 226.693 225.894 227.469 218.0555
21 2011 247.718 218.684 232.765 231.928 233.390 224.9390
22 2012 252.588 222.005 238.663 236.648 239.650 229.5940
23 2013 256.833 224.545 241.563 239.207 245.023 232.9570
24 2014 260.230 228.468 246.018 242.434 251.985 236.7360
25 2015 260.558 227.792 249.364 244.632 258.572 237.0170
26 2016 263.365 229.302 254.886 249.246 266.344 240.0075
27 2017 268.520 233.611 262.668 256.210 274.924 245.1195
28 2018 273.641 237.706 271.089 265.962 285.550 251.1070
29 2019 278.164 241.181 277.984 274.114 295.004 255.6575
30 2020 282.920 243.873 282.693 278.567 300.084 258.8110
31 2021 292.303 254.159 295.560 289.244 309.721 270.9695
# Merge to df
df = pd.melt(cpi, id_vars=['year'], value_vars=cpi.columns[1:], 
            var_name='area', value_name='cpi') 

print(len(df)) # should be 32 years x 6 area = 192 rows
display(df.sample(5))
192
year area cpi
138 2000 SanFran 180.200
165 1995 All 152.350
89 2015 Seattle 249.364
85 2011 Seattle 232.765
35 1993 Chicago 145.400

Cost of Living Index (COLI)

CPI may not account for actual living cost. The COLI is closer to actual spend of living. When we judge income level, we always consider the COLI to scale.

Unlike the CPI, finding COLI was difficult. I found the COLI data of year 2010 from the Census.gov. I will assume this number is staying same over years up to relative between cities.

Also, it is not clear if this is calculated for metropolitan statistical area (broader) or only for city (smaller area). I'll assume the former, which is as same as the area division of wage statistics.

coli = pd.read_csv('data/COLI.csv')
coli = coli[coli.area!='Percent']
display(coli)
area coli Grocery Housing Utilities Transportation HealthCare Etc
0 NY 216.7 154.3 386.7 169.6 120.3 130.2 145.7
1 Chicago 116.9 111.2 134.8 117.3 116.5 108.5 104.4
2 Seattle 121.4 115.1 140.3 85.7 118.8 119.9 119.1
3 LA 136.4 106.0 207.1 101.7 113.6 109.1 107.0
4 SanFran 164.0 111.9 281.0 94.5 113.0 117.0 124.3

Housing price is the dominant driving factor.

# Merge to df
df = df.merge(coli[['area','coli']], how = 'outer', on = ['area'])

print(len(df)) # should be 32 years x 6 area = 192 rows
display(df.sample(5))
192
year area cpi coli
147 2009 SanFran 224.395 164.0
72 1998 Seattle 167.700 121.4
61 2019 Chicago 241.181 116.9
37 1995 Chicago 153.300 116.9
119 2013 LA 239.207 136.4
# COLI adjusted CPI (2010)
t = df[df.year==2010]
t['coli_cpi'] = t.coli/t.cpi
t = t[['area','coli_cpi']]
df = df.merge(t,how='left',on=['area'])
/var/folders/31/7v9nfdf14sz0sxn2xwnq90y00000gn/T/ipykernel_12606/1860590002.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  t['coli_cpi'] = t.coli/t.cpi

Census - U.S. and Metropolitan area population

Metropolitan statistical area population data

# Check metropolitan area census dataset
census = pd.read_csv('data/census.csv')
census.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2797 entries, 0 to 2796
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CBSA               2797 non-null   int64  
 1   MDIV               141 non-null    float64
 2   STCOU              1840 non-null   float64
 3   NAME               2797 non-null   object 
 4   LSAD               2797 non-null   object 
 5   CENSUS2010POP      2797 non-null   int64  
 6   ESTIMATESBASE2010  2797 non-null   int64  
 7   POPESTIMATE2010    2797 non-null   int64  
 8   POPESTIMATE2011    2797 non-null   int64  
 9   POPESTIMATE2012    2797 non-null   int64  
 10  POPESTIMATE2013    2797 non-null   int64  
 11  POPESTIMATE2014    2797 non-null   int64  
 12  POPESTIMATE2015    2797 non-null   int64  
 13  POPESTIMATE2016    2797 non-null   int64  
 14  POPESTIMATE2017    2797 non-null   int64  
 15  POPESTIMATE2018    2797 non-null   int64  
 16  POPESTIMATE2019    2797 non-null   int64  
 17  POPESTIMATE2020    5 non-null      float64
 18  POPESTIMATE2021    5 non-null      float64
dtypes: float64(4), int64(13), object(2)
memory usage: 415.3+ KB
# Select only interesting area

# Zipcode of Metropolitan Statistical Area
# LA has two zip codes because it has changed over years
zipcode_area = {31100:'LA',31080:'LA',41860:'SanFran',16980:'Chicago',35620:'NY',42660:'Seattle'}

lst=[]
for i in zipcode_area.keys():
    if i==31100:
        continue
    lst.append(census.loc[(census.CBSA==i)&(census.LSAD=='Metropolitan Statistical Area')])

census = pd.concat(lst)

display(census)
CBSA MDIV STCOU NAME LSAD CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016 POPESTIMATE2017 POPESTIMATE2018 POPESTIMATE2019 POPESTIMATE2020 POPESTIMATE2021
850 31080 NaN NaN Los Angeles-Long Beach-Anaheim, CA Metropolitan Statistical Area 12828837 12828957 12838417 12925753 13013443 13097434 13166609 13234696 13270694 13278000 13249879 13214799 13173266.0 12997353.0
1307 41860 NaN NaN San Francisco-Oakland-Berkeley, CA Metropolitan Statistical Area 4335391 4335593 4343634 4395725 4455473 4519636 4584981 4647924 4688198 4712421 4726314 4731803 4739649.0 4623264.0
291 16980 NaN NaN Chicago-Naperville-Elgin, IL-IN-WI Metropolitan Statistical Area 9461105 9461537 9470634 9500870 9528090 9550194 9560430 9552554 9533662 9514113 9484158 9458539 9601605.0 9509934.0
1017 35620 NaN NaN New York-Newark-Jersey City, NY-NJ-PA Metropolitan Statistical Area 18897109 18896277 18923407 19052774 19149689 19226449 19280929 19320968 19334778 19322607 19276644 19216182 20096413.0 19768458.0
1337 42660 NaN NaN Seattle-Tacoma-Bellevue, WA Metropolitan Statistical Area 3439809 3439808 3449241 3503891 3558829 3612347 3675160 3739654 3816355 3885579 3935179 3979845 4024730.0 4011553.0
# Select only interesting fields
census.drop(['MDIV','STCOU','LSAD','CENSUS2010POP','ESTIMATESBASE2010'],axis=1,inplace=True)

# Change name of dields
census.columns = ['CBSA','NAME',2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021]

# Transpose
census = census.T

# Put area names as field names
census.columns = ['LA','SanFran','Chicago','NY','Seattle']

# Reset index
census.reset_index()

# Leave only yearly population rows
census.drop(['CBSA','NAME'],inplace=True)

display(census)
LA SanFran Chicago NY Seattle
2010 12838417 4343634 9470634 18923407 3449241
2011 12925753 4395725 9500870 19052774 3503891
2012 13013443 4455473 9528090 19149689 3558829
2013 13097434 4519636 9550194 19226449 3612347
2014 13166609 4584981 9560430 19280929 3675160
2015 13234696 4647924 9552554 19320968 3739654
2016 13270694 4688198 9533662 19334778 3816355
2017 13278000 4712421 9514113 19322607 3885579
2018 13249879 4726314 9484158 19276644 3935179
2019 13214799 4731803 9458539 19216182 3979845
2020 13173266.0 4739649.0 9601605.0 20096413.0 4024730.0
2021 12997353.0 4623264.0 9509934.0 19768458.0 4011553.0

National population data

# Check one example file of national population
demo = pd.read_csv('data/C2015.csv')

demo.info()
display(demo.head(5))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 5 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   Label (Grouping)                        89 non-null     object
 1   United States!!Estimate                 84 non-null     object
 2   United States!!Margin of Error          84 non-null     object
 3   United States!!Percent                  84 non-null     object
 4   United States!!Percent Margin of Error  84 non-null     object
dtypes: object(5)
memory usage: 3.6+ KB
Label (Grouping) United States!!Estimate United States!!Margin of Error United States!!Percent United States!!Percent Margin of Error
0 SEX AND AGE NaN NaN NaN NaN
1 Total population 321,418,821 ***** 321,418,821 (X)
2 Male 158,167,834 ±31,499 49.2% ±0.1
3 Female 163,250,987 ±31,500 50.8% ±0.1
4 Under 5 years 19,793,807 ±16,520 6.2% ±0.1
%%script false --no-raise-error
# If you already have data/usDemo.csv, this block can be skipped.

# Combine multiple year files of national population
df_save = []
for year in range(2010,2020):
    demo = pd.read_csv('data/C{0}.csv'.format(year))
    demo['year'] = year
    df_save.append(demo)
    
# Mave a csv file
df_save = pd.concat(df_save)

df_save.columns = ['label','estimate','estimate_err','pct','pct_err','year','estimate_err2']
df_save.to_csv('data/usDemo.csv', index=False)
census_national = pd.read_csv('data/usDemo.csv')
display(census_national)
label estimate estimate_err pct pct_err year estimate_err2
0 SEX AND AGE NaN NaN NaN NaN 2010 NaN
1 Total population 309,349,689 ***** 309,349,689 (X) 2010 NaN
2 Male 152,089,450 ±27,325 49.2% ±0.1 2010 NaN
3 Female 157,260,239 ±27,325 50.8% ±0.1 2010 NaN
4 Under 5 years 20,133,943 ±20,568 6.5% ±0.1 2010 NaN
... ... ... ... ... ... ... ...
880 Total housing units 139,686,209 NaN (X) (X) 2019 ±6,973
881 CITIZEN, VOTING AGE POPULATION NaN NaN NaN NaN 2019 NaN
882 Citizen, 18 and over population 235,418,734 NaN 235,418,734 (X) 2019 ±159,764
883 Male 114,206,194 NaN 48.5% ±0.1 2019 ±98,225
884 Female 121,212,540 NaN 51.5% ±0.1 2019 ±79,689

885 rows × 7 columns

# Leave data to use only
# Change format to merge with metropolitan census dataframe

population=[]
years=[]

# make year:population dictionary
for year in range(2010,2020):
    pop = int(''.join(census_national[(census_national.label.str.contains('Total population'))\
                                      &(census_national.year==year)].iloc[0].estimate.split(',')))
    population.append(pop)
    years.append(year)
    

census_national = pd.DataFrame({'year':years,'All':population})
# Add extra years

census_national_other = pd.DataFrame([
    [2020, 331501080],
    [2021, 331893745],
    [1998, 275854000],
    [1999, 279040000],
    [2000, 282172000],
    [2001, 285082000],
    [2002, 287804000],
    [2003, 290326000],
    [2004, 293046000],
    [2005, 295753000],
    [2006, 298593000],
    [2007, 301580000],
    [2008, 304375000],
    [2009, 307007000]], 
    columns=['year','All'])


census_national = census_national.append(census_national_other, ignore_index=True)


census_national = census_national.sort_values(by=['year'])

display(census_national)
year All
12 1998 275854000
13 1999 279040000
14 2000 282172000
15 2001 285082000
16 2002 287804000
17 2003 290326000
18 2004 293046000
19 2005 295753000
20 2006 298593000
21 2007 301580000
22 2008 304375000
23 2009 307007000
0 2010 309349689
1 2011 311591919
2 2012 313914040
3 2013 316128839
4 2014 318857056
5 2015 321418821
6 2016 323127515
7 2017 325719178
8 2018 327167439
9 2019 328239523
10 2020 331501080
11 2021 331893745

Merge national population to metropolitan population

census = pd.merge(census_national, census, how='outer', right_index=True, left_on='year')

# Change data type to numeric
census = census.apply(pd.to_numeric)

census.info()
display(census)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 12 to 11
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   year     24 non-null     int64  
 1   All      24 non-null     int64  
 2   LA       12 non-null     float64
 3   SanFran  12 non-null     float64
 4   Chicago  12 non-null     float64
 5   NY       12 non-null     float64
 6   Seattle  12 non-null     float64
dtypes: float64(5), int64(2)
memory usage: 1.5 KB
year All LA SanFran Chicago NY Seattle
12 1998 275854000 NaN NaN NaN NaN NaN
13 1999 279040000 NaN NaN NaN NaN NaN
14 2000 282172000 NaN NaN NaN NaN NaN
15 2001 285082000 NaN NaN NaN NaN NaN
16 2002 287804000 NaN NaN NaN NaN NaN
17 2003 290326000 NaN NaN NaN NaN NaN
18 2004 293046000 NaN NaN NaN NaN NaN
19 2005 295753000 NaN NaN NaN NaN NaN
20 2006 298593000 NaN NaN NaN NaN NaN
21 2007 301580000 NaN NaN NaN NaN NaN
22 2008 304375000 NaN NaN NaN NaN NaN
23 2009 307007000 NaN NaN NaN NaN NaN
0 2010 309349689 12838417.0 4343634.0 9470634.0 18923407.0 3449241.0
1 2011 311591919 12925753.0 4395725.0 9500870.0 19052774.0 3503891.0
2 2012 313914040 13013443.0 4455473.0 9528090.0 19149689.0 3558829.0
3 2013 316128839 13097434.0 4519636.0 9550194.0 19226449.0 3612347.0
4 2014 318857056 13166609.0 4584981.0 9560430.0 19280929.0 3675160.0
5 2015 321418821 13234696.0 4647924.0 9552554.0 19320968.0 3739654.0
6 2016 323127515 13270694.0 4688198.0 9533662.0 19334778.0 3816355.0
7 2017 325719178 13278000.0 4712421.0 9514113.0 19322607.0 3885579.0
8 2018 327167439 13249879.0 4726314.0 9484158.0 19276644.0 3935179.0
9 2019 328239523 13214799.0 4731803.0 9458539.0 19216182.0 3979845.0
10 2020 331501080 13173266.0 4739649.0 9601605.0 20096413.0 4024730.0
11 2021 331893745 12997353.0 4623264.0 9509934.0 19768458.0 4011553.0
# Merge to df
census = pd.melt(census, id_vars=['year'], value_vars=cpi.columns[1:], 
        var_name='area', value_name='population')

df = df.merge(census, how='outer', on= ['area','year'])

print(len(df)) # should be 32 years x 6 area = 192 rows
display(df.sample(5))
192
year area cpi coli coli_cpi population
53 2011 Chicago 218.6840 116.9 0.549161 9500870.0
61 2019 Chicago 241.1810 116.9 0.549161 9458539.0
42 2000 Chicago 173.8000 116.9 0.549161 NaN
180 2010 All 218.0555 NaN NaN 309349689.0
91 2017 Seattle 262.6680 121.4 0.535526 3885579.0

U.S. market data

I crated “data/rev.csv” file by combining U.S. markets statistics from multiple sources. Data sources are

# Read data file
rev = pd.read_csv('data/rev.csv')

# Set every money scale to billion dollars
rev.fitness = rev.fitness/1000
rev.studio_wage = rev.studio_wage/1000

display(rev)
# fitness: U.S. fitness and recreational sports center revenue
# dance_studio: U.S. dance studio revenue
# studio_num: number of buinesses of U.S. dance studio
# studio_emp: number of employees of U.S. dance studio
# studio_wage: total wage of U.S. dance studio
year fitness dance_studio studio_num studio_emp studio_wage
0 1998 10.797 NaN NaN NaN NaN
1 1999 11.777 NaN NaN NaN NaN
2 2000 12.543 NaN NaN NaN NaN
3 2001 13.542 NaN NaN NaN NaN
4 2002 14.987 NaN NaN NaN NaN
5 2003 16.287 NaN NaN NaN NaN
6 2004 17.174 NaN NaN NaN NaN
7 2005 18.286 NaN NaN NaN NaN
8 2006 19.447 NaN NaN NaN NaN
9 2007 21.416 NaN NaN NaN NaN
10 2008 22.339 NaN NaN NaN NaN
11 2009 21.842 NaN NaN NaN NaN
12 2010 22.311 NaN NaN NaN NaN
13 2011 23.191 3.04 NaN NaN NaN
14 2012 24.051 3.22 47269.0 90668.0 0.9026
15 2013 25.803 3.28 48399.0 93420.0 0.9029
16 2014 27.001 3.42 52942.0 99696.0 0.9504
17 2015 28.838 3.59 55523.0 104321.0 1.0148
18 2016 31.223 3.70 56412.0 107832.0 1.0798
19 2017 33.042 3.87 58515.0 114075.0 1.1448
20 2018 33.971 4.10 63363.0 120456.0 1.1787
21 2019 35.889 4.20 65723.0 126288.0 1.2295
22 2020 24.361 3.43 62808.0 112485.0 1.0768
23 2021 NaN 3.72 66266.0 120081.0 1.1531
24 2022 NaN 3.83 68393.0 123680.0 1.1876
# Merge to df
df= df.merge(rev, how = 'outer', on= ['year'])

# for convenience
df.loc[df.area.isna(),'area']='All'

print(len(df)) # should be 32 years x 6 area + 1 year row = 193 rows
display(df.sample(3))
193
year area cpi coli coli_cpi population fitness dance_studio studio_num studio_emp studio_wage
138 2013 NY 256.833 216.7 0.899678 19226449.0 25.803 3.28 48399.0 93420.0 0.9029
115 2009 Chicago 209.995 116.9 0.549161 NaN 21.842 NaN NaN NaN NaN
102 2007 NY 226.940 216.7 0.899678 NaN 21.416 NaN NaN NaN NaN

Employee statistics data

This website provides a table of employment statistics (wage, number of employee, etc) of different area of each year. I downloaded each year’s file, and they will be cleaned and concatenated.

Explanation of fields

Here are definitions of each field. Not explained field is not used in this analysis.

Area identifier

  • area: area code
  • area_name(title): Area name

Job identifier

  • occ_code: The 6-digit Standard Occupational Classification (SOC) code or OEWS-specific code for the occupation
  • occ_title: SOC title or OEWS-specific title for the occupation

Number of employee

  • tot_emp: Estimated total employment rounded to the nearest 10 (excludes self-employed).
  • emp_prse: Percent relative standard error (PRSE) for the employment estimate. PRSE is a measure of sampling error, expressed as a percentage of the corresponding estimate. Sampling error occurs when values for a population are estimated from a sample survey of the population, rather than calculated from data for all members of the population. Estimates with lower PRSEs are typically more precise in the presence of sampling error.

Wage

  • h_mean: Mean hourly wage
  • a_mean: Mean annual wage

  • mean_prse: Percent relative standard error (PRSE) for the mean wage estimate.
  • h_pct10: Hourly 10th percentile wage
  • h_pct25: Hourly 25th percentile wage
  • h_median: Hourly median wage (or the 50th percentile)
  • h_pct75: Hourly 75th percentile wage
  • h_pct90: Hourly 90th percentile wage

  • a_pct10: Annual 10th percentile wage
  • a_pct25: Annual 25th percentile wage
  • a_median: Annual median wage (or the 50th percentile)
  • a_pct75: Annual 75th percentile wage
  • a_pct90: Annual 90th percentile wage
# Let's check how each file looks like
sample = pd.read_excel('data/2010.xls')

sample.info()
display(sample.sample(2))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7940 entries, 0 to 7939
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   PRIM_STATE    7940 non-null   object
 1   AREA          7940 non-null   int64 
 2   AREA_NAME     7940 non-null   object
 3   OCC_CODE      7940 non-null   object
 4   OCC_TITLE     7940 non-null   object
 5   GROUP         253 non-null    object
 6   TOT_EMP       7940 non-null   object
 7   EMP_PRSE      7940 non-null   object
 8   JOBS_1000     7940 non-null   object
 9   LOC QUOTIENT  7940 non-null   object
 10  H_MEAN        7940 non-null   object
 11  A_MEAN        7940 non-null   object
 12  MEAN_PRSE     7940 non-null   object
 13  H_PCT10       7940 non-null   object
 14  H_PCT25       7940 non-null   object
 15  H_MEDIAN      7940 non-null   object
 16  H_PCT75       7940 non-null   object
 17  H_PCT90       7940 non-null   object
 18  A_PCT10       7940 non-null   object
 19  A_PCT25       7940 non-null   object
 20  A_MEDIAN      7940 non-null   object
 21  A_PCT75       7940 non-null   object
 22  A_PCT90       7940 non-null   object
 23  ANNUAL        562 non-null    object
 24  HOURLY        38 non-null     object
dtypes: int64(1), object(24)
memory usage: 1.5+ MB
PRIM_STATE AREA AREA_NAME OCC_CODE OCC_TITLE GROUP TOT_EMP EMP_PRSE JOBS_1000 LOC QUOTIENT ... H_MEDIAN H_PCT75 H_PCT90 A_PCT10 A_PCT25 A_MEDIAN A_PCT75 A_PCT90 ANNUAL HOURLY
3384 IL 16980 Chicago-Naperville-Joliet, IL-IN-WI 47-2041 Carpet Installers NaN 1470 17.6 0.352 1.749 ... 22.34 32.03 39.24 23180 35810 46470 66630 81620 NaN NaN
2762 FL 33100 Miami-Fort Lauderdale-Pompano Beach, FL 51-4023 Rolling Machine Setters, Operators, and Tender... NaN 190 16.4 0.09 0.359 ... 15.46 18.64 20.53 18690 25990 32160 38780 42700 NaN NaN

2 rows × 25 columns

# jobs in interest
    # dancer: dancer
    # choreo: choreographer, 
    # fit_trainer: fitness trainer/instructer
    # rec_worker: recreational worker, 
    # all_jobs: all jobs sum/mean
    
code_job = {'27-2031':'dancer','27-2032':'choreo',
            '39-9031':'fit_trainer','39-9032':'rec_worker',
            '00-0000':'all_jobs'}
%%script false --no-raise-error
# This block combines multiple wage data files, then generate a single csv file.
# If you already have data/wage.csv, this block can be skipped. It takes time to run.


df_save = []
for year in range(2003,2022):
    
    print(year)
    metro = None # metropolitan area statistics data
    national = None # national statistics data
    
    try:
        if year>2004:
            metro = pd.read_excel('data/'+str(year)+'.xls')
        national = pd.read_excel('data/'+str(year)+'nat.xls')
    except:
        if year>2004:
            metro = pd.read_excel('data/'+str(year)+'.xlsx')
        national = pd.read_excel('data/'+str(year)+'nat.xlsx')

    if year>2004:
        metro.columns = metro.columns.str.strip().str.lower()
    national.columns = national.columns.str.strip().str.lower()
    
    # unify feature names in all years
    if year>2004:
        metro.rename(columns={'area_title':'area_name'},inplace=True)
        
    # LA area code changed
    area_la = 31100
    if year>2014:
        area_la=31080

    if year>2004:
        # Select metropolitan area in interest
        metro = metro.loc[(metro.area==area_la) | (metro.area==41860) | (metro.area==16980) | 
                        (metro.area==35620) | (metro.area==42660)]

        # Select occupation in interest
        metro = metro.loc[(metro.occ_code=='27-2031') | (metro.occ_code=='27-2032') | 
                          (metro.occ_code=='39-9031') | (metro.occ_code=='39-9032') | 
                          (metro.occ_code=='00-0000')]
        # Change zip code to the unique area names
        metro['area']=metro.apply(lambda x: zipcode_area[x['area']], axis=1)
        
    # Select occupation in interest
    national = national.loc[(national.occ_code=='27-2031') | (national.occ_code=='27-2032') |
                            (national.occ_code=='39-9031') | (national.occ_code=='39-9032') |                          
                            (national.occ_code=='00-0000')]

    


    # To match columns with metropolitan dataframe
    national['area'] = 'All'
    national['area_name'] = 'U.S. all'
    
    # Keep only columns to use
    if year>2004:
        metro = metro[['area', 'area_name', 'occ_code', 'occ_title', 
           'tot_emp', 'emp_prse', 'h_mean', 'a_mean', 'mean_prse', 'h_pct10',
           'h_pct25', 'h_median', 'h_pct75', 'h_pct90', 'a_pct10', 'a_pct25',
           'a_median', 'a_pct75', 'a_pct90']]

    national = national[['area', 'area_name', 'occ_code', 'occ_title', 
   'tot_emp', 'emp_prse', 'h_mean', 'a_mean', 'mean_prse', 'h_pct10',
   'h_pct25', 'h_median', 'h_pct75', 'h_pct90', 'a_pct10', 'a_pct25',
   'a_median', 'a_pct75', 'a_pct90']]
    
    
    # comebine national data to metropolitan data
    emp=None
    
    if year>2004:
        emp = pd.concat([national,metro], ignore_index=True)
    else:
        emp = national

    # add year
    emp['year']=year

    # add the unique occupation name
    emp['occ']=emp.apply(lambda x: code_job[x['occ_code']], axis=1)

    # Cleaning
    emp.replace('**',np.nan,inplace=True)
    emp.replace('*',np.nan,inplace=True)

    # Append to a list to save
    df_save.append(emp)

# Mave a csv file
df_save = pd.concat(df_save)
df_save.to_csv('data/emp.csv', index=False)
# Check data is prepared as intended
emp = pd.read_csv('data/emp.csv')

emp.info()

# Confirm if city and occupation labels are correct
print("Check area names are correctly marked ------ ")
for x in zipcode_area.values():
    print(x,emp[emp.area==x].area_name.unique())
    
print("\n Check occupation names are correctly marked ------ ")    
for x in code_job.values():
    print(x,emp[emp.occ==x].occ_title.unique())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 21 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   area       497 non-null    object 
 1   area_name  497 non-null    object 
 2   occ_code   497 non-null    object 
 3   occ_title  497 non-null    object 
 4   tot_emp    462 non-null    float64
 5   emp_prse   462 non-null    float64
 6   h_mean     491 non-null    float64
 7   a_mean     402 non-null    float64
 8   mean_prse  491 non-null    float64
 9   h_pct10    491 non-null    float64
 10  h_pct25    491 non-null    float64
 11  h_median   491 non-null    float64
 12  h_pct75    491 non-null    float64
 13  h_pct90    491 non-null    float64
 14  a_pct10    402 non-null    float64
 15  a_pct25    402 non-null    float64
 16  a_median   402 non-null    float64
 17  a_pct75    402 non-null    float64
 18  a_pct90    402 non-null    float64
 19  year       497 non-null    int64  
 20  occ        497 non-null    object 
dtypes: float64(15), int64(1), object(5)
memory usage: 81.7+ KB
Check area names are correctly marked ------ 
LA ['Los Angeles-Long Beach-Santa Ana, CA'
 'Los Angeles-Long Beach-Anaheim, CA']
LA ['Los Angeles-Long Beach-Santa Ana, CA'
 'Los Angeles-Long Beach-Anaheim, CA']
SanFran ['San Francisco-Oakland-Fremont, CA' 'San Francisco-Oakland-Hayward, CA']
Chicago ['Chicago-Naperville-Joilet, IL-IN-WI'
 'Chicago-Naperville-Joliet, IL-IN-WI'
 'Chicago-Joliet-Naperville, IL-IN-WI'
 'Chicago-Naperville-Elgin, IL-IN-WI']
NY ['New York-Northern New Jersey-Long Island, NY-NJ-PA'
 'New York-Newark-Jersey City, NY-NJ-PA']
Seattle ['Seattle-Tacoma-Bellevue, WA']

 Check occupation names are correctly marked ------ 
dancer ['Dancers']
choreo ['Choreographers']
fit_trainer ['Fitness trainers and aerobics instructors'
 'Fitness Trainers and Aerobics Instructors'
 'Exercise Trainers and Group Fitness Instructors']
rec_worker ['Recreation workers' 'Recreation Workers']
all_jobs ['All Occupations']
# Since area and occupations are correctly marked, let's remove them
emp.drop(['area_name','occ_code','occ_title'],axis=1,inplace=True)
print(len(emp)) # 17 years x 6 area x 5 jobs + 2 years x 1 area x 5 jobs = 520 rows, if no missing record
display(emp.sample(5))
497
area tot_emp emp_prse h_mean a_mean mean_prse h_pct10 h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 year occ
385 Seattle NaN NaN 37.82 78670.0 10.3 20.78 34.29 40.35 46.00 49.36 43220.0 71320.0 83920.0 95690.0 102660.0 2017 choreo
410 NY 70.0 33.8 40.86 84990.0 12.8 17.68 21.40 37.44 59.55 73.45 36780.0 44500.0 77870.0 123870.0 152780.0 2018 choreo
384 Seattle 110.0 10.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2017 dancer
166 LA 13510.0 3.8 12.09 25150.0 1.1 8.76 9.65 11.32 13.75 16.49 18210.0 20060.0 23550.0 28590.0 34290.0 2010 rec_worker
291 NY 8615710.0 0.3 28.39 59060.0 0.5 9.19 12.47 20.99 35.83 55.72 19120.0 25950.0 43660.0 74530.0 115900.0 2014 all_jobs
# Merge to df
df= df.merge(emp, how = 'outer', on= ['area','year'])

print(len(df)) # 520 rows, if no missing record +
                # 89 rows, 15 years x 5 area + 14 years x 1 area = 609
display(df.sample(5))
586
year area cpi coli coli_cpi population fitness dance_studio studio_num studio_emp ... h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 occ
64 2000 SanFran 180.200 164.0 0.720977 NaN 12.543 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
520 2019 LA 274.114 136.4 0.603823 13214799.0 35.889 4.20 65723.0 126288.0 ... 20.69 34.81 54.68 59.31 26400.0 43020.0 72400.0 113740.0 123350.0 choreo
203 2008 LA 225.008 136.4 0.603823 NaN 22.339 NaN NaN NaN ... 14.11 21.63 28.35 33.29 25440.0 29340.0 44980.0 58970.0 69240.0 choreo
42 1997 NY 170.800 216.7 0.899678 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
357 2013 All 232.957 NaN NaN 316128839.0 25.803 3.28 48399.0 93420.0 ... 10.15 15.88 23.36 32.19 17840.0 21110.0 33020.0 48590.0 66950.0 fit_trainer

5 rows × 27 columns

Save organized dataset

Now, we have all data prepared. Let’s save it for next steps.

# for convenience
df.loc[df.occ.isna(),'occ']='all_jobs'
#%%script false --no-raise-error
# If you already have data/dance.csv, this block can be skipped.
df.to_csv('data/dance1.csv',index=False)

Leave a comment