Analysing the factors of growth and spatial distribution of Airbnb properties across New York City
Introduction ¶
Airbnb properties across cities are a great alternative for travellers to find comparatively cheaper accommodation. It also provides homeowners opportunities to utilize spare or unused rooms as an additional income source. However in recent times the alarming spread of Airbnb properties has become a topic of debate among the public and the city authorities across the world.
Considering the above, a study is carried out in this sample notebook to understand the factors that are fuelling widespread growth in the number of Airbnb listings. These might include location characteristics of concerned neighbourhoods (which in this case, NYC census tracts) and as well as qualitative information about the inhabitants residing in them. The goal is to help city planners deal with the negative externalities of the Airbnb phenomenon (and similar short term rentals) by making informed decision on framing suitable policies.
The primary data is downloaded from the Airbnb website for the city of New York. Other data includes 2019 and 2017 census data using Esri's enrichment services, and various other datasets from the NYCOpenData portal.
Necessary Imports ¶
%matplotlib inline
import matplotlib.pyplot as plt
from datetime import datetime as dt
import pandas as pd
import numpy as np
from IPython.display import display, HTML
from IPython.core.pylabtools import figsize
import seaborn as sns
# Machine Learning models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.metrics as metrics
from sklearn import preprocessing
# Arcgis api imports
import arcgis
from arcgis.geoenrichment import Country
from arcgis.features import summarize_data
from arcgis.features.enrich_data import enrich_layer
from arcgis.features import SpatialDataFrame
from arcgis.features import use_proximity
from arcgis.gis import GIS
from arcgis.features import summarize_data
gis = GIS('home')
Access the NYC Airbnb and Tracts dataset ¶
Airbnb Data - It contains information about 48,000 Airbnb properties available in New York as of 2019. These include location of the property, its neighbourhood characters and transit facilities available, information about the owner, details of the room including number of bedrooms etc., and rental price per night.
NYC Tracts - It is a polygon shapefile consisting 2167 tracts of New York City, including area of the tracts along with unique id for each tract.
# Accessing NYCTracts
nyc_tract_full = gis.content.search('NYCTractData owner:api_data_owner', 'feature layer')[0]
nyc_tract_full
nyc_tracts_layer = nyc_tract_full.layers[0]
# Accessing airbnb NYC
airbnb_nyc2019 = gis.content.search('AnBNYC2019 owner:api_data_owner', 'feature layer')[0]
airbnb_nyc2019
airbnb_layer = airbnb_nyc2019.layers[0]
Visualizing dataset ¶
# NYC Tracts
m1 = gis.map('New York City')
m1.add_layer(nyc_tracts_layer)
m1
# NYC Airbnb Properties
m = gis.map('Springfield Gardens, NY')
m.add_layer(airbnb_layer)
m
# extracting the dataframe from the layer and visualize it as a pandas dataframe
pd.set_option('display.max_columns', 110)
sdf_airbnb_layer = pd.DataFrame.spatial.from_layer(airbnb_layer)
sdf_airbnb_layer.head(2)
Aggregating number of Airbnb properties by Tracts for NYC ¶
Number of Airbnb properties per tract is to be estimated using the polygon tract layer and the Airbnb point layer.
The Aggregate Points tool uses area features to summarize a set of point features. The boundaries from the area feature are used to collect the points within each area and use them to calculate statistics. The resulting layer displays the count of points within each area. Here, the polygon tract layer is used as the area feature, and the Airbnb point layer is used as the point feature.
agg_result = summarize_data.aggregate_points(point_layer=airbnb_layer,
polygon_layer=nyc_tracts_layer,
output_name='airbnb_counts'+ str(dt.now().microsecond))
agg_result
# mapping the aggregated airbnb data with darker areas showing more airbnb properties per tract
aggr_map = gis.map('NY', zoomlevel=10)
aggr_map.add_layer(agg_result,{"renderer":"ClassedColorRenderer", "field_name": "Point_Count"})
aggr_map
airbnb_count_by_tract = agg_result.layers[0]
sdf_airbnb_count_by_tract = airbnb_count_by_tract.query().sdf
sdf_airbnb_count_by_tract = sdf_airbnb_count_by_tract.sort_values('geoid')
sdf_airbnb_count_by_tract.head()
Here the Point_Count field from the above aggregated dataframe returns the number of Airbnb properties per tract. This would form the target variable for this problem.
Enriching tracts with demographic data using geoenrichment service from Esri ¶
The feature data is now created using selected demographics information for each tracts. This is accomplished accessing the geoenrichment services from Esri, which consists the latest census data. The entire data repository is first visualized, out of which the relevant variables are finalized from a literature study. These selected variables are searched for adding in the feature set.
# Displaying the various data topic available for geoenrichment for USA in the Esri database
usa = Country.get('US')
type(usa)
usa_data = usa.data_collections
df_usa_data = pd.DataFrame(usa_data)
df_usa_data.head()
All the data topics are visualized that are available in the geoenrichment services.
# Filtering the unique topic under dataCollectionID
df_usa_data.reset_index(inplace=True)
list(df_usa_data.dataCollectionID.unique())
Items can be searched using alias field, for the related analysis variable name -- here as an example a variable with 'Nonprofit' is searched. Out of the these the relevant 'Nonprofit' data is to be selected.
df_usa_data[df_usa_data['alias'].str.contains('Nonprofit')]
Adding data using enrichment - At this stage a literature study is undertaken to narrow down the various factors that might impact opening of new Airbnb properties in NYC.
Subsequently these factors are identified from the USA geoenrichment database as shown above. These variable names are then compiled in a dictionary for passing them to the enrichment tool.
enrichment_variables = {'classofworker.ACSCIVEMP': 'Employed Population Age 16+',
'classofworker.ACSMCIVEMP': 'Employed Male Pop Age 16+',
'classofworker.ACSMPRIVNP': 'Male 16+Priv Nonprofit',
'classofworker.ACSMEPRIVP': 'Male 16+:Priv Profit Empl',
'classofworker.ACSMSELFI': 'Male 16+:Priv Profit Self Empl',
'classofworker.ACSMSTGOV': 'Male 16+:State Govt Wrkr',
'classofworker.ACSMFEDGOV': 'Male 16+:Fed Govt Wrkr',
'classofworker.ACSMSELFNI': 'Male 16+:Self-Emp Not Inc',
'classofworker.ACSMUNPDFM': 'Male 16+:Unpaid Family Wrkr',
'classofworker.ACSFCIVEMP': 'Female Pop Age 16+',
'classofworker.ACSFEPRIVP': 'Female 16+:Priv Profit Empl',
'classofworker.ACSFSELFI': 'Female 16+:Priv Profit Self Empl',
'classofworker.ACSFPRIVNP': 'Female 16+:Priv Nonprofit',
'classofworker.ACSFLOCGOV': 'Female 16+:Local Govt Wrkr',
'classofworker.ACSFSTGOV': 'Female 16+:State Govt Wrkr',
'classofworker.ACSFFEDGOV': 'Female 16+:Fed Govt Wrkr',
'classofworker.ACSFSELFNI': 'Female 16+:Self-Emp Not Inc',
'classofworker.ACSFUNPDFM': 'Female 16+:Unpaid Family Wrkr',
'gender.MEDAGE_CY': '2019 Median Age',
'Generations.GENALPHACY': '2019 Generation Alpha Population',
'Generations.GENZ_CY': '2019 Generation Z Population',
'Generations.MILLENN_CY': '2019 Millennial Population',
'Generations.GENX_CY': '2019 Generation X Population',
'Generations.BABYBOOMCY': '2019 Baby Boomer Population',
'Generations.OLDRGENSCY': '2019 Silent & Greatest Generations Population',
'Generations.GENBASE_CY': '2019 Population by Generation Base',
'populationtotals.POPDENS_CY': '2019 Population Density',
'DaytimePopulation.DPOP_CY': '2019 Total Daytime Population',
'raceandhispanicorigin.WHITE_CY': '2019 White Population',
'raceandhispanicorigin.BLACK_CY': '2019 Black Population',
'raceandhispanicorigin.AMERIND_CY': '2019 American Indian Population',
'raceandhispanicorigin.ASIAN_CY': '2019 Asian Population',
'raceandhispanicorigin.PACIFIC_CY': '2019 Pacific Islander Population',
'raceandhispanicorigin.OTHRACE_CY': '2019 Other Race Population',
'raceandhispanicorigin.DIVINDX_CY': '2019 Diversity Index',
'households.ACSHHBPOV': 'HHs: Inc Below Poverty Level',
'households.ACSHHAPOV': 'HHs:Inc at/Above Poverty Level',
'households.ACSFAMHH': 'ACS Family Households',
'businesses.S01_BUS': 'Total Businesses (SIC)',
'businesses.N05_BUS': 'Construction Businesses (NAICS)',
'businesses.N08_BUS': 'Retail Trade Businesses (NAICS)',
'businesses.N21_BUS': 'Transportation/Warehouse Bus (NAICS)',
'ElectronicsInternet.MP09147a_B': 'Own any tablet',
'ElectronicsInternet.MP09148a_B': 'Own any e-reader',
'ElectronicsInternet.MP19001a_B': 'Have access to Internet at home',
'ElectronicsInternet.MP19070a_I': 'Index: Spend 0.5-0.9 hrs online(excl email/IM .',
'ElectronicsInternet.MP19071a_B': 'Spend <0.5 hrs online (excl email/IM time) daily',
'populationtotals.TOTPOP_CY': '2019 Total Population',
'gender.MALES_CY': '2019 Male Population',
'gender.FEMALES_CY': '2019 Female Population',
'industry.EMP_CY': '2019 Employed Civilian Pop 16+',
'industry.UNEMP_CY': '2019 Unemployed Population 16+',
'industry.UNEMPRT_CY': '2019 Unemployment Rate',
'commute.ACSWORKERS': 'ACS Workers Age 16+',
'commute.ACSDRALONE': 'ACS Workers 16+: Drove Alone',
'commute.ACSCARPOOL': 'ACS Workers 16+: Carpooled',
'commute.ACSPUBTRAN': 'ACS Workers 16+: Public Transportation',
'commute.ACSBUS': 'ACS Workers 16+: Bus',
'commute.ACSSTRTCAR': 'ACS Workers 16+: Streetcar',
'commute.ACSSUBWAY': 'ACS Workers 16+: Subway',
'commute.ACSRAILRD': 'ACS Workers 16+: Railroad',
'commute.ACSFERRY': 'ACS Workers 16+: Ferryboat',
'commute.ACSTAXICAB': 'ACS Workers 16+: Taxicab',
'commute.ACSMCYCLE': 'ACS Workers 16+: Motorcycle',
'commute.ACSBICYCLE': 'ACS Workers 16+: Bicycle',
'commute.ACSWALKED': 'ACS Workers 16+: Walked',
'commute.ACSOTHTRAN': 'ACS Workers 16+: Other Means',
'commute.ACSWRKHOME': 'ACS Wrkrs 16+: Worked at Home',
'OwnerRenter.OWNER_CY': '2019 Owner Occupied HUs',
'OwnerRenter.RENTER_CY': '2019 Renter Occupied HUs',
'vacant.VACANT_CY': '2019 Vacant Housing Units',
'homevalue.MEDVAL_CY': '2019 Median Home Value',
'housingunittotals.TOTHU_CY': '2019 Total Housing Units',
'yearbuilt.ACSMEDYBLT': 'ACS Median Year Structure Built: HUs',
'SpendingTotal.X1001_X': '2019 Annual Budget Exp',
'transportation.X6001_X': '2019 Transportation',
'households.ACSTOTHH': 'ACS Total Households',
'DaytimePopulation.DPOPWRK_CY': '2019 Daytime Pop: Workers',
'DaytimePopulation.DPOPRES_CY': '2019 Daytime Pop: Residents',
'DaytimePopulation.DPOPDENSCY': '2019 Daytime Pop Density',
'occupation.OCCPROT_CY': '2019 Occupation: Protective Service',
'occupation.OCCFOOD_CY': '2019 Occupation: Food Preperation',
'occupation.OCCPERS_CY': '2019 Occupation: Personal Care',
'occupation.OCCADMN_CY': '2019 Occupation: Office/Admin',
'occupation.OCCCONS_CY': '2019 Occupation: Construction/Extraction',
'occupation.OCCPROD_CY': '2019 Occupation: Production'
}
# Enrichment operation using ArcGIS API for Python
enrichment_variables_df = pd.DataFrame.from_dict(enrichment_variables, orient='index',columns=['Variable Definition'])
enrichment_variables_df.reset_index(level=0, inplace=True)
enrichment_variables_df.columns = ['AnalysisVariable','Variable Definition']
enrichment_variables_df.head()
# Convertng the variables names to list for passing them to the enrichment tool
variable_names = enrichment_variables_df['AnalysisVariable'].tolist()
# checking the firt few values of the list
variable_names[1:5]
# Data Enriching operation
airbnb_count_by_tract_enriched = enrich_layer(airbnb_count_by_tract,
analysis_variables = variable_names,
output_name='airbnb_tract_enrich1'+ str(dt.now().microsecond))
# Extracting the resulting enriched dataframe after the geoenrichment method
sdf_airbnb_count_by_tract_enriched = airbnb_count_by_tract_enriched.layers[0].query().sdf
# Visualizing the data as a pandas dataframe
print(sdf_airbnb_count_by_tract_enriched.columns)
sdf_airbnb_count_by_tract_enriched_sorted = sdf_airbnb_count_by_tract_enriched.sort_values('geoid')
sdf_airbnb_count_by_tract_enriched_sorted.head()
The field name of the enriched dataframe are code words which needs to be elaborated. Hence these are renamed with their actual definition from the variable definition of the list that was first created during selection of the variables.
enrichment_variables_df.head()
enrichment_variables_copy = enrichment_variables_df.copy()
enrichment_variables_copy.head(2)
enrichment_variables_copy['AnalysisVariable'] = enrichment_variables_copy.AnalysisVariable.str.split(pat='.', expand=True)[1]
enrichment_variables_copy
enrichment_variables_copy.set_index("AnalysisVariable", drop=True, inplace=True)
dictionary = enrichment_variables_copy.to_dict()
new_columns = dictionary['Variable Definition']
# Field renamed and new dataframe visualized
pd.set_option('display.max_columns', 150)
sdf_airbnb_count_by_tract_enriched_sorted.rename(columns=new_columns, inplace=True)
sdf_airbnb_count_by_tract_enriched_sorted.head()
The renamed data frame above is now self explanatory hence more interpretable.
Estimating distances of tracts from various city features ¶
The next set of feature data set will be the distances of each of the tract from various city features. These distance variables accomplishes two important tasks.
First is that they include the spatial components of the Airbnb development phenomenon into the model.
Secondly each Airbnb properties are impacted by unique locational factors. This is reflected from the Airbnb reviews where the most highly rated in demand Airbnb property are located in neighbourhood with good transit accessibility. Hence these are accounted into the model by including the distances of different public transit options from the tracts.
The hypothesis formed here is that tracts located near transit hubs which could be subway station, bus stops, railroad lines, subway routes etc., might attract more Airbnb property. Similarly the central business district which for New York is located at lower Manhattan might also influence Airbnb properties, since this is the city's main business hub. In the following these various distances are estimated using ArcGIS API for Python proximity method.
# accessing the various city feature shapefile from arcgis portal
busi_distr = gis.content.search('BusinessDistricts owner:api_data_owner', 'feature layer')[0]
cbd = gis.content.search('NYCBD owner:api_data_owner', 'feature layer')[0]
bus_stop = gis.content.search('NYCBusStop owner:api_data_owner', 'feature layer')[0]
hotels = gis.content.search('NYCHotels owner:api_data_owner', 'feature layer')[0]
railroad = gis.content.search('NYCRailroad owner:api_data_owner', 'feature layer')[0]
subwy_rt = gis.content.search('NYCSubwayRoutes owner:api_data_owner', 'feature layer')[0]
subwy_stn = gis.content.search('NYCSubwayStation owner:api_data_owner', 'feature layer')[0]
bus_stop_lyr = bus_stop.layers[0]
cbd_lyr = cbd.layers[0]
hotels_lyr = hotels.layers[0]
subwy_stn_lyr =subwy_stn.layers[0]
subwy_rt_lyr = subwy_rt.layers[0]
railroad_lyr = railroad.layers[0]
busi_distrs_lyr = busi_distr.layers[0]
# Avoid warning for chain operation
pd.set_option('mode.chained_assignment', None)
# Estimating Tract to hotel distances
tract_hotel_dist = use_proximity.find_nearest(nyc_tracts_layer,
hotels_lyr,
measurement_type='StraightLine',
max_count=1,
output_name='ny_tract_hotel_dist1' + str(dt.now().microsecond))
tract_hotel_dist.layers
tract_hotel_dist_lyr = tract_hotel_dist.layers[1]
sdf_tract_hotel_dist_lyr = pd.DataFrame.spatial.from_layer(tract_hotel_dist_lyr)
sdf_tract_hotel_dist_lyr.head()
In the above dataframe the Total_Miles field returns the distances of the tract from hotels in miles. Hence this field is converted into feet and retained. This is then repeated for each of the other distance estimation.
# Final hotel Distances in feet — Here in each row column "hotel_dist" returns the distance of the nearest hotel from that tract indicated by its geoids.
# For example in the first row the tract with ID 36005000100 has a nearest hotel at 5571.75 feet away from it.
sdf_tract_hotel_dist_lyr_new = sdf_tract_hotel_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_hotel_dist_lyr_new['hotel_dist'] = round(sdf_tract_hotel_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_hotel_dist_lyr_new.sort_values('From_geoid').head()
# Estimating Busstop distances from tracts
tract_bustop_dist = use_proximity.find_nearest(nyc_tracts_layer,
bus_stop_lyr,
measurement_type='StraightLine',
max_count=1,
output_name='ny_tract_bus_stop_dist'+ str(dt.now().microsecond))
tract_bustop_dist_lyr = tract_bustop_dist.layers[1]
sdf_tract_bustop_dist_lyr =tract_bustop_dist_lyr.query().sdf
# Final Bustop Distances in feet — Here in each row column "busstop_dist" returns the distance of the nearest bus stop
# from that tract indicated by its geoids
sdf_tract_bustop_dist_lyr_new = sdf_tract_bustop_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_bustop_dist_lyr_new['busstop_dist'] = round(sdf_tract_bustop_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_bustop_dist_lyr_new.sort_values('From_geoid').head()
# estimating number of bus stops per tract
num_bustops_tracts = summarize_data.aggregate_points(point_layer=bus_stop_lyr,
polygon_layer=nyc_tracts_layer,
output_name='bustops_by_tracts'+ str(dt.now().microsecond))
num_bustops_tracts_lyr = num_bustops_tracts.layers[0]
sdf_num_bustops_tracts_lyr = pd.DataFrame.spatial.from_layer(num_bustops_tracts_lyr)
sdf_num_bustops_tracts_lyr.head()
# Number of Bus stops per tract — Here in each row column "num_bustop" returns the number of bus stops inside respective tracts
sdf_num_bustops_tracts_lyr_new = sdf_num_bustops_tracts_lyr[['geoid', 'Point_Count']]
sdf_num_bustops_tracts_lyr_new = sdf_num_bustops_tracts_lyr_new.rename(columns={'Point_Count':'num_bustop'})
sdf_num_bustops_tracts_lyr_new.sort_values('geoid').head()
# estimating tracts distances from CBD
tract_cbd_dist=use_proximity.find_nearest(nyc_tracts_layer,
cbd_lyr,
measurement_type='StraightLine',
max_count=1,
output_name='ny_tract_cbd_dist'+ str(dt.now().microsecond))
tract_cbd_dist_lyr = tract_cbd_dist.layers[1]
sdf_tract_cbd_dist_lyr = tract_cbd_dist_lyr.query().sdf
sdf_tract_cbd_dist_lyr.head()
# Final CBD distances in feet — Here in each row the column "cbd_dst" returns the distance of the CBD from respective tracts
sdf_tract_cbd_dist_lyr_new = sdf_tract_cbd_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_cbd_dist_lyr_new['cbd_dist'] = round(sdf_tract_cbd_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_cbd_dist_lyr_new.sort_values('From_geoid').head()
# Estimating NYCSubwayStation distances from tracts
tract_subwy_stn_dist = use_proximity.find_nearest(nyc_tracts_layer,
subwy_stn_lyr,
measurement_type='StraightLine',
max_count=1,
output_name='ny_tract_subway_station_dist'+ str(dt.now().microsecond))
tract_subwy_stn_dist_lyr = tract_subwy_stn_dist.layers[1]
sdf_tract_subwy_stn_dist_lyr = pd.DataFrame.spatial.from_layer(tract_subwy_stn_dist_lyr)
sdf_tract_subwy_stn_dist_lyr.head()
# Final Tract to NYC Subway Station distances in feet — Here in each row, column "subwy_stn_dist" returns the distance of
# the nearest subway station from that tract
sdf_tract_subwy_stn_dist_lyr_new = sdf_tract_subwy_stn_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_subwy_stn_dist_lyr_new['subwy_stn_dist'] = round(sdf_tract_subwy_stn_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_subwy_stn_dist_lyr_new.sort_values('From_geoid').head()
# Estimating distances to NYCSubwayRoutes
tract_subwy_rt_dist=use_proximity.find_nearest(nyc_tracts_layer,
subwy_rt_lyr,
measurement_type='StraightLine',
max_count=1,
output_name='ny_tract_subway_routes_dist'+ str(dt.now().microsecond))
tract_subwy_rt_dist_lyr = tract_subwy_rt_dist.layers[1]
sdf_tract_subwy_rt_dist_lyr = tract_subwy_rt_dist_lyr.query().sdf
sdf_tract_subwy_rt_dist_lyr.head()
# Final Tract to NYCSubwayRoutes distances in feet — Here in each row, column "subwy_rt_dist" returns the distance of
# the nearest subway route from that tract
sdf_tract_subwy_rt_dist_lyr_new = sdf_tract_subwy_rt_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_subwy_rt_dist_lyr_new['subwy_rt_dist'] = round(sdf_tract_subwy_rt_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_subwy_rt_dist_lyr_new.sort_values('From_geoid').head()
# Estimating distances to NYCRailroad
tract_railroad_dist = use_proximity.find_nearest(nyc_tracts_layer,
railroad_lyr,
measurement_type='StraightLine',
max_count=1,
output_name='tract_railroad_dist'+ str(dt.now().microsecond))
tract_railroad_dist_lyr = tract_railroad_dist.layers[1]
sdf_tract_railroad_dist_lyr = pd.DataFrame.spatial.from_layer(tract_railroad_dist_lyr)
sdf_tract_railroad_dist_lyr.head()
# Final Tract to NYCRailroad distances in feet — Here in each row, column "railroad_dist" returns the distance of
# the nearest rail road route from that tract
sdf_tract_railroad_dist_lyr_new = sdf_tract_railroad_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_railroad_dist_lyr_new['railroad_dist'] = round(sdf_tract_railroad_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_railroad_dist_lyr_new.sort_values('From_geoid').head()
# Estimating distances to NYC Businesss Districts
tract_busi_distrs_dist = use_proximity.find_nearest(nyc_tracts_layer,
busi_distrs_lyr,
measurement_type='StraightLine',
max_count=1,
output_name='tract_busi_distrs_dist'+ str(dt.now().microsecond))
tract_busi_distrs_dist_lyr = tract_busi_distrs_dist.layers[1]
sdf_tract_busi_distrs_dist_lyr = pd.DataFrame.spatial.from_layer(tract_busi_distrs_dist_lyr)
sdf_tract_busi_distrs_dist_lyr.head()
# Final Tract to NYC Businesss Districts distances in feet — Here in each row, column "busi_distr_dist" returns the distance of the CBD from respective tracts
sdf_tract_busi_distrs_dist_lyr_new = sdf_tract_busi_distrs_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_busi_distrs_dist_lyr_new['busi_distr_dist'] = round(sdf_tract_busi_distrs_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_busi_distrs_dist_lyr_new.sort_values('From_geoid').head()
Importing Borough Info for each Tracts ¶
# Name of the borough, inside which the tracts are located
ny_tract_boro = gis.content.search('NYCTractBorough owner:api_data_owner', 'feature layer')[0]
ny_tract_boro_lyr = ny_tract_boro.layers[0]
sdf_ny_tract_boro_lyr = pd.DataFrame.spatial.from_layer(ny_tract_boro_lyr)
sdf_ny_tract_boro_lyr_new = sdf_ny_tract_boro_lyr[['geoid', 'boro_name']]
sdf_ny_tract_boro_lyr_new.sort_values('geoid').head()
Merging all the above estimated data set of features ¶
tract_merge_dist = sdf_tract_hotel_dist_lyr_new.merge(sdf_tract_subwy_rt_dist_lyr_new,
on='From_geoid').merge(sdf_tract_railroad_dist_lyr_new,
on='From_geoid').merge(sdf_tract_subwy_stn_dist_lyr_new,
on='From_geoid').merge(sdf_tract_busi_distrs_dist_lyr_new,
on='From_geoid').merge(sdf_tract_cbd_dist_lyr_new, on='From_geoid')
tract_merge_dist_new = tract_merge_dist[['From_geoid',
'hotel_dist',
'subwy_rt_dist',
'railroad_dist',
'subwy_stn_dist',
'busi_distr_dist',
'cbd_dist']]
tract_merge_dist_new = tract_merge_dist_new.rename(columns={'From_geoid':'geoid'})
tract_merge_dist_new.sort_values('geoid').head()
# merging number of bus stop and borough name
tract_merge_dist_new = tract_merge_dist_new.merge(sdf_num_bustops_tracts_lyr_new,
on='geoid').merge(sdf_ny_tract_boro_lyr_new,
on='geoid')
tract_merge_dist_new = tract_merge_dist_new.sort_values('geoid')
tract_merge_dist_new.head()
# Accessing the airbnb count for each tract
sdf_airbnb_count_by_tract_new = sdf_airbnb_count_by_tract[['geoid','Point_Count']]
sdf_airbnb_count_by_tract_new = sdf_airbnb_count_by_tract_new.rename(columns={'Point_Count':'total_airbnb'})
sdf_airbnb_count_by_tract_new.head()
# preparing the final distance table with airbnb count by tract
tract_merge_dist_all = sdf_airbnb_count_by_tract_new.merge(tract_merge_dist_new, on='geoid')
tract_merge_dist_all.head()
tract_merge_dist_all.info()
Borough column being an important location indicator is converted into numerical variable and inlcuded in the feature data
tract_merge_dist_final = pd.get_dummies(tract_merge_dist_all, columns=['boro_name'])
tract_merge_dist_final.head()
Adding census data 2019 obtained using geoenrichment ¶
The above distance data set is now added with the census data to form the final feature set for the model
sdf_airbnb_count_by_tract_enriched_sorted_new = sdf_airbnb_count_by_tract_enriched_sorted.drop(['AnalysisArea',
'ENRICH_FID',
'HasData',
'ID',
'OBJECTID',
'Point_Count',
'SHAPE',
'aggregationMethod',
'aland',
'apportionmentConfidence',
'awater',
'countyfp',
'funcstat',
'intptlat',
'intptlon',
'mtfcc',
'name',
'namelsad',
'populationToPolygonSizeRating',
'sourceCountry',
'statefp','tractce'], axis=1)
sdf_airbnb_count_by_tract_enriched_sorted_new.shape
# checking the rows of the table for nan values
row_with_null = sdf_airbnb_count_by_tract_enriched_sorted_new.isnull().any(axis=1)
# printing the row which has nan values
sdf_airbnb_count_by_tract_enriched_sorted_new[row_with_null]
# checking total number of nan values
nan_test = sdf_airbnb_count_by_tract_enriched_sorted_new.drop(['geoid'], axis=1)
np.isnan(nan_test).sum().sum()
These two tracts area actually are water areas within NYC, hence have nan values and are filled with zeros
sdf_airbnb_count_by_tract_enriched_sorted_fill = sdf_airbnb_count_by_tract_enriched_sorted_new.fillna(0)
#nan rechecked
nan_test = sdf_airbnb_count_by_tract_enriched_sorted_fill.drop(['geoid'], axis=1)
np.isnan(nan_test).sum().sum()
Merging the distance data with the enriched data
final_df = pd.merge(tract_merge_dist_final,
sdf_airbnb_count_by_tract_enriched_sorted_fill,
left_on = 'geoid',
right_on = 'geoid',
how = 'left')
print(final_df.shape)
final_df.head()
# rechecking nan values of the final dataframe
final_nan_test = final_df.drop('geoid', axis=1)
np.isnan(final_nan_test).sum().sum()
Model Building ¶
The goal here is to find the factors contributing towards the development of new Airbnb properties in New York City. Thus a model is fitted predicting the number of Airbnb properties per tract with the feature set composed of the distance and demographics characteristics of each tract. Once a good fit is obtained the most important predictors of the model are estimated which is our main ask.
# Creating feature data
X = final_df.drop(['geoid','total_airbnb'], axis=1)
# Creating target data -- the number airbnb per tract
y = pd.DataFrame(final_df['total_airbnb'])
split the dataframe into train - test of 90% to 10%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 20)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
# Converting the target into 1d array
y_train_array = y_train.values.flatten()
y_test_array = y_test.values.flatten()
print(y_train_array.shape)
print(y_test_array.shape)
As a best practice since scaled data performs well for model fitting, the features are normalized using Robust scaler
scaler = preprocessing.RobustScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)
X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test), columns=X_test.columns)
RandomForest Regressor Model ¶
The modelling is first started using a linear regression. However the linear model was failing to fit the data well. Hence it was carried out with a non linear algorithm as follows. This could be tested by the user to see the improvement of using Random Forest over a linear regression.
The accuracy metrics of mean absoute error and r-square is used
# Random forest with scaled data
# for the best parameters a grid search could be done which could take some time
# however this model uses the default parameters of RF algorithm, while the estimators are changed till the best fit is obtained
model_RF = RandomForestRegressor(n_estimators = 500, random_state=43)
# Train the model
model_RF.fit(X_train_scaled, y_train_array)
# Training metrics for Random forest model
print('Training metrics for Random forest model using scaled data')
ypred_RF_train = model_RF.predict(X_train_scaled)
print('r-square_RF_Train: ', round(model_RF.score(X_train_scaled, y_train_array), 2))
mse_RF_train = metrics.mean_squared_error(y_train_array, ypred_RF_train)
print('RMSE_RF_train: ', round(np.sqrt(mse_RF_train),4))
mean_absolute_error_RF_train = metrics.mean_absolute_error(y_train_array, ypred_RF_train)
print('MAE_RF_train: ', round(mean_absolute_error_RF_train, 4))
# Test metrics for Random Forest model
print('\nTest metrics for Random Forest model scaled data')
ypred_RF_test = model_RF.predict(X_test_scaled)
print('r-square_RF_test: ', round(model_RF.score(X_test_scaled, y_test_array), 2))
mse_RF_test = metrics.mean_squared_error(y_test_array, ypred_RF_test)
print('RMSE_RF_test: ', round(np.sqrt(mse_RF_test), 4))
mean_absolute_error_RF_test = metrics.mean_absolute_error(y_test_array, ypred_RF_test)
print('MAE_RF_test: ', round(mean_absolute_error_RF_test, 4))
The result shows that the model is returning an r-square of 0.85 with a mean absolute error of 9.28
Feature importance for the RF model ¶
feature_imp_RF = model_RF.feature_importances_
#relative feature importance
rel_feature_imp = 100 * (feature_imp_RF / max(feature_imp_RF))
rel_feature_imp = pd.DataFrame({'features':list(X_train.columns),
'rel_importance':rel_feature_imp })
rel_feature_imp = rel_feature_imp.sort_values('rel_importance', ascending=False)
#plotting the top twenty important features
top20_features = rel_feature_imp.head(20)
plt.figure(figsize=[20,10])
plt.yticks(fontsize=15)
ax = sns.barplot(x="rel_importance", y="features",
data=top20_features,
palette="Accent_r")
plt.xlabel("Relative Importance", fontsize=25)
plt.ylabel("Features", fontsize=25)
plt.show()
rel_feature_imp.head()
The feature importance plot reveals that distance from the city centre (cbd_dist) is the most important predictor of the number of Airbnb formation in NYC. This is expected since hotel rents near the cbd are quite high, rental income from Airbnb properties would be high as well, hence setting up Airbnb property would be a lucrative option, compared to long term rental income in areas near the cbd.
This is followed by the number of millennial population, or the tracts having most number of people in the age group of 25 to 40 years old. One reason might be that these group of population are more active online and are comfortable with internet technologies which is in a way a necessary prerequisite for setting up Airbnb properties. This is supported by the presence of another interesting predictor variable of -- 0.5-0.9 hrs online activity, in the top twenty.
This is followed by the tracts having workers who commute by bicycle and is the third most important predictor, which is followed by the number of generation alpha population, who are person born after 2011, and then by tracts having people commuting by subway, and so on. The median home value of the tracts is also an interesting predictor.
Gradient Boosting Regressor Model ¶
Here trial shows that the gradient boosting model performs better with non scale data
# GradientBoosting with non scaled data
# this model uses the default parameters of GB algorithm, while the estimators are changed to obtain the best fit
model_GB_nonscale = GradientBoostingRegressor(n_estimators=500, random_state=60)
# Train the model
model_GB_nonscale.fit(X_train, y_train_array)
# Training metrics for Gradient Boosting Regressor model
print('Training metrics for Gradient Boosting Regressor model using scaled data')
ypred_GB_train = model_GB_nonscale.predict(X_train)
print('r-square_GB_Train: ', round(model_GB_nonscale.score(X_train, y_train_array), 2))
mse_RF_train = metrics.mean_squared_error(y_train_array, ypred_GB_train)
print('RMSE_GB_Train: ', round(np.sqrt(mse_RF_train), 4))
mean_absolute_error_RF_train = metrics.mean_absolute_error(y_train_array, ypred_GB_train)
print('MAE_GB_Train: ', round(mean_absolute_error_RF_train, 4))
#Test metrics for Gradient Boosting Regressor model
print('\nTest metrics for Gradient Boosting Regressor model using scaled data')
ypred_GB_test = model_GB_nonscale.predict(X_test)
print('r-square_GB_Test: ', round(model_GB_nonscale.score(X_test, y_test_array),2))
mse_RF_Test = metrics.mean_squared_error(y_test_array, ypred_GB_test)
print('RMSE_GB_Test: ', round(np.sqrt(mse_RF_Test),4))
mean_absolute_error_GB_Test = metrics.mean_absolute_error(y_test_array, ypred_GB_test)
print('MAE_GB_Test: ', round(mean_absolute_error_GB_Test, 4))
The result shows that the Gradient boosting regressor model is performing slightly better both in terms of Mean Absolute error and r-square than the random forest model.
Feature Importance of Gradient Boosting Model ¶
#checking the feature importance for the Gradient Boosting regressor
feature_imp_GB = model_GB_nonscale.feature_importances_
rel_feature_imp_GB = 100 * feature_imp_GB / max(feature_imp_GB)
rel_feature_imp_GB = pd.DataFrame({'features':list(X_train.columns),
'rel_importance':rel_feature_imp_GB})
rel_feature_imp_GB = rel_feature_imp_GB.sort_values('rel_importance', ascending=False)
rel_feature_imp_GB.head()
# Plot feature importance for the Gradient Boosting regressor
top20_features_GB = rel_feature_imp_GB.head(20)
plt.figure(figsize=[20,10])
plt.yticks(fontsize=15)
ax = sns.barplot(x="rel_importance", y="features", data = top20_features_GB, palette="Accent_r")
plt.xlabel("Relative Importance", fontsize=25)
plt.ylabel("Features", fontsize=25)
plt.show()
The feature importance shown by the Gradient boosting model are almost identical to the one returned by the random forest model, which is expected.
Running cross validation ¶
The above model is fitted and accuracy measured on a particular train and test split of the data. However the model accuracy for multiple split of the data remains to be seen. This is accomplished using k fold cross validation which splits the data into k different train-test splits and fit the model for each of them. Hence a 10 fold cross validation is run to check the overall model accuracy which is measured here as the mean absolute error for model fit accross the 10 different splits.
# Validating with a 10 fold cross validation for the Gradient Boosting models
y_array = y.values.flatten()
modelGB_cross_val = GradientBoostingRegressor(n_estimators=500, random_state=60)
modelGB_cross_val_scores = cross_val_score(modelGB_cross_val,
X,
y_array,
cv=10,
scoring='neg_mean_absolute_error')
print("All Model Scores: ", modelGB_cross_val_scores)
print("Negative Mean Absolute Error: {}".format(np.mean(modelGB_cross_val_scores)))
# Validating with a 10 fold cross validation for the Random forest models
y_array = y.values.flatten()
modelRF_cross_val = RandomForestRegressor(n_estimators=500, random_state=43)
modelRF_cross_val_scores = cross_val_score(modelRF_cross_val,
X,
y_array,
cv=10,
scoring='neg_mean_absolute_error')
print("All Model Scores: ", modelRF_cross_val_scores)
print("Negative Mean Absolute Error: {}".format(np.mean(modelRF_cross_val_scores)))
Final Result Visualization ¶
# Plotting a kernel density map of the predicted vs. observed data
plt.figure(figsize=[15,5])
# plotting the prediction
sns.kdeplot(ypred_RF_test, label = 'Predictions', color = 'orange')
y_observed = np.array(y_test).reshape((-1, ))
sns.kdeplot(y_observed, label = 'Observation', color = 'green')
# label the plot
plt.xlabel('No. of Airbnb listings per census tract', fontsize=15)
plt.ylabel('Density', fontsize=15)
plt.title('Density Plot: Predicted vs Observed', fontsize=15)
plt.xticks(range(0,500,25), fontsize=10)
plt.yticks(fontsize=10)
plt.legend(fontsize=15)
plt.show()
# Converting the predicted and observed values to dataframe and plotting the observed vs predicted
y_test_df = y_test.copy()
y_test_df['Predicted'] = (ypred_RF_test)
y_test_df.head()
# plotting the actual observed vs predicted airbnb properties by tract
plt.figure(figsize = [25,12])
sns.set(style = 'whitegrid')
sns.lineplot(data = y_test_df, markers=True)
#label the plot
plt.xlabel('Tract ID', fontsize=15)
plt.ylabel('Total No. of Airbnb', fontsize=15)
plt.title('Actual No. of Airbnb by Tract: Predicted vs Observed', fontsize=15)
plt.xticks(range(0,2000,100), fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='x-large', title_fontsize='10')
plt.legend(fontsize=15)
plt.show()
The plot shows that the predicted values closely matches the observed values. However there are instances of underprediction for tracts with extremely high number of airbnb properties, and also overprediction instances for some tracts with comparatively lower number of airbnb properties.
Conclusion ¶
The study shows that the location factor of distance from CBD is the foremost important factor which stimulates creation of Airbnb properties.
The proximity tool from the ArcGIS API for Python was used to perform this significant task for all the distance estimation. Other factors as returned by the feature importance result could be dealt individually. Another interesting capability of Esri utilized in the study is that of Esri's data repository, elaborated here via the geoenrichment services. The data enrichment service could provide the analyst an wide array of data that could be used for critical analysis. Further analysis would be done in the next study on this dataset.
Summary of methods used ¶
Method | Question | Examples |
---|---|---|
aggregate_points | How many points within each polygon? | Counting the number of airbnb rentals within each NYC tracts |
Data Enrichment | Which demographic attribute are relevant for the problem? | Population of Millennials for each tract |
find_nearest | Which distances from city features are relevant for the problem? | Distance of the CBD from each tract |