Analysing the factors of growth and spatial distribution of Airbnb properties across New York City

Introduction

Airbnb properties across cities are a great alternative for travellers to find comparatively cheaper accommodation. It also provides homeowners opportunities to utilize spare or unused rooms as an additional income source. However in recent times the alarming spread of Airbnb properties has become a topic of debate among the public and the city authorities across the world.

Considering the above, a study is carried out in this sample notebook to understand the factors that are fuelling widespread growth in the number of Airbnb listings. These might include location characteristics of concerned neighbourhoods (which in this case, NYC census tracts) and as well as qualitative information about the inhabitants residing in them. The goal is to help city planners deal with the negative externalities of the Airbnb phenomenon (and similar short term rentals) by making informed decision on framing suitable policies.

The primary data is downloaded from the Airbnb website for the city of New York. Other data includes 2019 and 2017 census data using Esri's enrichment services, and various other datasets from the NYCOpenData portal.

Necessary Imports

%matplotlib inline
import matplotlib.pyplot as plt


from datetime import datetime as dt
import pandas as pd
import numpy as np
from IPython.display import display, HTML
from IPython.core.pylabtools import figsize
import seaborn as sns


# Machine Learning models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.metrics as metrics
from sklearn import preprocessing

# Arcgis api imports
import arcgis
from arcgis.geoenrichment import Country
from arcgis.features import summarize_data
from arcgis.features.enrich_data import enrich_layer
from arcgis.features import SpatialDataFrame
from arcgis.features import use_proximity 
from arcgis.gis import GIS
from arcgis.features import summarize_data
gis = GIS('home')

Access the NYC Airbnb and Tracts dataset

Airbnb Data - It contains information about 48,000 Airbnb properties available in New York as of 2019. These include location of the property, its neighbourhood characters and transit facilities available, information about the owner, details of the room including number of bedrooms etc., and rental price per night.

NYC Tracts - It is a polygon shapefile consisting 2167 tracts of New York City, including area of the tracts along with unique id for each tract.

# Accessing NYCTracts
nyc_tract_full = gis.content.search('NYCTractData owner:api_data_owner', 'feature layer')[0]
nyc_tract_full
NYCTractData
Feature Layer Collection by api_data_owner
Last Modified: August 14, 2019
0 comments, 2 views
nyc_tracts_layer = nyc_tract_full.layers[0]
# Accessing airbnb NYC
airbnb_nyc2019 = gis.content.search('AnBNYC2019 owner:api_data_owner', 'feature layer')[0]
airbnb_nyc2019
AnBNYC2019
Feature Layer Collection by api_data_owner
Last Modified: September 30, 2019
0 comments, 5 views
airbnb_layer = airbnb_nyc2019.layers[0]

Visualizing dataset

# NYC Tracts
m1 = gis.map('New York City')
m1.add_layer(nyc_tracts_layer)
m1
# NYC Airbnb Properties
m = gis.map('Springfield Gardens, NY')
m.add_layer(airbnb_layer)
m
# extracting the dataframe from the layer and visualize it as a pandas dataframe
pd.set_option('display.max_columns', 110)
sdf_airbnb_layer = pd.DataFrame.spatial.from_layer(airbnb_layer)
sdf_airbnb_layer.head(2)
FIDSHAPEaccessaccommodatairbnbamenitiesavailabi_1availabi_2availabi_3availabilibathroomsbed_typebedroomsbedscalculat_1calculat_2calculat_3calculatedcalendar_lcalendar_ucancellaticitycleaning_fcountrycountry_codescriptioexperienceextra_peopfirst_reviguests_inchas_availahost_accephost_has_phost_identhost_is_suhost_listihost_locathost_neighhost_res_1host_respohost_sincehost_totalhouse_ruleidinstant_bointeractiois_businesis_locatiojurisdictilast_revielast_scraplatitudelicenselongitudemarketmaximum__1maximum_mamaximum_mimaximum_niminimum__1minimum_maminimum_miminimum_nimonthly_prnameneighborhoneighbou_1neighbou_2neighbourhnotesnumber_of1number_of_priceproperty_trequire__1require_gurequires_lreview_s_1review_s_2review_s_3review_s_4review_s_5review_s_6review_scoreviews_peroom_typescrape_idsecurity_dsmart_locaspacesquare_feestatestreetsummarytransitweekly_prizipcode
01{"x": -8235507.210868829, "y": 4964733.1453062...41{TV,"Cable TV",Internet,Wifi,"Air conditioning...00002.0Real Bed2211022019-06-034 weeks agomoderateBrooklyn$60.00United StatesUSImagine a quiet, spacious apartment, with beau...none$0.002011-05-292tN/Attt2New York, New York, United StatesPark Slope100%within a day2011-05-222No pets, no smoking. The $25/night for each gu...121861fft2016-05-022019-06-0340.67644-73.98082New York730.073027302.073022Park Slope Apt:, Spacious 2 bedroomPark SlopeBrooklynBrooklyn023165Apartmentfff101010101010990.24Entire home/apt20200000000000$250.00Brooklyn, NYImagine a quiet, spacious apartment, with beau...1500NYBrooklyn, NY, United States$1,050.0011215
12{"x": -8231847.026011546, "y": 4983593.6741002...Everything in the studio is for their use.31{TV,"Cable TV",Wifi,"Air conditioning","Paid p...284329681.0Real Bed0211022019-06-037 months agostrict_14_with_grace_periodNew York$60.00United StatesUSComfortable, spacious studio in one of the mos...none$25.002011-05-301tN/Atff2New York, New York, United StatesHarlem100%within an hour2011-05-232no loud music no pets no children $300 dollar...123784tAs much as the guest would like.ft2019-05-172019-06-0340.80481-73.94794New York365.036523652.036522$3,200.00NYC Studio for Rent in TownhouseThe new restaurants, stores and cafes. Everyth...HarlemManhattanHarlem45.00 dollar fee for air-conditioner in the su...42138110Apartmentttf1091010109941.41Entire home/apt20200000000000$500.00New York, NYThis is a large studio room with a private bat...0NYNew York, NY, United StatesComfortable, spacious studio in one of the mos...$735.0010027

Aggregating number of Airbnb properties by Tracts for NYC

Number of Airbnb properties per tract is to be estimated using the polygon tract layer and the Airbnb point layer.

The Aggregate Points tool uses area features to summarize a set of point features. The boundaries from the area feature are used to collect the points within each area and use them to calculate statistics. The resulting layer displays the count of points within each area. Here, the polygon tract layer is used as the area feature, and the Airbnb point layer is used as the point feature.

agg_result = summarize_data.aggregate_points(point_layer=airbnb_layer,
                                             polygon_layer=nyc_tracts_layer,
                                             output_name='airbnb_counts'+ str(dt.now().microsecond))
agg_result
airbnb_counts748780
Feature Layer Collection by arcgis_python
Last Modified: October 01, 2019
0 comments, 0 views
# mapping the aggregated airbnb data with darker areas showing more airbnb properties per tract
aggr_map = gis.map('NY', zoomlevel=10)
aggr_map.add_layer(agg_result,{"renderer":"ClassedColorRenderer", "field_name": "Point_Count"})
aggr_map
airbnb_count_by_tract = agg_result.layers[0]
sdf_airbnb_count_by_tract = airbnb_count_by_tract.query().sdf
sdf_airbnb_count_by_tract = sdf_airbnb_count_by_tract.sort_values('geoid')
sdf_airbnb_count_by_tract.head()
AnalysisAreaOBJECTIDPoint_CountSHAPEShape__AreaShape__Lengthalandawatercountyfpfuncstatgeoidintptlatintptlonmtfccnamenamelsadstatefptractce
20951.04443020960{"rings": [[[-8226256.9418, 4982172.581], [-82...4.724201e+069395.02390815793611125765005S36005000100+40.7934921-073.8835318G50201Census Tract 136000100
20590.53368120600{"rings": [[[-8222638.612, 4985024.3226], [-82...2.414911e+068067.034661455322926899005S36005000200+40.8045733-073.8568585G50202Census Tract 236000200
20670.585075206815{"rings": [[[-8222012.885, 4985135.2266], [-82...2.647647e+068312.955974912392602945005S36005000400+40.8089152-073.8504884G50204Census Tract 436000400
16680.18728916691{"rings": [[[-8222181.7567, 4986069.1354], [-8...8.478126e+053898.1631494850790005S36005001600+40.8188478-073.8580764G502016Census Tract 1636001600
21271.074650212824{"rings": [[[-8230028.8927, 4984061.5402], [-8...4.862123e+0611742.63114716436541139660005S36005001900+40.8009990-073.9093729G502019Census Tract 1936001900

Here the Point_Count field from the above aggregated dataframe returns the number of Airbnb properties per tract. This would form the target variable for this problem.

Enriching tracts with demographic data using geoenrichment service from Esri

The feature data is now created using selected demographics information for each tracts. This is accomplished accessing the geoenrichment services from Esri, which consists the latest census data. The entire data repository is first visualized, out of which the relevant variables are finalized from a literature study. These selected variables are searched for adding in the feature set.

# Displaying the various data topic available for geoenrichment for USA in the Esri database
usa = Country.get('US')
type(usa)
usa_data = usa.data_collections
df_usa_data = pd.DataFrame(usa_data)
df_usa_data.head()
analysisVariablealiasfieldCategoryvintage
dataCollectionID
1yearincrements1yearincrements.AGE0_CY2019 Population Age <12019 Age: 1 Year Increments (Esri)2019
1yearincrements1yearincrements.AGE1_CY2019 Population Age 12019 Age: 1 Year Increments (Esri)2019
1yearincrements1yearincrements.AGE2_CY2019 Population Age 22019 Age: 1 Year Increments (Esri)2019
1yearincrements1yearincrements.AGE3_CY2019 Population Age 32019 Age: 1 Year Increments (Esri)2019
1yearincrements1yearincrements.AGE4_CY2019 Population Age 42019 Age: 1 Year Increments (Esri)2019

All the data topics are visualized that are available in the geoenrichment services.

# Filtering the unique topic under dataCollectionID
df_usa_data.reset_index(inplace=True)
list(df_usa_data.dataCollectionID.unique())
['1yearincrements',
 '5yearincrements',
 'ACS_Housing_Summary_rep',
 'ACS_Population_Summary_rep',
 'Age',
 'AgeDependency',
 'Age_50_Profile_rep',
 'Age_by_Sex_Profile_rep',
 'Age_by_Sex_by_Race_Profile_rep',
 'AtRisk',
 'AutomobilesAutomotiveProducts',
 'Automotive_Aftermarket_Expenditures_rep',
 'BabyProductsToysGames',
 'Business_Summary_rep',
 'CivicActivitiesPoliticalAffiliation',
 'ClothingShoesAccessories',
 'Community_Profile_rep',
 'DaytimePopulation',
 'Demographic_and_Income_Comparison_Profile_rep',
 'Demographic_and_Income_Profile_rep',
 'Disposable_Income_Profile_rep',
 'ElectronicsInternet',
 'Electronics_and_Internet_Market_Potential_rep',
 'Executive_Summary_rep',
 'Finances_Market_Potential_rep',
 'FinancialInsurance',
 'Financial_Expenditures_rep',
 'Generations',
 'Graphic_Profile_rep',
 'GroceryAlcoholicBeverages',
 'Health',
 'HealthPersonalCare',
 'HealthPersonalCareCEX',
 'Health_and_Beauty_Market_Potential_rep',
 'HistoricalHouseholds',
 'HistoricalHousing',
 'HistoricalPopulation',
 'HomeImprovementGardenLawn',
 'House_and_Home_Expenditures_rep',
 'HouseholdGoodsFurnitureAppliances',
 'Household_Budget_Expenditures_rep',
 'Household_Income_Profile_rep',
 'HouseholdsByIncome',
 'HousingHousehold',
 'Housing_Profile_rep',
 'Infrastructure',
 'KeyGlobalFacts',
 'KeyUSFacts',
 'LandCover',
 'LandscapeFacts',
 'LeisureActivitiesLifestyle',
 'LifeInsurancePensions',
 'Market_Profile_rep',
 'MediaMagazinesNewspapers',
 'MediaRadioOtherAudio',
 'MediaTVViewing',
 'Medical_Expenditures_rep',
 'Net_Worth_Profile_rep',
 'OwnerRenter',
 'PetsPetProducts',
 'Pets_and_Products_Market_Potential_rep',
 'PhonesYellowPages',
 'Policy',
 'PsychographicsAdvertising',
 'PublicLands',
 'RaceAndEthnicity',
 'Recreation_Expenditures_rep',
 'Restaurant_Market_Potential_rep',
 'Retail_Goods_and_Services_Expenditures_rep',
 'Retail_MarketPlace_Profile_rep',
 'Retail_Market_Potential_rep',
 'Soils',
 'SpendingTotal',
 'Sports_and_Leisure_Market_Potential_rep',
 'TapestryHouseholdsProjections',
 'TapestryNEW',
 'Tapestry_Segmentation_Area_Profile_rep',
 'TravelCEX',
 'WaterWetlands',
 'Wealth',
 '_2010_Census_Profile_rep',
 'agebyracebysex',
 'basicFactsForMobileApps',
 'businesses',
 'classofworker',
 'clothing',
 'commute',
 'crime',
 'disability',
 'disposableincome',
 'education',
 'educationalattainment',
 'employees',
 'entertainment',
 'financial',
 'food',
 'foodstampsSNAP',
 'gender',
 'groupquarters',
 'healthinsurancecoverage',
 'heatingfuel',
 'hispanicorigin',
 'homevalue',
 'householdincome',
 'households',
 'householdsbyageofhouseholder',
 'householdsbyraceofhouseholder',
 'householdsbysize',
 'householdtotals',
 'householdtype',
 'housingbyageofhouseholder',
 'housingbyraceofhouseholder',
 'housingbysize',
 'housingcosts',
 'housingunittotals',
 'incomebyage',
 'industry',
 'industrybynaicscode',
 'industrybysiccode',
 'language',
 'lifemodegroupsNEW',
 'maritalstatustotals',
 'miscellaneous',
 'networth',
 'occupation',
 'population',
 'populationtotals',
 'presenceofchildren',
 'raceandhispanicorigin',
 'restaurants',
 'retailmarketplace',
 'sales',
 'schoolenrollment',
 'shopping',
 'spendingFactsForMobileApps',
 'sports',
 'tapestryadultsNEW',
 'tapestryhouseholdsNEW',
 'transportation',
 'travelMPI',
 'unitsinstructure',
 'urbanizationgroupsNEW',
 'vacant',
 'vehiclesavailable',
 'veterans',
 'women',
 'yearbuilt',
 'yearmovedin']

Items can be searched using alias field, for the related analysis variable name -- here as an example a variable with 'Nonprofit' is searched. Out of the these the relevant 'Nonprofit' data is to be selected.

df_usa_data[df_usa_data['alias'].str.contains('Nonprofit')]                        
dataCollectionIDanalysisVariablealiasfieldCategoryvintage
10587classofworkerclassofworker.ACSMPRIVNPACS Civ Emp Male 16+:Priv Nonprofit2013-2017 Civilian Population 16+ by Class of ...2013-2017
10588classofworkerclassofworker.MOEMPRIVNPMOE Civ Emp Male 16+:Priv Nonprofit2013-2017 Civilian Population 16+ by Class of ...2013-2017
10595classofworkerclassofworker.RELMPRIVNPREL Civ Emp Male 16+:Priv Nonprofit2013-2017 Civilian Population 16+ by Class of ...2013-2017
10620classofworkerclassofworker.ACSFPRIVNPACS Civ Emp Female 16+:Priv Nonprofit2013-2017 Civilian Population 16+ by Class of ...2013-2017
10621classofworkerclassofworker.MOEFPRIVNPMOE Civ Emp Female 16+:Priv Nonprofit2013-2017 Civilian Population 16+ by Class of ...2013-2017
10622classofworkerclassofworker.RELFPRIVNPREL Civ Emp Female 16+:Priv Nonprofit2013-2017 Civilian Population 16+ by Class of ...2013-2017

Adding data using enrichment - At this stage a literature study is undertaken to narrow down the various factors that might impact opening of new Airbnb properties in NYC.

Subsequently these factors are identified from the USA geoenrichment database as shown above. These variable names are then compiled in a dictionary for passing them to the enrichment tool.

enrichment_variables = {'classofworker.ACSCIVEMP':      'Employed Population Age 16+',
 'classofworker.ACSMCIVEMP':                      'Employed Male Pop Age 16+',
 'classofworker.ACSMPRIVNP':                      'Male 16+Priv Nonprofit',
 'classofworker.ACSMEPRIVP':                      'Male 16+:Priv Profit Empl',
 'classofworker.ACSMSELFI':                       'Male 16+:Priv Profit Self Empl',
 'classofworker.ACSMSTGOV':                       'Male 16+:State Govt Wrkr',
 'classofworker.ACSMFEDGOV':                      'Male 16+:Fed Govt Wrkr',
 'classofworker.ACSMSELFNI':                      'Male 16+:Self-Emp Not Inc',
 'classofworker.ACSMUNPDFM':                      'Male 16+:Unpaid Family Wrkr',              
 'classofworker.ACSFCIVEMP':                      'Female Pop Age 16+',
 'classofworker.ACSFEPRIVP':                      'Female 16+:Priv Profit Empl',
 'classofworker.ACSFSELFI':                       'Female 16+:Priv Profit Self Empl',                      
 'classofworker.ACSFPRIVNP':                      'Female 16+:Priv Nonprofit',
 'classofworker.ACSFLOCGOV':                      'Female 16+:Local Govt Wrkr',
 'classofworker.ACSFSTGOV':                       'Female 16+:State Govt Wrkr',
 'classofworker.ACSFFEDGOV':                      'Female 16+:Fed Govt Wrkr',                      
 'classofworker.ACSFSELFNI':                      'Female 16+:Self-Emp Not Inc',                      
 'classofworker.ACSFUNPDFM':                      'Female 16+:Unpaid Family Wrkr',                      
 'gender.MEDAGE_CY':                              '2019 Median Age',
 'Generations.GENALPHACY':                        '2019 Generation Alpha Population',
 'Generations.GENZ_CY':                           '2019 Generation Z Population',
 'Generations.MILLENN_CY':                        '2019 Millennial Population',
 'Generations.GENX_CY':                           '2019 Generation X Population',
 'Generations.BABYBOOMCY':                        '2019 Baby Boomer Population',
 'Generations.OLDRGENSCY':                        '2019 Silent & Greatest Generations Population',
 'Generations.GENBASE_CY':                        '2019 Population by Generation Base',
 'populationtotals.POPDENS_CY':                   '2019 Population Density',
 'DaytimePopulation.DPOP_CY':                     '2019 Total Daytime Population',
 'raceandhispanicorigin.WHITE_CY':                '2019 White Population',
 'raceandhispanicorigin.BLACK_CY':                '2019 Black Population',
 'raceandhispanicorigin.AMERIND_CY':              '2019 American Indian Population',
 'raceandhispanicorigin.ASIAN_CY':                '2019 Asian Population',
 'raceandhispanicorigin.PACIFIC_CY':              '2019 Pacific Islander Population',
 'raceandhispanicorigin.OTHRACE_CY':              '2019 Other Race Population',
 'raceandhispanicorigin.DIVINDX_CY':              '2019 Diversity Index',
 'households.ACSHHBPOV':                          'HHs: Inc Below Poverty Level',
 'households.ACSHHAPOV':                          'HHs:Inc at/Above Poverty Level',
 'households.ACSFAMHH':                           'ACS Family Households',
 'businesses.S01_BUS':                            'Total Businesses (SIC)',
 'businesses.N05_BUS':                            'Construction Businesses (NAICS)',
 'businesses.N08_BUS':                            'Retail Trade Businesses (NAICS)',
 'businesses.N21_BUS':                            'Transportation/Warehouse Bus (NAICS)',
 'ElectronicsInternet.MP09147a_B':                'Own any tablet',
 'ElectronicsInternet.MP09148a_B':                'Own any e-reader',
 'ElectronicsInternet.MP19001a_B':                'Have access to Internet at home',                
 'ElectronicsInternet.MP19070a_I':                'Index: Spend 0.5-0.9 hrs online(excl email/IM .',               
 'ElectronicsInternet.MP19071a_B':                'Spend <0.5 hrs online (excl email/IM time) daily',
 'populationtotals.TOTPOP_CY':                    '2019 Total Population',              
 'gender.MALES_CY':                               '2019 Male Population',
 'gender.FEMALES_CY':                             '2019 Female Population',
 'industry.EMP_CY':                               '2019 Employed Civilian Pop 16+',
 'industry.UNEMP_CY':                             '2019 Unemployed Population 16+',                     
 'industry.UNEMPRT_CY':                           '2019 Unemployment Rate',
 'commute.ACSWORKERS':                            'ACS Workers Age 16+',
 'commute.ACSDRALONE':                            'ACS Workers 16+: Drove Alone',
 'commute.ACSCARPOOL':                            'ACS Workers 16+: Carpooled',
 'commute.ACSPUBTRAN':                            'ACS Workers 16+: Public Transportation',
 'commute.ACSBUS':                                'ACS Workers 16+: Bus',
 'commute.ACSSTRTCAR':                            'ACS Workers 16+: Streetcar',
 'commute.ACSSUBWAY':                             'ACS Workers 16+: Subway',
 'commute.ACSRAILRD':                             'ACS Workers 16+: Railroad',
 'commute.ACSFERRY':                              'ACS Workers 16+: Ferryboat',
 'commute.ACSTAXICAB':                            'ACS Workers 16+: Taxicab',           
 'commute.ACSMCYCLE':                             'ACS Workers 16+: Motorcycle',
 'commute.ACSBICYCLE':                            'ACS Workers 16+: Bicycle',                             
 'commute.ACSWALKED':                             'ACS Workers 16+: Walked',
 'commute.ACSOTHTRAN':                            'ACS Workers 16+: Other Means',
 'commute.ACSWRKHOME':                            'ACS Wrkrs 16+: Worked at Home',
 'OwnerRenter.OWNER_CY':                          '2019 Owner Occupied HUs', 
 'OwnerRenter.RENTER_CY':                         '2019 Renter Occupied HUs', 
 'vacant.VACANT_CY':                              '2019 Vacant Housing Units', 
 'homevalue.MEDVAL_CY':                           '2019 Median Home Value',
 'housingunittotals.TOTHU_CY':                    '2019 Total Housing Units',
 'yearbuilt.ACSMEDYBLT':                          'ACS Median Year Structure Built: HUs',
 'SpendingTotal.X1001_X':                         '2019 Annual Budget Exp',
 'transportation.X6001_X':                        '2019 Transportation',
 'households.ACSTOTHH':                           'ACS Total Households',
 'DaytimePopulation.DPOPWRK_CY':                  '2019 Daytime Pop: Workers',
 'DaytimePopulation.DPOPRES_CY':                  '2019 Daytime Pop: Residents',
 'DaytimePopulation.DPOPDENSCY':                  '2019 Daytime Pop Density',
 'occupation.OCCPROT_CY':                         '2019 Occupation: Protective Service',
 'occupation.OCCFOOD_CY':                         '2019 Occupation: Food Preperation',
 'occupation.OCCPERS_CY':                         '2019 Occupation: Personal Care',
 'occupation.OCCADMN_CY':                         '2019 Occupation: Office/Admin',
 'occupation.OCCCONS_CY':                         '2019 Occupation: Construction/Extraction',
 'occupation.OCCPROD_CY':                         '2019 Occupation: Production'
                  }
# Enrichment operation using ArcGIS API for Python 
enrichment_variables_df = pd.DataFrame.from_dict(enrichment_variables, orient='index',columns=['Variable Definition'])
enrichment_variables_df.reset_index(level=0, inplace=True)
enrichment_variables_df.columns = ['AnalysisVariable','Variable Definition']
enrichment_variables_df.head()
AnalysisVariableVariable Definition
0classofworker.ACSCIVEMPEmployed Population Age 16+
1classofworker.ACSMCIVEMPEmployed Male Pop Age 16+
2classofworker.ACSMPRIVNPMale 16+Priv Nonprofit
3classofworker.ACSMEPRIVPMale 16+:Priv Profit Empl
4classofworker.ACSMSELFIMale 16+:Priv Profit Self Empl
# Convertng the variables names to list for passing them to the enrichment tool
variable_names = enrichment_variables_df['AnalysisVariable'].tolist()

# checking the firt few values of the list
variable_names[1:5]
['classofworker.ACSMCIVEMP',
 'classofworker.ACSMPRIVNP',
 'classofworker.ACSMEPRIVP',
 'classofworker.ACSMSELFI']
# Data Enriching operation
airbnb_count_by_tract_enriched = enrich_layer(airbnb_count_by_tract,
                                              analysis_variables = variable_names,
                                              output_name='airbnb_tract_enrich1'+ str(dt.now().microsecond))
{"messageCode": "AO_100047", "message": "Enrichment may not be available for some features."}
{"messageCode": "AO_100000", "message": "Unable to detect country for study area at [15]."}
{"messageCode": "AO_100000", "message": "Unable to detect country for study area at [13]."}
# Extracting the resulting enriched dataframe after the geoenrichment method
sdf_airbnb_count_by_tract_enriched = airbnb_count_by_tract_enriched.layers[0].query().sdf
# Visualizing the data as a pandas dataframe
print(sdf_airbnb_count_by_tract_enriched.columns)
sdf_airbnb_count_by_tract_enriched_sorted = sdf_airbnb_count_by_tract_enriched.sort_values('geoid')
sdf_airbnb_count_by_tract_enriched_sorted.head()
Index(['ACSBICYCLE', 'ACSBUS', 'ACSCARPOOL', 'ACSCIVEMP', 'ACSDRALONE',
       'ACSFAMHH', 'ACSFCIVEMP', 'ACSFEPRIVP', 'ACSFERRY', 'ACSFFEDGOV',
       ...
       'geoid', 'intptlat', 'intptlon', 'mtfcc', 'name', 'namelsad',
       'populationToPolygonSizeRating', 'sourceCountry', 'statefp', 'tractce'],
      dtype='object', length=111)
ACSBICYCLEACSBUSACSCARPOOLACSCIVEMPACSDRALONEACSFAMHHACSFCIVEMPACSFEPRIVPACSFERRYACSFFEDGOVACSFLOCGOVACSFPRIVNPACSFSELFIACSFSELFNIACSFSTGOVACSFUNPDFMACSHHAPOVACSHHBPOVACSMCIVEMPACSMCYCLEACSMEDYBLTACSMEPRIVPACSMFEDGOVACSMPRIVNPACSMSELFIACSMSELFNIACSMSTGOVACSMUNPDFMACSOTHTRANACSPUBTRANACSRAILRDACSSTRTCARACSSUBWAYACSTAXICABACSTOTHHACSWALKEDACSWORKERSACSWRKHOMEAMERIND_CYASIAN_CYAnalysisAreaBABYBOOMCYBLACK_CYDIVINDX_CYDPOPDENSCYDPOPRES_CYDPOPWRK_CYDPOP_CYEMP_CYENRICH_FIDFEMALES_CYGENALPHACYGENBASE_CYGENX_CYGENZ_CY...IDMALES_CYMEDAGE_CYMEDVAL_CYMILLENN_CYMP09147a_BMP09148a_BMP19001a_BMP19070a_IMP19071a_BN05_BUSN08_BUSN21_BUSOBJECTIDOCCADMN_CYOCCCONS_CYOCCFOOD_CYOCCPERS_CYOCCPROD_CYOCCPROT_CYOLDRGENSCYOTHRACE_CYOWNER_CYPACIFIC_CYPOPDENS_CYPoint_CountRENTER_CYS01_BUSSHAPEShape__AreaShape__LengthTOTHU_CYTOTPOP_CYUNEMPRT_CYUNEMP_CYVACANT_CYWHITE_CYX1001_XX6001_XaggregationMethodalandapportionmentConfidenceawatercountyfpfuncstatgeoidintptlatintptlonmtfccnamenamelsadpopulationToPolygonSizeRatingsourceCountrystatefptractce
20950.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0110.0194.01.044430396.06016.081.4475.90.0497.0497.00.02096.0862.00.010471.03162.01599.0...459609.031.50.05308.00.00.00.00.00.00.00.00.020960.00.00.00.00.00.06.02629.00.027.010025.900.010.0{"rings": [[[-8226256.9418, 4982172.581], [-82...4.724201e+069395.0239080.010471.00.00.00.01489.00.00.0BlockApportionment:US.BlockGroups15793612.5761125765005S36005000100+40.7934921-073.8835318G50201Census Tract 12.191US36000100
20590.0287.0173.01858.0951.01013.0718.0427.00.027.0101.065.017.049.032.00.01021.0300.01140.00.01962.0758.037.087.042.061.025.00.00.0615.00.00.0328.016.01321.071.01835.09.054.0163.00.533681876.01286.089.95823.52888.0220.03108.01795.02060.02479.0177.04638.0915.01188.0...92159.035.0437798.01233.01637.0192.03091.067.0250.02.03.03.02060257.079.076.032.079.021.0249.01440.0768.015.08690.30667.036.0{"rings": [[[-8222638.612, 4985024.3226], [-82...2.414911e+068067.0346611511.04638.05.3100.076.01404.0106532171.011165486.0BlockApportionment:US.BlockGroups4553222.576926899005S36005000200+40.8045733-073.8568585G50202Census Tract 22.191US36000200
206710.0427.0224.02917.01257.01497.01349.0794.00.024.0299.0163.023.035.011.00.01696.0239.01568.00.01999.0823.040.0128.030.083.015.00.00.01251.023.00.0801.07.01935.046.02848.053.052.0155.00.5850751168.01925.090.46855.23470.0541.04011.02890.02068.03231.0246.06288.01372.01613.0...173057.035.3392481.01622.02233.0349.04390.092.0215.05.08.02.02068432.070.0113.099.045.0276.0267.02067.01384.05.010746.915658.054.0{"rings": [[[-8222012.885, 4985135.2266], [-82...2.647647e+068312.9559742150.06288.05.0153.0108.01662.0166787168.017539648.0BlockApportionment:US.BlockGroups9123922.576602945005S36005000400+40.8089152-073.8504884G50204Census Tract 42.191US36000400
16680.0538.0113.02120.0759.01470.01284.0801.00.018.0329.074.010.052.00.00.01405.0557.0836.00.01973.0525.00.045.050.00.065.00.00.0991.00.00.0453.00.01962.0154.02098.081.054.078.00.1872891205.02298.087.830304.33716.01960.05676.02332.01669.03365.0238.05982.01100.01484.0...182617.035.6415686.01515.01769.0211.03424.070.0247.02.04.02.01669378.023.0157.0219.070.0131.0440.01550.0356.00.031938.111675.053.0{"rings": [[[-8222181.7567, 4986069.1354], [-8...8.478126e+053898.1631492099.05982.05.2128.068.01649.092988868.09541987.0BlockApportionment:US.BlockGroups4850792.5760005S36005001600+40.8188478-073.8580764G502016Census Tract 162.191US36001600
212712.0142.032.01290.0251.0495.0627.0372.00.04.092.097.010.012.040.00.0687.0282.0663.00.01954.0440.013.053.00.091.031.00.012.0843.023.026.0652.00.0969.076.01250.024.023.032.01.074650266.0642.089.67699.61143.07131.08274.0896.02128.0927.0127.02019.0486.0473.0...271092.033.0393478.0614.0504.064.01135.088.066.065.039.029.02128118.040.039.031.039.00.053.0642.048.01.01878.824547.0380.0{"rings": [[[-8230028.8927, 4984061.5402], [-8...4.862123e+0611742.631147690.02019.012.2124.095.0590.025499197.02670728.0BlockApportionment:US.BlockGroups16436542.5761139660005S36005001900+40.8009990-073.9093729G502019Census Tract 192.191US36001900

5 rows × 111 columns

The field name of the enriched dataframe are code words which needs to be elaborated. Hence these are renamed with their actual definition from the variable definition of the list that was first created during selection of the variables.

enrichment_variables_df.head()
AnalysisVariableVariable Definition
0classofworker.ACSCIVEMPEmployed Population Age 16+
1classofworker.ACSMCIVEMPEmployed Male Pop Age 16+
2classofworker.ACSMPRIVNPMale 16+Priv Nonprofit
3classofworker.ACSMEPRIVPMale 16+:Priv Profit Empl
4classofworker.ACSMSELFIMale 16+:Priv Profit Self Empl
enrichment_variables_copy = enrichment_variables_df.copy()
enrichment_variables_copy.head(2)
AnalysisVariableVariable Definition
0classofworker.ACSCIVEMPEmployed Population Age 16+
1classofworker.ACSMCIVEMPEmployed Male Pop Age 16+
enrichment_variables_copy['AnalysisVariable'] = enrichment_variables_copy.AnalysisVariable.str.split(pat='.', expand=True)[1]
enrichment_variables_copy
AnalysisVariableVariable Definition
0ACSCIVEMPEmployed Population Age 16+
1ACSMCIVEMPEmployed Male Pop Age 16+
2ACSMPRIVNPMale 16+Priv Nonprofit
3ACSMEPRIVPMale 16+:Priv Profit Empl
4ACSMSELFIMale 16+:Priv Profit Self Empl
5ACSMSTGOVMale 16+:State Govt Wrkr
6ACSMFEDGOVMale 16+:Fed Govt Wrkr
7ACSMSELFNIMale 16+:Self-Emp Not Inc
8ACSMUNPDFMMale 16+:Unpaid Family Wrkr
9ACSFCIVEMPFemale Pop Age 16+
10ACSFEPRIVPFemale 16+:Priv Profit Empl
11ACSFSELFIFemale 16+:Priv Profit Self Empl
12ACSFPRIVNPFemale 16+:Priv Nonprofit
13ACSFLOCGOVFemale 16+:Local Govt Wrkr
14ACSFSTGOVFemale 16+:State Govt Wrkr
15ACSFFEDGOVFemale 16+:Fed Govt Wrkr
16ACSFSELFNIFemale 16+:Self-Emp Not Inc
17ACSFUNPDFMFemale 16+:Unpaid Family Wrkr
18MEDAGE_CY2019 Median Age
19GENALPHACY2019 Generation Alpha Population
20GENZ_CY2019 Generation Z Population
21MILLENN_CY2019 Millennial Population
22GENX_CY2019 Generation X Population
23BABYBOOMCY2019 Baby Boomer Population
24OLDRGENSCY2019 Silent & Greatest Generations Population
25GENBASE_CY2019 Population by Generation Base
26POPDENS_CY2019 Population Density
27DPOP_CY2019 Total Daytime Population
28WHITE_CY2019 White Population
29BLACK_CY2019 Black Population
.........
56ACSPUBTRANACS Workers 16+: Public Transportation
57ACSBUSACS Workers 16+: Bus
58ACSSTRTCARACS Workers 16+: Streetcar
59ACSSUBWAYACS Workers 16+: Subway
60ACSRAILRDACS Workers 16+: Railroad
61ACSFERRYACS Workers 16+: Ferryboat
62ACSTAXICABACS Workers 16+: Taxicab
63ACSMCYCLEACS Workers 16+: Motorcycle
64ACSBICYCLEACS Workers 16+: Bicycle
65ACSWALKEDACS Workers 16+: Walked
66ACSOTHTRANACS Workers 16+: Other Means
67ACSWRKHOMEACS Wrkrs 16+: Worked at Home
68OWNER_CY2019 Owner Occupied HUs
69RENTER_CY2019 Renter Occupied HUs
70VACANT_CY2019 Vacant Housing Units
71MEDVAL_CY2019 Median Home Value
72TOTHU_CY2019 Total Housing Units
73ACSMEDYBLTACS Median Year Structure Built: HUs
74X1001_X2019 Annual Budget Exp
75X6001_X2019 Transportation
76ACSTOTHHACS Total Households
77DPOPWRK_CY2019 Daytime Pop: Workers
78DPOPRES_CY2019 Daytime Pop: Residents
79DPOPDENSCY2019 Daytime Pop Density
80OCCPROT_CY2019 Occupation: Protective Service
81OCCFOOD_CY2019 Occupation: Food Preperation
82OCCPERS_CY2019 Occupation: Personal Care
83OCCADMN_CY2019 Occupation: Office/Admin
84OCCCONS_CY2019 Occupation: Construction/Extraction
85OCCPROD_CY2019 Occupation: Production

86 rows × 2 columns

enrichment_variables_copy.set_index("AnalysisVariable", drop=True, inplace=True)
dictionary = enrichment_variables_copy.to_dict()
new_columns = dictionary['Variable Definition']
# Field renamed and new dataframe visualized
pd.set_option('display.max_columns', 150)
sdf_airbnb_count_by_tract_enriched_sorted.rename(columns=new_columns, inplace=True)
sdf_airbnb_count_by_tract_enriched_sorted.head()
ACS Workers 16+: BicycleACS Workers 16+: BusACS Workers 16+: CarpooledEmployed Population Age 16+ACS Workers 16+: Drove AloneACS Family HouseholdsFemale Pop Age 16+Female 16+:Priv Profit EmplACS Workers 16+: FerryboatFemale 16+:Fed Govt WrkrFemale 16+:Local Govt WrkrFemale 16+:Priv NonprofitFemale 16+:Priv Profit Self EmplFemale 16+:Self-Emp Not IncFemale 16+:State Govt WrkrFemale 16+:Unpaid Family WrkrHHs:Inc at/Above Poverty LevelHHs: Inc Below Poverty LevelEmployed Male Pop Age 16+ACS Workers 16+: MotorcycleACS Median Year Structure Built: HUsMale 16+:Priv Profit EmplMale 16+:Fed Govt WrkrMale 16+Priv NonprofitMale 16+:Priv Profit Self EmplMale 16+:Self-Emp Not IncMale 16+:State Govt WrkrMale 16+:Unpaid Family WrkrACS Workers 16+: Other MeansACS Workers 16+: Public TransportationACS Workers 16+: RailroadACS Workers 16+: StreetcarACS Workers 16+: SubwayACS Workers 16+: TaxicabACS Total HouseholdsACS Workers 16+: WalkedACS Workers Age 16+ACS Wrkrs 16+: Worked at Home2019 American Indian Population2019 Asian PopulationAnalysisArea2019 Baby Boomer Population2019 Black Population2019 Diversity Index2019 Daytime Pop Density2019 Daytime Pop: Residents2019 Daytime Pop: Workers2019 Total Daytime Population2019 Employed Civilian Pop 16+ENRICH_FID2019 Female Population2019 Generation Alpha Population2019 Population by Generation Base2019 Generation X Population2019 Generation Z PopulationHasDataID2019 Male Population2019 Median Age2019 Median Home Value2019 Millennial PopulationOwn any tabletOwn any e-readerHave access to Internet at homeIndex: Spend 0.5-0.9 hrs online(excl email/IM .Spend <0.5 hrs online (excl email/IM time) dailyConstruction Businesses (NAICS)Retail Trade Businesses (NAICS)Transportation/Warehouse Bus (NAICS)OBJECTID2019 Occupation: Office/Admin2019 Occupation: Construction/Extraction2019 Occupation: Food Preperation2019 Occupation: Personal Care2019 Occupation: Production2019 Occupation: Protective Service2019 Silent & Greatest Generations Population2019 Other Race Population2019 Owner Occupied HUs2019 Pacific Islander Population2019 Population DensityPoint_Count2019 Renter Occupied HUsTotal Businesses (SIC)SHAPEShape__AreaShape__Length2019 Total Housing Units2019 Total Population2019 Unemployment Rate2019 Unemployed Population 16+2019 Vacant Housing Units2019 White Population2019 Annual Budget Exp2019 TransportationaggregationMethodalandapportionmentConfidenceawatercountyfpfuncstatgeoidintptlatintptlonmtfccnamenamelsadpopulationToPolygonSizeRatingsourceCountrystatefptractce
20950.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0110.0194.01.044430396.06016.081.4475.90.0497.0497.00.02096.0862.00.010471.03162.01599.01.0459609.031.50.05308.00.00.00.00.00.00.00.00.020960.00.00.00.00.00.06.02629.00.027.010025.900.010.0{"rings": [[[-8226256.9418, 4982172.581], [-82...4.724201e+069395.0239080.010471.00.00.00.01489.00.00.0BlockApportionment:US.BlockGroups15793612.5761125765005S36005000100+40.7934921-073.8835318G50201Census Tract 12.191US36000100
20590.0287.0173.01858.0951.01013.0718.0427.00.027.0101.065.017.049.032.00.01021.0300.01140.00.01962.0758.037.087.042.061.025.00.00.0615.00.00.0328.016.01321.071.01835.09.054.0163.00.533681876.01286.089.95823.52888.0220.03108.01795.02060.02479.0177.04638.0915.01188.01.092159.035.0437798.01233.01637.0192.03091.067.0250.02.03.03.02060257.079.076.032.079.021.0249.01440.0768.015.08690.30667.036.0{"rings": [[[-8222638.612, 4985024.3226], [-82...2.414911e+068067.0346611511.04638.05.3100.076.01404.0106532171.011165486.0BlockApportionment:US.BlockGroups4553222.576926899005S36005000200+40.8045733-073.8568585G50202Census Tract 22.191US36000200
206710.0427.0224.02917.01257.01497.01349.0794.00.024.0299.0163.023.035.011.00.01696.0239.01568.00.01999.0823.040.0128.030.083.015.00.00.01251.023.00.0801.07.01935.046.02848.053.052.0155.00.5850751168.01925.090.46855.23470.0541.04011.02890.02068.03231.0246.06288.01372.01613.01.0173057.035.3392481.01622.02233.0349.04390.092.0215.05.08.02.02068432.070.0113.099.045.0276.0267.02067.01384.05.010746.915658.054.0{"rings": [[[-8222012.885, 4985135.2266], [-82...2.647647e+068312.9559742150.06288.05.0153.0108.01662.0166787168.017539648.0BlockApportionment:US.BlockGroups9123922.576602945005S36005000400+40.8089152-073.8504884G50204Census Tract 42.191US36000400
16680.0538.0113.02120.0759.01470.01284.0801.00.018.0329.074.010.052.00.00.01405.0557.0836.00.01973.0525.00.045.050.00.065.00.00.0991.00.00.0453.00.01962.0154.02098.081.054.078.00.1872891205.02298.087.830304.33716.01960.05676.02332.01669.03365.0238.05982.01100.01484.01.0182617.035.6415686.01515.01769.0211.03424.070.0247.02.04.02.01669378.023.0157.0219.070.0131.0440.01550.0356.00.031938.111675.053.0{"rings": [[[-8222181.7567, 4986069.1354], [-8...8.478126e+053898.1631492099.05982.05.2128.068.01649.092988868.09541987.0BlockApportionment:US.BlockGroups4850792.5760005S36005001600+40.8188478-073.8580764G502016Census Tract 162.191US36001600
212712.0142.032.01290.0251.0495.0627.0372.00.04.092.097.010.012.040.00.0687.0282.0663.00.01954.0440.013.053.00.091.031.00.012.0843.023.026.0652.00.0969.076.01250.024.023.032.01.074650266.0642.089.67699.61143.07131.08274.0896.02128.0927.0127.02019.0486.0473.01.0271092.033.0393478.0614.0504.064.01135.088.066.065.039.029.02128118.040.039.031.039.00.053.0642.048.01.01878.824547.0380.0{"rings": [[[-8230028.8927, 4984061.5402], [-8...4.862123e+0611742.631147690.02019.012.2124.095.0590.025499197.02670728.0BlockApportionment:US.BlockGroups16436542.5761139660005S36005001900+40.8009990-073.9093729G502019Census Tract 192.191US36001900

The renamed data frame above is now self explanatory hence more interpretable.

Estimating distances of tracts from various city features

The next set of feature data set will be the distances of each of the tract from various city features. These distance variables accomplishes two important tasks.

First is that they include the spatial components of the Airbnb development phenomenon into the model.

Secondly each Airbnb properties are impacted by unique locational factors. This is reflected from the Airbnb reviews where the most highly rated in demand Airbnb property are located in neighbourhood with good transit accessibility. Hence these are accounted into the model by including the distances of different public transit options from the tracts.

The hypothesis formed here is that tracts located near transit hubs which could be subway station, bus stops, railroad lines, subway routes etc., might attract more Airbnb property. Similarly the central business district which for New York is located at lower Manhattan might also influence Airbnb properties, since this is the city's main business hub. In the following these various distances are estimated using ArcGIS API for Python proximity method.

# accessing the various city feature shapefile from arcgis portal
busi_distr = gis.content.search('BusinessDistricts owner:api_data_owner', 'feature layer')[0]
cbd = gis.content.search('NYCBD owner:api_data_owner', 'feature layer')[0]
bus_stop = gis.content.search('NYCBusStop owner:api_data_owner', 'feature layer')[0]
hotels = gis.content.search('NYCHotels owner:api_data_owner', 'feature layer')[0]
railroad = gis.content.search('NYCRailroad owner:api_data_owner', 'feature layer')[0]
subwy_rt = gis.content.search('NYCSubwayRoutes owner:api_data_owner', 'feature layer')[0]
subwy_stn = gis.content.search('NYCSubwayStation owner:api_data_owner', 'feature layer')[0]
bus_stop_lyr = bus_stop.layers[0]
cbd_lyr = cbd.layers[0] 
hotels_lyr = hotels.layers[0] 
subwy_stn_lyr =subwy_stn.layers[0]
subwy_rt_lyr = subwy_rt.layers[0] 
railroad_lyr = railroad.layers[0]
busi_distrs_lyr = busi_distr.layers[0] 
# Avoid warning for chain operation
pd.set_option('mode.chained_assignment', None) 

# Estimating Tract to hotel distances
tract_hotel_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                              hotels_lyr,
                                              measurement_type='StraightLine',
                                              max_count=1,
                                              output_name='ny_tract_hotel_dist1' + str(dt.now().microsecond))
tract_hotel_dist.layers
[<FeatureLayer url:"https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/ny_tract_hotel_dist1479635/FeatureServer/0">,
 <FeatureLayer url:"https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/ny_tract_hotel_dist1479635/FeatureServer/1">]
tract_hotel_dist_lyr = tract_hotel_dist.layers[1]
sdf_tract_hotel_dist_lyr = pd.DataFrame.spatial.from_layer(tract_hotel_dist_lyr)
sdf_tract_hotel_dist_lyr.head()
From_IDFrom_NameFrom_Shape__AreaFrom_Shape__LengthFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_ACRESTo_ADD_ADDRTo_ADD_CITYTo_ADD_OWNERTo_ADD_POBOXTo_ADD_STATETo_ADD_ZIPTo_AGDISTCODETo_AGDISTNAMETo_BLDG_DESCTo_BLDG_STYLETo_BOOKTo_CALC_ACRESTo_COUNTYTo_CT_NAMETo_CT_SWISTo_DEPTHTo_DUP_GEOTo_FRONTTo_FUEL_DESCTo_FUEL_TYPETo_FULL_MVTo_GRID_EASTTo_GRID_NORTHTo_HEAT_DESCTo_HEAT_TYPETo_IDTo_LAND_AVTo_LOC_STREETTo_LOC_ST_NBRTo_LOC_UNITTo_LOC_ZIPTo_MAIL_ADDRTo_MAIL_CITYTo_MAIL_STATETo_MAIL_ZIPTo_MUNI_NAMETo_MUNI_PCLIDTo_NAMESOURCETo_NBR_BEDRMTo_NBR_F_BATHTo_NBR_KITCHNTo_NYS_NAMETo_OWNER_TYPETo_PAGETo_PARCELADDRTo_PO_BOXTo_PRINT_KEYTo_PRMY_OWNERTo_PROP_CLASSTo_ROLL_SECTTo_ROLL_YRTo_SBLTo_SCH_CODETo_SCH_NAMETo_SEWER_DESCTo_SEWER_TYPETo_SPATIAL_YRTo_SQFT_LIVTo_SQ_FTTo_SWISTo_SWISPKIDTo_SWISSBLIDTo_Shape__AreaTo_Shape__LengthTo_TOTAL_AVTo_USEDASCODETo_USEDASDESCTo_UTILITIESTo_UTIL_DESCTo_WATER_DESCTo_WATER_SUPPTo_YR_BLTTotal_Miles
01332227.0372591.2343751083.721395415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227552.1448, 4989516.0602], [-8...0MiscellaneousH900.162924BronxBronx600100159420101188624835230202500WEBSTER AVENUE193010457Bronx000801930 WEBSTER AVENUEWEBSTER TREMONT EQUIT052017203027001009201719110675360010060010020302700101154.062500166.305662153180019310.519338
1501207.0182598.6289061212.182959472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112None0DormitoriesH800.237481New YorkManhattan6201001001010995721234272701389700AMSTERDAM AVENUE123510027Manhattan000801235 AMSTERDAM AVENUEBARNARD COLLEGE0820171019630030032017790361009262010062010010196300301680.277344163.992960678735019680.000000
2469174.0288608.6718751246.964673507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213None0DormitoriesH800.338975New YorkManhattan62010010114209989662293166951291050EAST 110 STREET5510029Manhattan0008055 EAST 110 STREETEDWIN GOULD RESIDENCE0820171016160024042017375701434762010062010010161600242397.410156198.095165305145020040.000000
3454160.0289790.9023441256.907546514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214None0DormitoriesH800.742213New YorkManhattan62010010130509974212263556896925950EAST 98 STREET5010029Manhattan0008050 EAST 98 STREETMSMC RESIDENTIAL REAL08201710160300390220172400003078162010062010010160300395248.046875332.868992897435019840.000000
4440150.0190155.2617191257.359120516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115{"paths": [[[-8232878.2271, 4979973.2013], [-8...0Transient Occupancy - Midtown Manhattan AreaH300.036426New YorkManhattan6201001012609968412233416871395000EAST 87 STREET16410128Manhattan00080164 EAST 87 STREET164 EAST 87TH ST LLC05201710151500450220171830025716201006201001015150045257.50000094.764453462735019300.139874

In the above dataframe the Total_Miles field returns the distances of the tract from hotels in miles. Hence this field is converted into feet and retained. This is then repeated for each of the other distance estimation.

# Final hotel Distances in feet — Here in each row column "hotel_dist" returns the distance of the nearest hotel from that tract indicated by its geoids.
# For example in the first row the tract with ID 36005000100 has a nearest hotel at 5571.75 feet away from it. 
sdf_tract_hotel_dist_lyr_new = sdf_tract_hotel_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_hotel_dist_lyr_new['hotel_dist'] = round(sdf_tract_hotel_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_hotel_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Mileshotel_dist
2095360050001001.0552565571.75
2059360050002001.0390995486.44
2067360050004000.4726642495.67
1668360050016000.5859773093.96
2127360050019000.0000000.00
# Estimating Busstop distances from tracts
tract_bustop_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                               bus_stop_lyr,
                                               measurement_type='StraightLine',
                                               max_count=1,
                                               output_name='ny_tract_bus_stop_dist'+ str(dt.now().microsecond))
tract_bustop_dist_lyr = tract_bustop_dist.layers[1]
sdf_tract_bustop_dist_lyr =tract_bustop_dist_lyr.query().sdf
# Final Bustop Distances in feet — Here in each row column "busstop_dist" returns the distance of the nearest bus stop 
# from that tract indicated by its geoids 
sdf_tract_bustop_dist_lyr_new = sdf_tract_bustop_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_bustop_dist_lyr_new['busstop_dist'] = round(sdf_tract_bustop_dist_lyr_new['Total_Kilometers'] * 3280.84, 2)
sdf_tract_bustop_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milesbusstop_dist
2095360050001000.7443443930.0
2059360050002000.00598332.0
2067360050004000.0000000.0
1668360050016000.0000000.0
2127360050019000.0000000.0
# estimating number of bus stops per tract
num_bustops_tracts = summarize_data.aggregate_points(point_layer=bus_stop_lyr,
                                                   polygon_layer=nyc_tracts_layer,
                                                   output_name='bustops_by_tracts'+ str(dt.now().microsecond)) 
num_bustops_tracts_lyr = num_bustops_tracts.layers[0]
sdf_num_bustops_tracts_lyr = pd.DataFrame.spatial.from_layer(num_bustops_tracts_lyr)
sdf_num_bustops_tracts_lyr.head()
AnalysisAreaOBJECTIDPoint_CountSHAPEShape__AreaShape__Lengthalandawatercountyfpfuncstatgeoidintptlatintptlonmtfccnamenamelsadstatefptractce
00.01602411{"rings": [[[-8227813.3004, 4989345.3624], [-8...72591.2343751083.721395415020005S36005022703+40.8440198-073.9104999G5020227.03Census Tract 227.0336022703
10.01825222{"rings": [[[-8233183.0202, 4984115.3687], [-8...82598.6289061212.182959472730061S36061020701+40.8089775-073.9584600G5020207.01Census Tract 207.0136020701
20.01958731{"rings": [[[-8231989.6748, 4982433.1457], [-8...88608.6718751246.964673507300061S36061017402+40.7968026-073.9471624G5020174.02Census Tract 174.0236017402
30.01985441{"rings": [[[-8232691.8783, 4981159.0609], [-8...89790.9023441256.907546514220061S36061016002+40.7878787-073.9536853G5020160.02Census Tract 160.0236016002
40.01993950{"rings": [[[-8233292.0018, 4980071.8459], [-8...90155.2617191257.359120516430061S36061015001+40.7801987-073.9592834G5020150.01Census Tract 150.0136015001
# Number of Bus stops per tract — Here in each row column "num_bustop" returns the number of bus stops inside respective tracts 
sdf_num_bustops_tracts_lyr_new = sdf_num_bustops_tracts_lyr[['geoid', 'Point_Count']] 
sdf_num_bustops_tracts_lyr_new = sdf_num_bustops_tracts_lyr_new.rename(columns={'Point_Count':'num_bustop'})
sdf_num_bustops_tracts_lyr_new.sort_values('geoid').head()
geoidnum_bustop
2095360050001000
2059360050002000
2067360050004001
1668360050016003
2127360050019002
# estimating tracts distances from CBD 
tract_cbd_dist=use_proximity.find_nearest(nyc_tracts_layer,
                                          cbd_lyr,
                                          measurement_type='StraightLine',
                                          max_count=1,
                                          output_name='ny_tract_cbd_dist'+ str(dt.now().microsecond))
tract_cbd_dist_lyr = tract_cbd_dist.layers[1]
sdf_tract_cbd_dist_lyr = tract_cbd_dist_lyr.query().sdf
sdf_tract_cbd_dist_lyr.head()
From_IDFrom_NameFrom_Shape__AreaFrom_Shape__LengthFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_IDTo_Shape__AreaTo_Shape__LengthTo_bidTo_boroughTo_date_creatTo_date_modifTo_objectidTo_shape_areaTo_shape_lenTo_time_creatTo_time_modifTotal_Miles
01332227.0372591.2343751083.721395415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227840.685, 4989242.9453], [-82...1198651.3945312349.824219Bryant Park BIDManhattan2008-11-192016-10-3158NoneNone00:00:00.00000:00:00.0007.102363
1501207.0182598.6289061212.182959472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112{"paths": [[[-8233012.3673, 4983978.5918], [-8...1198651.3945312349.824219Bryant Park BIDManhattan2008-11-192016-10-3158NoneNone00:00:00.00000:00:00.0003.809966
2469174.0288608.6718751246.964673507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213{"paths": [[[-8231927.2364, 4982389.3819], [-8...1198651.3945312349.824219Bryant Park BIDManhattan2008-11-192016-10-3158NoneNone00:00:00.00000:00:00.0003.363737
3454160.0289790.9023441256.907546514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214{"paths": [[[-8232719.7081, 4981110.3946], [-8...1198651.3945312349.824219Bryant Park BIDManhattan2008-11-192016-10-3158NoneNone00:00:00.00000:00:00.0002.658677
4440150.0190155.2617191257.359120516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115{"paths": [[[-8233341.205, 4979982.316], [-823...1198651.3945312349.824219Bryant Park BIDManhattan2008-11-192016-10-3158NoneNone00:00:00.00000:00:00.0002.055165
# Final CBD distances in feet — Here in each row the column "cbd_dst" returns the distance of the CBD from respective tracts
sdf_tract_cbd_dist_lyr_new = sdf_tract_cbd_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_cbd_dist_lyr_new['cbd_dist'] = round(sdf_tract_cbd_dist_lyr_new['Total_Kilometers'] * 3280.84, 2) 
sdf_tract_cbd_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milescbd_dist
2095360050001004.99924726396.02
2059360050002006.85851436212.95
2067360050004007.32192738659.77
1668360050016007.52553539734.83
2127360050019004.33359022881.35
# Estimating NYCSubwayStation distances from tracts 
tract_subwy_stn_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                                  subwy_stn_lyr,
                                                  measurement_type='StraightLine',
                                                  max_count=1,
                                                  output_name='ny_tract_subway_station_dist'+ str(dt.now().microsecond))
tract_subwy_stn_dist_lyr = tract_subwy_stn_dist.layers[1]
sdf_tract_subwy_stn_dist_lyr = pd.DataFrame.spatial.from_layer(tract_subwy_stn_dist_lyr)
sdf_tract_subwy_stn_dist_lyr.head()
From_IDFrom_NameFrom_Shape__AreaFrom_Shape__LengthFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_IDTo_NameTo_lineTo_notesTo_objectidTo_urlTotal_Miles
01332227.0372591.2343751083.721395415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227646.1872, 4989522.1588], [-8...21174th-175th StsB-DB-rush hours, D-all times, skips rush hours AM...21http://web.mta.info/nyct/service/0.054525
1501207.0182598.6289061212.182959472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112{"paths": [[[-8233201.4992, 4984081.6891], [-8...167116th St - Columbia University11-all times167http://web.mta.info/nyct/service/0.211254
2469174.0288608.6718751246.964673507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213{"paths": [[[-8231617.2329, 4982254.7552], [-8...450110th St4-6-6 Express4 nights, 6-all times, 6 Express-weekdays AM s...450http://web.mta.info/nyct/service/0.097270
3454160.0289790.9023441256.907546514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214{"paths": [[[-8232360.48, 4980909.7037], [-823...3396th St4-6-6 Express4 nights, 6-all times, 6 Express-weekdays AM s...33http://web.mta.info/nyct/service/0.098659
4440150.0190155.2617191257.359120516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115{"paths": [[[-8232878.9944, 4979971.8171], [-8...45186th St4-5-6-6 Express4,6-all times, 5-all times exc nights, 6 Expre...451http://web.mta.info/nyct/service/0.097110
# Final Tract to NYC Subway Station distances in feet — Here in each row, column "subwy_stn_dist" returns the distance of
# the nearest subway station from that tract
sdf_tract_subwy_stn_dist_lyr_new = sdf_tract_subwy_stn_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_subwy_stn_dist_lyr_new['subwy_stn_dist'] = round(sdf_tract_subwy_stn_dist_lyr_new['Total_Kilometers'] * 3280.84, 2) 
sdf_tract_subwy_stn_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milessubwy_stn_dist
2095360050001000.9462264996.07
2059360050002001.1081735851.15
2067360050004001.1915056291.15
1668360050016000.7296613852.61
2127360050019000.080063422.73
# Estimating distances to NYCSubwayRoutes
tract_subwy_rt_dist=use_proximity.find_nearest(nyc_tracts_layer,
                                               subwy_rt_lyr,
                                               measurement_type='StraightLine',
                                               max_count=1,
                                               output_name='ny_tract_subway_routes_dist'+ str(dt.now().microsecond))
tract_subwy_rt_dist_lyr = tract_subwy_rt_dist.layers[1]
sdf_tract_subwy_rt_dist_lyr = tract_subwy_rt_dist_lyr.query().sdf
sdf_tract_subwy_rt_dist_lyr.head()
From_IDFrom_NameFrom_Shape__AreaFrom_Shape__LengthFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_IDTo_Shape__LengthTo_group_To_route_idTo_route_longTo_route_shorTotal_Miles
01332227.0372591.2343751083.721395415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311None1251293.284426BDFMB6 Avenue ExpressB0.000000
1501207.0182598.6289061212.182959472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112{"paths": [[[-8233150.5149, 4984174.9334], [-8...331208.1378311231Broadway - 7 Avenue Local10.169658
2469174.0288608.6718751246.964673507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213{"paths": [[[-8231635.9013, 4982220.9583], [-8...931863.5453394566Lexington Avenue Express/Local60.096920
3454160.0289790.9023441256.907546514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214{"paths": [[[-8232305.5995, 4981008.652], [-82...931863.5453394566Lexington Avenue Express/Local60.096942
4440150.0190155.2617191257.359120516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115{"paths": [[[-8232878.2271, 4979973.2013], [-8...931863.5453394566Lexington Avenue Express/Local60.096767
# Final Tract to NYCSubwayRoutes distances in feet — Here in each row, column "subwy_rt_dist" returns the distance of
# the nearest subway route from that tract
sdf_tract_subwy_rt_dist_lyr_new = sdf_tract_subwy_rt_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_subwy_rt_dist_lyr_new['subwy_rt_dist'] = round(sdf_tract_subwy_rt_dist_lyr_new['Total_Kilometers'] * 3280.84, 2) 
sdf_tract_subwy_rt_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milessubwy_rt_dist
2095360050001000.9053104780.0
2059360050002001.1087255854.0
2067360050004001.1920226294.0
1668360050016000.7243213824.0
2127360050019000.00285315.0
# Estimating distances to NYCRailroad
tract_railroad_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                           railroad_lyr,
                                           measurement_type='StraightLine',
                                           max_count=1,
                                           output_name='tract_railroad_dist'+ str(dt.now().microsecond))
tract_railroad_dist_lyr = tract_railroad_dist.layers[1]
sdf_tract_railroad_dist_lyr = pd.DataFrame.spatial.from_layer(tract_railroad_dist_lyr)
sdf_tract_railroad_dist_lyr.head()
From_IDFrom_NameFrom_Shape__AreaFrom_Shape__LengthFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_IDTo_Id_OrigTo_Shape__LengthTotal_Miles
01332227.0372591.2343751083.721395415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227770.665, 4989475.2983], [-82...102.194199e+060.140554
1501207.0182598.6289061212.182959472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112{"paths": [[[-8232997.3392, 4984450.7008], [-8...102.194199e+060.166535
2469174.0288608.6718751246.964673507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213None102.194199e+060.000000
3454160.0289790.9023441256.907546514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214None102.194199e+060.000000
4440150.0190155.2617191257.359120516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115{"paths": [[[-8232883.8969, 4979976.3554], [-8...102.194199e+060.559931
# Final Tract to NYCRailroad distances in feet — Here in each row, column "railroad_dist" returns the distance of
# the nearest rail road route from that tract
sdf_tract_railroad_dist_lyr_new = sdf_tract_railroad_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_railroad_dist_lyr_new['railroad_dist'] = round(sdf_tract_railroad_dist_lyr_new['Total_Kilometers'] * 3280.84, 2) 
sdf_tract_railroad_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milesrailroad_dist
2095360050001000.4030542128.12
2059360050002000.2153951137.29
2067360050004000.7085513741.15
1668360050016000.6145063244.59
2127360050019000.0000000.00
# Estimating distances to NYC Businesss Districts
tract_busi_distrs_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                                      busi_distrs_lyr,
                                                      measurement_type='StraightLine',
                                                      max_count=1,
                                                      output_name='tract_busi_distrs_dist'+ str(dt.now().microsecond))
tract_busi_distrs_dist_lyr = tract_busi_distrs_dist.layers[1]
sdf_tract_busi_distrs_dist_lyr = pd.DataFrame.spatial.from_layer(tract_busi_distrs_dist_lyr)
sdf_tract_busi_distrs_dist_lyr.head()
From_IDFrom_NameFrom_Shape__AreaFrom_Shape__LengthFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_IDTo_Shape__AreaTo_Shape__LengthTo_bidTo_boroughTo_date_creatTo_date_modifTo_objectidTo_shape_areaTo_shape_lenTo_time_creatTo_time_modifTotal_Miles
01332227.0372591.2343751083.721395415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227809.0397, 4989358.3474], [-8...14180282.2265623281.574700Washington Heights BIDManhattan2008-11-192016-10-2569NoneNone00:00:00.00000:00:00.0001.034897
1501207.0182598.6289061212.182959472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112{"paths": [[[-8232865.8707, 4984369.9553], [-8...16269468.5078124849.117421125th Street BIDManhattan2008-11-192016-10-2567NoneNone00:00:00.00000:00:00.0000.159359
2469174.0288608.6718751246.964673507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213{"paths": [[[-8231888.4853, 4982612.3984], [-8...16269468.5078124849.117421125th Street BIDManhattan2008-11-192016-10-2567NoneNone00:00:00.00000:00:00.0000.604451
3454160.0289790.9023441256.907546514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214{"paths": [[[-8232539.9271, 4981009.3871], [-8...70409639.4140628507.996060Madison Avenue BIDManhattan2008-11-192016-10-2664NoneNone00:00:00.00000:00:00.0000.502002
4440150.0190155.2617191257.359120516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115None70409639.4140628507.996060Madison Avenue BIDManhattan2008-11-192016-10-2664NoneNone00:00:00.00000:00:00.0000.000000
# Final Tract to NYC Businesss Districts distances in feet — Here in each row, column "busi_distr_dist" returns the distance of the CBD from respective tracts
sdf_tract_busi_distrs_dist_lyr_new = sdf_tract_busi_distrs_dist_lyr[['From_geoid', 'Total_Kilometers']]
sdf_tract_busi_distrs_dist_lyr_new['busi_distr_dist'] = round(sdf_tract_busi_distrs_dist_lyr_new['Total_Kilometers'] * 3280.84, 2) 
sdf_tract_busi_distrs_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milesbusi_distr_dist
2095360050001001.3086366909.60
2059360050002001.2925056824.43
2067360050004001.5963958428.97
1668360050016001.2376206534.63
2127360050019000.5106112696.02

Importing Borough Info for each Tracts

# Name of the borough, inside which the tracts are located 
ny_tract_boro = gis.content.search('NYCTractBorough owner:api_data_owner', 'feature layer')[0]
ny_tract_boro_lyr = ny_tract_boro.layers[0]
sdf_ny_tract_boro_lyr = pd.DataFrame.spatial.from_layer(ny_tract_boro_lyr)
sdf_ny_tract_boro_lyr_new = sdf_ny_tract_boro_lyr[['geoid', 'boro_name']]
sdf_ny_tract_boro_lyr_new.sort_values('geoid').head()
geoidboro_name
036005000100Bronx
236005000200Bronx
536005000400Bronx
736005001600Bronx
936005001900Bronx

Merging all the above estimated data set of features

tract_merge_dist = sdf_tract_hotel_dist_lyr_new.merge(sdf_tract_subwy_rt_dist_lyr_new,
                                                           on='From_geoid').merge(sdf_tract_railroad_dist_lyr_new,
                                                           on='From_geoid').merge(sdf_tract_subwy_stn_dist_lyr_new,
                                                           on='From_geoid').merge(sdf_tract_busi_distrs_dist_lyr_new,
                                                           on='From_geoid').merge(sdf_tract_cbd_dist_lyr_new, on='From_geoid')
tract_merge_dist_new = tract_merge_dist[['From_geoid',
                                         'hotel_dist',
                                         'subwy_rt_dist',
                                         'railroad_dist',
                                         'subwy_stn_dist',
                                         'busi_distr_dist',
                                         'cbd_dist']]
tract_merge_dist_new = tract_merge_dist_new.rename(columns={'From_geoid':'geoid'})
tract_merge_dist_new.sort_values('geoid').head()
geoidhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_dist
2095360050001005571.754780.02128.124996.076909.6026396.02
2059360050002005486.445854.01137.295851.156824.4336212.95
2067360050004002495.676294.03741.156291.158428.9738659.77
1668360050016003093.963824.03244.593852.616534.6339734.83
2127360050019000.0015.00.00422.732696.0222881.35
# merging number of bus stop and borough name
tract_merge_dist_new = tract_merge_dist_new.merge(sdf_num_bustops_tracts_lyr_new,
                                                 on='geoid').merge(sdf_ny_tract_boro_lyr_new,
                                                 on='geoid') 
tract_merge_dist_new = tract_merge_dist_new.sort_values('geoid')
tract_merge_dist_new.head()
geoidhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_distnum_bustopboro_name
2095360050001005571.754780.02128.124996.076909.6026396.020Bronx
2059360050002005486.445854.01137.295851.156824.4336212.950Bronx
2067360050004002495.676294.03741.156291.158428.9738659.771Bronx
1668360050016003093.963824.03244.593852.616534.6339734.833Bronx
2127360050019000.0015.00.00422.732696.0222881.352Bronx
# Accessing the airbnb count for each tract
sdf_airbnb_count_by_tract_new = sdf_airbnb_count_by_tract[['geoid','Point_Count']]
sdf_airbnb_count_by_tract_new = sdf_airbnb_count_by_tract_new.rename(columns={'Point_Count':'total_airbnb'})
sdf_airbnb_count_by_tract_new.head()
geoidtotal_airbnb
2095360050001000
2059360050002000
20673600500040015
1668360050016001
21273600500190024
# preparing the final distance table with airbnb count by tract
tract_merge_dist_all = sdf_airbnb_count_by_tract_new.merge(tract_merge_dist_new, on='geoid')
tract_merge_dist_all.head()
geoidtotal_airbnbhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_distnum_bustopboro_name
03600500010005571.754780.02128.124996.076909.6026396.020Bronx
13600500020005486.445854.01137.295851.156824.4336212.950Bronx
236005000400152495.676294.03741.156291.158428.9738659.771Bronx
33600500160013093.963824.03244.593852.616534.6339734.833Bronx
436005001900240.0015.00.00422.732696.0222881.352Bronx
tract_merge_dist_all.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2167 entries, 0 to 2166
Data columns (total 10 columns):
geoid              2167 non-null object
total_airbnb       2167 non-null int64
hotel_dist         2167 non-null float64
subwy_rt_dist      2167 non-null float64
railroad_dist      2167 non-null float64
subwy_stn_dist     2167 non-null float64
busi_distr_dist    2167 non-null float64
cbd_dist           2167 non-null float64
num_bustop         2167 non-null int64
boro_name          2167 non-null object
dtypes: float64(6), int64(2), object(2)
memory usage: 186.2+ KB

Borough column being an important location indicator is converted into numerical variable and inlcuded in the feature data

tract_merge_dist_final = pd.get_dummies(tract_merge_dist_all, columns=['boro_name'])
tract_merge_dist_final.head()
geoidtotal_airbnbhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_distnum_bustopboro_name_Bronxboro_name_Brooklynboro_name_Manhattanboro_name_Queensboro_name_Staten Island
03600500010005571.754780.02128.124996.076909.6026396.02010000
13600500020005486.445854.01137.295851.156824.4336212.95010000
236005000400152495.676294.03741.156291.158428.9738659.77110000
33600500160013093.963824.03244.593852.616534.6339734.83310000
436005001900240.0015.00.00422.732696.0222881.35210000

Adding census data 2019 obtained using geoenrichment

The above distance data set is now added with the census data to form the final feature set for the model

sdf_airbnb_count_by_tract_enriched_sorted_new = sdf_airbnb_count_by_tract_enriched_sorted.drop(['AnalysisArea',
                                                                                                'ENRICH_FID',
                                                                                                'HasData',
                                                                                                'ID',
                                                                                                'OBJECTID',
                                                                                                'Point_Count',
                                                                                                'SHAPE',                      
                                                                                                'aggregationMethod',
                                                                                                'aland',
                                                                                                'apportionmentConfidence',
                                                                                                'awater',
                                                                                                'countyfp',
                                                                                                'funcstat',
                                                                                                'intptlat',
                                                                                                'intptlon',
                                                                                                'mtfcc',
                                                                                                'name',
                                                                                                'namelsad',
                                                                                                'populationToPolygonSizeRating',
                                                                                                'sourceCountry',
                                                                                                'statefp','tractce'], axis=1)
sdf_airbnb_count_by_tract_enriched_sorted_new.shape
(2167, 87)
# checking the rows of the table for nan values
row_with_null = sdf_airbnb_count_by_tract_enriched_sorted_new.isnull().any(axis=1)

# printing the row which has nan values
sdf_airbnb_count_by_tract_enriched_sorted_new[row_with_null]
ACS Workers 16+: BicycleACS Workers 16+: BusACS Workers 16+: CarpooledEmployed Population Age 16+ACS Workers 16+: Drove AloneACS Family HouseholdsFemale Pop Age 16+Female 16+:Priv Profit EmplACS Workers 16+: FerryboatFemale 16+:Fed Govt WrkrFemale 16+:Local Govt WrkrFemale 16+:Priv NonprofitFemale 16+:Priv Profit Self EmplFemale 16+:Self-Emp Not IncFemale 16+:State Govt WrkrFemale 16+:Unpaid Family WrkrHHs:Inc at/Above Poverty LevelHHs: Inc Below Poverty LevelEmployed Male Pop Age 16+ACS Workers 16+: MotorcycleACS Median Year Structure Built: HUsMale 16+:Priv Profit EmplMale 16+:Fed Govt WrkrMale 16+Priv NonprofitMale 16+:Priv Profit Self EmplMale 16+:Self-Emp Not IncMale 16+:State Govt WrkrMale 16+:Unpaid Family WrkrACS Workers 16+: Other MeansACS Workers 16+: Public TransportationACS Workers 16+: RailroadACS Workers 16+: StreetcarACS Workers 16+: SubwayACS Workers 16+: TaxicabACS Total HouseholdsACS Workers 16+: WalkedACS Workers Age 16+ACS Wrkrs 16+: Worked at Home2019 American Indian Population2019 Asian Population2019 Baby Boomer Population2019 Black Population2019 Diversity Index2019 Daytime Pop Density2019 Daytime Pop: Residents2019 Daytime Pop: Workers2019 Total Daytime Population2019 Employed Civilian Pop 16+2019 Female Population2019 Generation Alpha Population2019 Population by Generation Base2019 Generation X Population2019 Generation Z Population2019 Male Population2019 Median Age2019 Median Home Value2019 Millennial PopulationOwn any tabletOwn any e-readerHave access to Internet at homeIndex: Spend 0.5-0.9 hrs online(excl email/IM .Spend <0.5 hrs online (excl email/IM time) dailyConstruction Businesses (NAICS)Retail Trade Businesses (NAICS)Transportation/Warehouse Bus (NAICS)2019 Occupation: Office/Admin2019 Occupation: Construction/Extraction2019 Occupation: Food Preperation2019 Occupation: Personal Care2019 Occupation: Production2019 Occupation: Protective Service2019 Silent & Greatest Generations Population2019 Other Race Population2019 Owner Occupied HUs2019 Pacific Islander Population2019 Population Density2019 Renter Occupied HUsTotal Businesses (SIC)2019 Total Housing Units2019 Total Population2019 Unemployment Rate2019 Unemployed Population 16+2019 Vacant Housing Units2019 White Population2019 Annual Budget Exp2019 Transportationgeoid
2163NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN36047990100
2165NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN36085990100
# checking total number of nan values
nan_test = sdf_airbnb_count_by_tract_enriched_sorted_new.drop(['geoid'], axis=1)
np.isnan(nan_test).sum().sum()
172

These two tracts area actually are water areas within NYC, hence have nan values and are filled with zeros

sdf_airbnb_count_by_tract_enriched_sorted_fill = sdf_airbnb_count_by_tract_enriched_sorted_new.fillna(0)

#nan rechecked
nan_test = sdf_airbnb_count_by_tract_enriched_sorted_fill.drop(['geoid'], axis=1)
np.isnan(nan_test).sum().sum()
0

Merging the distance data with the enriched data

final_df = pd.merge(tract_merge_dist_final,
                    sdf_airbnb_count_by_tract_enriched_sorted_fill,
                    left_on = 'geoid',
                    right_on = 'geoid',
                    how = 'left')

print(final_df.shape)
final_df.head()
(2167, 100)
geoidtotal_airbnbhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_distnum_bustopboro_name_Bronxboro_name_Brooklynboro_name_Manhattanboro_name_Queensboro_name_Staten IslandACS Workers 16+: BicycleACS Workers 16+: BusACS Workers 16+: CarpooledEmployed Population Age 16+ACS Workers 16+: Drove AloneACS Family HouseholdsFemale Pop Age 16+Female 16+:Priv Profit EmplACS Workers 16+: FerryboatFemale 16+:Fed Govt WrkrFemale 16+:Local Govt WrkrFemale 16+:Priv NonprofitFemale 16+:Priv Profit Self EmplFemale 16+:Self-Emp Not IncFemale 16+:State Govt WrkrFemale 16+:Unpaid Family WrkrHHs:Inc at/Above Poverty LevelHHs: Inc Below Poverty LevelEmployed Male Pop Age 16+ACS Workers 16+: MotorcycleACS Median Year Structure Built: HUsMale 16+:Priv Profit EmplMale 16+:Fed Govt WrkrMale 16+Priv NonprofitMale 16+:Priv Profit Self EmplMale 16+:Self-Emp Not IncMale 16+:State Govt WrkrMale 16+:Unpaid Family WrkrACS Workers 16+: Other MeansACS Workers 16+: Public TransportationACS Workers 16+: RailroadACS Workers 16+: StreetcarACS Workers 16+: SubwayACS Workers 16+: TaxicabACS Total HouseholdsACS Workers 16+: WalkedACS Workers Age 16+ACS Wrkrs 16+: Worked at Home2019 American Indian Population2019 Asian Population2019 Baby Boomer Population2019 Black Population2019 Diversity Index2019 Daytime Pop Density2019 Daytime Pop: Residents2019 Daytime Pop: Workers2019 Total Daytime Population2019 Employed Civilian Pop 16+2019 Female Population2019 Generation Alpha Population2019 Population by Generation Base2019 Generation X Population2019 Generation Z Population2019 Male Population2019 Median Age2019 Median Home Value2019 Millennial PopulationOwn any tabletOwn any e-readerHave access to Internet at homeIndex: Spend 0.5-0.9 hrs online(excl email/IM .Spend <0.5 hrs online (excl email/IM time) dailyConstruction Businesses (NAICS)Retail Trade Businesses (NAICS)Transportation/Warehouse Bus (NAICS)2019 Occupation: Office/Admin2019 Occupation: Construction/Extraction2019 Occupation: Food Preperation2019 Occupation: Personal Care2019 Occupation: Production2019 Occupation: Protective Service2019 Silent & Greatest Generations Population2019 Other Race Population2019 Owner Occupied HUs2019 Pacific Islander Population2019 Population Density2019 Renter Occupied HUsTotal Businesses (SIC)2019 Total Housing Units2019 Total Population2019 Unemployment Rate2019 Unemployed Population 16+2019 Vacant Housing Units2019 White Population2019 Annual Budget Exp2019 Transportation
03600500010005571.754780.02128.124996.076909.6026396.020100000.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0110.0194.0396.06016.081.4475.90.0497.0497.00.0862.00.010471.03162.01599.09609.031.50.05308.00.00.00.00.00.00.00.00.00.00.00.00.00.00.06.02629.00.027.010025.90.010.00.010471.00.00.00.01489.00.00.0
13600500020005486.445854.01137.295851.156824.4336212.950100000.0287.0173.01858.0951.01013.0718.0427.00.027.0101.065.017.049.032.00.01021.0300.01140.00.01962.0758.037.087.042.061.025.00.00.0615.00.00.0328.016.01321.071.01835.09.054.0163.0876.01286.089.95823.52888.0220.03108.01795.02479.0177.04638.0915.01188.02159.035.0437798.01233.01637.0192.03091.067.0250.02.03.03.0257.079.076.032.079.021.0249.01440.0768.015.08690.3667.036.01511.04638.05.3100.076.01404.0106532171.011165486.0
236005000400152495.676294.03741.156291.158428.9738659.7711000010.0427.0224.02917.01257.01497.01349.0794.00.024.0299.0163.023.035.011.00.01696.0239.01568.00.01999.0823.040.0128.030.083.015.00.00.01251.023.00.0801.07.01935.046.02848.053.052.0155.01168.01925.090.46855.23470.0541.04011.02890.03231.0246.06288.01372.01613.03057.035.3392481.01622.02233.0349.04390.092.0215.05.08.02.0432.070.0113.099.045.0276.0267.02067.01384.05.010746.9658.054.02150.06288.05.0153.0108.01662.0166787168.017539648.0
33600500160013093.963824.03244.593852.616534.6339734.833100000.0538.0113.02120.0759.01470.01284.0801.00.018.0329.074.010.052.00.00.01405.0557.0836.00.01973.0525.00.045.050.00.065.00.00.0991.00.00.0453.00.01962.0154.02098.081.054.078.01205.02298.087.830304.33716.01960.05676.02332.03365.0238.05982.01100.01484.02617.035.6415686.01515.01769.0211.03424.070.0247.02.04.02.0378.023.0157.0219.070.0131.0440.01550.0356.00.031938.11675.053.02099.05982.05.2128.068.01649.092988868.09541987.0
436005001900240.0015.00.00422.732696.0222881.3521000012.0142.032.01290.0251.0495.0627.0372.00.04.092.097.010.012.040.00.0687.0282.0663.00.01954.0440.013.053.00.091.031.00.012.0843.023.026.0652.00.0969.076.01250.024.023.032.0266.0642.089.67699.61143.07131.08274.0896.0927.0127.02019.0486.0473.01092.033.0393478.0614.0504.064.01135.088.066.065.039.029.0118.040.039.031.039.00.053.0642.048.01.01878.8547.0380.0690.02019.012.2124.095.0590.025499197.02670728.0
# rechecking nan values of the final dataframe
final_nan_test = final_df.drop('geoid', axis=1)
np.isnan(final_nan_test).sum().sum()
0

Model Building

The goal here is to find the factors contributing towards the development of new Airbnb properties in New York City. Thus a model is fitted predicting the number of Airbnb properties per tract with the feature set composed of the distance and demographics characteristics of each tract. Once a good fit is obtained the most important predictors of the model are estimated which is our main ask.

# Creating feature data 
X = final_df.drop(['geoid','total_airbnb'], axis=1)

# Creating target data  -- the number airbnb per tract
y = pd.DataFrame(final_df['total_airbnb'])

split the dataframe into train - test of 90% to 10%

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 20)

print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

# Converting the target into 1d array
y_train_array = y_train.values.flatten()
y_test_array = y_test.values.flatten() 

print(y_train_array.shape)
print(y_test_array.shape)
(1950, 98)
(1950, 1)
(217, 98)
(217, 1)
(1950,)
(217,)

As a best practice since scaled data performs well for model fitting, the features are normalized using Robust scaler

scaler = preprocessing.RobustScaler()

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns) 
X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test), columns=X_test.columns)

RandomForest Regressor Model

The modelling is first started using a linear regression. However the linear model was failing to fit the data well. Hence it was carried out with a non linear algorithm as follows. This could be tested by the user to see the improvement of using Random Forest over a linear regression.

The accuracy metrics of mean absoute error and r-square is used

# Random forest with scaled data
# for the best parameters a grid search could be done which could take some time
# however this model uses the default parameters of RF algorithm, while the estimators are changed till the best fit is obtained
model_RF = RandomForestRegressor(n_estimators = 500, random_state=43)

# Train the model
model_RF.fit(X_train_scaled, y_train_array)

# Training metrics for Random forest model
print('Training metrics for Random forest model using scaled data')
ypred_RF_train = model_RF.predict(X_train_scaled)
print('r-square_RF_Train: ', round(model_RF.score(X_train_scaled, y_train_array), 2))

mse_RF_train = metrics.mean_squared_error(y_train_array, ypred_RF_train)  
print('RMSE_RF_train: ', round(np.sqrt(mse_RF_train),4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(y_train_array, ypred_RF_train)
print('MAE_RF_train: ', round(mean_absolute_error_RF_train, 4)) 

# Test metrics for Random Forest model
print('\nTest metrics for Random Forest model scaled data')
ypred_RF_test = model_RF.predict(X_test_scaled)
print('r-square_RF_test: ', round(model_RF.score(X_test_scaled, y_test_array), 2))

mse_RF_test = metrics.mean_squared_error(y_test_array, ypred_RF_test) 
print('RMSE_RF_test: ', round(np.sqrt(mse_RF_test), 4))

mean_absolute_error_RF_test = metrics.mean_absolute_error(y_test_array, ypred_RF_test)
print('MAE_RF_test: ', round(mean_absolute_error_RF_test, 4))
Training metrics for Random forest model using scaled data
r-square_RF_Train:  0.97
RMSE_RF_train:  7.1627
MAE_RF_train:  3.5437

Test metrics for Random Forest model scaled data
r-square_RF_test:  0.85
RMSE_RF_test:  18.2192
MAE_RF_test:  9.2817

The result shows that the model is returning an r-square of 0.85 with a mean absolute error of 9.28

Feature importance for the RF model

feature_imp_RF = model_RF.feature_importances_

#relative feature importance  
rel_feature_imp = 100 * (feature_imp_RF / max(feature_imp_RF)) 
rel_feature_imp = pd.DataFrame({'features':list(X_train.columns),
                                'rel_importance':rel_feature_imp })

rel_feature_imp = rel_feature_imp.sort_values('rel_importance', ascending=False)


#plotting the top twenty important features
top20_features = rel_feature_imp.head(20) 

plt.figure(figsize=[20,10])
plt.yticks(fontsize=15)
ax = sns.barplot(x="rel_importance", y="features",
                 data=top20_features,
                 palette="Accent_r")

plt.xlabel("Relative Importance", fontsize=25)
plt.ylabel("Features", fontsize=25)
plt.show()
<Figure size 1440x720 with 1 Axes>
rel_feature_imp.head()
featuresrel_importance
5cbd_dist100.000000
682019 Millennial Population44.053143
12ACS Workers 16+: Bicycle18.573569
612019 Generation Alpha Population7.116598
44ACS Workers 16+: Subway4.931776

The feature importance plot reveals that distance from the city centre (cbd_dist) is the most important predictor of the number of Airbnb formation in NYC. This is expected since hotel rents near the cbd are quite high, rental income from Airbnb properties would be high as well, hence setting up Airbnb property would be a lucrative option, compared to long term rental income in areas near the cbd.

This is followed by the number of millennial population, or the tracts having most number of people in the age group of 25 to 40 years old. One reason might be that these group of population are more active online and are comfortable with internet technologies which is in a way a necessary prerequisite for setting up Airbnb properties. This is supported by the presence of another interesting predictor variable of -- 0.5-0.9 hrs online activity, in the top twenty.

This is followed by the tracts having workers who commute by bicycle and is the third most important predictor, which is followed by the number of generation alpha population, who are person born after 2011, and then by tracts having people commuting by subway, and so on. The median home value of the tracts is also an interesting predictor.

Gradient Boosting Regressor Model

Here trial shows that the gradient boosting model performs better with non scale data

# GradientBoosting with non scaled data
# this model uses the default parameters of GB algorithm, while the estimators are changed to obtain the best fit 
model_GB_nonscale = GradientBoostingRegressor(n_estimators=500, random_state=60)

# Train the model
model_GB_nonscale.fit(X_train, y_train_array)

# Training metrics for Gradient Boosting Regressor model
print('Training metrics for Gradient Boosting Regressor model using scaled data')

ypred_GB_train = model_GB_nonscale.predict(X_train)
print('r-square_GB_Train: ', round(model_GB_nonscale.score(X_train, y_train_array), 2))

mse_RF_train = metrics.mean_squared_error(y_train_array, ypred_GB_train)
print('RMSE_GB_Train: ', round(np.sqrt(mse_RF_train), 4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(y_train_array, ypred_GB_train)
print('MAE_GB_Train: ', round(mean_absolute_error_RF_train, 4))

#Test metrics for Gradient Boosting Regressor model
print('\nTest metrics for Gradient Boosting Regressor model using scaled data')

ypred_GB_test = model_GB_nonscale.predict(X_test)
print('r-square_GB_Test: ', round(model_GB_nonscale.score(X_test, y_test_array),2))

mse_RF_Test = metrics.mean_squared_error(y_test_array, ypred_GB_test)  
print('RMSE_GB_Test: ', round(np.sqrt(mse_RF_Test),4))

mean_absolute_error_GB_Test = metrics.mean_absolute_error(y_test_array, ypred_GB_test)
print('MAE_GB_Test: ', round(mean_absolute_error_GB_Test, 4))
Training metrics for Gradient Boosting Regressor model using scaled data
r-square_GB_Train:  0.99
RMSE_GB_Train:  3.1854
MAE_GB_Train:  2.3426

Test metrics for Gradient Boosting Regressor model using scaled data
r-square_GB_Test:  0.88
RMSE_GB_Test:  16.2517
MAE_GB_Test:  8.6577

The result shows that the Gradient boosting regressor model is performing slightly better both in terms of Mean Absolute error and r-square than the random forest model.

Feature Importance of Gradient Boosting Model

#checking the feature importance for the Gradient Boosting regressor
feature_imp_GB = model_GB_nonscale.feature_importances_
rel_feature_imp_GB = 100 * feature_imp_GB / max(feature_imp_GB)
rel_feature_imp_GB = pd.DataFrame({'features':list(X_train.columns),
                                   'rel_importance':rel_feature_imp_GB})
rel_feature_imp_GB = rel_feature_imp_GB.sort_values('rel_importance', ascending=False)
rel_feature_imp_GB.head()
featuresrel_importance
5cbd_dist100.000000
682019 Millennial Population46.725049
12ACS Workers 16+: Bicycle31.834925
44ACS Workers 16+: Subway15.586322
612019 Generation Alpha Population12.891210
# Plot  feature importance for the Gradient Boosting regressor
top20_features_GB = rel_feature_imp_GB.head(20) 

plt.figure(figsize=[20,10])
plt.yticks(fontsize=15)
ax = sns.barplot(x="rel_importance", y="features", data = top20_features_GB, palette="Accent_r")
plt.xlabel("Relative Importance", fontsize=25)
plt.ylabel("Features", fontsize=25)
plt.show()
<Figure size 1440x720 with 1 Axes>

The feature importance shown by the Gradient boosting model are almost identical to the one returned by the random forest model, which is expected.

Running cross validation

The above model is fitted and accuracy measured on a particular train and test split of the data. However the model accuracy for multiple split of the data remains to be seen. This is accomplished using k fold cross validation which splits the data into k different train-test splits and fit the model for each of them. Hence a 10 fold cross validation is run to check the overall model accuracy which is measured here as the mean absolute error for model fit accross the 10 different splits.

# Validating with a 10 fold cross validation for the Gradient Boosting models
y_array = y.values.flatten()

modelGB_cross_val = GradientBoostingRegressor(n_estimators=500, random_state=60) 

modelGB_cross_val_scores = cross_val_score(modelGB_cross_val,
                                           X, 
                                           y_array,
                                           cv=10,
                                           scoring='neg_mean_absolute_error')

print("All Model Scores: ", modelGB_cross_val_scores)

print("Negative Mean Absolute Error: {}".format(np.mean(modelGB_cross_val_scores)))
All Model Scores:  [ -9.028579 -11.43918   -9.866992 -23.537995  -5.697105 -39.70685  -15.560179  -9.992469  -4.796929  -5.15258 ]
Negative Mean Absolute Error: -13.477885926281335
# Validating with a 10 fold cross validation for the Random forest models
y_array = y.values.flatten()

modelRF_cross_val = RandomForestRegressor(n_estimators=500, random_state=43)

modelRF_cross_val_scores = cross_val_score(modelRF_cross_val,
                                           X, 
                                           y_array,
                                           cv=10,
                                           scoring='neg_mean_absolute_error')

print("All Model Scores: ", modelRF_cross_val_scores)

print("Negative Mean Absolute Error: {}".format(np.mean(modelRF_cross_val_scores)))
All Model Scores:  [-11.675733  -9.465871 -11.120866 -22.958313  -4.940608 -37.30388  -18.044691 -11.955269  -5.43287   -4.168185]
Negative Mean Absolute Error: -13.706628720771466

Final Result Visualization

# Plotting a kernel density map of the predicted vs. observed data
plt.figure(figsize=[15,5])

# plotting the prediction
sns.kdeplot(ypred_RF_test, label = 'Predictions', color = 'orange')
y_observed = np.array(y_test).reshape((-1, ))
sns.kdeplot(y_observed, label = 'Observation', color = 'green')

# label the plot
plt.xlabel('No. of Airbnb listings per census tract', fontsize=15)
plt.ylabel('Density', fontsize=15)
plt.title('Density Plot: Predicted vs Observed', fontsize=15)
plt.xticks(range(0,500,25), fontsize=10)
plt.yticks(fontsize=10)
plt.legend(fontsize=15)
plt.show()
<Figure size 1080x360 with 1 Axes>
# Converting the predicted and observed values to dataframe and plotting the observed vs predicted
y_test_df = y_test.copy()
y_test_df['Predicted'] = (ypred_RF_test)  
y_test_df.head()
total_airbnbPredicted
2043.940
91064.186
68524.908
45012.236
104428.422
# plotting the actual observed vs predicted airbnb properties by tract
plt.figure(figsize = [25,12])
sns.set(style = 'whitegrid')
sns.lineplot(data = y_test_df, markers=True) 

#label the plot
plt.xlabel('Tract ID', fontsize=15)
plt.ylabel('Total No. of Airbnb', fontsize=15)
plt.title('Actual No. of Airbnb by Tract: Predicted vs Observed', fontsize=15)
plt.xticks(range(0,2000,100), fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='x-large', title_fontsize='10')
plt.legend(fontsize=15)
plt.show()
<Figure size 1800x864 with 1 Axes>

The plot shows that the predicted values closely matches the observed values. However there are instances of underprediction for tracts with extremely high number of airbnb properties, and also overprediction instances for some tracts with comparatively lower number of airbnb properties.

Conclusion

The study shows that the location factor of distance from CBD is the foremost important factor which stimulates creation of Airbnb properties.

The proximity tool from the ArcGIS API for Python was used to perform this significant task for all the distance estimation. Other factors as returned by the feature importance result could be dealt individually. Another interesting capability of Esri utilized in the study is that of Esri's data repository, elaborated here via the geoenrichment services. The data enrichment service could provide the analyst an wide array of data that could be used for critical analysis. Further analysis would be done in the next study on this dataset.

Summary of methods used

MethodQuestionExamples
aggregate_pointsHow many points within each polygon?Counting the number of airbnb rentals within each NYC tracts
Data EnrichmentWhich demographic attribute are relevant for the problem?Population of Millennials for each tract
find_nearestWhich distances from city features are relevant for the problem?Distance of the CBD from each tract

Data resources

ShapefileSourceLink
airbnb_nyc2019NYC Airbnb Data Inside Airbnb:Get the Datahttp://insideairbnb.com/get-the-data.html
nyc_tract_fulllNYC Open Data: 2010 Census Tracts (water areas included)https://data.cityofnewyork.us/City-Government/2010-Census-Tracts-water-areas-included-/gx7x-82rk
busi_distrNYC Open Data: Business Improvement Districtshttps://data.cityofnewyork.us/Business/Business-Improvement-Districts/ejxk-d93y
cbdNYC Open Data: Business Improvement Districtshttps://data.cityofnewyork.us/Business/Business-Improvement-Districts/ejxk-d93y
bus_stopNYC Open Data: Bus Stop Sheltershttps://data.cityofnewyork.us/Transportation/Bus-Stop-Shelters/qafz-7myz
hotelsNYC Open Data: Facilities Databasehttps://data.cityofnewyork.us/City-Government/Facilities-Database-Shapefile/2fpa-bnsx
railroadNYC Open Data: Railroad Linehttps://data.cityofnewyork.us/Transportation/Railroad-Line/i7a5-bsik
subwy_rtNYC Open Data: Subway Lineshttps://data.cityofnewyork.us/Transportation/Subway-Lines/3qz8-muuu
subwy_stnNYC Open Data: Subway Stationshttps://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.