Analysing the factors of growth and spatial distribution of Airbnb properties across New York City

Introduction

Airbnb properties across cities are a great alternative for travellers to find comparatively cheaper accommodation. It also provides homeowners opportunities to utilize spare or unused rooms as an additional income source. However in recent times the alarming spread of Airbnb properties has become a topic of debate among the public and the city authorities across the world.

Considering the above, a study is carried out in this sample notebook to understand the factors that are fuelling widespread growth in the number of Airbnb listings. These might include location characteristics of concerned neighbourhoods (which in this case, NYC census tracts) and as well as qualitative information about the inhabitants residing in them. The goal is to help city planners deal with the negative externalities of the Airbnb phenomenon (and similar short term rentals) by making informed decision on framing suitable policies.

The primary data is downloaded from the Airbnb website for the city of New York. Other data includes 2019 and 2017 census data using Esri's enrichment services, and various other datasets from the NYCOpenData portal.

Note: We need to install pillow version 9.0.0, seaborn and scikit-learn for this notebook

Necessary Imports

if you have pillow version 9.0.0, seaborn and scikit-learn installed you can skip running the next three cells

pip install pillow==9.0.0
pip install seaborn
pip install scikit-learn
%matplotlib inline
import matplotlib.pyplot as plt


from datetime import datetime as dt
import pandas as pd
import numpy as np
from IPython.display import display, HTML
from IPython.core.pylabtools import figsize
import seaborn as sns


# Machine Learning models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.metrics as metrics
from sklearn import preprocessing

# Arcgis api imports
import arcgis
from arcgis.geoenrichment import Country
from arcgis.features import summarize_data
from arcgis.features.enrich_data import enrich_layer
from arcgis.features import use_proximity 
from arcgis.gis import GIS
from arcgis.features import summarize_data
gis = GIS(profile='your_online_profile')

Access the NYC Airbnb and Tracts dataset

Airbnb Data - It contains information about 48,000 Airbnb properties available in New York as of 2019. These include location of the property, its neighbourhood characters and transit facilities available, information about the owner, details of the room including number of bedrooms etc., and rental price per night.

NYC Tracts - It is a polygon shapefile consisting 2167 tracts of New York City, including area of the tracts along with unique id for each tract.

# Accessing NYCTracts
nyc_tract_full = gis.content.search('NYCTractData owner:api_data_owner', 'feature layer')[0]
nyc_tract_full
NYCTractData
Feature Layer Collection by api_data_owner
Last Modified: August 14, 2019
0 comments, 558 views
nyc_tracts_layer = nyc_tract_full.layers[0]
# Accessing airbnb NYC
airbnb_nyc2019 = gis.content.search('AnBNYC2019 owner:api_data_owner', 'feature layer')[0]
airbnb_nyc2019
AnBNYC2019
Feature Layer Collection by api_data_owner
Last Modified: September 30, 2019
0 comments, 133 views
airbnb_layer = airbnb_nyc2019.layers[0]

Visualizing dataset

# NYC Tracts
m1 = gis.map('New York City')
m1.add_layer(nyc_tracts_layer)
m1
# NYC Airbnb Properties
m = gis.map('Springfield Gardens, NY')
m.add_layer(airbnb_layer)
m
# extracting the dataframe from the layer and visualize it as a pandas dataframe
pd.set_option('display.max_columns', 110)
sdf_airbnb_layer = pd.DataFrame.spatial.from_layer(airbnb_layer)
sdf_airbnb_layer.head(2)
FIDidscrape_idlast_scrapnamesummaryspacedescriptioexperienceneighborhonotestransitaccessinteractiohouse_rulehost_sincehost_locathost_respohost_res_1host_accephost_is_suhost_neighhost_listihost_totalhost_has_phost_identstreetneighbourhneighbou_1neighbou_2citystatezipcodemarketsmart_locacountry_cocountrylatitudelongitudeis_locatioproperty_troom_typeaccommodatbathroomsbedroomsbedsbed_typeamenitiessquare_feepriceweekly_primonthly_prsecurity_dcleaning_fguests_incextra_peopminimum_nimaximum_niminimum_mimaximum_miminimum_mamaximum_maminimum__1maximum__1calendar_uhas_availaavailabiliavailabi_1availabi_2availabi_3calendar_lnumber_of_number_of1first_revilast_reviereview_scoreview_s_1review_s_2review_s_3review_s_4review_s_5review_s_6requires_llicensejurisdictiinstant_bois_businescancellatirequire_gurequire__1calculatedcalculat_1calculat_2calculat_3reviews_peairbnbSHAPE
0112186120200000000000.02019-06-03Park Slope Apt:, Spacious 2 bedroomImagine a quiet, spacious apartment, with beau...Imagine a quiet, spacious apartment, with beau...noneNo pets, no smoking. The $25/night for each gu...2011-05-22New York, New York, United Stateswithin a day100%N/AtPark Slope22ttBrooklyn, NY, United StatesBrooklynPark SlopeBrooklynBrooklynNY11215New YorkBrooklyn, NYUSUnited States40.67644-73.98082tApartmentEntire home/apt42.022Real Bed{TV,"Cable TV",Internet,Wifi,"Air conditioning...1500165.0$1,050.00$250.00$60.002$0.002730227307302.0730.04 weeks agot00002019-06-032302011-05-292016-05-0299101010101010fffmoderateff21100.241{'x': -8235507.210868829, 'y': 4964733.1453062...
1212378420200000000000.02019-06-03NYC Studio for Rent in TownhouseComfortable, spacious studio in one of the mos...This is a large studio room with a private bat...Comfortable, spacious studio in one of the mos...noneThe new restaurants, stores and cafes. Everyth...45.00 dollar fee for air-conditioner in the su...Everything in the studio is for their use.As much as the guest would like.no loud music no pets no children $300 dollar...2011-05-23New York, New York, United Stateswithin an hour100%N/AfHarlem22tfNew York, NY, United StatesHarlemHarlemManhattanNew YorkNY10027New YorkNew York, NYUSUnited States40.80481-73.94794tApartmentEntire home/apt31.002Real Bed{TV,"Cable TV",Wifi,"Air conditioning","Paid p...0110.0$735.00$3,200.00$500.00$60.001$25.002365223653652.0365.07 months agot828432962019-06-03138422011-05-302019-05-17941091010109ftfstrict_14_with_grace_periodtt21101.411{'x': -8231847.026011546, 'y': 4983593.6741002...

Aggregating number of Airbnb properties by Tracts for NYC

Number of Airbnb properties per tract is to be estimated using the polygon tract layer and the Airbnb point layer.

The Aggregate Points tool uses area features to summarize a set of point features. The boundaries from the area feature are used to collect the points within each area and use them to calculate statistics. The resulting layer displays the count of points within each area. Here, the polygon tract layer is used as the area feature, and the Airbnb point layer is used as the point feature.

agg_result = summarize_data.aggregate_points(point_layer=airbnb_layer,
                                             polygon_layer=nyc_tracts_layer,
                                             output_name='airbnb_counts'+ str(dt.now().microsecond))
{"cost": 50.968}
agg_result
airbnb_counts131815
Feature Layer Collection by arcgis_python
Last Modified: April 30, 2024
0 comments, 0 views
# mapping the aggregated airbnb data with darker areas showing more airbnb properties per tract
aggr_map = gis.map('NY', zoomlevel=10)
aggr_map.add_layer(agg_result,{"renderer":"ClassedColorRenderer", "field_name": "Point_Count"})
aggr_map
airbnb_count_by_tract = agg_result.layers[0]
sdf_airbnb_count_by_tract = airbnb_count_by_tract.query().sdf
sdf_airbnb_count_by_tract = sdf_airbnb_count_by_tract.sort_values('geoid')
sdf_airbnb_count_by_tract.head()
OBJECTIDstatefpcountyfptractcegeoidnamenamelsadmtfccfuncstatalandawaterintptlatintptlonPoint_CountAnalysisAreaSHAPE
2095209636005000100360050001001Census Tract 1G5020S15793611125765+40.7934921-073.883531802.705062{"rings": [[[-8226256.9418, 4982172.581], [-82...
2059206036005000200360050002002Census Tract 2G5020S455322926899+40.8045733-073.856858501.382228{"rings": [[[-8222638.612, 4985024.3226], [-82...
2067206836005000400360050004004Census Tract 4G5020S912392602945+40.8089152-073.8504884151.515336{"rings": [[[-8222012.885, 4985135.2266], [-82...
16681669360050016003600500160016Census Tract 16G5020S4850790+40.8188478-073.858076410.485076{"rings": [[[-8222181.7567, 4986069.1354], [-8...
21272128360050019003600500190019Census Tract 19G5020S16436541139660+40.8009990-073.9093729242.783331{"rings": [[[-8230028.8927, 4984061.5402], [-8...

Here the Point_Count field from the above aggregated dataframe returns the number of Airbnb properties per tract. This would form the target variable for this problem.

Enriching tracts with demographic data using geoenrichment service from Esri

The feature data is now created using selected demographics information for each tracts. This is accomplished accessing the geoenrichment services from Esri, which consists the latest census data. The entire data repository is first visualized, out of which the relevant variables are finalized from a literature study. These selected variables are searched for adding in the feature set.

# Displaying the various data topic available for geoenrichment for USA in the Esri database
usa = Country.get('US')
type(usa)
usa_data = usa.data_collections
df_usa_data = pd.DataFrame(usa_data)
df_usa_data.head()
analysisVariablealiasfieldCategoryvintage
dataCollectionID
1yearincrements1yearincrements.AGE0_CY2023 Population Age <12023 Age: 1 Year Increments (Esri)2023
1yearincrements1yearincrements.AGE1_CY2023 Population Age 12023 Age: 1 Year Increments (Esri)2023
1yearincrements1yearincrements.AGE2_CY2023 Population Age 22023 Age: 1 Year Increments (Esri)2023
1yearincrements1yearincrements.AGE3_CY2023 Population Age 32023 Age: 1 Year Increments (Esri)2023
1yearincrements1yearincrements.AGE4_CY2023 Population Age 42023 Age: 1 Year Increments (Esri)2023

All the data topics are visualized that are available in the geoenrichment services.

# Filtering the unique topic under dataCollectionID
df_usa_data.reset_index(inplace=True)
list(df_usa_data.dataCollectionID.unique())
['1yearincrements',
 '5yearincrements',
 'Age',
 'agebyracebysex',
 'agebyracebysex2010',
 'agebyracebysex2020',
 'AgeDependency',
 'AtRisk',
 'AutomobilesAutomotiveProducts',
 'BabyProductsToysGames',
 'basicFactsForMobileApps',
 'businesses',
 'CivicActivitiesPoliticalAffiliation',
 'classofworker',
 'clothing',
 'ClothingShoesAccessories',
 'commute',
 'crime',
 'DaytimePopulation',
 'disability',
 'disposableincome',
 'DniRates',
 'education',
 'educationalattainment',
 'ElectronicsInternet',
 'employees',
 'EmploymentUnemployment',
 'entertainment',
 'financial',
 'FinancialInsurance',
 'food',
 'foodstampsSNAP',
 'gender',
 'Generations',
 'GroceryAlcoholicBeverages',
 'groupquarters',
 'Health',
 'healthinsurancecoverage',
 'HealthPersonalCare',
 'HealthPersonalCareCEX',
 'heatingfuel',
 'hispanicorigin',
 'HistoricalHouseholds',
 'HistoricalHousing',
 'HistoricalPopulation',
 'HomeImprovementGardenLawn',
 'homevalue',
 'HouseholdGoodsFurnitureAppliances',
 'householdincome',
 'households',
 'householdsbyageofhouseholder',
 'HouseholdsByIncome',
 'householdsbyraceofhouseholder',
 'householdsbysize',
 'householdtotals',
 'householdtype',
 'housingbyageofhouseholder',
 'housingbyraceofhouseholder',
 'housingbysize',
 'housingcosts',
 'HousingHousehold',
 'housingunittotals',
 'incomebyage',
 'industry',
 'industrybynaicscode',
 'industrybysiccode',
 'InternetComputerUsage',
 'KeyGlobalFacts',
 'KeyUSFacts',
 'language',
 'LeisureActivitiesLifestyle',
 'LifeInsurancePensions',
 'lifemodegroupsNEW',
 'maritalstatustotals',
 'MediaMagazinesNewspapers',
 'MediaRadioOtherAudio',
 'MediaTVViewing',
 'miscellaneous',
 'networth',
 'NonHispanicOrigin',
 'occupation',
 'OwnerRenter',
 'PetsPetProducts',
 'PhonesYellowPages',
 'Policy',
 'population',
 'populationtotals',
 'presenceofchildren',
 'PsychographicsAdvertising',
 'PsychographicsFood',
 'PsychographicsLifestyle',
 'PsychographicsMedia',
 'PsychographicsShopping',
 'RaceAndEthnicity',
 'raceandhispanicorigin',
 'restaurants',
 'RetailDemandbyNAICS',
 'sales',
 'schoolenrollment',
 'shopping',
 'spendingFactsForMobileApps',
 'SpendingTotal',
 'sports',
 'tapestryadultsNEW',
 'tapestryhouseholdsNEW',
 'TapestryNEW',
 'transportation',
 'TravelCEX',
 'travelMPI',
 'unitsinstructure',
 'urbanizationgroupsNEW',
 'vacant',
 'vehiclesavailable',
 'veterans',
 'Wealth',
 'women',
 'yearbuilt',
 'yearmovedin']

Items can be searched using alias field, for the related analysis variable name -- here as an example a variable with 'Nonprofit' is searched. Out of the these the relevant 'Nonprofit' data is to be selected.

df_usa_data[df_usa_data['alias'].str.contains('Nonprofit')]                        
dataCollectionIDanalysisVariablealiasfieldCategoryvintage
3840classofworkerclassofworker.ACSMPRIVNP2021 Civ Male 16+:Priv Nonprofit (ACS 5-Yr)2017-2021 Class of Worker (ACS)2017-2021
3841classofworkerclassofworker.MOEMPRIVNP2021 Civ Male 16+:Priv Nonprofit MOE (ACS 5-Yr)2017-2021 Class of Worker (ACS)2017-2021
3848classofworkerclassofworker.RELMPRIVNP2021 Civ Male 16+:Priv Nonprofit REL (ACS 5-Yr)2017-2021 Class of Worker (ACS)2017-2021
3873classofworkerclassofworker.ACSFPRIVNP2021 Civ Female 16+:Priv Nonprofit (ACS 5-Yr)2017-2021 Class of Worker (ACS)2017-2021
3874classofworkerclassofworker.MOEFPRIVNP2021 Civ Female 16+:Priv Nonprofit MOE (ACS 5-Yr)2017-2021 Class of Worker (ACS)2017-2021
3875classofworkerclassofworker.RELFPRIVNP2021 Civ Female 16+:Priv Nonprofit REL (ACS 5-Yr)2017-2021 Class of Worker (ACS)2017-2021

Adding data using enrichment - At this stage a literature study is undertaken to narrow down the various factors that might impact opening of new Airbnb properties in NYC.

Subsequently these factors are identified from the USA geoenrichment database as shown above. These variable names are then compiled in a dictionary for passing them to the enrichment tool.

enrichment_variables = {'classofworker.ACSCIVEMP':      'Employed Population Age 16+',
 'classofworker.ACSMCIVEMP':                      'Employed Male Pop Age 16+',
 'classofworker.ACSMPRIVNP':                      'Male 16+Priv Nonprofit',
 'classofworker.ACSMEPRIVP':                      'Male 16+:Priv Profit Empl',
 'classofworker.ACSMSELFI':                       'Male 16+:Priv Profit Self Empl',
 'classofworker.ACSMSTGOV':                       'Male 16+:State Govt Wrkr',
 'classofworker.ACSMFEDGOV':                      'Male 16+:Fed Govt Wrkr',
 'classofworker.ACSMSELFNI':                      'Male 16+:Self-Emp Not Inc',
 'classofworker.ACSMUNPDFM':                      'Male 16+:Unpaid Family Wrkr',              
 'classofworker.ACSFCIVEMP':                      'Female Pop Age 16+',
 'classofworker.ACSFEPRIVP':                      'Female 16+:Priv Profit Empl',
 'classofworker.ACSFSELFI':                       'Female 16+:Priv Profit Self Empl',                      
 'classofworker.ACSFPRIVNP':                      'Female 16+:Priv Nonprofit',
 'classofworker.ACSFLOCGOV':                      'Female 16+:Local Govt Wrkr',
 'classofworker.ACSFSTGOV':                       'Female 16+:State Govt Wrkr',
 'classofworker.ACSFFEDGOV':                      'Female 16+:Fed Govt Wrkr',                      
 'classofworker.ACSFSELFNI':                      'Female 16+:Self-Emp Not Inc',                      
 'classofworker.ACSFUNPDFM':                      'Female 16+:Unpaid Family Wrkr',                      
 'gender.MEDAGE_CY':                              '2019 Median Age',
 'Generations.GENALPHACY':                        '2019 Generation Alpha Population',
 'Generations.GENZ_CY':                           '2019 Generation Z Population',
 'Generations.MILLENN_CY':                        '2019 Millennial Population',
 'Generations.GENX_CY':                           '2019 Generation X Population',
 'Generations.BABYBOOMCY':                        '2019 Baby Boomer Population',
 'Generations.OLDRGENSCY':                        '2019 Silent & Greatest Generations Population',
 'Generations.GENBASE_CY':                        '2019 Population by Generation Base',
 'populationtotals.POPDENS_CY':                   '2019 Population Density',
 'DaytimePopulation.DPOP_CY':                     '2019 Total Daytime Population',
 'raceandhispanicorigin.WHITE_CY':                '2019 White Population',
 'raceandhispanicorigin.BLACK_CY':                '2019 Black Population',
 'raceandhispanicorigin.AMERIND_CY':              '2019 American Indian Population',
 'raceandhispanicorigin.ASIAN_CY':                '2019 Asian Population',
 'raceandhispanicorigin.PACIFIC_CY':              '2019 Pacific Islander Population',
 'raceandhispanicorigin.OTHRACE_CY':              '2019 Other Race Population',
 'raceandhispanicorigin.DIVINDX_CY':              '2019 Diversity Index',
 'households.ACSHHBPOV':                          'HHs: Inc Below Poverty Level',
 'households.ACSHHAPOV':                          'HHs:Inc at/Above Poverty Level',
 'households.ACSFAMHH':                           'ACS Family Households',
 'businesses.S01_BUS':                            'Total Businesses (SIC)',
 'businesses.N05_BUS':                            'Construction Businesses (NAICS)',
 'businesses.N08_BUS':                            'Retail Trade Businesses (NAICS)',
 'businesses.N21_BUS':                            'Transportation/Warehouse Bus (NAICS)',
 'ElectronicsInternet.MP09147a_B':                'Own any tablet',
 'ElectronicsInternet.MP09148a_B':                'Own any e-reader',
 'ElectronicsInternet.MP19001a_B':                'Have access to Internet at home',                
 'ElectronicsInternet.MP19070a_I':                'Index: Spend 0.5-0.9 hrs online(excl email/IM .',               
 'ElectronicsInternet.MP19071a_B':                'Spend <0.5 hrs online (excl email/IM time) daily',
 'populationtotals.TOTPOP_CY':                    '2019 Total Population',              
 'gender.MALES_CY':                               '2019 Male Population',
 'gender.FEMALES_CY':                             '2019 Female Population',
 'industry.EMP_CY':                               '2019 Employed Civilian Pop 16+',
 'industry.UNEMP_CY':                             '2019 Unemployed Population 16+',                     
 'industry.UNEMPRT_CY':                           '2019 Unemployment Rate',
 'commute.ACSWORKERS':                            'ACS Workers Age 16+',
 'commute.ACSDRALONE':                            'ACS Workers 16+: Drove Alone',
 'commute.ACSCARPOOL':                            'ACS Workers 16+: Carpooled',
 'commute.ACSPUBTRAN':                            'ACS Workers 16+: Public Transportation',
 'commute.ACSBUS':                                'ACS Workers 16+: Bus',
 'commute.ACSSTRTCAR':                            'ACS Workers 16+: Streetcar',
 'commute.ACSSUBWAY':                             'ACS Workers 16+: Subway',
 'commute.ACSRAILRD':                             'ACS Workers 16+: Railroad',
 'commute.ACSFERRY':                              'ACS Workers 16+: Ferryboat',
 'commute.ACSTAXICAB':                            'ACS Workers 16+: Taxicab',           
 'commute.ACSMCYCLE':                             'ACS Workers 16+: Motorcycle',
 'commute.ACSBICYCLE':                            'ACS Workers 16+: Bicycle',                             
 'commute.ACSWALKED':                             'ACS Workers 16+: Walked',
 'commute.ACSOTHTRAN':                            'ACS Workers 16+: Other Means',
 'commute.ACSWRKHOME':                            'ACS Wrkrs 16+: Worked at Home',
 'OwnerRenter.OWNER_CY':                          '2019 Owner Occupied HUs', 
 'OwnerRenter.RENTER_CY':                         '2019 Renter Occupied HUs', 
 'vacant.VACANT_CY':                              '2019 Vacant Housing Units', 
 'homevalue.MEDVAL_CY':                           '2019 Median Home Value',
 'housingunittotals.TOTHU_CY':                    '2019 Total Housing Units',
 'yearbuilt.ACSMEDYBLT':                          'ACS Median Year Structure Built: HUs',
 'SpendingTotal.X1001_X':                         '2019 Annual Budget Exp',
 'transportation.X6001_X':                        '2019 Transportation',
 'households.ACSTOTHH':                           'ACS Total Households',
 'DaytimePopulation.DPOPWRK_CY':                  '2019 Daytime Pop: Workers',
 'DaytimePopulation.DPOPRES_CY':                  '2019 Daytime Pop: Residents',
 'DaytimePopulation.DPOPDENSCY':                  '2019 Daytime Pop Density',
 'occupation.OCCPROT_CY':                         '2019 Occupation: Protective Service',
 'occupation.OCCFOOD_CY':                         '2019 Occupation: Food Preperation',
 'occupation.OCCPERS_CY':                         '2019 Occupation: Personal Care',
 'occupation.OCCADMN_CY':                         '2019 Occupation: Office/Admin',
 'occupation.OCCCONS_CY':                         '2019 Occupation: Construction/Extraction',
 'occupation.OCCPROD_CY':                         '2019 Occupation: Production'
                  }
# Enrichment operation using ArcGIS API for Python 
enrichment_variables_df = pd.DataFrame.from_dict(enrichment_variables, orient='index',columns=['Variable Definition'])
enrichment_variables_df.reset_index(level=0, inplace=True)
enrichment_variables_df.columns = ['AnalysisVariable','Variable Definition']
enrichment_variables_df.head()
AnalysisVariableVariable Definition
0classofworker.ACSCIVEMPEmployed Population Age 16+
1classofworker.ACSMCIVEMPEmployed Male Pop Age 16+
2classofworker.ACSMPRIVNPMale 16+Priv Nonprofit
3classofworker.ACSMEPRIVPMale 16+:Priv Profit Empl
4classofworker.ACSMSELFIMale 16+:Priv Profit Self Empl
# Convertng the variables names to list for passing them to the enrichment tool
variable_names = enrichment_variables_df['AnalysisVariable'].tolist()

# checking the firt few values of the list
variable_names[1:5]
['classofworker.ACSMCIVEMP',
 'classofworker.ACSMPRIVNP',
 'classofworker.ACSMEPRIVP',
 'classofworker.ACSMSELFI']
# Data Enriching operation
airbnb_count_by_tract_enriched = enrich_layer(airbnb_count_by_tract,
                                              analysis_variables = variable_names,
                                              output_name='airbnb_tract_enrich1'+ str(dt.now().microsecond))
{"messageCode": "AO_100047", "message": "Enrichment may not be available for some features."}
{"messageCode": "AO_100288", "message": "Unable to detect the country for one or more features."}
{"messageCode": "AO_100047", "message": "Enrichment may not be available for some features."}
{"messageCode": "AO_100000", "message": "Variables [commute.ACSRAILRD] are not defined for country 'US'."}
{"messageCode": "AO_100000", "message": "Variables [households.ACSFAMHH, households.ACSTOTHH] are not defined for country 'US'."}
{"cost": -1}
# Extracting the resulting enriched dataframe after the geoenrichment method
sdf_airbnb_count_by_tract_enriched = airbnb_count_by_tract_enriched.layers[0].query().sdf
# Visualizing the data as a pandas dataframe
print(sdf_airbnb_count_by_tract_enriched.columns)
sdf_airbnb_count_by_tract_enriched_sorted = sdf_airbnb_count_by_tract_enriched.sort_values('geoid')
sdf_airbnb_count_by_tract_enriched_sorted.head()
Index(['OBJECTID', 'statefp', 'countyfp', 'tractce', 'geoid', 'name',
       'namelsad', 'mtfcc', 'funcstat', 'aland',
       ...
       'DPOPWRK_CY', 'DPOPRES_CY', 'DPOPDENSCY', 'OCCPROT_CY', 'OCCFOOD_CY',
       'OCCPERS_CY', 'OCCADMN_CY', 'OCCCONS_CY', 'OCCPROD_CY', 'SHAPE'],
      dtype='object', length=106)
OBJECTIDstatefpcountyfptractcegeoidnamenamelsadmtfccfuncstatalandawaterintptlatintptlonPoint_CountAnalysisAreaIDsourceCountryENRICH_FIDaggregationMethodpopulationToPolygonSizeRatingapportionmentConfidenceHasDataACSCIVEMPACSMCIVEMPACSMPRIVNPACSMEPRIVPACSMSELFIACSMSTGOVACSMFEDGOVACSMSELFNIACSMUNPDFMACSFCIVEMPACSFEPRIVPACSFSELFIACSFPRIVNPACSFLOCGOVACSFSTGOVACSFFEDGOVACSFSELFNIACSFUNPDFMMEDAGE_CYGENALPHACYGENZ_CYMILLENN_CYGENX_CYBABYBOOMCYOLDRGENSCYGENBASE_CYPOPDENS_CYDPOP_CYWHITE_CYBLACK_CYAMERIND_CYASIAN_CYPACIFIC_CYOTHRACE_CYDIVINDX_CYACSHHBPOVACSHHAPOVS01_BUSN05_BUSN08_BUSN21_BUSMP09147a_BMP09148a_BMP19001a_BMP19070a_IMP19071a_BTOTPOP_CYMALES_CYFEMALES_CYEMP_CYUNEMP_CYUNEMPRT_CYACSWORKERSACSDRALONEACSCARPOOLACSPUBTRANACSBUSACSSTRTCARACSSUBWAYACSFERRYACSTAXICABACSMCYCLEACSBICYCLEACSWALKEDACSOTHTRANACSWRKHOMEOWNER_CYRENTER_CYVACANT_CYMEDVAL_CYTOTHU_CYACSMEDYBLTX1001_XX6001_XDPOPWRK_CYDPOPRES_CYDPOPDENSCYOCCPROT_CYOCCFOOD_CYOCCPERS_CYOCCADMN_CYOCCCONS_CYOCCPROD_CYSHAPE
2095209636005000100360050001001Census Tract 1G5020S15793611125765+40.7934921-073.883531802.70506245US2096BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57610.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.034.50.0698.01997.0921.0127.05.03748.03588.6663.0629.02330.017.067.01.0672.075.20.00.024.03.04.01.00.00.00.00.00.03748.03407.0341.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.01.00.00.00.0663.00.0634.80.00.00.00.00.00.0{"rings": [[[-8226256.9418, 4982172.581], [-82...
2059206036005000200360050002002Census Tract 2G5020S455322926899+40.8045733-073.856858501.3822289US2060BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57611664.01068.051.0709.00.043.00.038.00.0596.0316.015.078.0115.060.012.00.00.035.8430.01218.01245.0913.0823.0168.04797.08988.53210.0438.01195.069.0217.07.02150.083.6278.01114.042.02.02.05.02177.0437.03469.086.0123.04797.02249.02548.01834.0192.09.51664.0766.013.0619.0235.00.0362.022.039.00.00.0200.00.027.0767.0766.078.0558333.01611.01957.0132707872.013576654.0196.03014.06014.948.0106.015.0162.033.0130.0{"rings": [[[-8222638.612, 4985024.3226], [-82...
2067206836005000400360050004004Census Tract 4G5020S912392602945+40.8089152-073.8504884151.51533617US2068BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57613128.01675.0253.0932.021.038.0104.075.00.01453.0804.037.0276.0281.016.039.00.00.035.7506.01589.01589.01226.0945.0166.06021.010291.03758.0641.01803.060.0240.06.02278.085.4107.02092.057.03.09.03.02754.0730.04455.0103.0172.06021.02937.03084.02807.0120.04.13082.01628.061.01077.0250.00.0776.033.00.00.00.0130.064.0122.01548.0537.0108.0552034.02193.02001.0197321413.020238565.0471.03287.06423.1239.083.096.0429.099.042.0{"rings": [[[-8222012.885, 4985135.2266], [-82...
16681669360050016003600500160016Census Tract 16G5020S4850790+40.8188478-073.858076410.48507618US1669BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57612513.0959.031.0620.029.00.018.020.00.01554.0927.07.0159.0296.090.051.024.00.036.6477.01384.01313.0996.0997.0350.05517.029457.35334.0497.02127.070.0123.03.01900.084.6526.01661.057.00.07.02.02392.0491.03950.080.0114.05517.02394.03123.02087.0183.08.12422.0880.073.01154.0604.00.0538.012.019.00.012.0110.029.0145.0295.01712.0103.0650463.02110.01973.0117903160.011830562.01831.03503.028480.2101.0104.035.0299.03.021.0{"rings": [[[-8222181.7567, 4986069.1354], [-8...
21272128360050019003600500190019Census Tract 19G5020S16436541139660+40.8009990-073.9093729242.78333127US2128BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57611790.0887.0144.0580.046.00.07.041.017.0903.0585.015.0133.064.017.00.089.00.036.1362.0946.01286.0885.0553.0134.04166.03876.68595.0637.01656.062.067.05.01236.085.8323.0938.0409.075.045.026.01836.0347.03093.080.094.04166.02182.01984.02152.0114.05.01751.0279.089.01054.0142.014.0898.00.016.00.021.075.035.0182.0118.01486.0124.0381395.01728.01964.0118882651.012251786.06516.02079.07998.032.0127.0162.0177.058.064.0{"rings": [[[-8230028.8927, 4984061.5402], [-8...

The field name of the enriched dataframe are code words which needs to be elaborated. Hence these are renamed with their actual definition from the variable definition of the list that was first created during selection of the variables.

enrichment_variables_df.head()
AnalysisVariableVariable Definition
0classofworker.ACSCIVEMPEmployed Population Age 16+
1classofworker.ACSMCIVEMPEmployed Male Pop Age 16+
2classofworker.ACSMPRIVNPMale 16+Priv Nonprofit
3classofworker.ACSMEPRIVPMale 16+:Priv Profit Empl
4classofworker.ACSMSELFIMale 16+:Priv Profit Self Empl
enrichment_variables_copy = enrichment_variables_df.copy()
enrichment_variables_copy.head(2)
AnalysisVariableVariable Definition
0classofworker.ACSCIVEMPEmployed Population Age 16+
1classofworker.ACSMCIVEMPEmployed Male Pop Age 16+
enrichment_variables_copy['AnalysisVariable'] = enrichment_variables_copy.AnalysisVariable.str.split(pat='.', expand=True)[1]
enrichment_variables_copy
AnalysisVariableVariable Definition
0ACSCIVEMPEmployed Population Age 16+
1ACSMCIVEMPEmployed Male Pop Age 16+
2ACSMPRIVNPMale 16+Priv Nonprofit
3ACSMEPRIVPMale 16+:Priv Profit Empl
4ACSMSELFIMale 16+:Priv Profit Self Empl
.........
81OCCFOOD_CY2019 Occupation: Food Preperation
82OCCPERS_CY2019 Occupation: Personal Care
83OCCADMN_CY2019 Occupation: Office/Admin
84OCCCONS_CY2019 Occupation: Construction/Extraction
85OCCPROD_CY2019 Occupation: Production

86 rows × 2 columns

enrichment_variables_copy.set_index("AnalysisVariable", drop=True, inplace=True)
dictionary = enrichment_variables_copy.to_dict()
new_columns = dictionary['Variable Definition']
# Field renamed and new dataframe visualized
pd.set_option('display.max_columns', 150)
sdf_airbnb_count_by_tract_enriched_sorted.rename(columns=new_columns, inplace=True)
sdf_airbnb_count_by_tract_enriched_sorted.head()
OBJECTIDstatefpcountyfptractcegeoidnamenamelsadmtfccfuncstatalandawaterintptlatintptlonPoint_CountAnalysisAreaIDsourceCountryENRICH_FIDaggregationMethodpopulationToPolygonSizeRatingapportionmentConfidenceHasDataEmployed Population Age 16+Employed Male Pop Age 16+Male 16+Priv NonprofitMale 16+:Priv Profit EmplMale 16+:Priv Profit Self EmplMale 16+:State Govt WrkrMale 16+:Fed Govt WrkrMale 16+:Self-Emp Not IncMale 16+:Unpaid Family WrkrFemale Pop Age 16+Female 16+:Priv Profit EmplFemale 16+:Priv Profit Self EmplFemale 16+:Priv NonprofitFemale 16+:Local Govt WrkrFemale 16+:State Govt WrkrFemale 16+:Fed Govt WrkrFemale 16+:Self-Emp Not IncFemale 16+:Unpaid Family Wrkr2019 Median Age2019 Generation Alpha Population2019 Generation Z Population2019 Millennial Population2019 Generation X Population2019 Baby Boomer Population2019 Silent & Greatest Generations Population2019 Population by Generation Base2019 Population Density2019 Total Daytime Population2019 White Population2019 Black Population2019 American Indian Population2019 Asian Population2019 Pacific Islander Population2019 Other Race Population2019 Diversity IndexHHs: Inc Below Poverty LevelHHs:Inc at/Above Poverty LevelTotal Businesses (SIC)Construction Businesses (NAICS)Retail Trade Businesses (NAICS)Transportation/Warehouse Bus (NAICS)Own any tabletOwn any e-readerHave access to Internet at homeIndex: Spend 0.5-0.9 hrs online(excl email/IM .Spend <0.5 hrs online (excl email/IM time) daily2019 Total Population2019 Male Population2019 Female Population2019 Employed Civilian Pop 16+2019 Unemployed Population 16+2019 Unemployment RateACS Workers Age 16+ACS Workers 16+: Drove AloneACS Workers 16+: CarpooledACS Workers 16+: Public TransportationACS Workers 16+: BusACS Workers 16+: StreetcarACS Workers 16+: SubwayACS Workers 16+: FerryboatACS Workers 16+: TaxicabACS Workers 16+: MotorcycleACS Workers 16+: BicycleACS Workers 16+: WalkedACS Workers 16+: Other MeansACS Wrkrs 16+: Worked at Home2019 Owner Occupied HUs2019 Renter Occupied HUs2019 Vacant Housing Units2019 Median Home Value2019 Total Housing UnitsACS Median Year Structure Built: HUs2019 Annual Budget Exp2019 Transportation2019 Daytime Pop: Workers2019 Daytime Pop: Residents2019 Daytime Pop Density2019 Occupation: Protective Service2019 Occupation: Food Preperation2019 Occupation: Personal Care2019 Occupation: Office/Admin2019 Occupation: Construction/Extraction2019 Occupation: ProductionSHAPE
2095209636005000100360050001001Census Tract 1G5020S15793611125765+40.7934921-073.883531802.70506245US2096BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57610.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.034.50.0698.01997.0921.0127.05.03748.03588.6663.0629.02330.017.067.01.0672.075.20.00.024.03.04.01.00.00.00.00.00.03748.03407.0341.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.01.00.00.00.0663.00.0634.80.00.00.00.00.00.0{"rings": [[[-8226256.9418, 4982172.581], [-82...
2059206036005000200360050002002Census Tract 2G5020S455322926899+40.8045733-073.856858501.3822289US2060BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57611664.01068.051.0709.00.043.00.038.00.0596.0316.015.078.0115.060.012.00.00.035.8430.01218.01245.0913.0823.0168.04797.08988.53210.0438.01195.069.0217.07.02150.083.6278.01114.042.02.02.05.02177.0437.03469.086.0123.04797.02249.02548.01834.0192.09.51664.0766.013.0619.0235.00.0362.022.039.00.00.0200.00.027.0767.0766.078.0558333.01611.01957.0132707872.013576654.0196.03014.06014.948.0106.015.0162.033.0130.0{"rings": [[[-8222638.612, 4985024.3226], [-82...
2067206836005000400360050004004Census Tract 4G5020S912392602945+40.8089152-073.8504884151.51533617US2068BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57613128.01675.0253.0932.021.038.0104.075.00.01453.0804.037.0276.0281.016.039.00.00.035.7506.01589.01589.01226.0945.0166.06021.010291.03758.0641.01803.060.0240.06.02278.085.4107.02092.057.03.09.03.02754.0730.04455.0103.0172.06021.02937.03084.02807.0120.04.13082.01628.061.01077.0250.00.0776.033.00.00.00.0130.064.0122.01548.0537.0108.0552034.02193.02001.0197321413.020238565.0471.03287.06423.1239.083.096.0429.099.042.0{"rings": [[[-8222012.885, 4985135.2266], [-82...
16681669360050016003600500160016Census Tract 16G5020S4850790+40.8188478-073.858076410.48507618US1669BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57612513.0959.031.0620.029.00.018.020.00.01554.0927.07.0159.0296.090.051.024.00.036.6477.01384.01313.0996.0997.0350.05517.029457.35334.0497.02127.070.0123.03.01900.084.6526.01661.057.00.07.02.02392.0491.03950.080.0114.05517.02394.03123.02087.0183.08.12422.0880.073.01154.0604.00.0538.012.019.00.012.0110.029.0145.0295.01712.0103.0650463.02110.01973.0117903160.011830562.01831.03503.028480.2101.0104.035.0299.03.021.0{"rings": [[[-8222181.7567, 4986069.1354], [-8...
21272128360050019003600500190019Census Tract 19G5020S16436541139660+40.8009990-073.9093729242.78333127US2128BlockApportionment:US.BlockGroups;PointsLayer:...2.1912.57611790.0887.0144.0580.046.00.07.041.017.0903.0585.015.0133.064.017.00.089.00.036.1362.0946.01286.0885.0553.0134.04166.03876.68595.0637.01656.062.067.05.01236.085.8323.0938.0409.075.045.026.01836.0347.03093.080.094.04166.02182.01984.02152.0114.05.01751.0279.089.01054.0142.014.0898.00.016.00.021.075.035.0182.0118.01486.0124.0381395.01728.01964.0118882651.012251786.06516.02079.07998.032.0127.0162.0177.058.064.0{"rings": [[[-8230028.8927, 4984061.5402], [-8...

The renamed data frame above is now self explanatory hence more interpretable.

Estimating distances of tracts from various city features

The next set of feature data set will be the distances of each of the tract from various city features. These distance variables accomplishes two important tasks.

First is that they include the spatial components of the Airbnb development phenomenon into the model.

Secondly each Airbnb properties are impacted by unique locational factors. This is reflected from the Airbnb reviews where the most highly rated in demand Airbnb property are located in neighbourhood with good transit accessibility. Hence these are accounted into the model by including the distances of different public transit options from the tracts.

The hypothesis formed here is that tracts located near transit hubs which could be subway station, bus stops, railroad lines, subway routes etc., might attract more Airbnb property. Similarly the central business district which for New York is located at lower Manhattan might also influence Airbnb properties, since this is the city's main business hub. In the following these various distances are estimated using ArcGIS API for Python proximity method.

gis.content.search('NYCBusStop owner:api_data_owner')
[]
busi_distr = gis.content.search('BusinessDistricts owner:api_data_owner', 'feature layer')[0]
cbd = gis.content.search('NYCBD owner:api_data_owner', 'feature layer')[0]
bus_stop = gis.content.search('NYCBusStop owner:api_data_owner', 'feature layer')[0]
hotels = gis.content.search('NYCHotels owner:api_data_owner', 'feature layer')[0]
railroad = gis.content.search('NYCRailroad owner:api_data_owner', 'feature layer')[0]
subwy_rt = gis.content.search('NYCSubwayRoutes owner:api_data_owner', 'feature layer')[0]
subwy_stn = gis.content.search('NYCSubwayStation owner:api_data_owner', 'feature layer')[0]
# accessing the various city feature shapefile from arcgis portal
busi_distr, cbd, bus_stop, hotels, railroad, subwy_rt, subwy_stn 
(<Item title:"tract_busi_distrs_dist874392" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"NYCBD" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"ny_tract_bus_stop_dist452536" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"ny_tract_hotel_dist128626" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"tract_railroad_dist613260" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"ny_tract_subway_routes_dist709262" type:Feature Layer Collection owner:api_data_owner>,
 <Item title:"ny_tract_subway_station_dist691504" type:Feature Layer Collection owner:api_data_owner>)
bus_stop_lyr = bus_stop.layers[0]
cbd_lyr = cbd.layers[0] 
hotels_lyr = hotels.layers[0] 
subwy_stn_lyr =subwy_stn.layers[0]
subwy_rt_lyr = subwy_rt.layers[0] 
railroad_lyr = railroad.layers[0]
busi_distrs_lyr = busi_distr.layers[0] 
# Avoid warning for chain operation
pd.set_option('mode.chained_assignment', None) 

# Estimating Tract to hotel distances
tract_hotel_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                              hotels_lyr,
                                              measurement_type='StraightLine',
                                              max_count=1,
                                              output_name='ny_tract_hotel_dist1' + str(dt.now().microsecond))
{"cost": 2.555}
tract_hotel_dist.layers
[<FeatureLayer url:"https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/ny_tract_hotel_dist1396959/FeatureServer/0">,
 <FeatureLayer url:"https://services7.arcgis.com/JEwYeAy2cc8qOe3o/arcgis/rest/services/ny_tract_hotel_dist1396959/FeatureServer/1">]
tract_hotel_dist_lyr = tract_hotel_dist.layers[1]
sdf_tract_hotel_dist_lyr = pd.DataFrame.spatial.from_layer(tract_hotel_dist_lyr)
sdf_tract_hotel_dist_lyr.head()
From_IDFrom_NameFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_ACRESTo_ADD_ADDRTo_ADD_CITYTo_ADD_OWNERTo_ADD_POBOXTo_ADD_STATETo_ADD_ZIPTo_AGDISTCODETo_AGDISTNAMETo_BLDG_DESCTo_BLDG_STYLETo_BOOKTo_CALC_ACRESTo_COUNTYTo_CT_NAMETo_CT_SWISTo_DEPTHTo_DUP_GEOTo_FRONTTo_FUEL_DESCTo_FUEL_TYPETo_FULL_MVTo_GRID_EASTTo_GRID_NORTHTo_HEAT_DESCTo_HEAT_TYPETo_IDTo_LAND_AVTo_LOC_STREETTo_LOC_ST_NBRTo_LOC_UNITTo_LOC_ZIPTo_MAIL_ADDRTo_MAIL_CITYTo_MAIL_STATETo_MAIL_ZIPTo_MUNI_NAMETo_MUNI_PCLIDTo_NAMESOURCETo_NBR_BEDRMTo_NBR_F_BATHTo_NBR_KITCHNTo_NYS_NAMETo_ORIG_FIDTo_OWNER_TYPETo_PAGETo_PARCELADDRTo_PO_BOXTo_PRINT_KEYTo_PRMY_OWNERTo_PROP_CLASSTo_ROLL_SECTTo_ROLL_YRTo_SBLTo_SCH_CODETo_SCH_NAMETo_SEWER_DESCTo_SEWER_TYPETo_SPATIAL_YRTo_SQFT_LIVTo_SQ_FTTo_SWISTo_SWISPKIDTo_SWISSBLIDTo_Shape__AreaTo_Shape__LengthTo_TOTAL_AVTo_USEDASCODETo_USEDASDESCTo_UTILITIESTo_UTIL_DESCTo_WATER_DESCTo_WATER_SUPPTo_YR_BLTTotal_Miles
01332227.03415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227552.1448, 4989516.0602], [-8...0.0MiscellaneousH900.162924BronxBronx600100159420.01011886.0248352.0237202500.0WEBSTER AVENUE193010457Bronx00030801930 WEBSTER AVENUEWEBSTER TREMONT EQUIT052017203027001009201719110.06753.060010060010020302700101154.0625166.3056621531800.019310.519338
1501207.01472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112None0.0DormitoriesH800.237481New YorkManhattan6201001001010.0995721.0234272.0231389700.0AMSTERDAM AVENUE123510027Manhattan000701801235 AMSTERDAM AVENUEBARNARD COLLEGE082017101963003003201779036.010092.062010062010010196300301680.277344163.992966787350.019680.0
2469174.02507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213None0.0DormitoriesH800.338975New YorkManhattan6201001011420.0998966.0229316.02731291050.0EAST 110 STREET5510029Manhattan0006958055 EAST 110 STREETEDWIN GOULD RESIDENCE082017101616002404201737570.014347.062010062010010161600242397.410156198.0951653051450.020040.0
3454160.02514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214None0.0DormitoriesH800.742213New YorkManhattan6201001013050.0997421.0226355.03666925950.0EAST 98 STREET5010029Manhattan0006898050 EAST 98 STREETMSMC RESIDENTIAL REAL0820171016030039022017240000.030781.062010062010010160300395248.046875332.8689928974350.019840.0
4440150.01516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115{"paths": [[[-8232878.2271, 4979973.2013], [-8...0.0Transient Occupancy - Midtown Manhattan AreaH300.036426New YorkManhattan620100101260.0996841.0223341.0291395000.0EAST 87 STREET16410128Manhattan00068780164 EAST 87 STREET164 EAST 87TH ST LLC052017101515004502201718300.02571.06201006201001015150045257.594.7644534627350.019300.139874

In the above dataframe the Total_Miles field returns the distances of the tract from hotels in miles. Hence this field is converted into feet and retained. This is then repeated for each of the other distance estimation.

# Final hotel Distances in feet — Here in each row column "hotel_dist" returns the distance of the nearest hotel from that tract indicated by its geoids.
# For example in the first row the tract with ID 36005000100 has a nearest hotel at 5571.75 feet away from it. 
sdf_tract_hotel_dist_lyr_new = sdf_tract_hotel_dist_lyr[['From_geoid', 'Total_Miles']]

# 1 mile = 5280 feet
sdf_tract_hotel_dist_lyr_new['hotel_dist'] = round(sdf_tract_hotel_dist_lyr_new['Total_Miles'] * 5280, 2)
sdf_tract_hotel_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Mileshotel_dist
2095360050001001.0552565571.75
2059360050002001.0390995486.44
2067360050004000.4726642495.67
1668360050016000.5859773093.96
2127360050019000.00.0
# Estimating Busstop distances from tracts
tract_bustop_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                               bus_stop_lyr,
                                               measurement_type='StraightLine',
                                               max_count=1,
                                               output_name='ny_tract_bus_stop_dist'+ str(dt.now().microsecond))
tract_bustop_dist_lyr = tract_bustop_dist.layers[1]
sdf_tract_bustop_dist_lyr =tract_bustop_dist_lyr.query().sdf
{"cost": 3.846}
sdf_tract_bustop_dist_lyr
OBJECTIDFrom_IDTo_IDNearRankFrom_statefpFrom_countyfpFrom_tractceFrom_geoidFrom_NameFrom_namelsadFrom_mtfccFrom_funcstatFrom_alandFrom_awaterFrom_intptlatFrom_intptlonTo_CounDistTo_BoroCDTo_AssemDistTo_the_geomTo_CongDistTo_StSenDistTo_SHELTER_IDTo_LOCATIONTo_AT_BETWEENTo_LONGITUDETo_LATITUDETo_AssetIDTo_BoroCodeTo_BoroNameTo_StreetTo_SegmentIDTo_PhysicalIDTo_NODEIDTo_ORIG_FIDTotal_MilesSHAPE
011332104913600502270336005022703227.03Census Tract 227.03G5020S415020+40.8440198-073.91049991620477POINT (-73.91154799999998 40.84352500000006)1533BX0293Grand ConcourseMT EDEN AV EAST-73.91154840.8435259662BronxGRAND CONCOURSE190988166310019690.0None
125013513606102070136061020701207.01Census Tract 207.01G5020S472730+40.8089775-073.9584600710970POINT (-73.95978799999995 40.80835700000006)1030MN0273AMSTERDAM AVW 118 ST-73.95978840.80835720491ManhattanAMSTERDAM AVENUE3836138370490.0None
2346941313606101740236061017402174.02Census Tract 174.02G5020S507300+40.7968026-073.9471624811168POINT (-73.94691999999998 40.796970000000044)1330MN01334MADISON AVE 111 ST-73.9469240.7969717831ManhattanMADISON AVENUE37970353807840.0None
3445410813606101600236061016002160.02Census Tract 160.02G5020S514220+40.7878787-073.9536853411168POINT (-73.95376199999998 40.787168000000065)1229MN0868E 96 STMADISON AV-73.95376240.78716822011ManhattanEAST 96 STREET29413117210790451752010.0None
4544039813606101500136061015001150.01Census Tract 150.01G5020S516430+40.7801987-073.9592834510876POINT (-73.95346799999999 40.779304000000025)1228MN012713 AVE 87 ST-73.95346840.77930417391Manhattan3 AVENUE376693682236477400.204742{"paths": [[[-8232878.2271, 4979973.2013], [-8...
..................................................................................................................
216221632037165313608107160036081071600716Census Tract 716G5020S182424141117802+40.6476943-073.78605893141331POINT (-73.77444099999997 40.660487000000046)510QN04677ROCKAWAY BLVD147 AV-73.77444140.66048729134QueensROCKAWAY BOULEVARD594228417033410.035317{"paths": [[[-8212568.4035, 4962325.3532], [-8...
21632164287973136047990100360479901009901Census Tract 9901G5020S017793513+40.5649933-074.01488654331146POINT (-73.99949999999995 40.595282000000054)1123BR0471SHORE PKWYBAY PKWY-73.999540.5952828453BrooklynSHORE PARKWAY9009667167800018470.33537{"paths": [[[-8238249.5277, 4952568.7722], [-8...
216421652145238136081107202360811072021072.02Census Tract 1072.02G5020S750474516636750+40.6252538-073.81376463241423POINT (-73.82648099999994 40.58301400000005)515QN04053ROCKAWAY BEACH BLVDBEACH 105 ST-73.82648140.58301423904QueensROCKAWAY BEACH BOULEVARD1472661487003900.332147{"paths": [[[-8218474.4488, 4951720.8592], [-8...
216521666911445136085990100360859901009901Census Tract 9901G5020S080255169+40.5255512-074.10858295059564POINT (-74.07749299999995 40.57932000000005)1124SI05038FR CAPODANNO BLVDSEAVIEW AV-74.07749340.5793232805Staten IslandFR CAPODANNO BOULEVARD145481448569277027800.433019{"paths": [[[-8245592.2067, 4949864.8834], [-8...
216621675811620136081990100360819901009901Census Tract 9901G5020S0122602887+40.5401732-073.89096983241423POINT (-73.86499899999995 40.56884100000008)515QN04568ROCKAWAY BEACH BLVDBEACH 149 ST-73.86499940.56884128204QueensROCKAWAY BEACH BOULEVARD11419515135032480.54502{"paths": [[[-8222119.6803, 4947903.8282], [-8...

2167 rows × 37 columns

# Final Bustop Distances in feet — Here in each row column "busstop_dist" returns the distance of the nearest bus stop 
# from that tract indicated by its geoids 
sdf_tract_bustop_dist_lyr_new = sdf_tract_bustop_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_bustop_dist_lyr_new['busstop_dist'] = round(sdf_tract_bustop_dist_lyr_new['Total_Miles'] * 5280, 2)
sdf_tract_bustop_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milesbusstop_dist
2095360050001000.7443443930.14
2059360050002000.00598331.59
2067360050004000.00.0
1668360050016000.00.0
2127360050019000.00.0
# estimating number of bus stops per tract
num_bustops_tracts = summarize_data.aggregate_points(point_layer=bus_stop_lyr,
                                                   polygon_layer=nyc_tracts_layer,
                                                   output_name='bustops_by_tracts'+ str(dt.now().microsecond)) 
{"cost": 3.846}
num_bustops_tracts_lyr = num_bustops_tracts.layers[0]
sdf_num_bustops_tracts_lyr = pd.DataFrame.spatial.from_layer(num_bustops_tracts_lyr)
sdf_num_bustops_tracts_lyr.head()
AnalysisAreaOBJECTIDPoint_CountSHAPEalandawatercountyfpfuncstatgeoidintptlatintptlonmtfccnamenamelsadstatefptractce
00.01602411{"rings": [[[-8227813.3004, 4989345.3624], [-8...415020005S36005022703+40.8440198-073.9104999G5020227.03Census Tract 227.0336022703
10.01825221{"rings": [[[-8233183.0202, 4984115.3687], [-8...472730061S36061020701+40.8089775-073.9584600G5020207.01Census Tract 207.0136020701
20.01958731{"rings": [[[-8231989.6748, 4982433.1457], [-8...507300061S36061017402+40.7968026-073.9471624G5020174.02Census Tract 174.0236017402
30.01985441{"rings": [[[-8232691.8783, 4981159.0609], [-8...514220061S36061016002+40.7878787-073.9536853G5020160.02Census Tract 160.0236016002
40.01993950{"rings": [[[-8233292.0018, 4980071.8459], [-8...516430061S36061015001+40.7801987-073.9592834G5020150.01Census Tract 150.0136015001
# Number of Bus stops per tract — Here in each row column "num_bustop" returns the number of bus stops inside respective tracts 
sdf_num_bustops_tracts_lyr_new = sdf_num_bustops_tracts_lyr[['geoid', 'Point_Count']] 
sdf_num_bustops_tracts_lyr_new = sdf_num_bustops_tracts_lyr_new.rename(columns={'Point_Count':'num_bustop'})
sdf_num_bustops_tracts_lyr_new.sort_values('geoid').head()
geoidnum_bustop
2095360050001000
2059360050002000
2067360050004001
1668360050016001
2127360050019001
# estimating tracts distances from CBD 
tract_cbd_dist=use_proximity.find_nearest(nyc_tracts_layer,
                                          cbd_lyr,
                                          measurement_type='StraightLine',
                                          max_count=1,
                                          output_name='ny_tract_cbd_dist'+ str(dt.now().microsecond))
tract_cbd_dist_lyr = tract_cbd_dist.layers[1]
sdf_tract_cbd_dist_lyr = tract_cbd_dist_lyr.query().sdf
sdf_tract_cbd_dist_lyr.head()
{"cost": 2.168}
OBJECTIDFrom_IDTo_IDNearRankFrom_statefpFrom_countyfpFrom_tractceFrom_geoidFrom_NameFrom_namelsadFrom_mtfccFrom_funcstatFrom_alandFrom_awaterFrom_intptlatFrom_intptlonTo_bidTo_boroughTo_date_creatTo_time_creatTo_date_modifTo_time_modifTo_objectidTo_shape_areaTo_shape_lenTotal_MilesSHAPE
011332113600502270336005022703227.03Census Tract 227.03G5020S415020+40.8440198-073.9104999Bryant Park BIDManhattan2008-11-1900:00:00.0002016-10-3100:00:00.00058.01225783.270845837.6035577.102363{"paths": [[[-8227840.685, 4989242.9453], [-82...
12501113606102070136061020701207.01Census Tract 207.01G5020S472730+40.8089775-073.9584600Bryant Park BIDManhattan2008-11-1900:00:00.0002016-10-3100:00:00.00058.01225783.270845837.6035573.809966{"paths": [[[-8233012.3673, 4983978.5918], [-8...
23469113606101740236061017402174.02Census Tract 174.02G5020S507300+40.7968026-073.9471624Bryant Park BIDManhattan2008-11-1900:00:00.0002016-10-3100:00:00.00058.01225783.270845837.6035573.363737{"paths": [[[-8231927.236, 4982389.3817], [-82...
34454113606101600236061016002160.02Census Tract 160.02G5020S514220+40.7878787-073.9536853Bryant Park BIDManhattan2008-11-1900:00:00.0002016-10-3100:00:00.00058.01225783.270845837.6035572.658677{"paths": [[[-8232719.7081, 4981110.3946], [-8...
45440113606101500136061015001150.01Census Tract 150.01G5020S516430+40.7801987-073.9592834Bryant Park BIDManhattan2008-11-1900:00:00.0002016-10-3100:00:00.00058.01225783.270845837.6035572.055165{"paths": [[[-8233341.205, 4979982.316], [-823...
# Final CBD distances in feet — Here in each row the column "cbd_dst" returns the distance of the CBD from respective tracts
sdf_tract_cbd_dist_lyr_new = sdf_tract_cbd_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_cbd_dist_lyr_new['cbd_dist'] = round(sdf_tract_cbd_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_cbd_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milescbd_dist
2095360050001004.99924726396.02
2059360050002006.85851436212.95
2067360050004007.32192738659.77
1668360050016007.52553539734.83
2127360050019004.3335922881.35
# Estimating NYCSubwayStation distances from tracts 
tract_subwy_stn_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                                  subwy_stn_lyr,
                                                  measurement_type='StraightLine',
                                                  max_count=1,
                                                  output_name='ny_tract_subway_station_dist'+ str(dt.now().microsecond))
tract_subwy_stn_dist_lyr = tract_subwy_stn_dist.layers[1]
sdf_tract_subwy_stn_dist_lyr = pd.DataFrame.spatial.from_layer(tract_subwy_stn_dist_lyr)
sdf_tract_subwy_stn_dist_lyr.head()
{"cost": 2.599}
From_IDFrom_NameFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_IDTo_NameTo_ORIG_FIDTo_lineTo_notesTo_objectidTo_urlTotal_Miles
01332227.03415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227646.1872, 4989522.1588], [-8...21174th-175th Sts21B-DB-rush hours, D-all times, skips rush hours AM...21.0http://web.mta.info/nyct/service/0.054525
1501207.01472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112{"paths": [[[-8233201.4992, 4984081.6891], [-8...159116th St - Columbia University16711-all times167.0http://web.mta.info/nyct/service/0.211254
2469174.02507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213{"paths": [[[-8231617.2328, 4982254.7552], [-8...413110th St4504-6-6 Express4 nights, 6-all times, 6 Express-weekdays AM s...450.0http://web.mta.info/nyct/service/0.09727
3454160.02514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214{"paths": [[[-8232360.48, 4980909.7037], [-823...3396th St334-6-6 Express4 nights, 6-all times, 6 Express-weekdays AM s...33.0http://web.mta.info/nyct/service/0.098659
4440150.01516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115{"paths": [[[-8232878.9945, 4979971.8169], [-8...41486th St4514-5-6-6 Express4,6-all times, 5-all times exc nights, 6 Expre...451.0http://web.mta.info/nyct/service/0.09711
# Final Tract to NYC Subway Station distances in feet — Here in each row, column "subwy_stn_dist" returns the distance of
# the nearest subway station from that tract
sdf_tract_subwy_stn_dist_lyr_new = sdf_tract_subwy_stn_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_subwy_stn_dist_lyr_new['subwy_stn_dist'] = round(sdf_tract_subwy_stn_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_subwy_stn_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milessubwy_stn_dist
2095360050001000.9462264996.07
2059360050002001.1081735851.15
2067360050004001.1915056291.15
1668360050016000.7296613852.61
2127360050019000.080063422.73
# Estimating distances to NYCSubwayRoutes
tract_subwy_rt_dist=use_proximity.find_nearest(nyc_tracts_layer,
                                               subwy_rt_lyr,
                                               measurement_type='StraightLine',
                                               max_count=1,
                                               output_name='ny_tract_subway_routes_dist'+ str(dt.now().microsecond))
tract_subwy_rt_dist_lyr = tract_subwy_rt_dist.layers[1]
sdf_tract_subwy_rt_dist_lyr = tract_subwy_rt_dist_lyr.query().sdf
sdf_tract_subwy_rt_dist_lyr.head()
{"cost": 2.191}
OBJECTIDFrom_IDTo_IDNearRankFrom_statefpFrom_countyfpFrom_tractceFrom_geoidFrom_NameFrom_namelsadFrom_mtfccFrom_funcstatFrom_alandFrom_awaterFrom_intptlatFrom_intptlonTo_route_idTo_route_shorTo_route_longTo_group_To_Shape__LengthTo_ORIG_FIDTotal_MilesSHAPE
0113321913600502270336005022703227.03Census Tract 227.03G5020S415020+40.8440198-073.9104999BB6 Avenue ExpressBDFM51293.284426120.0None
12501913606102070136061020701207.01Census Tract 207.01G5020S472730+40.8089775-073.958460011Broadway - 7 Avenue Local12331208.13783130.169658{"paths": [[[-8233150.5149, 4984174.9334], [-8...
234691013606101740236061017402174.02Census Tract 174.02G5020S507300+40.7968026-073.947162466Lexington Avenue Express/Local45631863.54533990.09692{"paths": [[[-8231635.9013, 4982220.9583], [-8...
344541013606101600236061016002160.02Census Tract 160.02G5020S514220+40.7878787-073.953685366Lexington Avenue Express/Local45631863.54533990.096942{"paths": [[[-8232305.5995, 4981008.652], [-82...
454401013606101500136061015001150.01Census Tract 150.01G5020S516430+40.7801987-073.959283466Lexington Avenue Express/Local45631863.54533990.096767{"paths": [[[-8232878.2271, 4979973.2013], [-8...
# Final Tract to NYCSubwayRoutes distances in feet — Here in each row, column "subwy_rt_dist" returns the distance of
# the nearest subway route from that tract
sdf_tract_subwy_rt_dist_lyr_new = sdf_tract_subwy_rt_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_subwy_rt_dist_lyr_new['subwy_rt_dist'] = round(sdf_tract_subwy_rt_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_subwy_rt_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milessubwy_rt_dist
2095360050001000.905314780.04
2059360050002001.1087265854.07
2067360050004001.1920226293.88
1668360050016000.7243213824.42
2127360050019000.00285315.06
# Estimating distances to NYCRailroad
tract_railroad_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                           railroad_lyr,
                                           measurement_type='StraightLine',
                                           max_count=1,
                                           output_name='tract_railroad_dist'+ str(dt.now().microsecond))
tract_railroad_dist_lyr = tract_railroad_dist.layers[1]
sdf_tract_railroad_dist_lyr = pd.DataFrame.spatial.from_layer(tract_railroad_dist_lyr)
sdf_tract_railroad_dist_lyr.head()
{"cost": 2.168}
From_IDFrom_NameFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_IDTo_Id_OrigTo_ORIG_FIDTo_Shape__LengthTotal_Miles
01332227.03415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227770.665, 4989475.2983], [-82...1012194198.770380.140554
1501207.01472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112{"paths": [[[-8232997.3392, 4984450.7008], [-8...1012194198.770380.166535
2469174.02507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213None1012194198.770380.0
3454160.02514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214None1012194198.770380.0
4440150.01516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115{"paths": [[[-8232883.8965, 4979976.3551], [-8...1012194198.770380.559931
# Final Tract to NYCRailroad distances in feet — Here in each row, column "railroad_dist" returns the distance of
# the nearest rail road route from that tract
sdf_tract_railroad_dist_lyr_new = sdf_tract_railroad_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_railroad_dist_lyr_new['railroad_dist'] = round(sdf_tract_railroad_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_railroad_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milesrailroad_dist
2095360050001000.4030542128.12
2059360050002000.2153951137.29
2067360050004000.7085513741.15
1668360050016000.6145063244.59
2127360050019000.00.0
# Estimating distances to NYC Business Districts
tract_busi_distrs_dist = use_proximity.find_nearest(nyc_tracts_layer,
                                                      busi_distrs_lyr,
                                                      measurement_type='StraightLine',
                                                      max_count=1,
                                                      output_name='tract_busi_distrs_dist'+ str(dt.now().microsecond))
tract_busi_distrs_dist_lyr = tract_busi_distrs_dist.layers[1]
sdf_tract_busi_distrs_dist_lyr = pd.DataFrame.spatial.from_layer(tract_busi_distrs_dist_lyr)
sdf_tract_busi_distrs_dist_lyr.head()
{"cost": 2.241}
From_IDFrom_NameFrom_alandFrom_awaterFrom_countyfpFrom_funcstatFrom_geoidFrom_intptlatFrom_intptlonFrom_mtfccFrom_namelsadFrom_statefpFrom_tractceNearRankOBJECTIDSHAPETo_IDTo_ORIG_FIDTo_Shape__AreaTo_Shape__LengthTo_bidTo_boroughTo_date_creatTo_date_modifTo_objectidTo_time_creatTo_time_modifTotal_Miles
01332227.03415020005S36005022703+40.8440198-073.9104999G5020Census Tract 227.033602270311{"paths": [[[-8227809.0396, 4989358.3475], [-8...1514180282.2265623281.5747Washington Heights BIDManhattan2008-11-192016-10-2569.000:00:00.00000:00:00.0001.034897
1501207.01472730061S36061020701+40.8089775-073.9584600G5020Census Tract 207.013602070112{"paths": [[[-8232865.8707, 4984369.9553], [-8...2916269468.5078124849.117421125th Street BIDManhattan2008-11-192016-10-2567.000:00:00.00000:00:00.0000.159359
2469174.02507300061S36061017402+40.7968026-073.9471624G5020Census Tract 174.023601740213{"paths": [[[-8231888.4853, 4982612.3984], [-8...2916269468.5078124849.117421125th Street BIDManhattan2008-11-192016-10-2567.000:00:00.00000:00:00.0000.604451
3454160.02514220061S36061016002+40.7878787-073.9536853G5020Census Tract 160.023601600214{"paths": [[[-8232539.9271, 4981009.3871], [-8...6570409639.4140628507.99606Madison Avenue BIDManhattan2008-11-192016-10-2664.000:00:00.00000:00:00.0000.502002
4440150.01516430061S36061015001+40.7801987-073.9592834G5020Census Tract 150.013601500115None6570409639.4140628507.99606Madison Avenue BIDManhattan2008-11-192016-10-2664.000:00:00.00000:00:00.0000.0
# Final Tract to NYC Businesss Districts distances in feet — Here in each row, column "busi_distr_dist" returns the distance of the CBD from respective tracts
sdf_tract_busi_distrs_dist_lyr_new = sdf_tract_busi_distrs_dist_lyr[['From_geoid', 'Total_Miles']]
sdf_tract_busi_distrs_dist_lyr_new['busi_distr_dist'] = round(sdf_tract_busi_distrs_dist_lyr_new['Total_Miles'] * 5280, 2) 
sdf_tract_busi_distrs_dist_lyr_new.sort_values('From_geoid').head()
From_geoidTotal_Milesbusi_distr_dist
2095360050001001.3086366909.6
2059360050002001.2925056824.43
2067360050004001.5963958428.97
1668360050016001.237626534.63
2127360050019000.5106112696.02

Importing Borough Info for each Tracts

# Name of the borough, inside which the tracts are located 
ny_tract_boro = gis.content.search('NYCTractBorough owner:api_data_owner', 'feature layer')[0]
ny_tract_boro_lyr = ny_tract_boro.layers[0]
sdf_ny_tract_boro_lyr = pd.DataFrame.spatial.from_layer(ny_tract_boro_lyr)
sdf_ny_tract_boro_lyr_new = sdf_ny_tract_boro_lyr[['geoid', 'boro_name']]
sdf_ny_tract_boro_lyr_new.sort_values('geoid').head()
geoidboro_name
036005000100Bronx
236005000200Bronx
536005000400Bronx
736005001600Bronx
936005001900Bronx

Merging all the above estimated data set of features

tract_merge_dist = sdf_tract_hotel_dist_lyr_new.merge(sdf_tract_subwy_rt_dist_lyr_new,
                                                           on='From_geoid', suffixes=('_left1', '_right1')).merge(sdf_tract_railroad_dist_lyr_new,
                                                           on='From_geoid', suffixes=('_left2', '_right2')).merge(sdf_tract_subwy_stn_dist_lyr_new,
                                                           on='From_geoid', suffixes=('_left3', '_right3')).merge(sdf_tract_busi_distrs_dist_lyr_new,
                                                           on='From_geoid', suffixes=('_left4', '_right4')).merge(sdf_tract_cbd_dist_lyr_new, on='From_geoid')
tract_merge_dist_new = tract_merge_dist[['From_geoid',
                                         'hotel_dist',
                                         'subwy_rt_dist',
                                         'railroad_dist',
                                         'subwy_stn_dist',
                                         'busi_distr_dist',
                                         'cbd_dist']]
tract_merge_dist_new = tract_merge_dist_new.rename(columns={'From_geoid':'geoid'})
tract_merge_dist_new.sort_values('geoid').head()
geoidhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_dist
2095360050001005571.754780.042128.124996.076909.626396.02
2059360050002005486.445854.071137.295851.156824.4336212.95
2067360050004002495.676293.883741.156291.158428.9738659.77
1668360050016003093.963824.423244.593852.616534.6339734.83
2127360050019000.015.060.0422.732696.0222881.35
# merging number of bus stop and borough name
tract_merge_dist_new = tract_merge_dist_new.merge(sdf_num_bustops_tracts_lyr_new,
                                                 on='geoid').merge(sdf_ny_tract_boro_lyr_new,
                                                 on='geoid') 
tract_merge_dist_new = tract_merge_dist_new.sort_values('geoid')
tract_merge_dist_new.head()
geoidhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_distnum_bustopboro_name
2095360050001005571.754780.042128.124996.076909.626396.020Bronx
2059360050002005486.445854.071137.295851.156824.4336212.950Bronx
2067360050004002495.676293.883741.156291.158428.9738659.771Bronx
1668360050016003093.963824.423244.593852.616534.6339734.831Bronx
2127360050019000.015.060.0422.732696.0222881.351Bronx
# Accessing the airbnb count for each tract
sdf_airbnb_count_by_tract_new = sdf_airbnb_count_by_tract[['geoid','Point_Count']]
sdf_airbnb_count_by_tract_new = sdf_airbnb_count_by_tract_new.rename(columns={'Point_Count':'total_airbnb'})
sdf_airbnb_count_by_tract_new.head()
geoidtotal_airbnb
2095360050001000
2059360050002000
20673600500040015
1668360050016001
21273600500190024
# preparing the final distance table with airbnb count by tract
tract_merge_dist_all = sdf_airbnb_count_by_tract_new.merge(tract_merge_dist_new, on='geoid')
tract_merge_dist_all.head()
geoidtotal_airbnbhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_distnum_bustopboro_name
03600500010005571.754780.042128.124996.076909.626396.020Bronx
13600500020005486.445854.071137.295851.156824.4336212.950Bronx
236005000400152495.676293.883741.156291.158428.9738659.771Bronx
33600500160013093.963824.423244.593852.616534.6339734.831Bronx
436005001900240.015.060.0422.732696.0222881.351Bronx
tract_merge_dist_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2167 entries, 0 to 2166
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   geoid            2167 non-null   string 
 1   total_airbnb     2167 non-null   Int32  
 2   hotel_dist       2167 non-null   Float64
 3   subwy_rt_dist    2167 non-null   Float64
 4   railroad_dist    2167 non-null   Float64
 5   subwy_stn_dist   2167 non-null   Float64
 6   busi_distr_dist  2167 non-null   Float64
 7   cbd_dist         2167 non-null   Float64
 8   num_bustop       2167 non-null   Int32  
 9   boro_name        2167 non-null   string 
dtypes: Float64(6), Int32(2), string(2)
memory usage: 169.4 KB

Borough column being an important location indicator is converted into numerical variable and inlcuded in the feature data

tract_merge_dist_final = pd.get_dummies(tract_merge_dist_all, columns=['boro_name'])
tract_merge_dist_final.head()
geoidtotal_airbnbhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_distnum_bustopboro_name_Bronxboro_name_Brooklynboro_name_Manhattanboro_name_Queensboro_name_Staten Island
03600500010005571.754780.042128.124996.076909.626396.020TrueFalseFalseFalseFalse
13600500020005486.445854.071137.295851.156824.4336212.950TrueFalseFalseFalseFalse
236005000400152495.676293.883741.156291.158428.9738659.771TrueFalseFalseFalseFalse
33600500160013093.963824.423244.593852.616534.6339734.831TrueFalseFalseFalseFalse
436005001900240.015.060.0422.732696.0222881.351TrueFalseFalseFalseFalse

Adding census data 2019 obtained using geoenrichment

The above distance data set is now added with the census data to form the final feature set for the model

sdf_airbnb_count_by_tract_enriched_sorted_new = sdf_airbnb_count_by_tract_enriched_sorted.drop(['AnalysisArea',
                                                                                                'ENRICH_FID',
                                                                                                'HasData',
                                                                                                'ID',
                                                                                                'OBJECTID',
                                                                                                'Point_Count',
                                                                                                'SHAPE',                      
                                                                                                'aggregationMethod',
                                                                                                'aland',
                                                                                                'apportionmentConfidence',
                                                                                                'awater',
                                                                                                'countyfp',
                                                                                                'funcstat',
                                                                                                'intptlat',
                                                                                                'intptlon',
                                                                                                'mtfcc',
                                                                                                'name',
                                                                                                'namelsad',
                                                                                                'populationToPolygonSizeRating',
                                                                                                'sourceCountry',
                                                                                                'statefp','tractce'], axis=1)
sdf_airbnb_count_by_tract_enriched_sorted_new.shape
(2167, 84)
# checking the rows of the table for nan values
row_with_null = sdf_airbnb_count_by_tract_enriched_sorted_new.isnull().any(axis=1)

# printing the row which has nan values
sdf_airbnb_count_by_tract_enriched_sorted_new[row_with_null]
geoidEmployed Population Age 16+Employed Male Pop Age 16+Male 16+Priv NonprofitMale 16+:Priv Profit EmplMale 16+:Priv Profit Self EmplMale 16+:State Govt WrkrMale 16+:Fed Govt WrkrMale 16+:Self-Emp Not IncMale 16+:Unpaid Family WrkrFemale Pop Age 16+Female 16+:Priv Profit EmplFemale 16+:Priv Profit Self EmplFemale 16+:Priv NonprofitFemale 16+:Local Govt WrkrFemale 16+:State Govt WrkrFemale 16+:Fed Govt WrkrFemale 16+:Self-Emp Not IncFemale 16+:Unpaid Family Wrkr2019 Median Age2019 Generation Alpha Population2019 Generation Z Population2019 Millennial Population2019 Generation X Population2019 Baby Boomer Population2019 Silent & Greatest Generations Population2019 Population by Generation Base2019 Population Density2019 Total Daytime Population2019 White Population2019 Black Population2019 American Indian Population2019 Asian Population2019 Pacific Islander Population2019 Other Race Population2019 Diversity IndexHHs: Inc Below Poverty LevelHHs:Inc at/Above Poverty LevelTotal Businesses (SIC)Construction Businesses (NAICS)Retail Trade Businesses (NAICS)Transportation/Warehouse Bus (NAICS)Own any tabletOwn any e-readerHave access to Internet at homeIndex: Spend 0.5-0.9 hrs online(excl email/IM .Spend <0.5 hrs online (excl email/IM time) daily2019 Total Population2019 Male Population2019 Female Population2019 Employed Civilian Pop 16+2019 Unemployed Population 16+2019 Unemployment RateACS Workers Age 16+ACS Workers 16+: Drove AloneACS Workers 16+: CarpooledACS Workers 16+: Public TransportationACS Workers 16+: BusACS Workers 16+: StreetcarACS Workers 16+: SubwayACS Workers 16+: FerryboatACS Workers 16+: TaxicabACS Workers 16+: MotorcycleACS Workers 16+: BicycleACS Workers 16+: WalkedACS Workers 16+: Other MeansACS Wrkrs 16+: Worked at Home2019 Owner Occupied HUs2019 Renter Occupied HUs2019 Vacant Housing Units2019 Median Home Value2019 Total Housing UnitsACS Median Year Structure Built: HUs2019 Annual Budget Exp2019 Transportation2019 Daytime Pop: Workers2019 Daytime Pop: Residents2019 Daytime Pop Density2019 Occupation: Protective Service2019 Occupation: Food Preperation2019 Occupation: Personal Care2019 Occupation: Office/Admin2019 Occupation: Construction/Extraction2019 Occupation: Production
216336047990100<NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA>
216536085990100<NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA>
# checking total number of nan values
nan_test = sdf_airbnb_count_by_tract_enriched_sorted_new.drop(['geoid'], axis=1)
np.isnan(nan_test).sum().sum()
0

These two tracts area actually are water areas within NYC, hence have nan values and are filled with zeros

sdf_airbnb_count_by_tract_enriched_sorted_fill = sdf_airbnb_count_by_tract_enriched_sorted_new.fillna(0)

#nan rechecked
nan_test = sdf_airbnb_count_by_tract_enriched_sorted_fill.drop(['geoid'], axis=1)
np.isnan(nan_test).sum().sum()
0

Merging the distance data with the enriched data

final_df = pd.merge(tract_merge_dist_final,
                    sdf_airbnb_count_by_tract_enriched_sorted_fill,
                    left_on = 'geoid',
                    right_on = 'geoid',
                    how = 'left')

print(final_df.shape)
final_df.head()
(2167, 97)
geoidtotal_airbnbhotel_distsubwy_rt_distrailroad_distsubwy_stn_distbusi_distr_distcbd_distnum_bustopboro_name_Bronxboro_name_Brooklynboro_name_Manhattanboro_name_Queensboro_name_Staten IslandEmployed Population Age 16+Employed Male Pop Age 16+Male 16+Priv NonprofitMale 16+:Priv Profit EmplMale 16+:Priv Profit Self EmplMale 16+:State Govt WrkrMale 16+:Fed Govt WrkrMale 16+:Self-Emp Not IncMale 16+:Unpaid Family WrkrFemale Pop Age 16+Female 16+:Priv Profit EmplFemale 16+:Priv Profit Self EmplFemale 16+:Priv NonprofitFemale 16+:Local Govt WrkrFemale 16+:State Govt WrkrFemale 16+:Fed Govt WrkrFemale 16+:Self-Emp Not IncFemale 16+:Unpaid Family Wrkr2019 Median Age2019 Generation Alpha Population2019 Generation Z Population2019 Millennial Population2019 Generation X Population2019 Baby Boomer Population2019 Silent & Greatest Generations Population2019 Population by Generation Base2019 Population Density2019 Total Daytime Population2019 White Population2019 Black Population2019 American Indian Population2019 Asian Population2019 Pacific Islander Population2019 Other Race Population2019 Diversity IndexHHs: Inc Below Poverty LevelHHs:Inc at/Above Poverty LevelTotal Businesses (SIC)Construction Businesses (NAICS)Retail Trade Businesses (NAICS)Transportation/Warehouse Bus (NAICS)Own any tabletOwn any e-readerHave access to Internet at homeIndex: Spend 0.5-0.9 hrs online(excl email/IM .Spend <0.5 hrs online (excl email/IM time) daily2019 Total Population2019 Male Population2019 Female Population2019 Employed Civilian Pop 16+2019 Unemployed Population 16+2019 Unemployment RateACS Workers Age 16+ACS Workers 16+: Drove AloneACS Workers 16+: CarpooledACS Workers 16+: Public TransportationACS Workers 16+: BusACS Workers 16+: StreetcarACS Workers 16+: SubwayACS Workers 16+: FerryboatACS Workers 16+: TaxicabACS Workers 16+: MotorcycleACS Workers 16+: BicycleACS Workers 16+: WalkedACS Workers 16+: Other MeansACS Wrkrs 16+: Worked at Home2019 Owner Occupied HUs2019 Renter Occupied HUs2019 Vacant Housing Units2019 Median Home Value2019 Total Housing UnitsACS Median Year Structure Built: HUs2019 Annual Budget Exp2019 Transportation2019 Daytime Pop: Workers2019 Daytime Pop: Residents2019 Daytime Pop Density2019 Occupation: Protective Service2019 Occupation: Food Preperation2019 Occupation: Personal Care2019 Occupation: Office/Admin2019 Occupation: Construction/Extraction2019 Occupation: Production
03600500010005571.754780.042128.124996.076909.626396.020TrueFalseFalseFalseFalse0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.034.50.0698.01997.0921.0127.05.03748.03588.6663.0629.02330.017.067.01.0672.075.20.00.024.03.04.01.00.00.00.00.00.03748.03407.0341.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.01.00.00.00.0663.00.0634.80.00.00.00.00.00.0
13600500020005486.445854.071137.295851.156824.4336212.950TrueFalseFalseFalseFalse1664.01068.051.0709.00.043.00.038.00.0596.0316.015.078.0115.060.012.00.00.035.8430.01218.01245.0913.0823.0168.04797.08988.53210.0438.01195.069.0217.07.02150.083.6278.01114.042.02.02.05.02177.0437.03469.086.0123.04797.02249.02548.01834.0192.09.51664.0766.013.0619.0235.00.0362.022.039.00.00.0200.00.027.0767.0766.078.0558333.01611.01957.0132707872.013576654.0196.03014.06014.948.0106.015.0162.033.0130.0
236005000400152495.676293.883741.156291.158428.9738659.771TrueFalseFalseFalseFalse3128.01675.0253.0932.021.038.0104.075.00.01453.0804.037.0276.0281.016.039.00.00.035.7506.01589.01589.01226.0945.0166.06021.010291.03758.0641.01803.060.0240.06.02278.085.4107.02092.057.03.09.03.02754.0730.04455.0103.0172.06021.02937.03084.02807.0120.04.13082.01628.061.01077.0250.00.0776.033.00.00.00.0130.064.0122.01548.0537.0108.0552034.02193.02001.0197321413.020238565.0471.03287.06423.1239.083.096.0429.099.042.0
33600500160013093.963824.423244.593852.616534.6339734.831TrueFalseFalseFalseFalse2513.0959.031.0620.029.00.018.020.00.01554.0927.07.0159.0296.090.051.024.00.036.6477.01384.01313.0996.0997.0350.05517.029457.35334.0497.02127.070.0123.03.01900.084.6526.01661.057.00.07.02.02392.0491.03950.080.0114.05517.02394.03123.02087.0183.08.12422.0880.073.01154.0604.00.0538.012.019.00.012.0110.029.0145.0295.01712.0103.0650463.02110.01973.0117903160.011830562.01831.03503.028480.2101.0104.035.0299.03.021.0
436005001900240.015.060.0422.732696.0222881.351TrueFalseFalseFalseFalse1790.0887.0144.0580.046.00.07.041.017.0903.0585.015.0133.064.017.00.089.00.036.1362.0946.01286.0885.0553.0134.04166.03876.68595.0637.01656.062.067.05.01236.085.8323.0938.0409.075.045.026.01836.0347.03093.080.094.04166.02182.01984.02152.0114.05.01751.0279.089.01054.0142.014.0898.00.016.00.021.075.035.0182.0118.01486.0124.0381395.01728.01964.0118882651.012251786.06516.02079.07998.032.0127.0162.0177.058.064.0
# rechecking nan values of the final dataframe
final_nan_test = final_df.drop('geoid', axis=1)
np.isnan(final_nan_test).sum().sum()
0

Model Building

The goal here is to find the factors contributing towards the development of new Airbnb properties in New York City. Thus a model is fitted predicting the number of Airbnb properties per tract with the feature set composed of the distance and demographics characteristics of each tract. Once a good fit is obtained the most important predictors of the model are estimated which is our main ask.

# Creating feature data 
X = final_df.drop(['geoid','total_airbnb'], axis=1)

# Creating target data  -- the number airbnb per tract
y = pd.DataFrame(final_df['total_airbnb'])

split the dataframe into train - test of 90% to 10%

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 20)

print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

# Converting the target into 1d array
y_train_array = y_train.values.flatten()
y_test_array = y_test.values.flatten() 

print(y_train_array.shape)
print(y_test_array.shape)
(1950, 95)
(1950, 1)
(217, 95)
(217, 1)
(1950,)
(217,)

As a best practice since scaled data performs well for model fitting, the features are normalized using Robust scaler

scaler = preprocessing.RobustScaler()

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns) 
X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test), columns=X_test.columns)

RandomForest Regressor Model

The modelling is first started using a linear regression. However the linear model was failing to fit the data well. Hence it was carried out with a non linear algorithm as follows. This could be tested by the user to see the improvement of using Random Forest over a linear regression.

The accuracy metrics of mean absoute error and r-square is used

# Random forest with scaled data
# for the best parameters a grid search could be done which could take some time
# however this model uses the default parameters of RF algorithm, while the estimators are changed till the best fit is obtained
model_RF = RandomForestRegressor(n_estimators = 500, random_state=43)

# Train the model
model_RF.fit(X_train_scaled, y_train_array)

# Training metrics for Random forest model
print('Training metrics for Random forest model using scaled data')
ypred_RF_train = model_RF.predict(X_train_scaled)
print('r-square_RF_Train: ', round(model_RF.score(X_train_scaled, y_train_array), 2))

mse_RF_train = metrics.mean_squared_error(y_train_array, ypred_RF_train)  
print('RMSE_RF_train: ', round(np.sqrt(mse_RF_train),4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(y_train_array, ypred_RF_train)
print('MAE_RF_train: ', round(mean_absolute_error_RF_train, 4)) 

# Test metrics for Random Forest model
print('\nTest metrics for Random Forest model scaled data')
ypred_RF_test = model_RF.predict(X_test_scaled)
print('r-square_RF_test: ', round(model_RF.score(X_test_scaled, y_test_array), 2))

mse_RF_test = metrics.mean_squared_error(y_test_array, ypred_RF_test) 
print('RMSE_RF_test: ', round(np.sqrt(mse_RF_test), 4))

mean_absolute_error_RF_test = metrics.mean_absolute_error(y_test_array, ypred_RF_test)
print('MAE_RF_test: ', round(mean_absolute_error_RF_test, 4))
Training metrics for Random forest model using scaled data
r-square_RF_Train:  0.97
RMSE_RF_train:  6.9747
MAE_RF_train:  3.4844

Test metrics for Random Forest model scaled data
r-square_RF_test:  0.77
RMSE_RF_test:  22.2146
MAE_RF_test:  10.2689

The result shows that the model is returning an r-square of 0.85 with a mean absolute error of 9.28

Feature importance for the RF model

feature_imp_RF = model_RF.feature_importances_

#relative feature importance  
rel_feature_imp = 100 * (feature_imp_RF / max(feature_imp_RF)) 
rel_feature_imp = pd.DataFrame({'features':list(X_train.columns),
                                'rel_importance':rel_feature_imp })

rel_feature_imp = rel_feature_imp.sort_values('rel_importance', ascending=False)


#plotting the top twenty important features
top20_features = rel_feature_imp.head(20) 

plt.figure(figsize=[20,10])
plt.yticks(fontsize=15)
ax = sns.barplot(x="rel_importance", y="features",
                 data=top20_features,
                 palette="Accent_r")

plt.xlabel("Relative Importance", fontsize=25)
plt.ylabel("Features", fontsize=25)
plt.show()
<Figure size 2000x1000 with 1 Axes>
rel_feature_imp.head()
featuresrel_importance
5cbd_dist100.000000
77ACS Wrkrs 16+: Worked at Home54.125825
332019 Millennial Population38.377447
74ACS Workers 16+: Bicycle18.789531
802019 Vacant Housing Units11.898179

The feature importance plot reveals that distance from the city centre (cbd_dist) is the most important predictor of the number of Airbnb formation in NYC. This is expected since hotel rents near the cbd are quite high, rental income from Airbnb properties would be high as well, hence setting up Airbnb property would be a lucrative option, compared to long term rental income in areas near the cbd.

This is followed by the number of millennial population, or the tracts having most number of people in the age group of 25 to 40 years old. One reason might be that these group of population are more active online and are comfortable with internet technologies which is in a way a necessary prerequisite for setting up Airbnb properties. This is supported by the presence of another interesting predictor variable of -- 0.5-0.9 hrs online activity, in the top twenty.

This is followed by the tracts having workers who commute by bicycle and is the third most important predictor, which is followed by the number of generation alpha population, who are person born after 2011, and then by tracts having people commuting by subway, and so on. The median home value of the tracts is also an interesting predictor.

Gradient Boosting Regressor Model

Here trial shows that the gradient boosting model performs better with non scale data

# GradientBoosting with non scaled data
# this model uses the default parameters of GB algorithm, while the estimators are changed to obtain the best fit 
model_GB_nonscale = GradientBoostingRegressor(n_estimators=500, random_state=60)

# Train the model
model_GB_nonscale.fit(X_train, y_train_array)

# Training metrics for Gradient Boosting Regressor model
print('Training metrics for Gradient Boosting Regressor model using scaled data')

ypred_GB_train = model_GB_nonscale.predict(X_train)
print('r-square_GB_Train: ', round(model_GB_nonscale.score(X_train, y_train_array), 2))

mse_RF_train = metrics.mean_squared_error(y_train_array, ypred_GB_train)
print('RMSE_GB_Train: ', round(np.sqrt(mse_RF_train), 4))

mean_absolute_error_RF_train = metrics.mean_absolute_error(y_train_array, ypred_GB_train)
print('MAE_GB_Train: ', round(mean_absolute_error_RF_train, 4))

#Test metrics for Gradient Boosting Regressor model
print('\nTest metrics for Gradient Boosting Regressor model using scaled data')

ypred_GB_test = model_GB_nonscale.predict(X_test)
print('r-square_GB_Test: ', round(model_GB_nonscale.score(X_test, y_test_array),2))

mse_RF_Test = metrics.mean_squared_error(y_test_array, ypred_GB_test)  
print('RMSE_GB_Test: ', round(np.sqrt(mse_RF_Test),4))

mean_absolute_error_GB_Test = metrics.mean_absolute_error(y_test_array, ypred_GB_test)
print('MAE_GB_Test: ', round(mean_absolute_error_GB_Test, 4))
Training metrics for Gradient Boosting Regressor model using scaled data
r-square_GB_Train:  0.99
RMSE_GB_Train:  3.1363
MAE_GB_Train:  2.3217

Test metrics for Gradient Boosting Regressor model using scaled data
r-square_GB_Test:  0.81
RMSE_GB_Test:  20.0925
MAE_GB_Test:  9.5355

The result shows that the Gradient boosting regressor model is performing slightly better both in terms of Mean Absolute error and r-square than the random forest model.

Feature Importance of Gradient Boosting Model

#checking the feature importance for the Gradient Boosting regressor
feature_imp_GB = model_GB_nonscale.feature_importances_
rel_feature_imp_GB = 100 * feature_imp_GB / max(feature_imp_GB)
rel_feature_imp_GB = pd.DataFrame({'features':list(X_train.columns),
                                   'rel_importance':rel_feature_imp_GB})
rel_feature_imp_GB = rel_feature_imp_GB.sort_values('rel_importance', ascending=False)
rel_feature_imp_GB.head()
featuresrel_importance
5cbd_dist100.000000
77ACS Wrkrs 16+: Worked at Home57.804037
74ACS Workers 16+: Bicycle48.524487
332019 Millennial Population37.625884
802019 Vacant Housing Units24.284889
# Plot  feature importance for the Gradient Boosting regressor
top20_features_GB = rel_feature_imp_GB.head(20) 

plt.figure(figsize=[20,10])
plt.yticks(fontsize=15)
ax = sns.barplot(x="rel_importance", y="features", data = top20_features_GB, palette="Accent_r")
plt.xlabel("Relative Importance", fontsize=25)
plt.ylabel("Features", fontsize=25)
plt.show()
<Figure size 2000x1000 with 1 Axes>

The feature importance shown by the Gradient boosting model are almost identical to the one returned by the random forest model, which is expected.

Running cross validation

The above model is fitted and accuracy measured on a particular train and test split of the data. However the model accuracy for multiple split of the data remains to be seen. This is accomplished using k fold cross validation which splits the data into k different train-test splits and fit the model for each of them. Hence a 10 fold cross validation is run to check the overall model accuracy which is measured here as the mean absolute error for model fit accross the 10 different splits.

# Validating with a 10 fold cross validation for the Gradient Boosting models
y_array = y.values.flatten()

modelGB_cross_val = GradientBoostingRegressor(n_estimators=500, random_state=60) 

modelGB_cross_val_scores = cross_val_score(modelGB_cross_val,
                                           X, 
                                           y_array,
                                           cv=10,
                                           scoring='neg_mean_absolute_error')

print("All Model Scores: ", modelGB_cross_val_scores)

print("Negative Mean Absolute Error: {}".format(np.mean(modelGB_cross_val_scores)))
All Model Scores:  [ -6.8058649  -11.62520912 -11.02025161 -21.53356819  -4.63056795
 -41.11872138 -14.46560863  -9.15669676  -6.10330726  -4.9869393 ]
Negative Mean Absolute Error: -13.1446735095419
# Validating with a 10 fold cross validation for the Random forest models
y_array = y.values.flatten()

modelRF_cross_val = RandomForestRegressor(n_estimators=500, random_state=43)

modelRF_cross_val_scores = cross_val_score(modelRF_cross_val,
                                           X, 
                                           y_array,
                                           cv=10,
                                           scoring='neg_mean_absolute_error')

print("All Model Scores: ", modelRF_cross_val_scores)

print("Negative Mean Absolute Error: {}".format(np.mean(modelRF_cross_val_scores)))
All Model Scores:  [-10.648       -9.98169585 -10.22411982 -22.20942857  -4.53489401
 -37.69979724 -16.13404608 -12.30612963  -6.35166667  -4.42249074]
Negative Mean Absolute Error: -13.45122686038573

Final Result Visualization

# Plotting a kernel density map of the predicted vs. observed data
plt.figure(figsize=[15,5])

# plotting the prediction
sns.kdeplot(ypred_RF_test, label = 'Predictions', color = 'orange')
y_observed = np.array(y_test).reshape((-1, ))
sns.kdeplot(y_observed, label = 'Observation', color = 'green')

# label the plot
plt.xlabel('No. of Airbnb listings per census tract', fontsize=15)
plt.ylabel('Density', fontsize=15)
plt.title('Density Plot: Predicted vs Observed', fontsize=15)
plt.xticks(range(0,500,25), fontsize=10)
plt.yticks(fontsize=10)
plt.legend(fontsize=15)
plt.show()
<Figure size 1500x500 with 1 Axes>
# Converting the predicted and observed values to dataframe and plotting the observed vs predicted
y_test_df = y_test.copy()
y_test_df['Predicted'] = (ypred_RF_test)  
y_test_df.head()
total_airbnbPredicted
2046.006
91063.980
68524.572
45012.000
104428.080
# plotting the actual observed vs predicted airbnb properties by tract
plt.figure(figsize = [25,12])
sns.set(style = 'whitegrid')
sns.lineplot(data = y_test_df, markers=True) 

#label the plot
plt.xlabel('Tract ID', fontsize=15)
plt.ylabel('Total No. of Airbnb', fontsize=15)
plt.title('Actual No. of Airbnb by Tract: Predicted vs Observed', fontsize=15)
plt.xticks(range(0,2000,100), fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='x-large', title_fontsize='10')
plt.legend(fontsize=15)
plt.show()
<Figure size 2500x1200 with 1 Axes>

The plot shows that the predicted values closely matches the observed values. However there are instances of underprediction for tracts with extremely high number of airbnb properties, and also overprediction instances for some tracts with comparatively lower number of airbnb properties.

Conclusion

The study shows that the location factor of distance from CBD is the foremost important factor which stimulates creation of Airbnb properties.

The proximity tool from the ArcGIS API for Python was used to perform this significant task for all the distance estimation. Other factors as returned by the feature importance result could be dealt individually. Another interesting capability of Esri utilized in the study is that of Esri's data repository, elaborated here via the geoenrichment services. The data enrichment service could provide the analyst an wide array of data that could be used for critical analysis. Further analysis would be done in the next study on this dataset.

Summary of methods used

MethodQuestionExamples
aggregate_pointsHow many points within each polygon?Counting the number of airbnb rentals within each NYC tracts
Data EnrichmentWhich demographic attribute are relevant for the problem?Population of Millennials for each tract
find_nearestWhich distances from city features are relevant for the problem?Distance of the CBD from each tract

Data resources

ShapefileSourceLink
airbnb_nyc2019NYC Airbnb Data Inside Airbnb:Get the Datahttp://insideairbnb.com/get-the-data.html
nyc_tract_fulllNYC Open Data: 2010 Census Tracts (water areas included)https://data.cityofnewyork.us/City-Government/2010-Census-Tracts-water-areas-included-/gx7x-82rk
busi_distrNYC Open Data: Business Improvement Districtshttps://data.cityofnewyork.us/Business/Business-Improvement-Districts/ejxk-d93y
cbdNYC Open Data: Business Improvement Districtshttps://data.cityofnewyork.us/Business/Business-Improvement-Districts/ejxk-d93y
bus_stopNYC Open Data: Bus Stop Sheltershttps://data.cityofnewyork.us/Transportation/Bus-Stop-Shelters/qafz-7myz
hotelsNYC Open Data: Facilities Databasehttps://data.cityofnewyork.us/City-Government/Facilities-Database-Shapefile/2fpa-bnsx
railroadNYC Open Data: Railroad Linehttps://data.cityofnewyork.us/Transportation/Railroad-Line/i7a5-bsik
subwy_rtNYC Open Data: Subway Lineshttps://data.cityofnewyork.us/Transportation/Subway-Lines/3qz8-muuu
subwy_stnNYC Open Data: Subway Stationshttps://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.