Geographically weighted regression (GWR)

Performs Geographically Weighted Regression (GWR), which is a local form of linear regression that is used to model spatially varying relationships.

GWR workflow diagram

Usage notes

  • GWR provides a local model of the variable or process you are trying to understand or predict by fitting a regression equation to every point in the DataFrame. GWR constructs these separate equations by incorporating the dependent and explanatory variables within the neighborhood of each target point. There are two options:

    • Number of neighbors—The neighborhood size is a function of a specified number of neighbors included in calculations for each point. Where record locations are dense, the spatial extent of the neighborhood is smaller; where point locations are sparse, the spatial extent of the neighborhood is larger. When you pick this option, select the number of neighbors you want to include. The number should be an integer between 2 and 5000. To use this option, use setNumNeighbors().

    • Distance band—The neighborhood size is a constant or fixed distance for each point. When you pick this option, select the distance band to represent the spatial extent of the neighborhood. To use this option, use setDistanceBand().

  • It is common practice to explore your data globally using the Generalized Linear Regression tool prior to exploring your data locally using the GWR tool.

  • The fields specified with setDependentVariable() and setExplanatoryVariables() should be numeric fields containing a variety of values. There should be variation in these values both globally and locally. For this reason, do not use "dummy" explanatory variables to represent different spatial regimes in your GWR model (such as assigning a value of 1 to census tracts outside the urban core, while all others are assigned a value of 0). Because the GWR tool allows explanatory variable coefficients to vary, these spatial regime explanatory variables are unnecessary, and if included, will create problems with local multicollinearity.

  • The value specified using setLocalWeightingScheme() determines the kernel type that will be used to provide the spatial weighting in the model. The kernel defines how each point is related to other points within its neighborhood.

    • Bisquare—A weight of 0 will be assigned to any point outside the neighborhood specified. This is the default.

    • Gaussian—All points will receive weights, but weights become exponentially smaller when farther away from the target point.

  • In global regression models, such as Generalized Linear Regression, results are unreliable when two or more variables exhibit multicollinearity (when two or more variables are redundant or together tell the same story). The GWR tool builds a local regression equation for each point in the DataFrame. When the values for a particular explanatory variable cluster spatially, it is likely that there are problems with local multicollinearity. The condition number field (COND_ADG) in the output DataFrame indicates when results are unstable due to local multicollinearity. As a general rule, be skeptical of results for points with a condition number greater than 30; equal to Null; or, for shapefiles, equal to -1.7976931348623158e+308.

  • Use caution when including nominal or categorical data in a GWR model. Where categories cluster spatially, there is strong risk of encountering local multicollinearity issues. The condition number included in the GWR output indicates when local collinearity is a problem (a condition number less than zero, greater than 30, or set to Null). Results in the presence of local multicollinearity are unstable.

  • A regression model is incorrectly specified if it is missing a key explanatory variable. Statistically significant spatial autocorrelation of the regression residuals or unexpected spatial variation among the coefficients of one or more explanatory variables suggests that your model is incorrectly specified. You should make every effort (through GLR residual analysis and GWR coefficient variation analysis, for example) to discover what these key missing variables are so they can be included in the model.

  • Always question whether it makes sense for an explanatory variable to be nonstationary. For example, suppose you are modeling the density of a particular plant species as a function of several variables including ASPECT. If you find that the coefficient for the ASPECT variable changes across the study area, you are likely seeing evidence of a key missing explanatory variable (perhaps prevalence of competing vegetation, for example). You should make every effort to include all key explanatory variables in your regression model.

  • Severe model design issues, or errors indicating that local equations do not include enough neighbors, often indicate a problem with global or local multicollinearity. To determine where the problem is, run a global model using Generalized Linear Regression and examine the VIF value for each explanatory variable. If some of the VIF values are large (above 7.5, for example), global multicollinearity is preventing GWR from solving. More likely, however, local multicollinearity is the problem. If, for example, you are modeling home values and have variables for bedrooms and bathrooms, you may want to combine these to increase value variation, or to represent them as bathroom/bedroom square footage. Avoid using spatial regime dummy variables, spatially clustering categorical or nominal variables, or using variables with very few possible values when constructing GWR models.

  • Geographically Weighted Regression (GWR) is a linear model subject to the same requirements as Generalized Linear Regression. Review the diagnostics explained in How Geographically Weighted Regression works carefully to ensure your GWR model is properly specified. The How regression models go bad section in Regression analysis basics also has information for ensuring your model is accurate.

  • Points with one or more null values in dependent or explanatory fields will be excluded from the output. If needed, you can modify values using the Calculate Field tool.

  • You should inspect the over- and under-predictions evident in your regression residuals to see if they provide clues about potential missing variables from your regression model.

  • When intercept, estimated coefficients, predicted values, residuals, and condition numbers are null, the model potentially has a poor fit. This may exist for one or more points in the model and can be caused by the following reasons:

    • Not enough neighbors. Points with fewer than two neighbors will not have a model fit.
    • Multicollinearity in the model.

In the above cases, the model should be assessed by examining the output diagnostics and potentially refit with different input values and coefficients.

Limitations

This implementation of Geographically Weighted Regression has the following limitations:

  • You cannot predict to another dataset or create raster coefficient outputs.
  • You cannot model a binary (logistic) variable or count (Poisson value) variable.
  • You cannot define the neighborhood search using Golden Search or Manual Intervals.

Results

The Geographically Weighted Regression tool produces a variety of outputs.

Interpret messages and diagnostics

  • AICc—AICc applies a bias correction to AIC for small sample sizes. AICc will approach AIC as the number of points in the input increase.

  • R-Squared—The R-Squared is a measure of goodness of fit. Its value varies from 0.0 to 1.0, with higher values being preferable. It may be interpreted as the proportion of dependent variable variance accounted for by the regression model. The denominator for the R-Squared computation is the sum of squared dependent variable values. Adding an extra explanatory variable to the model does not alter the denominator but does alter the numerator; this gives the impression of improvement in model fit that may not be real. See Adjusted R-Squared below.

GWR outputs a DataFrame the includes the input explanatory and dependent fields as well as the following additional columns:

FieldDescription
InterceptThe intercept.
SE_InterceptStandard error of the intercept.
C_<explanatory>Coefficient for the explanatory variable.
SE_<explanatory>Standard error of the explantory variables.
PREDICTED_<dependent>Predicted value for the dependent variable.
RESIDUALThe residual of the fitted model.
STD_RESIDThe standarized residual.
INFLUENCEThe influence.
COOKS_DThe Cook's distance.
LOCALR2Local R-squared.
COND_ADJAdjusted condition number.
NUM_NBRSNumber of neighbours used.

Performance notes

Improve the performance of Geographically Weighted Regression (GWR) by doing one or more of the following:

  • Only analyze the records in your area of interest. You can pick the records of interest by using one of the following SQL functions:

    • ST_Intersection—Clip to an area of interest represented by a polygon. This will modify your input records.
    • ST_EnvIntersects—Select records that intersect an envelope.
    • ST_Intersects—Select records that intersect another dataset or area of intersect represented by a polygon.
  • Decrease the number of neighbors in your calculation.
  • Use the setNumNeighbors() option instead of setDistanceBand().
  • Use fewer fields for the setExplanatoryVariables() value when possible.

Similar capabilities

If you want to calculate a global regression model, use the generalized linear regression tool available through Spark MLib.

Syntax

For more details, go to the GeoAnalytics Engine API reference for GWR.

SetterDescriptionRequired
setDependentVariable(dependent_variable)The numeric field containing the observed values to model.Yes
run(dataframe)Runs the GWR tool using the provided DataFrame.Yes
setDistanceBand(distance_band=None, distance_band_unit=None)Sets the neighborhood size as a fixed distance for each feature.One of setDistanceBand() orsetNumNeighbors() are required.
setExplanatoryVariables(*explanatory_variables)Sets one or more fields to represent independent explanatory variables in the model.Yes
setLocalWeightingScheme(local_weighting_scheme)Sets the kernel type that will be used to provide the spatial weighting in the model. The kernel defines how each points is related to other points within its neighborhood. Two options are supported: 'Bisquare' (default) and 'Gaussian'.No
setNumNeighbors(number_of_neighbors)Sets the neighborhood size as a function of a specified number of neighbors included in calculations for each point. Where points are dense, the spatial extent of the neighborhood is smaller; where points are sparse, the spatial extent of the neighborhood is larger.One of setDistanceBand() orsetNumNeighbors() are required.

Examples

Run Geographically Weighted Regression

Python
Use dark colors for code blocksCopy
                                                         
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Log in
import geoanalytics
geoanalytics.auth(username="myusername", password="mypassword")

# Imports
from geoanalytics.tools import GWR
from geoanalytics.sql import functions as ST

# Path to the USA weather stations data
data_path = r"https://services2.arcgis.com/j80Jz20at6Bi0thr/arcgis/rest" \
            "/services/station_data_no_missing/FeatureServer/0"

# Create a DataFrame using the March 2012 USA weather stations data and transform geometry
#   to NAD 1983 Contiguous USA Albers (6350)
df = spark.read.format("feature-service").load(data_path) \
                    .where("LST_YRMO == 201203") \
                    .withColumn("shape", ST.transform("shape", 6350))

# Run the GWR tool to predict average monthly temperature for each of the USA weather stations
#   - T_MONTHLY_: Average monthly temperature in Celsius
#   - huss: Near-Surface Specific Humidity
#   - rlut: TOA Outgoing Longwave Radiation (W m-2)
result = GWR() \
            .setExplanatoryVariables("huss", "rlut") \
            .setDependentVariable(dependent_variable="T_MONTHLY_") \
            .setLocalWeightingScheme(local_weighting_scheme="Bisquare") \
            .setNumNeighbors(number_of_neighbors=30) \
            .run(dataframe=df)

# View the first 5 rows of the result DataFrame
result.show(5)
Result
Use dark colors for code blocksCopy
          
1
2
3
4
5
6
7
8
9
10
+-----------+--------+----------+-------------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+------------------+-------------------+-------------------+--------------------+------------------+--------------------+--------+--------------------+
|       huss|    rlut|T_MONTHLY_|          Intercept|      SE_Intercept|            C_huss|           SE_huss|              C_rlut|            SE_rlut|PREDICTED_T_MONTHLY_|          RESIDUAL|          STD_RESID|          INFLUENCE|             COOKS_D|           LOCALR2|            COND_ADJ|NUM_NBRS|            geometry|
+-----------+--------+----------+-------------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+------------------+-------------------+-------------------+--------------------+------------------+--------------------+--------+--------------------+
|0.003209477|216.1092|      12.2| -67.05270401373855|  45.7625908684502|1878.5353981973603|1112.7369726085549|  0.3513017434646599|0.22970400926681692|  14.895953500543342|-2.695953500543343|-1.1034508910401566| 0.1750834976657929|0.005907358739974...| 0.837427141886643|2.3168939930203086E8|      30|{"x":2004815.1342...|
|0.002367065|208.0519|       9.5| -66.96336114874066|28.956354940844438|1233.8239714682568|1896.3696422846572|  0.3514890368142858|0.15118112704447417|   9.085140689216473|0.4148593107835268|0.19096659234135496|0.34780404255404695|4.445528753800241E-4|0.8754133327369877| 7.541161325010453E8|      30|{"x":9406.9496510...|
|0.002700314|201.8406|      12.3|-0.3729014011332765|36.911585747939206|  4758.90274715703|2647.1313879440995|-0.00768423232460691|0.20766762428019703|  10.926639843930595|1.3733601560694062| 0.5676508092327976|0.19109488636831884|0.001740067561838...|0.4178317631341004|1.3135900867102544E9|      30|{"x":-966935.8527...|
|0.003468615|218.7625|      14.6| -44.69811128470246|21.712925675142465| 2572.981239662273|2224.6744188712646| 0.22759236362099955|0.12311737526684455|   14.01524379753009|0.5847562024699098|0.23388728461669364| 0.1361712246112763|1.971165110486620...|0.6576686416537227| 7.775674654220295E8|      30|{"x":-563988.2354...|
|0.005393988|221.2435|      24.6| -39.75174675567541| 46.95631042826278| 6436.971645608544|2337.1874707743764| 0.11804130823998094| 0.2560032980668669|  21.085073125396388|3.5149268746036135| 1.3871167022953708|0.11264417840400887|0.005583275456322...|0.5126571020262354|1.6214697384262393E9|      30|{"x":-785789.2369...|
+-----------+--------+----------+-------------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+------------------+-------------------+-------------------+--------------------+------------------+--------------------+--------+--------------------+
only showing top 5 rows

Plot results

Python
Use dark colors for code blocksCopy
                                                         
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Create a contiguous USA states DataFrame transform geometry to
# NAD 1983 Contiguous USA Albers projected coordinate system (6350)
states_data_path = "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services" \
                   "/USA_State_Boundaries/FeatureServer/0"
states_subset_df = spark.read.format("feature-service").load(states_data_path) \
                                .where("""STATE_NAME != 'Alaska' and STATE_NAME != 'Hawaii' and
                                          STATE_NAME != 'District of Columbia'""") \
                                .withColumn("shape", ST.transform("shape", 6350))

# Plot the predicted values from the result DataFrame with the USA states data
states_subset_plot = states_subset_df.st.plot(facecolor="none",
                                              edgecolors="lightblue",
                                              figsize=(16,10))
result_plot = result.st.plot(cmap_values="PREDICTED_T_MONTHLY_",
                             legend=True,
                             cmap="Wistia",
                             ax=states_subset_plot)
result_plot.set_title("March 2012 predicted average monthly temperatures (Celsius) for USA weather stations")
result_plot.set_xlabel("X (Meters)")
result_plot.set_ylabel("Y (Meters)");

Plotting example for a Geographically Weighted Regression result. Predicted monthly temperatures for USA weather stations are shown.

Version table

ReleaseNotes

1.0.0

Tool introduced

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.