Geographically weighted regression (GWR)

Performs Geographically Weighted Regression (GWR), which is a local form of linear regression that is used to model spatially varying relationships.

Usage notes

GWR provides a local model of the variable or process you are trying to understand or predict by fitting a regression equation to every point in the DataFrame. GWR constructs these separate equations by incorporating the dependent and explanatory variables within the neighborhood of each target point. There are two options:
- Number of neighbors—The neighborhood size is a function of a specified number of neighbors included in calculations for each point. Where record locations are dense, the spatial extent of the neighborhood is smaller; where point locations are sparse, the spatial extent of the neighborhood is larger. When you pick this option, select the number of neighbors you want to include. The number should be an integer between 2 and 5000. To use this option, use setNumNeighbors().
- Distance band—The neighborhood size is a constant or fixed distance for each point. When you pick this option, select the distance band to represent the spatial extent of the neighborhood. To use this option, use setDistanceBand().
It is common practice to explore your data globally using the Generalized Linear Regression tool prior to exploring your data locally using the GWR tool.
The fields specified with setDependentVariable() and setExplanatoryVariables() should be numeric fields containing a variety of values. There should be variation in these values both globally and locally. For this reason, do not use "dummy" explanatory variables to represent different spatial regimes in your GWR model (such as assigning a value of 1 to census tracts outside the urban core, while all others are assigned a value of 0). Because the GWR tool allows explanatory variable coefficients to vary, these spatial regime explanatory variables are unnecessary, and if included, will create problems with local multicollinearity.
The value specified using setLocalWeightingScheme() determines the kernel type that will be used to provide the spatial weighting in the model. The kernel defines how each point is related to other points within its neighborhood.
- Bisquare—A weight of 0 will be assigned to any point outside the neighborhood specified. This is the default.
- Gaussian—All points will receive weights, but weights become exponentially smaller when farther away from the target point.
To view the model diagnostics (R-Squared, AICc, etc.) for the Geographically Weighted Regression, use runIncludeDiagnostics(). This will return a named tuple with the result DataFrame and a dictionary containing the model diagnostics. To return only the result DataFrame use run().
In global regression models, such as Generalized Linear Regression, results are unreliable when two or more variables exhibit multicollinearity (when two or more variables are redundant or together tell the same story). The GWR tool builds a local regression equation for each point in the DataFrame. When the values for a particular explanatory variable cluster spatially, it is likely that there are problems with local multicollinearity. The condition number field (COND_ADJ) in the output DataFrame indicates when results are unstable due to local multicollinearity. As a general rule, be skeptical of results for points with a condition number greater than 30; equal to Null; or, for shapefiles, equal to -1.7976931348623158e+308.
Use caution when including nominal or categorical data in a GWR model. Where categories cluster spatially, there is strong risk of encountering local multicollinearity issues. The condition number included in the GWR output indicates when local collinearity is a problem (a condition number less than zero, greater than 30, or set to Null). Results in the presence of local multicollinearity are unstable.
A regression model is incorrectly specified if it is missing a key explanatory variable. Statistically significant spatial autocorrelation of the regression residuals or unexpected spatial variation among the coefficients of one or more explanatory variables suggests that your model is incorrectly specified. You should make every effort (through GLR residual analysis and GWR coefficient variation analysis, for example) to discover what these key missing variables are so they can be included in the model.
Always question whether it makes sense for an explanatory variable to be nonstationary. For example, suppose you are modeling the density of a particular plant species as a function of several variables including ASPECT. If you find that the coefficient for the ASPECT variable changes across the study area, you are likely seeing evidence of a key missing explanatory variable (perhaps prevalence of competing vegetation, for example). You should make every effort to include all key explanatory variables in your regression model.
Severe model design issues, or errors indicating that local equations do not include enough neighbors, often indicate a problem with global or local multicollinearity. To determine where the problem is, run a global model using Generalized Linear Regression and examine the VIF value for each explanatory variable. If some of the VIF values are large (above 7.5, for example), global multicollinearity is preventing GWR from solving. More likely, however, local multicollinearity is the problem. If, for example, you are modeling home values and have variables for bedrooms and bathrooms, you may want to combine these to increase value variation, or to represent them as bathroom/bedroom square footage. Avoid using spatial regime dummy variables, spatially clustering categorical or nominal variables, or using variables with very few possible values when constructing GWR models.
Geographically Weighted Regression (GWR) is a linear model subject to the same requirements as Generalized Linear Regression. Review the diagnostics explained in How Geographically Weighted Regression works carefully to ensure your GWR model is properly specified. The How regression models go bad section in Regression analysis basics also has information for ensuring your model is accurate.
Points with one or more null values in dependent or explanatory fields will be excluded from the output. If needed, you can modify values using the Calculate Field tool.
You should inspect the over- and under-predictions evident in your regression residuals to see if they provide clues about potential missing variables from your regression model.
When intercept, estimated coefficients, predicted values, residuals, and condition numbers are null, the model potentially has a poor fit. This may exist for one or more points in the model and can be caused by the following reasons:
- Not enough neighbors. Points with fewer than two neighbors will not have a model fit.
- Multicollinearity in the model.

In the above cases, the model should be assessed by examining the output diagnostics and potentially refit with different input values and coefficients.

Limitations

This implementation of Geographically Weighted Regression has the following limitations:

You cannot predict to another dataset or create raster coefficient outputs.
You cannot model a binary (logistic) variable or count (Poisson value) variable.
You cannot define the neighborhood search using Golden Search or Manual Intervals.

Results

The Geographically Weighted Regression tool produces a variety of outputs.

Output DataFrame columns

GWR outputs a DataFrame that includes the input explanatory and dependent fields as well as the following additional columns:

Field	Description
`Intercept`	The intercept.
`SE_Intercept`	Standard error of the intercept.
`C_<explanatory>`	Coefficient for the explanatory variable.
`SE_<explanatory>`	Standard error of the explanatory variables.
`PREDICTED_<dependent>`	Predicted value for the dependent variable.
`RESIDUAL`	The residual of the fitted model.
`STD_RESID`	The standarized residual.
`INFLUENCE`	The influence.
`COOKS_D`	The Cook's distance.
`LOCALR2`	Local R-squared.
`COND_ADJ`	Adjusted condition number.
`NUM_NBRS`	Number of neighbors used.

Model diagnostics

If runIncludeMetrics() is used, GWR outputs a dictionary containing the following model diagnostics:

Diagnostic	Description
`R2`	R-Squared is a measure of goodness of fit. Its value varies from 0.0 to 1.0, with higher values being preferable. It may be interpreted as the proportion of dependent variable variance accounted for by the regression model. The denominator for the R-Squared computation is the sum of squared dependent variable values. Adding an extra explanatory variable to the model does not alter the denominator but does alter the numerator; this gives the impression of improvement in model fit that may not be real. See Adjusted R-Squared below.
`AdjR2`	Adjusted R-Squared, because of the problem described above for the R-Squared value, calculations for the Adjusted R-Squared value normalize the numerator and denominator by their degrees of freedom. This has the effect of compensating for the number of variables in a model, and consequently, the Adjusted R-Squared value is almost always less than the R-Squared value. However, in making this adjustment, you lose the interpretation of the value as a proportion of the variance explained. In GWR, the effective number of degrees of freedom is a function of the neighborhood used, so the adjustment may be marked in comparison to a global model such as that used by Generalized Linear Regression. For this reason, AICc is preferred as a means of comparing models.
`AICc`	Akaike's Information Criterion corrected (AICc) is a measure of model performance and can be used to compare regression models. Taking into account model complexity, the model with the lower AICc value provides a better fit to the observed data. AICc is not an absolute measure of goodness of fit but is useful for comparing models with different explanatory variables as long as they apply to the same dependent variable. If the AICc values for two models differ by more than 3, the model with the lower AICc value is considered to be better. Comparing the GWR AICc value to the generalized linear regression (GLR) AICc value is one way to assess the benefits of moving from a global model (GLR) to a local regression model (GWR). See Gollini et al. for the formulas used to compute AICc for all model types.
`Sigma2`	Sigma-Squared is the least-squares estimate of the variance (standard deviation squared) for the residuals. Smaller values of this statistic are preferable. This value is the normalized residual sum of squares in which the residual sum of squares is divided by the effective degrees of freedom of the residuals. Sigma-Squared is used for AICc computations.
`EDoF`	The Effective Degrees of Freedom value reflects a tradeoff between the variance of the fitted values and the bias in the coefficient estimates and is related to the choice of neighborhood size. As the neighborhood approaches infinity, the geographic weights for every feature approach 1, and the coefficient estimates will be very close to those for a global generalized linear regression model. For very large neighborhoods, the effective number of coefficients approaches the actual number; local coefficient estimates will have a small variance but will be biased. Conversely, as the neighborhood gets smaller and approaches zero, the geographic weights for every feature approach zero except for the regression point. For extremely small neighborhoods, the effective number of coefficients is the number of observations, and the local coefficient estimates will have a large variance but a low bias. The effective number is used to compute many other diagnostic measures.

Performance notes

Improve the performance of Geographically Weighted Regression (GWR) by doing one or more of the following:

Only analyze the records in your area of interest. You can pick the records of interest by using one of the following SQL functions:
- ST_Intersection—Clip to an area of interest represented by a polygon. This will modify your input records.
- ST_BboxIntersects—Select records that intersect an envelope.
- ST_EnvIntersects—Select records having an evelope that intersects the envelope of another geometry.
- ST_Intersects—Select records that intersect another dataset or area of intersect represented by a polygon.
Decrease the number of neighbors in your calculation.
Use the setNumNeighbors() option instead of setDistanceBand().
Use fewer fields for the setExplanatoryVariables() value when possible.

Similar capabilities

If you want to calculate a global regression model, use the generalized linear regression tool available through Spark MLlib.

Syntax

For more details, go to the GeoAnalytics Engine API reference for GWR.

Setter	Description	Required
`setDependentVariable(dependent_variable)`	The numeric field containing the observed values to model.	Yes
`run(dataframe)`	Runs the GWR tool using the provided DataFrame and returns a result DataFrame.	One of `run()` or `runIncludeDiagnostics()` is required.
`runIncludeDiagnostics(dataframe)`	Runs the GWR tool using the provided DataFrame. Returns a named tuple containing the result DataFrame and a dictionary of model diagnostics.	One of `run()` or `runIncludeDiagnostics()` is required.
`setDistanceBand(distance_band=None, distance_band_unit=None)`	Sets the neighborhood size as a fixed distance for each feature.	One of `setDistanceBand()` or `setNumNeighbors()` is required.
`setExplanatoryVariables(*explanatory_variables)`	Sets one or more fields to represent independent explanatory variables in the model.	Yes
`setLocalWeightingScheme(local_weighting_scheme)`	Sets the kernel type that will be used to provide the spatial weighting in the model. The kernel defines how each points is related to other points within its neighborhood. Two options are supported: `'Bisquare'` (default) and `'Gaussian'`.	No
`setNumNeighbors(number_of_neighbors)`	Sets the neighborhood size as a function of a specified number of neighbors included in calculations for each point. Where points are dense, the spatial extent of the neighborhood is smaller; where points are sparse, the spatial extent of the neighborhood is larger.	One of `setDistanceBand()` or `setNumNeighbors()` is required.

Examples

Run Geographically Weighted Regression

Python
Use dark colors for code blocksCopy
# Log in
import geoanalytics
geoanalytics.auth(username="myusername", password="mypassword")

# Imports
from geoanalytics.tools import GWR
from geoanalytics.sql import functions as ST
from pyspark.sql import functions as F

# Path to the USA weather stations data
data_path = r"https://services2.arcgis.com/j80Jz20at6Bi0thr/arcgis/rest" \
            "/services/station_data_no_missing/FeatureServer/0"

# Create a DataFrame using the March 2012 USA weather stations data and transform geometry
#   to NAD 1983 Contiguous USA Albers (6350)
df = spark.read.format("feature-service").load(data_path) \
                    .where("LST_YRMO == 201203") \
                    .withColumn("shape", ST.transform("shape", 6350))

# Run the GWR tool to predict average monthly temperature for each of the USA weather stations
#   - T_MONTHLY_: Average monthly temperature in Celsius
#   - huss: Near-Surface Specific Humidity
#   - rlut: TOA Outgoing Longwave Radiation (W m-2)
result = GWR() \
            .setExplanatoryVariables("huss", "rlut") \
            .setDependentVariable(dependent_variable="T_MONTHLY_") \
            .setLocalWeightingScheme(local_weighting_scheme="Bisquare") \
            .setNumNeighbors(number_of_neighbors=30) \
            .runIncludeDiagnostics(dataframe=df)

# View the first 5 rows of the result DataFrame
result.outputTrained.select(F.round("huss", 3).alias("huss"),
                            F.round("rlut", 3).alias("rlut"),
                            F.round("T_MONTHLY_", 3).alias("T_MONTHLY_"),
                            F.round("Intercept", 3).alias("Intercept"),
                            F.round("SE_Intercept", 3).alias("SE_Intercept"),
                            F.round("C_huss", 3).alias("C_huss"),
                            F.round("SE_huss", 3).alias("SE_huss"),
                            F.round("C_rlut", 3).alias("C_rlut"),
                            F.format_string("%.3e", F.col("COND_ADJ").cast("float")).alias("COND_ADJ"),
                            F.round("NUM_NBRS", 3).alias("NUM_NBRS"),
                            F.round("LOCALR2", 3).alias("LOCALR2"),
                            "NUM_NBRS", "geometry") \
                      .sort("LocalR2", ascending=False).show(5)

# View the model diagnostics
for k, v in result.modelDiagnostics.items():
    print(f"| {k} | {v:.4f} |")

Result
Use dark colors for code blocksCopy
+-----+-------+----------+---------+------------+--------+--------+------+---------+--------+-------+--------+--------------------+
| huss|   rlut|T_MONTHLY_|Intercept|SE_Intercept|  C_huss| SE_huss|C_rlut| COND_ADJ|NUM_NBRS|LOCALR2|NUM_NBRS|            geometry|
+-----+-------+----------+---------+------------+--------+--------+------+---------+--------+-------+--------+--------------------+
|0.002|207.458|       9.4|   -55.67|       23.43|3280.889|1017.383| 0.279|2.846e+08|      30|  0.959|      30|{"x":699758.25892...|
|0.003|216.765|      14.4|  -51.669|      23.223|3398.201| 968.569| 0.259|2.224e+08|      30|  0.957|      30|{"x":464907.46887...|
|0.002|211.781|      11.2|  -67.349|      29.289|2727.166|   999.7| 0.342|2.143e+08|      30|  0.946|      30|{"x":887328.35535...|
|0.003|222.099|      16.5|  -57.705|      26.136| 2928.91| 892.178| 0.295|1.429e+08|      30|  0.942|      30|{"x":588862.78255...|
|0.003|211.452|      11.3|  -57.479|      26.842|2568.137|1535.328| 0.294|6.090e+08|      30|  0.939|      30|{"x":232780.89725...|
+-----+-------+----------+---------+------------+--------+--------+------+---------+--------+-------+--------+--------------------+
only showing top 5 rows

| R2 | 0.8544 |
| AdjR2 | 0.7995 |
| AICc | 1046.0627 |
| Sigma2 | 7.2362 |
| EDoF | 150.5360 |

Plot results

Python
Use dark colors for code blocksCopy
# Create a contiguous USA states DataFrame transform geometry to
# NAD 1983 Contiguous USA Albers projected coordinate system (6350)
states_data_path = "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest/services" \
                   "/USA_State_Boundaries/FeatureServer/0"
states_subset_df = spark.read.format("feature-service").load(states_data_path) \
    .where("""STATE_NAME != 'Alaska' and STATE_NAME != 'Hawaii' and
              STATE_NAME != 'District of Columbia'""") \
    .withColumn("shape", ST.transform("shape", 6350))

# Plot the predicted values from the result DataFrame with the USA states data
states_subset_plot = states_subset_df.st.plot(facecolor="none",
                                              edgecolors="lightblue",
                                              figsize=(16, 10),
                                              basemap="light")
result_plot = result.outputTrained.st.plot(cmap_values="PREDICTED_T_MONTHLY_",
                                           legend=True,
                                           legend_kwds={"shrink": 0.8},
                                           cmap="Wistia",
                                           ax=states_subset_plot)
result_plot.set_title("March 2012 predicted average monthly temperatures (Celsius) for USA weather stations")
result_plot.set_xlabel("X (Meters)")
result_plot.set_ylabel("Y (Meters)");

Plotting example for a Geographically Weighted Regression result. Predicted monthly temperatures for USA weather stations are shown.

Version table

Release	Notes
1.0.0	Tool introduced
1.4.0	Added support for returning global model diagnostics.