Building Rule Engines with GeoPandas

Q: How do I prevent a single malformed geometry from halting the entire rule batch?

Wrap each rule's execution block in a try/except. On exception, mark every feature in the batch as failed for that rule with a failure_reason that records the exception message, then continue to the next rule. This graceful-degradation pattern ensures one bad geometry or unexpected null value does not suppress results from all subsequent rules.

Q: What is the fastest way to apply many rules to millions of features?

Order rules by cost: null/type checks (pandas isnull/dtype comparisons) first, bounding-box pre-filters using spatial index sindex.query second, exact geometry predicates (is_valid, within, intersects) last. Replace apply() loops with vectorised Series operations wherever possible. For datasets above ~2 million features or exceeding available RAM, partition with Dask-GeoPandas and run rules per partition in parallel, aggregating results afterward.

Spatial data quality failures — invalid polygons, coordinate reference system (CRS) mismatches, topology violations, missing mandatory attributes — propagate silently into analytical models and public-facing maps unless intercepted by automated validation. A well-structured rule engine gives GIS analysts, QA engineers, data stewards, and platform teams a repeatable, auditable mechanism to intercept those failures before they reach production. This guide covers every layer of that engine, from environment setup through distributed scale-out, and sits within the broader Validation Pipeline Architecture as the rule-evaluation stage of the directed acyclic graph (DAG).

Prerequisites

Confirm every item below before writing rule logic. Skipping these steps is the most common cause of silent numeric errors and non-reproducible validation results.

Python 3.10+ with locked dependencies. Pin geopandas==0.14.*, shapely==2.0.*, pandas==2.2.*, pyproj==3.6.*, and numpy==1.26.* in pyproject.toml or requirements.txt. GeoPandas 0.14 adopted Shapely 2.0 as its geometry engine — earlier combinations produce different is_valid results on some degenerate geometries.
PROJ data directories verified. Run python -c "import pyproj; print(pyproj.datadir.get_data_dir())" and confirm the path exists. Missing PROJ datum files cause to_crs() to silently produce NaN coordinates rather than raising an error.
Source data in a supported vector format. GeoPackage, GeoJSON, or Parquet with a geometry column are preferred. Shapefiles introduce 254-character attribute name truncation and 2 GB size limits that can corrupt rule inputs.
CRS documented for every input layer. Know the canonical target CRS (e.g., EPSG:4326 for global web mapping, EPSG:3857 for web tiles, or a local projected CRS for cadastral data) before authoring rules. Coordinate Reference System Precision Standards covers how to set and enforce decimal precision during CRS normalisation.
Attribute schema documented and stable. Rules that reference column names break silently if a schema migration renames a field. Validate DataFrame schema with pandera or pydantic at engine entry so missing columns raise an explicit error rather than a downstream KeyError.

Conceptual Foundation

A spatial rule engine is a structured execution loop: it applies a prioritised set of validation functions to a GeoDataFrame and collects per-feature, per-rule outcomes into a uniform result schema. The engine itself carries no spatial logic — it only orchestrates function calls, captures exceptions, and assembles results. Spatial logic lives exclusively in the rule functions, making each rule independently testable and replaceable.

Three principles govern good rule-engine design:

Separation of concerns. The engine (orchestration) is distinct from rules (spatial logic), which are distinct from routing (severity classification). This separation means you can add a new rule without touching the engine, change severity thresholds without rewriting rules, and swap the routing backend (from a log file to a Kafka topic) without modifying either.

Vectorisation over iteration. GeoPandas and Shapely 2.0 execute geometry operations on arrays via GEOS’s C layer. A rule that calls gdf.geometry.is_valid runs an order of magnitude faster than a rule that calls gdf.apply(lambda r: r.geometry.is_valid). The two-phase filter pattern — bounding-box pre-filter using sindex.query, then exact predicate — applies the same principle to spatial joins: filter cheaply first, then pay the exact-predicate cost only on the reduced candidate set.

Graceful degradation. A rule that raises an unhandled exception should mark all features as failed for that rule and continue to the next rule. One malformed geometry, one unexpected null, or one projection edge case must not suppress results from the remaining rule set. This property is essential for production pipelines where input quality is inherently variable.

The Open Geospatial Consortium (OGC) Simple Features Specification defines the authoritative vocabulary for geometry validity: a polygon is valid when its rings are closed, non-self-intersecting, and correctly oriented. Rules that check geometry validity should test against these definitions, not against heuristics, so that failures map unambiguously to documented standards. For deeper background on what makes a geometry structurally invalid per OGC rules, Understanding OGC Topology Rules is the recommended reference.

Step-by-Step Implementation

Step 1: Define the ValidationResult Contract

Every rule must return outcomes in an identical shape. Enforce this with a dataclass so type checkers and test suites can verify conformance.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class ValidationResult:
    feature_id: str
    rule_name: str
    passed: bool
    failure_reason: str | None = None
    severity: str = "warning"          # "blocker" | "warning" | "informational"
    reference_geometry: Any | None = None

Verification: instantiate one result and confirm asdict() serialises cleanly to JSON — this is the schema your routing and reporting stages will consume.

from dataclasses import asdict
import json

sample = ValidationResult(
    feature_id="parcel_001",
    rule_name="no_null_geometries",
    passed=False,
    failure_reason="geometry is None",
    severity="blocker"
)
print(json.dumps(asdict(sample), default=str))
# Expected: {"feature_id": "parcel_001", "rule_name": "no_null_geometries",
#            "passed": false, "failure_reason": "geometry is None",
#            "severity": "blocker", "reference_geometry": null}

Step 2: Build the Rule Engine Skeleton

Keep the engine as thin as possible — its only responsibility is CRS enforcement, rule dispatch, and exception isolation.

import geopandas as gpd
import pandas as pd
from typing import Callable

RuleFn = Callable[[gpd.GeoDataFrame], pd.Series]

class SpatialRuleEngine:
    """
    Executes a prioritised list of spatial validation rules against a GeoDataFrame.
    Each rule must return a boolean Series aligned to gdf.index.
    """

    def __init__(
        self,
        rules: list[dict],   # [{"name": str, "check": RuleFn, "severity": str}, ...]
        target_epsg: int = 4326,
    ):
        self.rules = rules
        self.target_epsg = target_epsg

    def _enforce_crs(self, gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
        if gdf.crs is None:
            raise ValueError(
                "Input GeoDataFrame has no CRS. Set it explicitly before passing "
                "to the rule engine — never assume a projection."
            )
        if gdf.crs.to_epsg() != self.target_epsg:
            source_epsg = gdf.crs.to_epsg()
            gdf = gdf.to_crs(epsg=self.target_epsg)
            print(f"[CRS] Reprojected from EPSG:{source_epsg} → EPSG:{self.target_epsg}")
        return gdf

    def validate(self, gdf: gpd.GeoDataFrame) -> pd.DataFrame:
        gdf = self._enforce_crs(gdf)
        results: list[ValidationResult] = []

        for rule in self.rules:
            rule_name = rule["name"]
            check_fn: RuleFn = rule["check"]
            severity = rule.get("severity", "warning")

            try:
                passed_series: pd.Series = check_fn(gdf)
                # Ensure alignment with gdf index
                passed_series = passed_series.reindex(gdf.index, fill_value=False)

                for idx in gdf.index:
                    passed = bool(passed_series.loc[idx])
                    results.append(ValidationResult(
                        feature_id=str(gdf.at[idx, gdf.index.name or idx]),
                        rule_name=rule_name,
                        passed=passed,
                        failure_reason=None if passed else f"Failed rule: {rule_name}",
                        severity=severity if not passed else "informational",
                    ))

            except Exception as exc:
                # Graceful degradation: record failure for every feature in this rule
                for idx in gdf.index:
                    results.append(ValidationResult(
                        feature_id=str(idx),
                        rule_name=rule_name,
                        passed=False,
                        failure_reason=f"Rule execution error: {exc}",
                        severity="blocker",
                    ))

        return pd.DataFrame([vars(r) for r in results])

Verification: run the engine against an empty GeoDataFrame with a known CRS; confirm it returns an empty DataFrame rather than raising.

import geopandas as gpd
from shapely.geometry import Point

empty_gdf = gpd.GeoDataFrame({"geometry": []}, crs="EPSG:4326")
engine = SpatialRuleEngine(rules=[], target_epsg=4326)
result = engine.validate(empty_gdf)
assert result.empty, "Engine must return empty DataFrame for empty input and no rules"

Step 3: Author Vectorised Rule Functions

Write each rule as a pure function returning a boolean Series. Apply cheap checks first to serve as pre-filters.

import geopandas as gpd
import pandas as pd
from shapely.geometry import box
from shapely.validation import explain_validity

# --- Attribute checks (cheapest — run first) ---

def rule_no_null_geometries(gdf: gpd.GeoDataFrame) -> pd.Series:
    """Fails features where the geometry column is None or empty."""
    return gdf.geometry.notna() & ~gdf.geometry.is_empty

def rule_required_attributes(
    gdf: gpd.GeoDataFrame,
    required_cols: list[str],
) -> pd.Series:
    """Fails features missing any of the specified mandatory attribute columns."""
    if not required_cols:
        return pd.Series(True, index=gdf.index)
    mask = pd.Series(True, index=gdf.index)
    for col in required_cols:
        if col not in gdf.columns:
            return pd.Series(False, index=gdf.index)
        mask &= gdf[col].notna()
    return mask

# --- Bounding-box pre-filter (moderate cost) ---

def rule_within_study_area(
    gdf: gpd.GeoDataFrame,
    bounds: tuple[float, float, float, float],
) -> pd.Series:
    """Fails features whose geometry centroid falls outside the bounding box."""
    minx, miny, maxx, maxy = bounds
    bbox = box(minx, miny, maxx, maxy)
    # Use centroid to avoid partial-overlap edge cases for point datasets
    return gdf.geometry.centroid.within(bbox)

# --- Geometry predicate checks (most expensive — run last) ---

def rule_valid_topologies(gdf: gpd.GeoDataFrame) -> pd.Series:
    """
    Checks OGC geometry validity using Shapely 2.0's vectorised is_valid.
    Returns False for None geometries (handled upstream, but defensive here too).
    """
    return gdf.geometry.apply(
        lambda geom: geom.is_valid if geom is not None and not geom.is_empty else False
    )

def rule_no_self_intersections(gdf: gpd.GeoDataFrame) -> pd.Series:
    """
    Identifies self-intersecting rings by checking is_valid AND
    that explain_validity does not mention 'Self-intersection'.
    More specific than is_valid alone for debugging purposes.
    """
    def _check(geom):
        if geom is None or geom.is_empty:
            return False
        if geom.is_valid:
            return True
        return "Self-intersection" not in explain_validity(geom)
    return gdf.geometry.apply(_check)

def rule_minimum_area(
    gdf: gpd.GeoDataFrame,
    min_area_m2: float = 1.0,
) -> pd.Series:
    """Fails polygon features below a minimum area threshold (in CRS units)."""
    geom_type = gdf.geometry.geom_type
    polygon_mask = geom_type.isin(["Polygon", "MultiPolygon"])
    result = pd.Series(True, index=gdf.index)
    result[polygon_mask] = gdf.geometry[polygon_mask].area >= min_area_m2
    return result

Verification: unit-test each rule with a synthetic fixture containing at least one known-fail and one known-pass feature.

from shapely.geometry import Point, Polygon
import geopandas as gpd

def _make_gdf(geoms, crs="EPSG:4326"):
    return gpd.GeoDataFrame({"geometry": geoms}, crs=crs)

# rule_no_null_geometries
gdf_test = _make_gdf([Point(0, 0), None])
result = rule_no_null_geometries(gdf_test)
assert result.tolist() == [True, False], "Null geometry must fail"

# rule_valid_topologies
bowtie = Polygon([(0,0),(1,1),(1,0),(0,1),(0,0)])  # self-intersecting bowtie
gdf_test2 = _make_gdf([Polygon([(0,0),(1,0),(1,1),(0,1)]), bowtie])
result2 = rule_valid_topologies(gdf_test2)
assert result2.tolist() == [True, False], "Self-intersecting polygon must fail validity"

Step 4: Assemble and Run a Rule Set

import geopandas as gpd
from functools import partial

# Load your dataset — replace with your actual path and CRS
parcels = gpd.read_file("data/parcels.gpkg")

engine = SpatialRuleEngine(
    target_epsg=4326,
    rules=[
        # Attribute checks (cheapest — run first)
        {
            "name": "no_null_geometries",
            "check": rule_no_null_geometries,
            "severity": "blocker",
        },
        {
            "name": "required_attributes",
            "check": partial(
                rule_required_attributes,
                required_cols=["parcel_id", "owner_name", "zoning_code"]
            ),
            "severity": "blocker",
        },
        # Bounding-box guard (moderate cost)
        {
            "name": "within_study_area",
            "check": partial(
                rule_within_study_area,
                bounds=(-73.9, 40.6, -73.7, 40.8)   # example: NYC borough extent
            ),
            "severity": "warning",
        },
        # Geometry predicates (most expensive — run last)
        {
            "name": "valid_topologies",
            "check": rule_valid_topologies,
            "severity": "blocker",
        },
        {
            "name": "minimum_area_1m2",
            "check": partial(rule_minimum_area, min_area_m2=1.0),
            "severity": "warning",
        },
    ],
)

results_df = engine.validate(parcels)

# Quick summary
summary = results_df.groupby(["rule_name", "severity", "passed"]).size().reset_index(name="count")
print(summary)

# Verification: no rule should produce all-False across a healthy dataset
passed_counts = results_df.groupby("rule_name")["passed"].sum()
print(passed_counts)  # Inspect — zero passes on any rule is a red flag

Common Failure Modes & Fixes

Failure / exception	Root cause	Remediation
`ValueError: Input GeoDataFrame has no CRS`	Layer loaded from a source that strips CRS metadata (some Parquet writers, or GeoJSON without a `crs` key)	Always call `gdf.set_crs(epsg=..., allow_override=False)` immediately after read if CRS is known, or raise at ingestion boundary
`ShapelyError: IllegalArgumentException: Points of LinearRing do not form a closed linestring`	WKB round-trip through a system that truncates coordinate precision, breaking ring closure	Apply `from shapely.validation import make_valid; gdf["geometry"] = gdf.geometry.apply(make_valid)` as a pre-processing step before validation
`KeyError: column_name` inside a rule function	Schema migration renamed or dropped a column the rule references	Validate DataFrame schema with `pandera` at engine entry; fail fast with a descriptive message naming the missing column
`CRSError: Input is not a CRS` during `to_crs()`	PROJ data directory not found, or CRS string is malformed	Verify `pyproj.datadir.get_data_dir()` resolves; use EPSG integer codes rather than PROJ4 strings
`passed_series` length mismatches `gdf.index`	Rule function calls `reset_index()` internally, breaking index alignment	Forbid `reset_index()` inside rule functions; use `.reindex(gdf.index, fill_value=False)` in the engine to recover gracefully
All features fail a topology rule after reprojection	Floating-point drift introduced by coordinate transformation produces invalid rings	Apply `make_valid` after any `to_crs()` call; set a consistent coordinate precision with `shapely.set_precision(gdf.geometry, grid_size=1e-9)`
`MemoryError` on large datasets	Entire GeoDataFrame materialised before any rule filtering	Switch to chunked ingestion (see Performance section below); apply `rule_no_null_geometries` as a drop filter before loading geometry-heavy predicates

Performance & Scale Considerations

Profile before optimising. Use py-spy record -o profile.svg -- python validate.py to produce a flamegraph. The bottleneck is almost always one of: unindexed spatial joins, iterative apply() calls where vectorised operations are available, or geometry deserialization from disk.

Replace apply() with vectorised Shapely 2.0 operations. Shapely 2.0 exposes array-level GEOS calls via shapely.is_valid(geometries), shapely.area(geometries), and shapely.within(geoms_a, geoms_b). These bypass Python object overhead entirely. For datasets above 500 k features, the difference between gdf.geometry.apply(lambda g: g.is_valid) and shapely.is_valid(gdf.geometry.values) is typically 5–15x.

Use the spatial index for set-level checks. When a rule checks whether features intersect a reference layer (e.g., a protected zone boundary), build a spatial index on the reference layer and query it with sindex.query(gdf.geometry, predicate="intersects") before calling the exact predicate. This reduces the exact-predicate candidate set by orders of magnitude for sparse intersections.

Chunk large datasets at the ingestion boundary. For files exceeding available RAM, use fiona.open() with a slice to process records in batches, or load a bounding-box subset with gpd.read_file(path, bbox=(minx, miny, maxx, maxy)). Aggregate per-chunk result DataFrames with pd.concat().

Transition to Dask-GeoPandas above 2 million features. dask_geopandas.from_geopandas(gdf, npartitions=N) partitions the GeoDataFrame spatially so that rules run in parallel across workers. Rules that reference only the local partition (geometry validity, attribute completeness, area checks) distribute directly. Rules that require cross-partition context (gap detection, overlap checks between non-adjacent features) require a different strategy — see Asynchronous Validation Workflows for queue-based approaches that handle cross-partition spatial checks without loading all partitions into one worker.

For geometry-level checks involving Shapely internals — ring orientation, sliver detection, precision model enforcement — Implementing Shapely Geometry Checks in Python goes deeper on GEOS-layer operations and how to integrate them into the rule pattern established here.

Integration with the Validation Pipeline

The rule engine sits at the rule-evaluation stage of the validation DAG — it consumes a normalised GeoDataFrame from the ingestion stage and produces a structured results DataFrame that feeds the severity-routing stage. Before the engine runs, the geometry validity checks at ingestion should already have repaired or quarantined structurally unreadable features. The engine then applies business rules on top of structurally sound data.

Ingestion contract. The engine expects: a GeoDataFrame with a non-null CRS; geometry column named geometry; no mixed geometry types unless the rule set explicitly handles them; and a stable attribute schema that matches the rule set’s column references. Violating any of these should raise at the ingestion boundary, not inside a rule function.

Output contract. The results DataFrame emitted by engine.validate() must include feature_id, rule_name, passed, failure_reason, and severity for every (feature, rule) pair. Downstream routing stages — including Categorizing and Prioritizing Spatial Errors — consume this schema to compute priority scores and route failures to auto-remediation queues, steward review dashboards, or compliance reports.

Observability hooks. Emit a structured log entry after each rule executes, recording rule name, feature count, pass count, fail count, and elapsed time. Feed these into Prometheus counters or a Datadog custom metric. A sudden spike in failures on a stable rule is the earliest signal of upstream schema drift or corrupted source data — catching it at the rule-execution layer is far cheaper than discovering it in a downstream model or map.

For teams processing continuous streams — IoT sensor feeds, real-time location updates, or CDC (Change Data Capture) streams from a PostGIS database — running the rule engine synchronously per batch is unsuitable. The Asynchronous Validation Workflows section covers how to decompose the engine into individual rule workers that consume from a message queue, apply a single rule per message, and publish results to an aggregation topic, enabling non-blocking execution and per-rule horizontal scaling.

Frequently Asked Questions

When should I use GeoPandas instead of PostGIS for rule evaluation?

GeoPandas is the right choice for iterative rule development, datasets that fit comfortably in memory (under roughly five million features), and pipelines where Python-native logic — regex checks, external API lookups, ML-based anomaly scores — must run alongside spatial predicates. PostGIS is preferable when data already lives in the database, when set-level topology checks across large feature collections are required, or when query plans can exploit existing GiST indexes without materialising the dataset in Python.

How do I prevent a single malformed geometry from halting the entire rule batch?

Wrap each rule’s execution block in a try/except. On exception, mark every feature in the batch as failed for that rule with a failure_reason that records the exception message, then continue to the next rule. This graceful-degradation pattern ensures one bad geometry or unexpected null value does not suppress results from all subsequent rules. The engine skeleton in Step 2 above demonstrates this pattern.

How should I handle mixed CRS inputs at rule-engine entry?

Enforce CRS normalisation as the very first step before any rule fires. Raise a hard error if gdf.crs is None, and call gdf.to_crs(target_epsg) if the incoming CRS differs from the pipeline canonical CRS. Log the source CRS, target CRS, and any detected datum shift for audit purposes. Never assume CRS correctness and never silently reproject without logging the transformation.

What is the fastest way to apply many rules to millions of features?

Order rules by cost: null/type checks first, bounding-box pre-filters using sindex.query second, exact geometry predicates last. Replace apply() loops with vectorised shapely array operations wherever possible. For datasets above roughly two million features or exceeding available RAM, partition with Dask-GeoPandas and run rules per partition in parallel, aggregating results afterward.

Implementing Shapely Geometry Checks in Python — deeper coverage of ring orientation, sliver detection, and GEOS-level precision models
Categorizing and Prioritizing Spatial Errors — how rule-engine output is scored and routed downstream
Asynchronous Validation Workflows — queue-based execution for streaming and cross-partition rule evaluation
Geometry Validity Checks for Vector Data — the ingestion-stage checks that precede rule-engine execution
Understanding OGC Topology Rules — the standards vocabulary underpinning topology validity rules

Back to Validation Pipeline Architecture