Categorizing and Prioritizing Spatial Errors in Validation Pipelines

Automated spatial validation routinely generates thousands of flagged features across geometry, topology, and attribute domains in a single pipeline run. Without a structured approach to categorization and scoring, that output becomes unactionable noise — QA engineers waste cycles triaging minor coordinate-precision rounding alongside genuine compliance violations. This page explains how to build the decision layer inside a Validation Pipeline Architecture that transforms raw validation logs into a deterministic, prioritized work queue. GIS analysts, platform engineers, and data stewards all benefit: analysts get clear remediation tasks, engineers get stable queue contracts, and stewards get audit-ready quality metrics.

Prerequisites

Before wiring categorization logic into a pipeline, confirm the following:

Python 3.10+, GeoPandas 1.0+, Shapely 2.0+, NumPy 1.25+: The scoring implementation relies on vectorized numpy.select and GeoPandas geometry accessors introduced in these versions. Pin these in pyproject.toml or requirements.txt.
Standardized spatial schema: Every dataset must conform to a documented coordinate reference system (CRS) and feature-type registry. Inconsistent schemas generate false positives during topology checks. Apply CRS precision standards before any categorization step.
Structured rule-engine output: Validation rules must emit records that include error_type, severity, rule_id, affected_geometry, regulatory_flag, downstream_consumers, and affected_area_km2. The rule engine built with GeoPandas already serializes output in this format.
Baseline quality thresholds per dataset class: Define acceptable error rates for each dataset type (e.g., cadastral parcels tolerate zero topology gaps; environmental monitoring points allow low-severity attribute nulls). Thresholds determine tier boundary values.
Compliance mapping: Align severity assignments with ISO 19157-1 Geographic Information — Data Quality quality elements so audit reports use recognized vocabulary.
Structured logging: Use structlog or loguru to capture validation traces. Every categorization decision should be machine-readable for downstream dashboarding.

Conceptual Foundation

What makes an error worth categorizing?

Spatial errors are not equal. A self-intersecting polygon in a parcel boundary dataset that feeds a municipal tax system carries far higher remediation urgency than a duplicate vertex on a decorative mapping layer. The ISO 19157-1 standard formalizes this intuition through its data quality elements — completeness, logical consistency, positional accuracy, temporal accuracy, and thematic accuracy — each of which maps cleanly to an error domain.

The Open Geospatial Consortium (OGC) Simple Feature Access specification further anchors geometry validity: a geometry is OGC-valid when it is closed, non-self-intersecting, and has consistent ring orientation. Any deviation triggers a GEO_* class error. Topology rules from the OGC topology model govern relationships between features — gaps, overlaps, and dangling nodes all have precise OGC definitions that translate directly into error codes.

The four error domains

Domain	Representative error codes	Typical source
Geometry	`GEO_SELF_INTERSECT`, `GEO_EMPTY`, `GEO_RING_ORIENT`	CRS reprojection, CAD imports, coordinate truncation
Topology	`TOPO_GAP`, `TOPO_OVERLAP`, `TOPO_DANGLE`, `TOPO_SLIVER`	Digitizing errors, conflation misalignment
Spatial Relationship	`REL_CONTAINMENT`, `REL_BUFFER_VIOLATION`, `REL_CROSSING`	Business rule violations (parcel crosses zoning line)
Attribute-Spatial	`ATTR_NULL_COORD`, `ATTR_TYPE_MISMATCH`, `ATTR_OUT_OF_BOUNDS`	ETL pipeline bugs, schema migrations

Each domain should map to exactly one remediation handler type. This prevents ambiguous routing and simplifies escalation logic.

Priority as a composite score

Raw error counts mislead resource allocation. A dataset with 10,000 low-severity duplicate vertices needs far less remediation effort than one with 12 overlapping administrative boundaries that violate a statutory reporting requirement. Priority scoring fuses intrinsic severity with external business factors:

Priority Score = (Severity × Regulatory Weight) + (Exposure Factor × 0.5) + Normalized Spatial Extent

Where:

Severity: Critical = 3, High = 2, Medium = 1, Low = 0.5
Regulatory Weight: errors in compliance-mandated datasets get 2×; standard datasets 1×
Exposure Factor: count of downstream systems consuming the affected layer
Normalized Spatial Extent: affected area or length scaled 0–1 against the dataset’s total

Step-by-Step Implementation

Step 1: Define and register the error taxonomy

Create a taxonomy registry as a version-controlled YAML file so both PostGIS triggers and GeoPandas runs consume identical definitions:

# taxonomy.py
import yaml
from pathlib import Path
from typing import TypedDict

class ErrorDefinition(TypedDict):
    domain: str        # geometry | topology | relationship | attribute
    severity: str      # critical | high | medium | low
    auto_fixable: bool
    iso_element: str   # ISO 19157-1 quality element name

def load_taxonomy(path: Path = Path("config/error_taxonomy.yaml")) -> dict[str, ErrorDefinition]:
    """Load error taxonomy from version-controlled YAML registry."""
    with path.open() as fh:
        raw = yaml.safe_load(fh)
    return {code: ErrorDefinition(**defn) for code, defn in raw["errors"].items()}

# Example taxonomy YAML structure:
# errors:
#   GEO_SELF_INTERSECT:
#     domain: geometry
#     severity: critical
#     auto_fixable: true
#     iso_element: logical_consistency
#   TOPO_GAP:
#     domain: topology
#     severity: high
#     auto_fixable: false
#     iso_element: logical_consistency

Verification: python -c "from taxonomy import load_taxonomy; t = load_taxonomy(); print(len(t), 'error types registered')" — expect the count to match your YAML entries with no KeyError.

Step 2: Validate the incoming error DataFrame

Before scoring, enforce schema integrity with pandera:

import pandera as pa
import pandas as pd

validation_schema = pa.DataFrameSchema(
    columns={
        "error_type":            pa.Column(str,   nullable=False),
        "severity":              pa.Column(str,   pa.Check.isin(["critical","high","medium","low"])),
        "regulatory_flag":       pa.Column(bool,  nullable=False),
        "downstream_consumers":  pa.Column(int,   pa.Check.ge(0)),
        "affected_area_km2":     pa.Column(float, pa.Check.ge(0.0)),
        "geometry":              pa.Column(object, nullable=True),
    },
    coerce=True,
)

def validate_input(df: pd.DataFrame) -> pd.DataFrame:
    """Validate and coerce the validation-flag DataFrame schema."""
    return validation_schema.validate(df, lazy=True)

Verification: Pass a DataFrame missing regulatory_flag — expect a pandera.errors.SchemaErrors with the missing column listed.

Step 3: Compute priority scores

import numpy as np
from typing import Any

SEVERITY_MAP: dict[str, float] = {
    "critical": 3.0,
    "high":     2.0,
    "medium":   1.0,
    "low":      0.5,
}

REGULATORY_WEIGHT: dict[bool, float] = {True: 2.0, False: 1.0}

def calculate_priority_scores(
    df: pd.DataFrame,
    exposure_col: str = "downstream_consumers",
    extent_col: str = "affected_area_km2",
) -> pd.DataFrame:
    """
    Vectorized priority scoring for spatial validation flags.

    Returns df with two new columns: priority_score (float) and routing_tier (str).
    """
    if df.empty:
        return df.assign(priority_score=0.0, routing_tier="log_only")

    result = df.copy()

    # Map severity and regulatory categories to numeric weights
    result["_sev_w"] = result["severity"].map(SEVERITY_MAP).fillna(1.0)
    result["_reg_w"] = result["regulatory_flag"].map(REGULATORY_WEIGHT).fillna(1.0)

    # Min-max normalize spatial extent to [0, 1]
    e_min, e_max = result[extent_col].min(), result[extent_col].max()
    result["_ext_n"] = (
        (result[extent_col] - e_min) / (e_max - e_min)
        if e_max > e_min
        else 0.0
    )

    # Composite score — fully vectorized
    result["priority_score"] = (
        result["_sev_w"] * result["_reg_w"]
        + result[exposure_col].astype(float) * 0.5
        + result["_ext_n"]
    ).round(3)

    # Deterministic tier assignment with numpy.select
    result["routing_tier"] = np.select(
        condlist=[
            result["priority_score"] >= 8.0,
            result["priority_score"] >= 1.5,
        ],
        choicelist=["escalation", "manual_review"],
        default="auto_remediate",
    )

    return result.drop(columns=["_sev_w", "_reg_w", "_ext_n"])

Verification:

sample = pd.DataFrame([{
    "error_type": "GEO_SELF_INTERSECT", "severity": "critical",
    "regulatory_flag": True, "downstream_consumers": 4,
    "affected_area_km2": 100.0, "geometry": None,
}])
scored = calculate_priority_scores(sample)
assert scored.loc[0, "routing_tier"] == "escalation", scored.loc[0, "priority_score"]

Step 4: Route to handlers

from dataclasses import dataclass
from typing import Callable

@dataclass
class RoutingConfig:
    auto_remediation_handler: Callable[[pd.DataFrame], None]
    manual_review_handler:    Callable[[pd.DataFrame], None]
    escalation_handler:       Callable[[pd.DataFrame], None]

def route_errors(
    scored_df: pd.DataFrame,
    config: RoutingConfig,
) -> dict[str, int]:
    """
    Dispatch scored error records to the appropriate handler.

    Returns a dict of tier → count for monitoring.
    """
    tiers: dict[str, int] = {}
    for tier, handler in [
        ("auto_remediate", config.auto_remediation_handler),
        ("manual_review",  config.manual_review_handler),
        ("escalation",     config.escalation_handler),
    ]:
        subset = scored_df[scored_df["routing_tier"] == tier]
        if not subset.empty:
            handler(subset)
        tiers[tier] = len(subset)
    return tiers

Each handler should write to an idempotent backend: an auto_remediate queue, a QA dashboard API, or an alert channel. Decoupling handlers from the scoring step allows you to swap a synchronous PostGIS repair call for an asynchronous Celery queue without changing the scoring logic.

Verification: Mock all three handlers and assert route_errors(scored_df, config) returns a dict summing to len(scored_df).

Step 5: Feed the feedback loop

Attach outcome tracking to every remediation handler:

def record_outcome(
    error_id: str,
    tier: str,
    resolution: str,   # "fixed" | "escalated" | "deferred" | "false_positive"
    resolution_ts: str,
    outcome_store: pd.DataFrame,
) -> pd.DataFrame:
    """Append a remediation outcome for calibration analysis."""
    row = {
        "error_id": error_id,
        "tier": tier,
        "resolution": resolution,
        "resolution_ts": pd.Timestamp(resolution_ts),
    }
    return pd.concat([outcome_store, pd.DataFrame([row])], ignore_index=True)

Compute false-positive rates and mean time to resolution (MTTR) per tier monthly. When the auto-remediation false-positive rate exceeds 10%, raise the auto_remediate threshold; when MTTR for manual review exceeds your SLA, lower the escalation threshold.

Common Failure Modes and Fixes

Symptom	Root cause	Fix
`NaN` in `priority_score`	`severity` value not in `SEVERITY_MAP`	Add `.fillna(1.0)` to the map call; log unmapped types to a monitoring metric
All errors land in `auto_remediate`	Regulatory weights not applied (all `False`)	Verify `regulatory_flag` is correctly set in the rule-engine output schema
`pandera.errors.SchemaErrors: column 'downstream_consumers' not in dataframe`	Rule engine emitting an older schema version	Check the rule-engine output schema version; add a schema migration shim
Escalation queue floods on first run	Threshold tuned for evolved dataset; initial run has clean baseline skewed	Run on a 1% sample first; establish the percentile distribution of scores before production routing
`SettingWithCopyWarning` from pandas	Scoring applied on a slice rather than `.copy()`	Always call `df.copy()` at the start of `calculate_priority_scores`
Spatial extent normalization returns `0.0` for all rows	Only one error record in the batch	Guard with `if e_max > e_min` before dividing; treat single-record batches as extent = 0

Performance and Scale Considerations

For datasets under ~500 k error records, the vectorized pandas implementation above completes in under a few seconds on a standard 8-core instance. Beyond that threshold:

Arrow-backed DataFrames: Pass dtype_backend="pyarrow" to pd.read_parquet() when loading large validation logs. Arrow columnar layout reduces memory for string-heavy error_type and severity columns by 40–60%.
Chunked scoring: Read validation logs in Parquet partitions and score chunk-by-chunk. Normalize spatial extent globally (compute e_min/e_max in a first pass) before scoring each partition.
Distributed execution: At tens of millions of records, migrate to batch processing with Dask or Apache Sedona. The calculate_priority_scores function is partition-safe as long as extent normalization uses globally pre-computed bounds.
Indexing before routing: If the routing step queries PostGIS for spatial context (e.g., to count downstream consumers dynamically), ensure a spatial index on the affected-feature geometry column. A missing index on a large feature table can turn a millisecond lookup into a multi-second full scan.
Queue backpressure: Use a message broker (RabbitMQ, Apache Kafka, or AWS SQS) between the scoring step and handler invocations. This prevents the scoring process from blocking on slow remediation handlers during peak ingestion.

Integration with the Validation Pipeline

Within the overall Validation Pipeline Architecture, categorization sits at the output stage of the DAG — downstream of rule evaluation and upstream of remediation.

Typical DAG position:

Ingestion — ingest raw features, apply CRS normalization
Rule evaluation — run predicate checks via the GeoPandas rule engine, emit structured flag records
Categorization and scoring ← this page
Routing — dispatch to auto-remediation, manual review, or escalation
Remediation — execute repairs, update lineage store, close the feedback loop

The categorization step should be stateless and idempotent: given the same input DataFrame it must always return the same scored DataFrame. This makes it safe to re-run after upstream pipeline retries, which is a common pattern in asynchronous validation workflows.

Frequently Asked Questions

What is the difference between error severity and error priority?

Severity is an intrinsic property of the error type — a self-intersection is always Critical, a duplicate vertex is always Low. Priority is computed from severity plus external factors: regulatory exposure, downstream consumer count, and spatial extent. Two Critical errors can have very different priorities if one affects a compliance-mandated dataset and the other does not.

How often should I recalibrate priority scoring weights?

Recalibrate whenever a significant data source changes, when remediation SLAs are missed, or at least quarterly. Track false-positive rates and MTTR per tier; a rising false-positive rate in the auto-remediation queue is the clearest signal that thresholds have drifted from the actual data distribution.

Can I use the same taxonomy across PostGIS and GeoPandas pipelines?

Yes — the taxonomy is engine-agnostic. Store categories and severity maps in a shared YAML or database-backed registry that both a PostGIS trigger and a GeoPandas validation run can query. Keep error type codes consistent (e.g., GEO_SELF_INTERSECT, TOPO_GAP) so downstream dashboards aggregate correctly regardless of which engine produced the flag.

What is the right queue depth before triggering a circuit breaker?

A common threshold is when more than 15% of features in a single batch are flagged, or when the escalation queue exceeds 2× its 30-day moving average. These signals indicate a systemic upstream problem — a projection change, a schema migration, or a corrupted source file — that automated repair cannot resolve feature-by-feature. Halt downstream routing and trigger a data source investigation before continuing.

Related

Building Rule Engines with GeoPandas — rule serialization patterns that produce the structured flag records this page scores
Classifying Topology Errors by Severity — deep-dive on severity assignment for gap, overlap, and dangle error codes
Asynchronous Validation Workflows — broker-backed queue patterns for decoupling scoring from remediation
Batch Processing Large Spatial Datasets — distributed scoring with Dask when record counts exceed single-node limits
Geometry Validity Checks for Vector Data — OGC validity rules underlying the GEO_* error domain

Back to Validation Pipeline Architecture