Skip to main content

Annexure B โ€” EO Data Pipeline Framework

Deliverable for Milestone 1 โ€” Data pipeline and storage architecture plan.

Overviewโ€‹

The EO Data Pipeline is a six-stage framework that takes raw satellite imagery from acquisition through to health surveillance output. All EO processing is performed via Google Earth Engine (GEE), with health and ground-truth data integrated at the fusion stage.

Stage 1: EO Data Acquisition
โ†“
Stage 2: Pre-processing & QA/QC
โ†“ โ† Health data inputs (NICD, DoH)
Stage 3: Feature Extraction โ† Ground truth (sensors, lab)
โ†“
Stage 4: Spatial-Temporal Data Fusion
โ†“
Stage 5: Risk Modelling Engine
โ†“
Stage 6: Surveillance Dashboard Output
โ†‘________ validation feedback loop __________โ†‘

Stage 1 โ€” EO Data Acquisitionโ€‹

Sources ingested:

SatelliteBand / ProductSpatial res.Temporal res.
Sentinel-2 MSIB3 (Green), B8 (NIR), B1110 m5 days
Sentinel-3 SLSTRLST1 kmDaily
Landsat 8/9 OLINDVI, land cover30 m16 days
MODIS TerraSoil moisture proxy500 mDaily
MODIS AquaChlorophyll-a, SST1 kmDaily

Acquisition method: Scheduled batch pulls via the GEE Python API. Assets are filtered by date range, cloud cover threshold (< 20 %) and study area boundary.

import ee
ee.Initialize()

sentinel2 = (ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED')
.filterDate('2025-01-01', '2026-01-31')
.filterBounds(limpopo_geometry)
.filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', 20)))

Stage 2 โ€” Pre-processing & QA/QCโ€‹

All imagery is processed to surface-ready values before analysis.

StepDescription
Atmospheric correctionSentinel-2 SR product already corrected; Landsat applies USGS surface reflectance
Cloud maskingSCL band (Sentinel-2) and QA_PIXEL (Landsat) used to mask cloud and shadow
NormalisationBand values scaled to 0โ€“1 reflectance range
Temporal compositingMonthly median composites reduce noise from residual cloud
QA/QC checksAutomated scripts flag missing data, extreme outliers and CRS mismatches

Stage 3 โ€” Feature Extractionโ€‹

Five primary indices are derived from the pre-processed imagery:

NDWI โ€” Normalised Difference Water Indexโ€‹

Detects surface water and waterlogged areas (mosquito breeding habitat):

NDWI = (Green โ€“ NIR) / (Green + NIR)

Values > โˆ’0.1 indicate significant surface water presence.

NDVI โ€” Normalised Difference Vegetation Indexโ€‹

Tracks vegetation density and seasonal land-cover change:

NDVI = (NIR โ€“ Red) / (NIR + Red)

LST โ€” Land Surface Temperatureโ€‹

Derived from Sentinel-3 SLSTR and Landsat thermal bands. Key malaria vector activity range: 25ยฐC โ€“ 30ยฐC.

Soil Moistureโ€‹

Approximated from MODIS surface reflectance and Sentinel-1 SAR backscatter. Threshold values:

ValueInterpretation
> 0.35High moisture โ€” high vector risk (+40 pts)
0.25โ€“0.35Moderate moisture (+20 pts)
< 0.25Low moisture

Agric_Percentageโ€‹

Proportion of the ward covered by agricultural land, derived from ESA WorldCover 10 m land cover classification (Class 40 โ€” Cropland).


Stage 4 โ€” Spatial-Temporal Data Fusionโ€‹

EO features are joined to health and ground-truth datasets using ward boundary polygons (GADM ADM3 for Limpopo) and monthly time steps.

EO Features (per ward, per month)
+ NICD case counts (per ward, per month)
+ Ground sensor readings (per station, interpolated to ward)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ†’ Fused dataset: Limpopo_Risk_Jan25_Jan26_Safe.csv

CSV schema:

ColumnTypeDescription
MonthstringJan 2025 โ€ฆ Jan 2026
WardLabelstringWard code (e.g. LIM331_1)
MunicipalistringMunicipality name
latitudefloatWard centroid latitude
longitudefloatWard centroid longitude
LST_Surface_CfloatLand surface temperature (ยฐC)
Air_Temp_CfloatAir temperature (ยฐC)
Soil_MoisturefloatSoil moisture index (0โ€“1)
NDWI_WaterfloatNDWI surface water index
Habitat_Vegetation_IndexfloatVegetation density (0โ€“1)
Agric_PercentagefloatAgricultural area fraction
Population_Density_Per_KM2floatPopulation density
Habitat_Class_CodefloatESA WorldCover class code

Current dataset: 7,384 records ยท 13 months ยท Limpopo Province ยท 9 municipalities


Stage 5 โ€” Risk Modelling Engineโ€‹

See System Architecture โ€” Analytics & Modelling Engine for the full risk formula.

Key outputs per ward per month:

  • Composite risk score (0โ€“100)
  • Risk label: High / Moderate / Low
  • Risk colour: #d93025 / #f9bb06 / #34a853

Stage 6 โ€” Surveillance Dashboard Outputโ€‹

The processed risk data is served to the frontend as a CSV file loaded at runtime by PapaParse:

Papa.parse('../data/Limpopo_Risk_Jan25_Jan26_Safe.csv', {
download: true,
header: true,
dynamicTyping: true,
step: ({ data }) => {
if (data.Municipali) {
allCSVData.push(data);
(districtLookup[data.Municipali] ||= []).push(data);
}
},
complete: () => renderAllHotspots()
});

The dashboard renders Leaflet circle markers coloured by risk score, with popups, time-slider filtering and dropdown drill-down.


Validation feedback loopโ€‹

Post-render, stakeholder feedback from UCT, CSIR and NICD is used to:

  1. Audit risk score accuracy against known disease burden data
  2. Adjust index weights in the modelling engine
  3. Validate QA/QC thresholds
  4. Refine ward boundary alignment

This loop is formalised in Milestone 3 (Prototype Validation & Testing).