Lead Data Acquisition Engineer – UK Commercial Energy Platform
8 hours ago
London
We’re building a national-scale data platform for UK commercial energy. At the core is a unified view of every commercial building in the UK, and an estimate of annual energy consumption and load profile for each occupant. We’ve already built the core spine (AddressBase, VOA, leases, CCOD/OCOD, INSPIRE, planning, NNDR, EPC, permits, renewables, Companies House). Now we need someone to own data acquisition and occupant modelling on top of this. ROLE Lead Data Acquisition Engineer – UK Commercial Energy Platform Type: Full-time or long-term contract Location: Remote (UK or Europe timezone preferred) WHAT WE’VE BUILT SO FAR Our current building/occupant spine includes: • OS AddressBase Core as the UPRN spine, • VOA valuation and floor area data, • Long leases, • CCOD / OCOD + INSPIRE polygons, • Planning application data and NNDR (where available), • EPC non-domestic data, • Environment Agency & DEFRA permitting datasets, • UK coverage of existing renewable projects, • Companies House API linkage Your job is to sit on top of this spine and turn it into something truly useful for per-occupant energy modelling. THE PROBLEM YOU’RE SOLVING For each of ~2 million UK commercial buildings we want to know: • Who the actual occupant(s) are, • How they operate in detail, • What that implies for energy use and load shape A plastics manufacturer is not the same as a frozen food warehouse, an office, or a logistics hub. We care about: • What they manufacture or do, • What machinery they have on-site, • What processes they run, and when they run them This is not a one-off scrape. It’s a systematic, repeatable pipeline that touches millions of rows. WHAT YOU WILL DO • Own data acquisition and scraping, • Design and run scraping / ingestion pipelines for:, • • DNO and other network datasets, • • Government and regulator datasets, • • Company-level and facility-level data beyond Companies House, • • Public signals of operations: websites, “our plant” pages, datasheets, job ads, fleet pages, Google Maps / Street View, industry directories, etc. Build robust scrapers at scale: • Parallelisation, retries, throttling, proxy management, error handling, • Logging and monitoring so we know what ran, what failed, and why • Resolve who actually occupies each building, • Extend our NNDR-based approach and close the gaps:, • • Link buildings to occupants using NNDR, Companies House, planning & permitting data, web presence and other public sources Build an entity resolution pipeline that: • Normalises and matches company names and addresses, • Uses fuzzy matching with confidence scores, • Maintains a master building-to-occupant table with history and provenance • Engineer occupant-specific, process-level variables, • For each building occupant, design and populate variables that matter for energy, for example:, • • Industry and sub-industry (SIC + text classification), • • Building function / process type:, • – Manufacturing vs distribution vs office vs retail, • – Plastics vs food vs metals vs pharma, etc., • – Cold storage, data centre, heavy process, light assembly, • • Operational characteristics:, • – Opening hours and shift patterns, • – 24/7 vs office hours, • – Indicative vehicle and truck movements, • – Refrigeration, compressed air, process heat, HVAC type, • • Machinery and equipment indicators, where possible:, • – Presence of large motors, injection moulders, CNC machines, presses, ovens, kilns, furnaces, chillers, freezers, compressors, data centre racks, etc., • – Signals from permits, product specs, job adverts (“CNC milling centre”, “ammonia refrigeration plant”), site photos, equipment lists, OEM case studies and similar Join all of this back to: • VOA dimensions, • EPC primary energy and HVAC/fuel indicators, • Scope 2 and emissions disclosures where available The key is depth and uniformity. A cold-storage warehouse will have different variables from a law firm, and a plastics injection-moulding plant different again – but everything must land in a consistent, model-ready structure across ~2M rows. • Build and document the data layer for the modelling team, • • Design schemas for long-term use and refresh, • • Implement ETL/ELT workflows (ingest → clean → enrich → publish), • • Add basic data-quality checks and reporting, • • Document sources, joins and assumptions so others can work confidently on top of your layer WHAT YOU SHOULD ALREADY HAVE DONE • 3–6+ years as a Data Engineer, Data Acquisition Engineer or similar, • Proven experience scraping and integrating large public or government datasets at scale, • A track record of production scraping pipelines, not just one-off scripts, • Strong entity-resolution background: – Fuzzy matching, deduplication, record linkage across messy sources – Ideally with companies and addresses • Experience turning unstructured information (websites, PDFs, job ads, photos) into structured variables, • Experience with UK data (ONS, EPC, VOA, NNDR, planning, AddressBase, etc.) is a strong plus TECHNICAL SKILLS – MUST HAVE • Strong Python: – requests or httpx – BeautifulSoup or lxml – Scrapy and/or Playwright or Selenium for JS-heavy sites • Strong SQL and experience with a relational warehouse (Postgres, BigQuery, Snowflake or similar), • Experience with an orchestration tool: Airflow, Prefect, Dagster or similar, • Comfort with: – Parallel and async scraping – Proxy rotation and basic anti-bot strategies – Designing and versioning schemas – Normalising and matching UK addresses and postcodes • Basic geospatial comfort: – UPRN / UARN, postcodes, lat-long – GeoPandas / Shapely / PostGIS at a practical level • Git and collaborative development workflows NICE TO HAVE • Direct exposure to OS AddressBase, VOA, EPC, NNDR, INSPIRE polygons or similar datasets, • Experience in energy, utilities, carbon accounting or real-estate analytics, • Use of NLP for text classification and keyword tagging over large corpora, • Experience with graph databases for relationship modelling WHAT KIND OF PERSON WILL FIT • You like turning messy, inconsistent public data into clean, reliable tables, • You enjoy thinking about data models and feature design, not just writing scrapers, • You’re comfortable working closely with founders and making pragmatic trade-offs, • You care about building pipelines that can run repeatedly without babysitting