Data Extraction Engineer (Remote)
hace 20 horas
Madrid
At Hays , we are collaborating with a global manufacturer of metal components for the automotive industry , present in 22 countries with more than 40,000 employees. This industrial environment generates big volumes of data across heterogeneous sources ( SAP ERP, SCADA systems, MES platforms, IoT sensors, relational databases, and document stores ) that have to be reliably extracted, normalised, and made available for analytics, reporting, and AI workloads. La experiencia que se espera de los solicitantes, así como las habilidades y cualificaciones adicionales necesarias para este trabajo, se enumeran a continuación. We are currently looking for a Data Extraction Engineer who specialises in getting data out of complex source systems into the Azure-based lakehouse reliably and at scale. You will own the extraction and ingestion layer end-to-end, building robust pipelines that connect their operational systems to the Bronze layer of their Medallion architecture on Databricks. What are the requirements? • 4+ years of hands-on experience building data extraction and ingestion pipelines in production environments., • Python proficiency for pipeline scripting, custom Airflow operators, and data validation logic., • Strong SQL skills across multiple engines: SQL Server, PostgreSQL, Azure SQL, and SAP HANA (CDS Views a strong plus)., • Practical experience extracting from and integrating MongoDB : change streams, oplog-based CDC, schema flexibility handling., • Proficiency with Apache Airflow: authoring complex DAGs, managing dependencies, handling retries, and monitoring pipeline health., • Solid understanding of CDC patterns : log-based vs. query-based, Debezium, Kafka connectors, and Azure Event Hubs integration., • Experience landing data into cloud lakehouse architectures ( Azure Data Lake + Delta Lake / Databricks)., • Experience extracting data from SAP (ABAP SDK, CDS Views, BAPI/RFC, OData) — highly valued., • Familiarity with industrial data sources: SCADA historians, MES systems, OPC-UA, MQTT., • Knowledge of DBT for downstream transformation layers., • Experience in multi-plant or multi-country enterprise environments. Data Extraction & Ingestion (core focus) • Design and implement data extraction pipelines from SAP HANA (CDS Views, ABAP SDK) , relational databases (SQL Server, PostgreSQL, Azure SQL), document stores (MongoDB), and SCADA/MES/IoT platforms handling high‑frequency time‑series data., • Build and maintain CDC pipelines from SAP and operational systems into Azure Data Lake Storage Gen2 using Apache Kafka / Azure Event Hubs ., • Develop API-based connectors for SaaS platforms (HR systems, quality tools, third‑party suppliers) according to business needs., • Orchestrate ingestion workflows using Apache Airflow (MWAA or AKS) , ensuring pipelines are idempotent, observable, and operationally robust., • Implement monitoring, alerting, and SLA tracking to ensure data freshness and pipeline reliability., • Manage incremental versus full-load strategies tailored to source system load, data volume, and latency requirements., • Apply data quality validations at ingestion , including completeness checks, schema and type validation, referential integrity, and duplicate detection., • Document source system schemas, data models, data dictionaries, and known data quality issues., • Partner with AI/ML and AI Agents teams to deliver curated datasets for analytics, RAG, and machine‑learning workloads., • Produce and maintain technical documentation, including extraction specifications, data contracts, pipeline runbooks, and ADRs ., • Remote work model but with possibility of on-site/hybrid model., • Location: Madrid (Headquarters). Hay opciones de teletrabajo/trabajo desde casa disponibles para este puesto.