Senior Python Developer
hace 22 horas
Vigo
Senior Python Developer Data Platform — Architecture & Infrastructure Lead ¿Tiene las cualificaciones y habilidades adecuadas para este trabajo? Descúbralo a continuación y pulse en "solicitar" para ser considerado. ABOUT INFOBEL PRO InfobelPro is a leading provider of B2B company data and points-of-interest intelligence. Our platform ingests, deduplicates, enriches, and cross-references company information from over a dozen authoritative sources — spanning national business registries, commercial data providers, open geographic datasets, and professional networks — to serve millions of entity records to customers via API, filtered queries, and bulk exports. We are building an entirely new data platform from the ground up. This is not a migration of legacy code into a new framework — it is a clean-sheet architecture designed to process billions of raw records into a unified, entity-resolved dataset that is accurate, auditable, and GDPR-compliant by design. The platform is built on a deliberately modern, open-source stack: Dagster for orchestration, DuckDB for processing, ClickHouse for serving, Splink for entity resolution, and FastAPI for the customer-facing layer. Everything runs on bare-metal servers (Hetzner) with infrastructure managed through Ansible — no cloud lock-in, no managed services, no abstraction layers between the team and the hardware. ABOUT THE ROLE You will be the technical lead of a three-person team building InfobelPro’s next-generation data platform from a blank repository. This is not a role where you inherit someone else’s architecture and maintain it, you will design the schema, lay the infrastructure, choose the patterns, and set the engineering culture that two junior developers will grow into. The platform processes billions of records from diverse sources, resolves them into a unified entity graph, and serves them at low latency to customers who depend on the data for critical business decisions. You will own all of it: the pipeline that transforms raw files into enriched entities, the ClickHouse schema that serves them, the infrastructure that keeps it running, and the GDPR compliance layer that keeps it legal. This is a high-autonomy role. You will report directly to the CTO and make architectural decisions with real consequences. The right candidate is someone who has previously owned a data pipeline end-to-end in production — including the infrastructure it runs on — and understands that the hardest problems are not algorithmic but operational: keeping a system reliable when data sources change format at midnight, pipelines fail silently, and disk fills up on a Friday. WHAT YOU WILL DO • Design and implement the ClickHouse schema for the entity-centric data model — companies, POIs, persons, addresses — optimised for three access patterns: point lookups, filtered analytical queries, and bulk exports., • Build and operate the infrastructure: provision and configure Hetzner bare metal servers using Ansible, manage the ClickHouse cluster (3 replicas + processing node + hot spare), configure networking, firewalls, and SSH access., • Architect the Dagster pipeline: define the asset graph, set up partitioning for monthly source ingestion, implement retry and backfill strategies, and establish the patterns that junior developers will follow for every new data source., • Own entity resolution: configure and tune Splink for probabilistic matching across company records from different sources. Define blocking rules, comparison functions, and match thresholds. This requires judgment — balancing precision against recall for datasets with hundreds of millions of records., • Implement the GDPR compliance layer: amendment tables, query-time filtering for erasure/rectification/restriction, immutable audit logging, and 7-year retention on tamper-proof backup storage., • Set up monitoring and alerting: Grafana dashboards for infrastructure health, pipeline status, and data quality trends. Prometheus for server metrics. Pydantic Logfire for data validation. On-call response when something breaks., • Design the backup and disaster recovery strategy: rclone-based backups to R2 and Hetzner Storage Box, tested recovery procedures, and documented runbooks., • Mentor two junior developers: conduct code reviews, pair on complex problems, design onboarding tasks that build real competence, and create a team culture where people ship with confidence., • Architect the AI enrichment layer (Phase 3): design the agent framework, define the multi-tier model strategy (cheap bulk models for gap-filling, frontier models for diagnostics), implement token budget caps and ROI tracking per task type., • Use AI coding tools as a force multiplier. We use tools like Claude Code and Cursor daily and we are actively pushing towards multi-agent workflows — orchestrating parallel AI sessions to scaffold connectors, generate tests, and prototype solutions simultaneously. You will be expected to adopt this way of working and help the team get better at it. WHAT WE ARE LOOKING FOR Required • 8+ years of professional software development experience, primarily in Python., • You have owned a data pipeline in production end-to-end — from ingestion through transformation to serving — and you were responsible for keeping it running, not just writing the initial code., • An enthusiastic and hands-on AI user, actively leveraging tools such as Claude Code, Cursor, and similar AI copilots multiple times a day to accelerate the delivery of working software, automate workflows, and enhance analytical tasks., • You understand how to effectively prompt, iterate, and validate AI-generated outputs to ship reliable working solutions faster., • You see AI as a force multiplier, not a shortcut, and know how to combine it with strong fundamentals in SQL and Python., • Strong infrastructure skills: you are comfortable with Linux server administration, Ansible (or equivalent IaC), networking fundamentals (firewall rules, SSH tunnels, VPN/vSwitch configuration), and diagnosing production issues from logs and metrics., • Deep SQL knowledge and experience with analytical/columnar databases. ClickHouse experience is a significant plus; experience with any OLAP system (BigQuery, Redshift, DuckDB, Druid) translates well., • Experience designing data models for large-scale analytical workloads — you understand denormalisation trade-offs, partitioning strategies, and why the entity model matters more than the engine., • You have mentored junior developers before and believe that a well-structured codebase and clear task scoping are more valuable than heroic individual effort., • You use AI-assisted development tools (Claude Code, Cursor, GitHub Copilot, or equivalent) as part of your daily workflow — not as a novelty, but as a core part of how you write and ship code. You should be comfortable prompting AI agents, evaluating their output critically, and knowing when AI-generated code needs manual correction. We will ask you to demonstrate this during the hiring process., • Comfort with ambiguity. This is a ground-up build. There is no existing architecture to reference, no established patterns to follow. You will define both. Strong Pluses • Experience with ClickHouse in production (schema design, ReplacingMergeTree, cluster management, query optimisation)., • Experience with probabilistic entity resolution or record linkage (Splink, Dedupe, Zingg, or custom implementations)., • Familiarity with Dagster, Airflow, Prefect, or similar orchestration frameworks., • Experience with DuckDB or other embedded analytical engines., • Background in B2B data, company data, or master data management., • Experience with GDPR compliance in data platforms (erasure, rectification, audit trails)., • Experience orchestrating multi-agent AI development workflows — running concurrent Claude Code or Codex sessions, delegating parallelisable tasks to agents, and integrating results into a coherent codebase. xcskxlj If you are already doing this, you will fit right in., • Experience building custom AI agent pipelines for data tasks — not just chat interfaces, but structured LLM workflows for data extraction, classification, or enrichment with programmatic validation.