Clients Project handling questions.
Data Engineer Needed to Build ETL Pipeline from AWS S3 to PostgreSQL
We are looking for an experienced Data Engineer or Backend Developer to design and implement an ETL process that ingests raw files from AWS S3 into our PostgreSQL database.
This project involves handling raw Excel and CSV files stored in S3, staging them into an unstructured format, and transforming them into structured tables that power internal tools and reporting.
Responsibilities:
Build an automated ETL pipeline that:
Downloads and reads files from specific AWS S3 folders
Parses and stores raw data into a staging schema in PostgreSQL
Transforms raw data into a clean, structured format based on business rules
Write and maintain PostgreSQL stored procedures (PL/pgSQL) for data transformation
Implement error handling, validation, and basic data quality checks
Collaborate with our internal team to understand schema requirements and transformation logic
Requirements:
Strong experience with PostgreSQL (including schema design, indexing, and PL/pgSQL)
Proficient in Python for scripting ETL processes
Experience working with AWS S3, including file access and automation
Ability to handle ingestion from Excel and CSV files
Familiarity with unstructured → structured data workflows
Strong SQL skills, including writing complex queries and stored procedures
Familiarity with Pandas, SQLAlchemy, or psycopg2 (a plus)
Bonus: Experience with Airflow, dbt, or other orchestration/transform tools
Deliverables:
A working ETL pipeline that can be triggered manually or on a schedule
Scripts for ingestion, transformation, and loading of data
Well-documented logic and setup instructions
PostgreSQL stored procedures for transforming staged data
Database Info:
Hosted on PostgreSQL
Access provided via secure VPN
Tables are divided into raw/staging schemas and production-ready schemas
Project Timeline:
We’d like to begin immediately, with a goal to complete the initial pipeline within 2–4 weeks, depending on complexity.
1. How do you detect and handle malformed or missing data during ETL?
2. Explain how you’d link each invoice row to carrier_id, carrier_service_id, and surcharge_id.
3. How do you handle partial failures (bad rows, missing FK matches, etc.) without halting the ETL?
4. Which tools/apps have you used in the past that may make this kind of work easier/faster while providing an informative GUI?
Comments
Post a Comment