Clients Project handling questions.

 

Data Engineer Needed to Build ETL Pipeline from AWS S3 to PostgreSQL

We are looking for an experienced Data Engineer or Backend Developer to design and implement an ETL process that ingests raw files from AWS S3 into our PostgreSQL database. This project involves handling raw Excel and CSV files stored in S3, staging them into an unstructured format, and transforming them into structured tables that power internal tools and reporting. Responsibilities: Build an automated ETL pipeline that: Downloads and reads files from specific AWS S3 folders Parses and stores raw data into a staging schema in PostgreSQL Transforms raw data into a clean, structured format based on business rules Write and maintain PostgreSQL stored procedures (PL/pgSQL) for data transformation Implement error handling, validation, and basic data quality checks Collaborate with our internal team to understand schema requirements and transformation logic Requirements: Strong experience with PostgreSQL (including schema design, indexing, and PL/pgSQL) Proficient in Python for scripting ETL processes Experience working with AWS S3, including file access and automation Ability to handle ingestion from Excel and CSV files Familiarity with unstructured → structured data workflows Strong SQL skills, including writing complex queries and stored procedures Familiarity with Pandas, SQLAlchemy, or psycopg2 (a plus) Bonus: Experience with Airflow, dbt, or other orchestration/transform tools Deliverables: A working ETL pipeline that can be triggered manually or on a schedule Scripts for ingestion, transformation, and loading of data Well-documented logic and setup instructions PostgreSQL stored procedures for transforming staged data Database Info: Hosted on PostgreSQL Access provided via secure VPN Tables are divided into raw/staging schemas and production-ready schemas Project Timeline: We’d like to begin immediately, with a goal to complete the initial pipeline within 2–4 weeks, depending on complexity.
1. How do you detect and handle malformed or missing data during ETL?

2. Explain how you’d link each invoice row to carrier_id, carrier_service_id, and surcharge_id.

3. 
How do you handle partial failures (bad rows, missing FK matches, etc.) without halting the ETL?

4. Which tools/apps have you used in the past that may make this kind of work easier/faster while providing an informative GUI?


Comments

Popular posts from this blog

👔 Why a CEO Must Understand Both Technology and People

The Startup India Seed Fund Scheme (SISFS)