Kelash Kumar — Data Engineer

Selected work

§ 01 / Pipelines

01

Live Streaming Production

Real-Time Crypto Streaming Pipeline

Python · Redpanda · PostgreSQL · dbt · Airflow · Metabase · Docker · Oracle Cloud

A six-stage streaming system pulling live crypto prices from CoinGecko through Redpanda, persisting them in PostgreSQL, transforming with dbt, orchestrating with Airflow, and surfacing them in a Metabase dashboard. Eight containers, end-to-end. Deployed to Oracle Cloud's free-tier ARM VM with CI/CD shipping changes in ninety seconds.

Architecture / 6 stages · 8 containers End-to-end · dockerized

01 Source CoinGecko REST API · 10 crypto assets · 60s polling cadence

02 Ingest Redpanda event broker · Python producer publishes to crypto-prices · consumer commits safely with poison-pill handling

03 Store PostgreSQL raw warehouse · indexed for read · volume-persisted across container restarts

04 Transform dbt three-layer models · staging → intermediate → mart · LAG windows, moving averages, daily summaries

05 Orchestrate Apache Airflow · 5-minute schedule · freshness gates · automatic retries · task-level observability

06 Analyze Metabase six-panel live dashboard · price trends, deltas, top movers, volumes — refreshed in near real-time

14,400/day

Throughput

< 5 min

Freshness

~ 90 s

Deploy

8 svc

Containers

Live demo → GitHub →

02

CDC Open Source

Change-Data-Capture Pipeline

PostgreSQL · Debezium · Redpanda · Python · dbt · Airflow · Docker

Log-based replication that turns hourly batches into sub-ten-second streams. Six containerized services capture WAL events from a source PostgreSQL via Debezium, stream them through Redpanda, and apply them idempotently to a target database. Bad records flow to a DLQ; dbt builds the analytical layer with full SCD Type 2 history.

Architecture / 6 services · sub-10s lag Docker compose

01 Source Source PostgreSQL :5433 · wal_level=logical · production database

02 Capture Debezium :8083 · pgoutput plugin · streams CDC events from WAL

03 Stream Redpanda :9092 · zero-JVM Kafka API · message broker

04 Consume Python consumer · confluent-kafka · ON CONFLICT upserts to target :5434 · DLQ for bad records

05 Model dbt · 3 models · 9 tests · staging → marts · SCD Type 2 with valid_from / valid_to dimensions

06 Orchestrate Airflow :8085 · cdc_pipeline_monitor DAG · 30-min cadence · check_cdc_lag → snapshot → test → healthy

< 10 s

Replication lag

9 / 3

Tests / models

SCD T2

History

At-least-once

Semantics

GitHub →

03

Lakehouse Quality Gates

Medallion Data Lakehouse

MinIO · DuckDB · Open-Meteo · Great Expectations · Airflow · Parquet · Docker

A four-layer lakehouse on object storage. MinIO holds Bronze, Silver, and Gold buckets; DuckDB reads and transforms Parquet directly from S3-compatible storage via the httpfs extension; Great Expectations gates every layer transition — the pipeline halts when validation fails, so the warehouse never silently corrupts.

Architecture / 4 layers · 5-task DAG medallion_weather

01 Lakehouse MinIO S3-compatible store · bronze/ raw · silver/ cleaned · gold/ curated marts

02 Compute DuckDB embedded with httpfs extension · reads Parquet from MinIO · transforms · writes Parquet back

03 Ingest Python script · Open-Meteo public API → raw Parquet in bronze/

04 Govern Great Expectations validation suites at Bronze and Silver · pipeline FAILS if expectations break

05 Refine DuckDB cleans & deduplicates Bronze → partitioned Parquet in silver/

06 Aggregate DuckDB rolls Silver → business marts in gold/ — city_weekly_summary, regional_daily, city_extremes

07 Orchestrate Airflow 2.9.1 LocalExecutor · DAG: bronze_ingest → bronze_gate → silver_transform → silver_gate → gold_transform

3 layers

Bronze · Silver · Gold

GE gates

Quality validation

Idempotent

Backfills

Parquet

Columnar storage

GitHub →

Technical stack

§ 02 / Tooling

Languages & Foundations 04

Python SQL Bash Java

Data Engineering & Streaming 07

Apache Airflow dbt Kafka Redpanda Debezium Pandas Great Expectations

Storage & Databases 05

PostgreSQL DuckDB MinIO Apache Parquet MongoDB

DevOps & Cloud 06

Docker Linux GitHub Actions Oracle Cloud pytest Caddy

Visualization & BI 05

Metabase Grafana Plotly Power BI Tableau

About & record

§ 03 / Background

Education 01

2022 — 2026

BS Computer Science

Sukkur IBA University · Pakistan

Database Systems · Data Structures & Algorithms · Operating Systems · Computer Networks · Software Engineering · System Design.

Certifications 03

2026

Google Cloud Data Engineer

Coursera

2025

IBM Data Engineering

Coursera

2025

Google Data Analytics

Coursera

Honors & Recognition 03

2026

National Skill Competency Test — 92.2 percentile

NSCT · Pakistan

Top marks in Programming, Database, and AI/ML & Data Analytics across a 10-subject national competency exam.

2022

Sindh Talent Hunt Program scholarship

STHP · Government of Sindh

Fully-funded merit scholarship covering BS Computer Science at Sukkur IBA University.

2025 — 26

Three production pipelines shipped

Personal · Open source

Streaming, CDC, and medallion lakehouse — each fully containerized and orchestrated, all deployed.

Kelash
Kumar.

Selected work

Real-Time Crypto Streaming Pipeline

Change-Data-Capture Pipeline

Medallion Data Lakehouse

Technical stack

About & record

Let's build something good.

Kelash Kumar.

Selected work

Real-Time Crypto Streaming Pipeline

Change-Data-Capture Pipeline

Medallion Data Lakehouse

Technical stack

About & record

Let's build something good.

Kelash
Kumar.