AI Data Engineer

Started on September 8, 2024

Status: On-hold

Simple prompts that can accompolish boring data tasks through Crew AI

Well you asked for it! Tada , it's here.

AI frameworks

You can use simple prompts to create whole infrastructure on cloud and set of agents that work towards a goal you define.

Crew AI is a framework that will allow it to happen.

There are other similar frameworks like below, that you can explore.

For this exercise, we will be using Crew AI and AWS.

A common task for most people is to get insights from the data. So that's what we will be doing through this project.

Tools

AWS Glue/Lambda (depending on processing needs) -> We can use Lambda for smaller tasks and Glue for bigger ETL
Jupyter notebook for visualiztion (matplotlib) - lightweight tool

we will follow the documentation from the website

crewai create crew <project_name>

You shall have AWS Keys setup in your environment file (I would encourage you to create an Admin role and use that here)
OpenAI API Key - Crew AI needs some LLM to work.

We will be using some open source data, TLC trip data from NYC taxi is a good source that we can use.

Status: On Hold

This project is currently on hold due to lot of concurrent side projects. I'll resume this shortly after wrapping up ExpenseSnap

09.08.2024

Let us define some requirements at high level for our agents to follow, then we will sub divide these into tasks.

Read files from TLC Trip data for 2024
Understand if there are any schema changes over time
Process data (Fill any null values with some defaults) and remove any duplicates
Create some data model, may be we can use Star Schema for our simplified use case.
- Extract entities and relationships from the raw data, then build that dataset.
- All the data quality checks has to happen before we write data to S3
Different entities data will be written to S3 for simplicity.
Data that is ready for reporting, Now gather this data into Jupyter notebook for visualiztion.
Define metrics for this data.
For each metric, read S3 data and come up with visualiztion

Lets go !!

we will define below agents