AI Data Engineer

Started on September 8, 2024

Status: On-hold

Simple prompts that can accompolish boring data tasks through Crew AI

Well you asked for it! Tada , it's here.

AI frameworks

You can use simple prompts to create whole infrastructure on cloud and set of agents that work towards a goal you define.

Crew AI is a framework that will allow it to happen.

There are other similar frameworks like below, that you can explore.

  1. Langchain
  2. AutoGPT

For this exercise, we will be using Crew AI and AWS.

A common task for most people is to get insights from the data. So that's what we will be doing through this project.

Tools

  • AWS Glue/Lambda (depending on processing needs) -> We can use Lambda for smaller tasks and Glue for bigger ETL
  • Jupyter notebook for visualiztion (matplotlib) - lightweight tool

Setup

we will follow the documentation from the website

  • Crew AI
crewai create crew <project_name>
  • You shall have AWS Keys setup in your environment file (I would encourage you to create an Admin role and use that here)

  • OpenAI API Key - Crew AI needs some LLM to work.

Goal

We will be using some open source data, TLC trip data from NYC taxi is a good source that we can use.

Status: On Hold

This project is currently on hold due to lot of concurrent side projects. I'll resume this shortly after wrapping up ExpenseSnap

Journal

09.08.2024

Defining high level requirements

Let us define some requirements at high level for our agents to follow, then we will sub divide these into tasks.

  • Read files from TLC Trip data for 2024
  • Understand if there are any schema changes over time
  • Process data (Fill any null values with some defaults) and remove any duplicates
  • Create some data model, may be we can use Star Schema for our simplified use case.
    • Extract entities and relationships from the raw data, then build that dataset.
    • All the data quality checks has to happen before we write data to S3
  • Different entities data will be written to S3 for simplicity.
  • Data that is ready for reporting, Now gather this data into Jupyter notebook for visualiztion.
  • Define metrics for this data.
  • For each metric, read S3 data and come up with visualiztion

Lets go !!

Define Agents

we will define below agents

  • Data Engineer
  • Quality Engineer
  • Orchestration Engineer
  • Data Modeler
  • Business Analyst
  • BI developer