Big data Vs Traditional ETL
There were times when big data went mainstream, a decision has to be made whether you will stick with your current stack or upgrade to big data anticipating rise in data needs projecting into future. But not all teams/tasks need a big data solution.
So what I meant by traditional ETL tools, were SSIS/Abinitio/Informatica, where even today many companies/teams stick to them and get the job done. I know many companies have at least part of their ETL work flow is traditional ETL, whether they are on-premise for security reasons or data use case does not require a big data solution.
There are many players in this domain and all of them are popular, some more than other depending on the industry. They are still relevant today, some even available on Cloud.
Pros
The team knows the tool and have been working with it for many years and built a code base that tackles the specific data problem they have, at the same time they have relevant developers who have been working extensively in this domain.
Why would we switch if reporting data has not changed and business requirements are incremental in nature?
Reporting data needs are vertically aligned with the ETL tool architecture.
Source and target systems have not updated interfaces (file based/API based) - which does not need lot of rework on ETL.
Cons
As with any technology, it might be difficult to find talent for a technology that is not trendy. Some of these tools are proprietary, thus you will not find documentation/training anywhere else, creating a huge barrier for anyone to learn. Teams may face hurdles to fill open positions and retain talent when trend across the industry is shifting.
Migration
If the team were to decide to migrate their tech stack, below could be few of the factors to consider
- Understand the data needs
- Estimate the budget grants
- Buy in from users on SLA/regulatory/HIPPA compliance etc
- Effort involved to re-train work force, along with support teams.
- Re-usability/extensibility of your current code
After much deliberation, if the team were to decide on upgrading their stack, I would like to propose they consider the following, depending on the team composition and skill set.
- Serverless ETL
- Cloud Datawarehouse (ELT)
- Traditional Big Data
Let’s explore each of these areas in little detail
Serverless ETL
Do not worry about provisioning any Infrastructure, just purely focus on building your business logic and horizontally scale as needed if you have large ingress in data volume. Team can utilize event driven architecture to fully enable an automated system for ingesting, transforming and eventual reporting. This has become quite popular recently with many teams adopting this aggressively.
Cloud Data-warehouse
There are few players in this category notably Snowflake/bigquery, those follow a pattern of ELT. Data Engineering teams are tasked with building stages that are entry points for data to be brought into Snowflake upon which there will be jobs that continuously process data and can scale horizontally as per the needs.
This is attractive as development is mostly based on SQL, which most of the DE teams are already familiar with. There are ton of new features including DBT + Snowflake ❄️ that is providing much needed CI/CD type SQL driven development.
Traditional Big Data
I consider Traditional Big data as Apache Spark/Databricks that does heavy lifting of 100’s of Peta bytes of data, heavy data intensive pipelines that are built to handle custom loads. This requires extensive knowledge in Distributed computation using Apache Spark or custom solution by Data bricks on cloud. This may not be the first choice for many teams to explore.
That’s the end, I hope I made sense based on limited knowledge I’ve, I could be completely wrong, but would love to hear feedback on this.