Data+AI Summit 2021 is Coming

April 11, 2021

It’s almost been half a year since the last summit.

Data+AI Summit 2021 starts on Monday, May 24 till Friday, May 28. The training will be held on May 24-25 and will cater to a large set of practitioners, definitely more extensive than previous times: Data Analyst, Data Engineer, Data Scientist, ML Engineer, Partner Data Engineer, Platform Engineer, Technical. The wide range of roles makes me curious about the various technical personas in the Data and AI space. It’s not only Data Engineers and Data Scientists. There is a wide range of people who can benefit from attending the summit.

Read More

Machine Learning in Production - Concepts you should know

March 4, 2021

Are you interested in learning about the Machine Learning side of data? Hurry πŸŽ‰ , you have reached the right place to start learning about it.

Here is a list of concepts for you to get started:

ML Algorithm

ML algorithm is a procedure that runs on data and produces a machine learning model. Some of the popular ones are Decision trees, Naive Bayes, and Linear Regression.

ML Model

ML model is the ML algorithm process outcome; It often contains a statistical representation of the data ingested into the algorithm. ML model input is data, and the output is either a prediction, decision, or classification.

Read More

Delta Lake essential Fundamentals: Part 4 - Practical Scenarios

February 22, 2021

πŸŽ‰ Welcome to the 4th part of Delta Lake essential fundamentals: the practical scenarios! πŸŽ‰

There are many great features that you can leverage in delta lake, from the ACID transaction, Schema Enforcement, Time Traveling, Exactly One semantic, and more.

Let’s discuss two common data pipelines patterns and solutions:

Spark Structured Streaming ETL with DeltaLake that serves multiple Users

Spark Structured Streaming- Apache Spark structured steaming are essentially unbounded tables of information. There is a continuous stream of data ingested into the system. As developers, we write the code to process the data continuously. ETL stands for Extract, Transform and Load.

Read More

Delta Lake essential Fundamentals: Part 3 - compaction and checkpoint

February 15, 2021

Let’s understand what are Delta Lake compact and checkpoint and why they are important.

Checkpoint

There are two known checkpoints mechanism in Apache Spark that can confuse us with DeltaLake checkpoint, so let’s understand them and how they differ from each other:

Spark RDD Checkpoint

Checkpoint in Spark RDD is a mechanism to persist current RDD to a file in a dedicated checkpoint directory while all references to its parent RDDs are removed. This operation, by default, breaks data lineage when used without auditing.

Read More

Delta Lake essential Fundamentals: Part 2 - The DeltaLog

February 11, 2021

In the previous part, you learned what ACID transactions are.
In this part, you will understand how Delta Transaction Log, named DeltaLog, is achieving ACID.

Transaction Log

A transaction log is a history of actions executed by a (TaDa πŸ’‘) database management system with the goal to guarantee ACID properties over a crash.

DeltaLake transaction log - DetlaLog

DeltaLog is a transaction log directory that holds an ordered record of every transaction committed on a Delta Lake table since it was created. The goal of DeltaLog is to be the single source of truth for readers who read from the same table at the same time. That means, parallel readers read the exact same data. This is achieved by tracking all the changes that users do: read, delete, update, etc. in the DeltaLog.

Read More

Delta Lake essential Fundamentals: Part 1 - ACID

February 4, 2021

πŸŽ‰ Welcome to the first part of Delta Lake essential fundamentals! πŸŽ‰

What is Delta Lake ?

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Sparkβ„’ and big data workloads.

DeltaLake open source consists of 3 projects:

  1. detla - Delta Lake core, written in Scala.
  2. delta-rs - Rust library for binding with Python and Ruby.
  3. connectors - Connectors to popular big data engines outside Spark, written mostly in Scala.

Delta provides us the ability to “travel back in time” into previous versions of our data, scalable metadata - that means if we have a large set of raw data stored in a data lake, having metadata provides us with the flexibility needed for analytics and exploration of the data. It also provides a mechanism to unify streaming and batch data.
Schema enforcement - handle schema variations to prevent insertion of bad/non-compliant records, and ACID transactions to ensure that the users/readers never see inconsistent data.

Read More

Apache Spark Ecosystem, Jan 2021 Highlights

January 12, 2021

If you’ve been reading here for a while, you know that I’m a big fan of Apache Spark and have been using it for more than 8 years.
Apache Spark is continually growing. It started as part of the Hadoop family,
but with the slow death of hadoop and the fast growth of Kubernetes, many new tools, connectors and open source have emerged.

Let’s take a look at three exciting open sources:

Read More

Kubernetes and Virtual Kubelet in a nutshell

January 10, 2021

drawing

Today, you will learn how to take a web app (it can be any programming language,
we used Java & Kotlin) and distribute it using Kubernetes (K8s) and Virtual Kubelet (VK).

Well, if you don’t know yet why you should consider distributing your web app - read my post here.

So, you are probably asking yourself
“what is Kubernetes and what can I use it for?”
Just keep reading

Kubernetes is an open-source container-orchestration system for automating application deployment, scaling, and management. It is used to build distributed, scalable microservices.

Read More