Data, ML & Open Source
🎉 Welcome to the 4th part of Delta Lake essential fundamentals: the practical scenarios! 🎉 There are many great features that you can leverage in delta lake, from the ACID transaction, Schema Enforcement, Time Traveling, Exactly One semantic, and more. Let’s discuss two common data pipelines patterns and solutions: Spark Structured Streaming ETL with DeltaLake that serves multiple Users Spark Structured Streaming- Apache Spark structured steaming are essentially unbounded tables of information.
Let’s understand what are Delta Lake compact and checkpoint and why they are important. Checkpoint There are two known checkpoints mechanism in Apache Spark that can confuse us with DeltaLake checkpoint, so let’s understand them and how they differ from each other: Spark RDD Checkpoint Checkpoint in Spark RDD is a mechanism to persist current RDD to a file in a dedicated checkpoint directory while all references to its parent RDDs are removed.
In the previous part, you learned what ACID transactions are. In this part, you will understand how Delta Transaction Log, named DeltaLog, is achieving ACID. Transaction Log A transaction log is a history of actions executed by a (TaDa 💡) database management system with the goal to guarantee ACID properties over a crash. DeltaLake transaction log - DetlaLog DeltaLog is a transaction log directory that holds an ordered record of every transaction committed on a Delta Lake table since it was created.
🎉 Welcome to the first part of Delta Lake essential fundamentals! 🎉 What is Delta Lake ? Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. DeltaLake open source consists of 3 projects: detla - Delta Lake core, written in Scala. delta-rs - Rust library for binding with Python and Ruby. connectors - Connectors to popular big data engines outside Spark, written mostly in Scala.
If you’ve been reading here for a while, you know that I’m a big fan of Apache Spark and have been using it for more than 8 years. Apache Spark is continually growing. It started as part of the Hadoop family, but with the slow death of hadoop and the fast growth of Kubernetes, many new tools, connectors and open source have emerged. Let’s take a look at three exciting open sources:
Today, you will learn how to take a web app (it can be any programming language, we used Java & Kotlin) and distribute it using Kubernetes (K8s) and Virtual Kubelet (VK). Well, if you don’t know yet why you should consider distributing your web app - read my post here. So, you are probably asking yourself “what is Kubernetes and what can I use it for?" Just keep reading Kubernetes is an open-source container-orchestration system for automating application deployment, scaling, and management.