Most large technology businesses collect data from their consumers in a variety of methods, and the majority of the time, this data is in its raw form. However, when data is presented in an understandable and accessible style, it may assist and drive business requirements. The task is to process the data and, if required, alter or clean it in order to make sense of it.
Data streaming applications in their most basic form transport data from a source bucket to a destination bucket. More advanced programs that use streams may conduct on-the-fly wizardry, such as changing the structure of the output data or enriching it with additional properties or fields. In this blog, we’ll examine how to use Kafka and Kafka UI to create a simple real-time data streaming application.
Why Apache Kafka?
LinkedIn built Kafka, which is now an open-source project managed by Confluent. It has a number of advantages. Kafka may be installed in a distributed fashion. It has a strong architecture thanks to the master-slave principle. This indicates that it is very fault-tolerant. Kafka can be horizontally scaled up to hundreds of brokers handling millions of messages per second. It is possible to attain a latency of less than 10 ms, which is termed real-time.
You will need the following items to follow along with this tutorial:
- The most recent versions of Node.js and npm installed on your machine.
- The most recent Java version (JVM) installed.
- Kafka already installed on your local PC. In this blog, we will walk through the process of installing Kafka locally on our workstations.
- A fundamental grasp of Node.js application development
However, before we go, let’s go over some fundamental Kafka principles and terminology so that we can simply follow along with this lesson.
1. Topics, Partitions, and Offsets
Kafka’s primary premise is topics. Each topic represents a stream of data and is analogous to a table in a database. There is no limit to the amount of topics that may be discussed. Each topic is labelled with a name and divided into sections. The divisions themselves are composed of ordered groups of messages that have been pushed to their respective topics. Each message in a partition is assigned an incremental id, also known as an offset. This offset is partition-specific. When establishing a new topic, the number of divisions must be given.
Although the sequence of communications within each partition is guaranteed, there is no equivalent assurance across partitions. When a producer (source system) sends a message to a topic, the message is routed to one of the partitions at random, unless a key is supplied. The message will not be modified after it has been written to the partition. The message will be removed from the partition after a certain length of time (one week by default).
While a subject and its divisions split messages theoretically, they must also be physically stored somewhere. Partitions are stored on brokers as basic log files. A Kafka cluster is often made up of numerous brokers (servers). A cluster with only one broker is also conceivable, although it is not recommended. The id identifies each broker, which contains certain topic divisions. Because Kafka is a distributed system, it will hold some of the data but not all of it. By connecting to a bootstrap broker, the instance is connected to the entire cluster. Using three brokers as a starting point is a fantastic way to get started.
Data about themes can be written by producers. The load is balanced across brokers due to the segmentation of each subject. Producers can use the acks parameter, which has three distinct possibilities, to strike a compromise between durability assurances and performance.
- If Acks = 0, the producer will just send the data to the broker and consider it received.
- Acks = 1: The producer will transmit the data to the broker and wait for confirmation (ack) that the data has arrived from the broker.
- Acks = all: The producer anticipates that the broker will wait for confirmation from all in-sync copies that they have replicated the data. Only when all of the copies have been acknowledged will the leader broker answer to the producer with an ack.
When a producer submits a message to a topic without a key, the message is dispersed to a random partition.
Consumers will read facts pertaining to a given issue. The data read for each partition is ordered, however it can be read from many partitions at the same time. Consumers will continue to receive messages from the newly chosen partition leader if the broker fails.
5. Kafka Broker Discovery
Each broker in Kafka is also known as a bootstrap server, and each broker is aware of all other brokers, topics, and partitions (metadata) in the cluster. This implies that connecting to a single broker is essentially the same as connecting to the entire cluster. This is also seen in the graph below.
A Zookeeper’s job is to oversee all brokers in a Kafka cluster. The Zookeeper will assist in leader elections and will notify the cluster if there are any changes (a new topic, broker died, etc.). A Kafka cluster cannot function without the Zookeeper, thus it must be launched first. The Zookeeper runs on an odd number of servers, with one leader and a number of read-only followers.
Wrapping It Up
We have covered the foundations of Apache Kafka, a distributed messaging system, in this blog. We explained its architecture in terms of a zookeeper, brokers, producers, and consumers. The primary element of Kafka is divided into replicated topics to which messages are transmitted.