Apache Kafka has become the de facto standard for building real-time data pipelines and streaming applications. Designed as a distributed commit log, Kafka decouples producers and consumers through a durable, high-throughput, fault-tolerant architecture. In this comprehensive blog post, we’ll explore Kafka’s core building blocks—Topics, Partitions, Offsets, Consumer Groups—and delve into the granular configuration options for Producers and Consumers that ensure performance, reliability, and exactly-once semantics.


Apache Kafka Flow Visualization

Real-time message streaming visualization

Producer

📤 PRODUCER

Kafka Cluster

Consumers

0
Sent
0
Consumed
0
Consumers

Configure Kafka System


1. Topics: The Logical Stream of Events

A Topic in Kafka is a named feed to which records are published. Think of it as a category or log file name. Topics provide a namespace for producers to write messages and for consumers to subscribe.

  • Key characteristics:
    • Append-only: Once written, records are immutable and stored in log segments on disk.
    • Retention policies: Configurable by size or time (retention.ms, retention.bytes).
    • Cleanup policies: delete (default) or compact, useful for changelog topics in stream processing.
# Create a topic named "orders" with 3 partitions and a replication factor of 2
kafka-topics --create \
  --topic orders \
  --partitions 3 \
  --replication-factor 2 \
  --bootstrap-server broker1:9092

2. Partitions: Parallelism and Ordering

Each topic is divided into Partitions, which are the unit of parallelism and ordering in Kafka.

  • Ordering guarantee: Within a single partition, Kafka guarantees strict order by offset.
  • Scalability: More partitions → higher throughput and more consumer parallelism.
  • Leader/Follower: One broker acts as the leader for a partition; others are followers replicating the log.

Producers can assign messages to partitions via:

  1. Key-based partitioning (deterministic): partition = hash(key) % num_partitions.
  2. Round-robin (no key): Balances load evenly across partitions.

3. Offsets: Bookmarking Your Place

Within a partition, each record has a monotonically increasing Offset (0, 1, 2, …). Offsets serve as bookmarks for consumers:

  • Consumer offset: The next record to read. Stored either in Kafka’s __consumer_offsets topic or externally.
  • Auto vs. manual commit:
    • enable.auto.commit=true: Kafka will commit offsets every auto.commit.interval.ms.
    • enable.auto.commit=false: You control when to commit via the consumer API, enabling finer control.
// Manual commit example
consumer.poll(Duration.ofMillis(100));
consumer.commitSync();

4. Consumer Groups: Scaling and Fault Tolerance

A set of consumers identified by the same group.id form a Consumer Group. Kafka ensures that each partition of a subscribed topic is consumed by exactly one consumer in the group, providing:

  • Load balancing: Distributes partitions across consumers.
  • Fault tolerance: If a consumer crashes, its partitions are reassigned.
  • Rebalancing: Triggered when consumers join/leave the group or subscriptions change.
# Sample consumer properties
group.id=order-processors
enable.auto.commit=false
auto.offset.reset=earliest

5. Producers: Publishing Data with Precision

5.1 Essential Producer Configurations

Config Description Default
bootstrap.servers Comma-separated list of broker addresses. -
key.serializer Class to serialize message keys (e.g., StringSerializer). -
value.serializer Class to serialize message values (e.g., StringSerializer). -
acks Number of acknowledgments the leader must receive (0, 1, all). 1
retries Number of retry attempts on transient failures. 2147483647
linger.ms Time to wait for additional messages before sending a batch. 0
batch.size Maximum size (in bytes) of each batch. 16KB
enable.idempotence Ensure exactly-once delivery per producer session. false

5.2 Enabling Exactly-Once Semantics

enable.idempotence=true
acks=all
retries=MAX_INT
max.in.flight.requests.per.connection=1

With idempotence, Kafka assigns a producer ID and sequence numbers to detect duplicates on retries.


6. Consumers: Fetching and Processing Messages

6.1 Key Consumer Configurations

Config Description Default
bootstrap.servers Comma-separated list of broker addresses. -
group.id Identifier for the consumer group. -
key.deserializer Class to deserialize message keys (e.g., StringDeserializer). -
value.deserializer Class to deserialize message values (e.g., StringDeserializer). -
auto.offset.reset Action when no offset is found (earliest, latest, none). latest
enable.auto.commit Whether to auto-commit offsets. true
fetch.min.bytes Minimum bytes to fetch in a request before responding. 1
max.poll.records Maximum number of records returned in a single poll(). 500

6.2 Poll Loop Example

while (running) {
  ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
  for (ConsumerRecord<String, String> record : records) {
    // process record.key(), record.value()
  }
  consumer.commitSync();
}

7. Under-the-Hood: Replication and Fault Tolerance

  • Replication factor: Number of copies per partition. Ensures data durability.
  • In-Sync Replicas (ISR): Followers fully caught up with the leader.
  • Unclean leader election: Avoided by default to prevent data loss.
min.insync.replicas=2

Sets the minimum ISR required for acks=all to succeed.


8. Delivery Guarantees

Semantics Description
At most once Messages may be lost but never redelivered (acks=0).
At least once Messages may be redelivered, duplicates possible (default acks=1, retries >0).
Exactly once No duplicates even on retries (idempotent producer + transactional API).
producer.initTransactions();
try {
  producer.beginTransaction();
  // send messages
  producer.commitTransaction();
} catch (Exception e) {
  producer.abortTransaction();
}

9. Conclusion

Apache Kafka’s design—distributed, partitioned, replicated—provides the backbone for resilient, high-throughput data streams. By tuning producer and consumer configurations, you can adapt Kafka for use cases ranging from simple pub/sub to mission-critical exactly-once stream processing. We’ve covered the fundamental concepts and configuration knobs; your next step is hands‑on experimentation with the Kafka console tools, client libraries, and stream processing APIs. Happy streaming!