FreedomDev
TeamAssessmentThe Systems Edge616-737-6350
FreedomDev Logo

Your Dedicated Dev Partner. Zero Hiring Risk. No Agency Contracts.

201 W Washington Ave, Ste. 210

Zeeland MI

616-737-6350

[email protected]

FacebookLinkedIn

Company

  • About Us
  • Culture
  • Our Team
  • Careers
  • Portfolio
  • Technologies
  • Contact

Core Services

  • All Services
  • Custom Software Development
  • Systems Integration
  • SQL Consulting
  • Database Services
  • Software Migrations
  • Performance Optimization

Specialized

  • QuickBooks Integration
  • ERP Development
  • Mobile App Development
  • Business Intelligence / Power BI
  • Business Consulting
  • AI Chatbots

Resources

  • Assessment
  • Blog
  • Resources
  • Testimonials
  • FAQ
  • The Systems Edge ↗

Solutions

  • Data Migration
  • Legacy Modernization
  • API Integration
  • Cloud Migration
  • Workflow Automation
  • Inventory Management
  • CRM Integration
  • Customer Portals
  • Reporting Dashboards
  • View All Solutions

Industries

  • Manufacturing
  • Automotive Manufacturing
  • Food Manufacturing
  • Healthcare
  • Logistics & Distribution
  • Construction
  • Financial Services
  • Retail & E-Commerce
  • View All Industries

Technologies

  • React
  • Node.js
  • .NET / C#
  • TypeScript
  • Python
  • SQL Server
  • PostgreSQL
  • Power BI
  • View All Technologies

Case Studies

  • Innotec ERP Migration
  • Great Lakes Fleet
  • Lakeshore QuickBooks
  • West MI Warehouse
  • View All Case Studies

Locations

  • Michigan
  • Ohio
  • Indiana
  • Illinois
  • View All Locations

Affiliations

  • FreedomDev is an InnoGroup Company
  • Located in the historic Colonial Clock Building
  • Proudly serving Innotec Corp. globally

Certifications

Proud member of the Michigan West Coast Chamber of Commerce

Gov. Contractor Codes

NAICS: 541511 (Custom Computer Programming)CAGE CODE: oYVQ9UEI: QS1AEB2PGF73
Download Capabilities Statement

© 2026 FreedomDev Sensible Software. All rights reserved.

HTML SitemapPrivacy & Cookies PolicyPortal
  1. Home
  2. /
  3. Technologies
  4. /
  5. Apache Kafka
Core Technology Stack

Apache Kafka Development & Real-Time Data Streaming Solutions

Distributed event streaming platform engineering for high-throughput message processing, real-time data pipelines, and microservices communication that scales to millions of events per second.

Apache Kafka

Enterprise Event Streaming Infrastructure Built for Scale

Apache Kafka processes over 7 trillion messages daily across organizations like LinkedIn, Netflix, and Uber, making it the de facto standard for real-time data streaming. According to the 2023 Apache Software Foundation annual report, Kafka ranks among the top five most active Apache projects with over 1,200 contributors and deployment in more than 80% of Fortune 100 companies. At FreedomDev, we've architected Kafka-based solutions that process 50 million+ events daily for manufacturing, logistics, and financial services clients across West Michigan and beyond.

Apache Kafka functions as a distributed commit log—a fundamentally different architecture from traditional message queues like RabbitMQ or ActiveMQ. Rather than delete messages after consumption, Kafka persists all events in an append-only log stored across multiple brokers. This design enables multiple consumers to read the same data stream independently, replay historical events, and maintain complete audit trails. The event sourcing pattern we implement with Kafka provides organizations with a single source of truth for all state changes, critical for regulatory compliance and data lineage tracking.

We've deployed Kafka clusters ranging from three-node development environments to production systems with 20+ brokers handling terabytes of data daily. Our [Real-Time Fleet Management Platform](/case-studies/great-lakes-fleet) case study demonstrates Kafka processing GPS coordinates, engine telemetry, and maintenance alerts from 150+ vehicles with sub-100ms latency. The platform ingests 2.3 million messages daily through Kafka Connect source connectors, routes them to appropriate consumer groups, and maintains a 30-day event window for operational replay and analytics.

Kafka's architecture consists of several key components we configure based on client requirements. Producers publish messages to named topics, which are partitioned across brokers for parallelism. Each partition maintains an ordered, immutable sequence of records with sequential offset identifiers. Consumer groups coordinate to distribute partition consumption across multiple instances, enabling horizontal scaling. ZooKeeper (or KRaft in Kafka 3.0+) manages cluster metadata, leader election, and configuration. We typically implement producer idempotence, exactly-once semantics, and consumer offset management strategies to guarantee message delivery and processing.

The platform excels in scenarios requiring high-throughput data ingestion, event-driven microservices, change data capture (CDC), and stream processing. Unlike traditional ETL batch processing that runs hourly or daily, Kafka enables continuous data pipelines with real-time transformation using Kafka Streams or external processors. Our [QuickBooks Bi-Directional Sync](/case-studies/lakeshore-quickbooks) integration leverages Kafka to maintain eventual consistency between cloud ERP systems and on-premise accounting software, processing invoice updates, payment records, and inventory adjustments with guaranteed ordering within partitions.

We implement Kafka using managed services like Confluent Cloud, Amazon MSK (Managed Streaming for Apache Kafka), or self-hosted clusters on AWS EC2, Azure VMs, or client data centers. The choice depends on factors including data sovereignty requirements, operational expertise, cost constraints, and integration complexity. A three-broker Confluent Cloud cluster costs approximately $1,200 monthly, while equivalent self-hosted infrastructure on AWS runs $800-900 monthly but requires dedicated DevOps resources for monitoring, patching, and scaling. For clients in regulated industries like healthcare or finance, we deploy on-premise Kafka clusters with encryption at rest and in transit, SASL/SCRAM authentication, and ACL-based authorization.

Performance optimization involves tuning producer batch size, linger time, compression algorithms, partition counts, replication factors, and consumer fetch configurations. We've achieved 99.9% uptime SLAs through proper broker placement across availability zones, configuring min.insync.replicas=2 for critical topics, implementing circuit breakers in producer applications, and establishing comprehensive monitoring with Prometheus and Grafana. Our standard Kafka deployment includes JMX metric collection, alerting on under-replicated partitions, consumer lag monitoring, and automated broker recovery procedures.

Kafka Connect provides pre-built connectors for databases, cloud storage, message queues, and SaaS platforms. We've implemented Debezium CDC connectors to stream MySQL and PostgreSQL change events into Kafka topics, JDBC sink connectors to populate data warehouses, S3 sink connectors for long-term archival, and custom connectors for proprietary systems. The Kafka Streams library enables stateful stream processing with windowing, joins, and aggregations without external dependencies. For complex event processing requirements, we integrate Apache Flink or Apache Spark Structured Streaming as Kafka consumers.

Organizations transitioning from batch processing to event streaming face architectural challenges around schema management, data consistency models, and operational complexity. We implement Confluent Schema Registry with Avro serialization to enforce schema evolution contracts between producers and consumers, preventing breaking changes that crash downstream applications. Event versioning strategies, backward/forward compatibility testing, and migration procedures ensure smooth deployments. Our [custom software development](/services/custom-software-development) practice includes Kafka adoption roadmaps, proof-of-concept implementations, and team training on event-driven design patterns.

Security implementation involves encryption (SSL/TLS), authentication (SASL/PLAIN, SASL/SCRAM, Kerberos, OAuth), authorization (ACLs), and audit logging. We configure separate topics with different replication factors based on data criticality—financial transactions replicated across five brokers versus debug logs on single replicas. Disaster recovery planning includes cluster mirroring with MirrorMaker 2.0 for multi-datacenter deployments, backup strategies for ZooKeeper state, and documented runbooks for broker failures, network partitions, and data corruption scenarios. The [Official Kafka Documentation](https://kafka.apache.org/documentation/) provides comprehensive configuration references we leverage for production deployments.

7T+
Daily Messages Processed Globally
80%
Fortune 100 Companies Using Kafka
50M+
Events/Day in Our Fleet Platform
99.9%
Uptime SLA Achieved
<100ms
Processing Latency
1,200+
Active Contributors to Project

Need to rescue a failing Apache Kafka project?

Our Apache Kafka Capabilities

High-Throughput Message Ingestion & Publishing

We design Kafka producer implementations that achieve 100,000+ messages per second throughput through batching, compression, and asynchronous send operations. Producer configurations include idempotence guarantees to prevent duplicate messages during retries, custom partitioning strategies to control message distribution across brokers, and callback handlers for send confirmation or error handling. Our [Java](/technologies/java) applications use the official Kafka client library with tuned parameters: batch.size=32768, linger.ms=10, compression.type=snappy, and buffer.memory=67108864. For Python applications, we implement kafka-python or confluent-kafka libraries with similar optimizations. Monitoring includes tracking producer request latency, batch size metrics, and buffer exhaustion events to identify performance bottlenecks.

High-Throughput Message Ingestion & Publishing
01

Distributed Consumer Groups with Automatic Rebalancing

Consumer group implementations distribute partition consumption across multiple application instances, automatically rebalancing when instances join or leave. We configure session.timeout.ms, heartbeat.interval.ms, and max.poll.interval.ms parameters based on processing complexity to prevent unnecessary rebalances that cause duplicate processing. Our standard pattern implements manual offset commits after successful processing to prevent data loss, with configurable retry policies for transient failures. We've deployed consumer applications using [Python](/technologies/python) FastAPI services, [JavaScript](/technologies/javascript) Node.js workers, and Java Spring Boot microservices, all coordinating through Kafka consumer groups. Metrics collection includes consumer lag monitoring (records-lag-max) with alerts when lag exceeds 10,000 messages.

Distributed Consumer Groups with Automatic Rebalancing
02

Kafka Connect for Zero-Code Integration

We implement Kafka Connect clusters in distributed mode to ingest data from databases, cloud storage, and SaaS platforms without custom code. Debezium connectors stream MySQL binlog changes and PostgreSQL WAL records as Kafka events, enabling change data capture architectures. JDBC sink connectors write Kafka topics to SQL databases with configurable batch sizes, connection pooling, and error handling. S3 sink connectors archive events to object storage with time-based or size-based partitioning. Custom connector development uses the Connect API framework when pre-built connectors don't meet requirements. We deploy Connect clusters with three workers for high availability, configure task parallelism based on topic partition counts, and implement monitoring for connector failures and data drift.

Kafka Connect for Zero-Code Integration
03

Stream Processing with Kafka Streams & ksqlDB

Kafka Streams library enables stateful processing including windowing, aggregations, joins between topics, and table-stream dualities without external infrastructure. We've implemented real-time analytics applications that compute 5-minute tumbling window aggregations on sensor data, join order events with customer lookup tables, and detect patterns across multiple event types. State stores use RocksDB for persistence with changelog topics for fault tolerance. For SQL-based stream processing, we deploy ksqlDB to enable business analysts to query Kafka topics with familiar SQL syntax and create materialized views that update continuously. Processing guarantees include exactly-once semantics through transactional writes and offset management. Our stream processing applications handle late-arriving data through configurable grace periods and watermarking strategies.

Stream Processing with Kafka Streams & ksqlDB
04

Schema Management with Confluent Schema Registry

Schema Registry provides centralized schema versioning and validation for Avro, Protobuf, and JSON Schema formats, preventing incompatible producers from breaking consumer applications. We configure compatibility modes (backward, forward, full, none) based on evolution requirements and implement schema validation in CI/CD pipelines before deployment. Avro serialization reduces message size by 50-70% compared to JSON while providing schema evolution capabilities. Producer applications include schema IDs in message headers, and consumers automatically fetch and cache schemas from the registry. Our [systems integration](/services/systems-integration) projects leverage Schema Registry to enforce data contracts between microservices, with documentation generated from schema definitions. Migration procedures handle breaking schema changes through versioned topics and dual-read consumer patterns.

Schema Management with Confluent Schema Registry
05

Multi-Datacenter Replication with MirrorMaker 2.0

MirrorMaker 2.0 replicates topics across Kafka clusters for disaster recovery, multi-region deployments, and data aggregation architectures. We configure active-passive replication for backup clusters with automatic failover procedures, and active-active replication for geo-distributed applications serving regional users. Replication flows include offset translation, consumer group synchronization, and configuration propagation. Topic naming strategies like source.topic-name prevent naming conflicts in multi-cluster environments. Monitoring includes replication lag metrics, checkpoint intervals, and end-to-end latency measurements. Our disaster recovery runbooks document cluster failover procedures, DNS updates, and application reconfiguration steps with tested recovery time objectives (RTO) under 15 minutes.

Multi-Datacenter Replication with MirrorMaker 2.0
06

Security Implementation with Authentication & Authorization

Enterprise Kafka deployments require SSL/TLS encryption for data in transit and encryption at rest for stored logs. We implement SASL/SCRAM authentication with username/password credentials stored in ZooKeeper or KRaft, configure ACLs to control topic read/write permissions per user or application, and enable audit logging for compliance requirements. Mutual TLS authentication uses client certificates for service-to-service communication. For LDAP/Active Directory integration, we configure SASL/PLAIN with external authorization services. Network segmentation places Kafka brokers in private subnets with security groups restricting access to application tiers. Key rotation procedures, certificate expiration monitoring, and credential management through HashiCorp Vault or AWS Secrets Manager complete our security architecture documented per [Official Kafka Security Documentation](https://kafka.apache.org/documentation/#security).

Security Implementation with Authentication & Authorization
07

Production Monitoring & Operational Management

Comprehensive monitoring collects JMX metrics from brokers, producers, and consumers using Prometheus exporters and visualizes them in Grafana dashboards. Key metrics include broker under-replicated partitions, producer request latency, consumer lag, disk usage, and network throughput. We configure PagerDuty or Opsgenie alerts for critical conditions: any under-replicated partitions, consumer lag exceeding thresholds, broker unavailability, or disk space above 80%. Log aggregation with ELK stack or CloudWatch Logs enables troubleshooting and audit trail analysis. Capacity planning reviews track message rate growth, storage usage trends, and partition count increases. Our operational runbooks document scaling procedures (adding brokers, increasing partition counts), common failure scenarios (broker crashes, network partitions), and performance tuning adjustments based on production metrics.

Production Monitoring & Operational Management
08

Need Senior Talent for Your Project?

Skip the recruiting headaches. Our experienced developers integrate with your team and deliver from day one.

  • Senior-level developers, no juniors
  • Flexible engagement — scale up or down
  • Zero hiring risk, no agency contracts
“
FreedomDev is very much the expert in the room for us. They've built us four or five successful projects including things we didn't think were feasible.
Paul Z.—Chief Operating Officer, Scott Group

Perfect Use Cases for Apache Kafka

Real-Time IoT Telemetry Collection & Processing

Manufacturing clients stream sensor data from production equipment, environmental monitors, and quality control systems into Kafka topics for real-time monitoring and predictive maintenance. Our implementation for a Grand Rapids automotive supplier ingests temperature, pressure, and vibration readings from 200+ machines every second (17.3 million events daily) into partitioned topics. Kafka Streams applications detect anomaly patterns, calculate rolling averages, and trigger alerts when thresholds are breached. Historical data flows through JDBC sink connectors into PostgreSQL for trend analysis. The architecture replaces batch file transfers that delayed anomaly detection by 15-30 minutes, enabling immediate equipment shutdown to prevent defects. Retention policies purge raw telemetry after 72 hours while aggregated metrics persist for two years.

Event-Driven Microservices Communication

Kafka serves as the event backbone for microservices architectures, enabling asynchronous communication that decouples services and improves resilience. Our e-commerce platform implementation publishes order placement, payment processing, inventory updates, and shipping notifications as Kafka events that multiple services consume independently. The order service publishes to orders topic, payment service consumes and publishes to payments topic, and warehouse service subscribes to both for fulfillment coordination. This choreography pattern eliminates point-to-point API calls that create tight coupling and cascading failures. Event sourcing stores all state changes as Kafka events, enabling complete audit trails and event replay for debugging. Schema Registry enforces event structure contracts between services, while consumer groups scale individual services independently based on load.

Change Data Capture for Database Synchronization

Debezium CDC connectors stream database changes (INSERT, UPDATE, DELETE) into Kafka topics in real-time, enabling data replication, cache invalidation, and event-driven workflows. Our implementation for a Holland-based distribution company captures PostgreSQL changes from an ERP system and publishes them to Kafka topics partitioned by table name. Downstream consumers update Elasticsearch for full-text search, invalidate Redis cache entries, trigger inventory reorder workflows, and replicate to a MySQL analytics database. This architecture replaces nightly batch ETL processes with continuous replication having sub-second latency. The [database services](/services/database-services) team configures logical replication slots, manages schema changes that affect Debezium, and monitors replication lag. Before/after event structures enable downstream consumers to implement upsert logic for idempotent processing.

Log Aggregation & Centralized Monitoring

Kafka collects application logs, system metrics, and security events from distributed services for centralized analysis and alerting. Our logging architecture uses Filebeat or Fluentd to tail log files and publish to Kafka topics, partitioned by application name and environment. Logstash consumers parse structured JSON logs, enrich with metadata, and write to Elasticsearch for analysis in Kibana. High-volume debug logs have 24-hour retention and single replication, while security audit logs replicate across five brokers with 365-day retention. This architecture processes 50GB of logs daily for a multi-tenant SaaS platform, providing tenant-isolated log views, regex-based search, and alerting on error patterns. Kafka's buffering handles log spikes during deployments without dropping messages, unlike direct Elasticsearch ingestion that throttles under load.

Fraud Detection & Real-Time Analytics

Financial services applications stream transaction events through Kafka for real-time fraud detection, risk scoring, and regulatory reporting. Our implementation for a regional credit union publishes ATM transactions, online banking activities, and card authorizations to Kafka topics consumed by a Kafka Streams application computing rolling 24-hour transaction counts and geographical velocity checks. Machine learning models deployed as Kafka consumers score transactions against historical patterns, flagging anomalies for manual review. The system processes 15,000+ transactions daily with decision latencies under 100ms. Exactly-once semantics prevent duplicate fraud alerts or missed detections during application restarts. Event replay capabilities enable backtesting new fraud detection rules against historical transaction streams, with A/B testing comparing rule effectiveness before production deployment.

Customer Activity Tracking & Personalization

E-commerce and SaaS platforms publish user interactions—page views, feature usage, search queries—as Kafka events for real-time personalization and analytics. Our retail client's web application publishes clickstream data to Kafka topics using JavaScript producers with batching to reduce request overhead. Consumer groups aggregate events into user sessions, identify product affinity patterns, and trigger personalized email campaigns. Real-time recommendation engines built with Kafka Streams maintain user preference state stores updated continuously from interaction events. The architecture processes 2 million+ events daily across 50,000 active users, enabling sub-second product recommendation updates. GDPR compliance features include event filtering by user consent flags, automated deletion requests processed as tombstone messages, and topic-level retention matching data retention policies.

Supply Chain Event Tracking & Coordination

Logistics companies use Kafka to track shipments, coordinate warehouse operations, and synchronize inventory across distribution centers. Our implementation for a West Michigan logistics provider publishes GPS coordinates, loading/unloading events, temperature readings for refrigerated shipments, and delivery confirmations to location-partitioned topics. Consumer applications update customer-facing tracking portals in real-time, calculate estimated arrival times based on current locations, and trigger exception workflows when delays occur. Integration with carriers' APIs publishes tracking updates from FedEx, UPS, and regional carriers into unified Kafka topics. The system replaced polling-based updates checking carrier APIs every 15 minutes with event-driven webhooks providing immediate status changes. Kafka's partition ordering guarantees sequential processing of location events per shipment, preventing race conditions that display incorrect status.

Regulatory Compliance & Audit Trail Maintenance

Kafka's immutable log provides comprehensive audit trails for regulated industries including finance, healthcare, and manufacturing. Our implementation for a medical device manufacturer publishes quality control measurements, production parameter changes, and calibration events to Kafka topics with 7-year retention per FDA CFR Part 11 requirements. Every message includes digital signatures for non-repudiation, timestamps from NTP-synchronized sources for temporal ordering, and operator IDs for accountability. Immutable storage with signed offsets prevents tampering, while Schema Registry enforces data structure validation. Audit consumers generate compliance reports, detect unauthorized access patterns, and alert on anomalous activities. The architecture passed FDA inspection demonstrating complete traceability from raw materials to finished devices, correlating production events with quality outcomes across manufacturing lines.

Talk to a Apache Kafka Architect

Schedule a technical scoping session to review your app architecture.

Frequently Asked Questions

How does Apache Kafka differ from traditional message queues like RabbitMQ?
Kafka is a distributed commit log that persists messages on disk for configurable retention periods (days, weeks, or indefinitely), while RabbitMQ deletes messages after consumer acknowledgment. This fundamental difference enables Kafka consumers to replay historical messages, multiple consumer groups to read the same data independently, and complete event sourcing patterns. Kafka achieves higher throughput (millions of messages per second) through batching and sequential disk writes, while RabbitMQ excels at complex routing patterns, per-message acknowledgments, and low-latency delivery of small message volumes. We recommend Kafka for high-volume event streaming and RabbitMQ for workflow orchestration with routing complexity.
What are the infrastructure requirements for running Kafka in production?
Minimum production Kafka clusters require three brokers for fault tolerance (allowing one failure), ZooKeeper ensemble with three nodes (or KRaft mode in Kafka 3.0+), and sufficient disk space for retention periods multiplied by daily message volume. A typical small deployment uses three m5.large EC2 instances (2 vCPU, 8GB RAM) with 500GB gp3 EBS volumes, costing approximately $800-900 monthly on AWS. Enterprise deployments use dedicated broker instances, separate ZooKeeper clusters, and tiered storage moving older data to S3. We recommend 10Gbps network connectivity between brokers, XFS filesystem for data directories, and Java 11+ with tuned JVM heap settings (typically 6GB for brokers with 16GB total RAM).
How do you handle Kafka schema evolution without breaking existing consumers?
Schema Registry with Avro serialization enforces compatibility rules preventing breaking changes. Backward compatibility allows new schemas to read old data (safe for consumers), forward compatibility allows old schemas to read new data (safe for producers), and full compatibility requires both. We implement versioned schemas with optional fields having default values, never remove required fields in backward-compatible changes, and use separate topics when breaking changes are necessary. Migration procedures deploy updated consumers before producers for backward-compatible changes, or maintain dual-read patterns during transitions. Schema validation in CI/CD pipelines catches incompatibilities before production deployment, documented per [Confluent Schema Evolution Documentation](https://docs.confluent.io/platform/current/schema-registry/avro.html#schema-evolution).
What guarantees does Kafka provide for message delivery and ordering?
Kafka guarantees messages are ordered within a single partition but not across partitions. Producer idempotence (enable.idempotence=true) prevents duplicates during retries, while transactions provide exactly-once semantics across multiple topic writes. Consumer ordering depends on partition assignment—single consumer per partition guarantees processing order, multiple consumers per partition (not recommended) may process out of order. We partition topics by entity ID (user ID, order ID) to maintain ordering for related events. At-least-once delivery is default (messages may duplicate), exactly-once requires transactional producers and consumers reading committed offsets only. Our standard configuration uses acks=all requiring all in-sync replicas to acknowledge writes, min.insync.replicas=2, and consumer manual offset commits after processing.
How does consumer lag monitoring work and what thresholds trigger alerts?
Consumer lag measures the offset difference between the last produced message and last consumed message per partition. We monitor records-lag-max metric per consumer group using JMX exporters to Prometheus, calculating total lag summed across all partitions. Alert thresholds depend on message rates and processing time—a consumer processing 1,000 messages/second with 10-second lag (10,000 messages behind) might be acceptable, while 5-minute lag (300,000 messages) indicates problems. Our standard alerts trigger at lag >10,000 messages or lag increasing >20% over 5 minutes. Lag causes include slow processing, insufficient consumer instances, partition skew (hot partitions), or external dependencies (database slowness). Remediation includes adding consumer instances, optimizing processing code, rebalancing partitions, or temporarily increasing consumer parallelism.
Can Kafka handle GDPR right-to-deletion requirements given its immutable log?
Kafka 2.1+ supports message deletion through tombstone messages (null value with same key) and log compaction, enabling GDPR compliance. We implement compacted topics with cleanup.policy=compact storing only the latest value per key—publishing a tombstone deletes the user record after compaction runs. For complete deletion, we partition topics by user ID, enabling targeted topic deletion while preserving other users' data. Alternative approaches include encrypting messages with user-specific keys (deleting keys makes data unreadable), anonymizing personally identifiable information in consumer applications before storage, or using short retention periods (7 days) with long-term storage in deletable systems. Our compliance implementations document data retention per topic, automated deletion workflows, and audit trails proving deletion completion.
What are the performance implications of increasing partition counts?
More partitions enable higher parallelism for producers and consumers but increase broker overhead and leader election time during failures. Each partition requires file handles, memory for network buffers, and periodic metadata updates. We've observed minimal performance impact up to 4,000 partitions per broker, degradation between 4,000-10,000 partitions, and significant issues beyond 10,000 partitions per broker. Leader election time increases linearly with partition count—clusters with 10,000 partitions may take 20-30 seconds for controller election during broker failures. Best practices include starting with 3-6 partitions per topic (matching typical consumer parallelism), monitoring broker CPU and file descriptor usage, and planning partition counts based on throughput requirements (each partition handles 10-100MB/sec depending on configuration).
How do you size Kafka clusters for specific throughput and retention requirements?
Sizing calculations start with daily message volume, average message size, retention period, and replication factor. Example: 10 million messages/day × 2KB average size × 7 day retention × 3 replication = 420GB storage required. Add 30% overhead for indexes and compaction, requiring 550GB. Throughput sizing assumes 50MB/sec per broker (conservative estimate)—100,000 messages/second × 2KB = 200MB/sec requires 4+ brokers. We recommend separate clusters for different SLAs (critical topics on dedicated brokers with higher replication), tiered storage moving old data to S3, and monitoring disk usage with alerts at 70% capacity. Network bandwidth (typically 1Gbps) often limits throughput before CPU or disk. Our standard three-broker clusters handle 50,000 messages/second, 100GB daily ingestion, and 7-day retention comfortably.
What disaster recovery strategies work best for Kafka deployments?
MirrorMaker 2.0 replicates topics across geographically separated clusters for disaster recovery, with active-passive configurations maintaining backup clusters and active-active for multi-region deployments. We configure heartbeat intervals detecting primary cluster failures within 30 seconds and automated DNS failover redirecting producers to backup clusters. ZooKeeper/KRaft metadata requires separate backup strategies—periodic snapshots stored in S3 with documented restoration procedures. Recovery Time Objectives (RTO) under 15 minutes require automated failover, pre-warmed backup clusters, and tested runbooks. Recovery Point Objectives (RPO) depend on replication lag—MirrorMaker typically maintains <1 second lag. Our DR implementations include quarterly failover tests, documented application reconfiguration procedures, and monitoring replication lag with alerts at >10 seconds.
When should we choose managed Kafka services versus self-hosted clusters?
Managed services like Confluent Cloud or Amazon MSK eliminate operational overhead (patching, monitoring, scaling) at 30-50% higher cost than self-hosted infrastructure. Small deployments (<500GB data, <100,000 messages/sec) benefit from managed services' simplified operations and built-in monitoring. Self-hosted clusters provide cost savings at scale, customization for specific performance requirements, and compliance for data sovereignty requirements. Hybrid approaches use managed services for development/staging and self-hosted for production when cost savings justify DevOps investment. Our decision matrix considers monthly message volume, data retention, team operational expertise, and compliance requirements. Typical break-even occurs at 3TB storage or when dedicated DevOps resources are already available. We've implemented both approaches based on client needs documented in our [case studies](/case-studies).

Explore More

Custom Software DevelopmentSystems IntegrationDatabase ServicesJavaPythonJavascript

Need Senior Apache Kafka Talent?

Whether you need to build from scratch or rescue a failing project, we can help.