top of page
Writer's pictureRafael Natali

Improving the performance of a Kafka Connect CSV source connector

Updated: Jan 22

When you create a source connector in Kafka Connect, by default it will use the broker's default batch.size which is 16384 bytes (16Kb). This can have a massive impact on the ingestion time of larger files. Let's use a file with 20,000,000+ lines as an example and retrieve some metrics:






Each record has around 493 bytes and at the maximum batch.size of 16Kb, this connector can send on average 30 records per request. Looking at the Records Send Total metric, we can see that it took 7 hours to send all the records to the topic.


As of Apache Kafka 2.3 (available as part of Confluent Platform 5.3) you can now override producer properties per connector. Note that this is enabled by default on Confluent 7.0+.


To do this you first need to allow it in the worker config:

connector.client.config.override.policy=All

If you’re using Docker then the configuration is set through the environment variable CONNECT_CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY, for example in Docker Compose would look like this:

CONNECT_CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY: 'All'

Accordingly to this article Kafka Connect: How to Increase Throughput on Source Connectors | UK, the calculation for the batch.size is:

batch.size = number_of_records * record_size_average_in_bytes

Considering we would like to include 1500 records in each batch:

batch.size = 1500 * 493 bytes
batch.size = 739500 bytes

Remember that, batch.size cannot be higher than max.message.bytes configured in the topic. Otherwise, you will get a “MESSAGE_TOO_LARGE” error.


Once that’s calculated, you can change the producer properties that you want in each connector’s configuration individually. For example:

"producer.override.batch.size": 739500,

Ingesting another file and looking at the same metrics, we can see the improvements:



Conclusion


In this example, we went from 7 hours to 30 minutes to ingest more than 30,000,000 records in a Kafka topic. There are other metrics and configurations you can try to improve your source connector.


39 views0 comments

Comments


bottom of page