When you create a source connector in Kafka Connect, by default it will use the broker's default batch.size which is 16384 bytes (16Kb). This can have a massive impact on the ingestion time of larger files. Let's use a file with 20,000,000+ lines as an example and retrieve some metrics:
Each record has around 493 bytes and at the maximum batch.size of 16Kb, this connector can send on average 30 records per request. Looking at the Records Send Total metric, we can see that it took 7 hours to send all the records to the topic.
As of Apache Kafka 2.3 (available as part of Confluent Platform 5.3) you can now override producer properties per connector. Note that this is enabled by default on Confluent 7.0+.
To do this you first need to allow it in the worker config:
connector.client.config.override.policy=All
If you’re using Docker then the configuration is set through the environment variable CONNECT_CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY, for example in Docker Compose would look like this:
CONNECT_CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY: 'All'
Accordingly to this article Kafka Connect: How to Increase Throughput on Source Connectors | UK, the calculation for the batch.size is:
batch.size = number_of_records * record_size_average_in_bytes
Considering we would like to include 1500 records in each batch:
batch.size = 1500 * 493 bytes
batch.size = 739500 bytes
Remember that, batch.size cannot be higher than max.message.bytes configured in the topic. Otherwise, you will get a “MESSAGE_TOO_LARGE” error.
Once that’s calculated, you can change the producer properties that you want in each connector’s configuration individually. For example:
"producer.override.batch.size": 739500,
Ingesting another file and looking at the same metrics, we can see the improvements:
Conclusion
In this example, we went from 7 hours to 30 minutes to ingest more than 30,000,000 records in a Kafka topic. There are other metrics and configurations you can try to improve your source connector.
Please study the article Kafka Connect: How to Increase Throughput on Source Connectors | UK for more details.
Comments