19

Use case: Read 10 million rows [10 columns] from database and write to a file (csv format).

  1. Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason?

  2. Which would be better performing (fast) in the above use case?

  3. Would the selection be different in case of a single-process vs multi-process approach?

  4. In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?

2
  • 1
    partition to create parts of csv + extra step to concat csv parts into single csv? Dec 4, 2013 at 22:05
  • @bellabax Please consider a single-process approach [for now]. Updated question to ask if it matters for single-process/mutli-process.
    – ram
    Dec 4, 2013 at 22:20

2 Answers 2

12

To process that kind of data, you're probably going to want to parallelize it if that is possible (the only thing preventing it would be if the output file needed to retain an order from the input). Assuming you are going to parallelize your processing, you are then left with two main options for this type of use case (from what you have provided):

  1. Multithreaded step - This will process a chunk per thread until complete. This allows for parallelization in a very easy way (simply adding a TaskExecutor to your step definition). With this, you do loose restartability out of the box because you will need to turn off state persistence on either of the ItemReaders you have mentioned (there are ways around this with flagging records in the database as having been processed, etc).
  2. Partitioning - This breaks up your input data into partitions that are processed by step instances in parallel (master/slave configuration). The partitions can be executed locally via threads (via a TaskExecutor) or remotely via remote partitioning. In either case, you gain restartability (each step processes it's own data so there is no stepping on state from partition to partition) with parallization.

I did a talk on processing data in parallel with Spring Batch. Specifically, the example I present is a remote partitioned job. You can view it here: https://www.youtube.com/watch?v=CYTj5YT7CZU

To your specific questions:

  1. Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? - Either of these two options can be tuned to meet many performance needs. It really depends on the database you're using, driver options available as well as processing models you can support. Another consideration is, do you need restartability?
  2. Which would be better performing (fast) in the above use case? - Again it depends on your processing model chosen.
  3. Would the selection be different in case of a single-process vs multi-process approach? - This goes to how you manage jobs more so than what Spring Batch can handle. The question is, do you want to manage partitioning external to the job (passing in the data description to the job as parameters) or do you want the job to manage it (via partitioning).
  4. In case of a multi-threaded approach using TaskExecutor, which one would be better & simple? - I won't deny that remote partitioning adds a level of complexity that local partitioning and multithreaded steps don't have.

I'd start with the basic step definition. Then try a multithreaded step. If that doesn't meet your needs, then move to local partitioning, and finally remote partitioning if needed. Keep in mind that Spring Batch was designed to make that progression as painless as possible. You can go from a regular step to a multithreaded step with only configuration updates. To go to partitioning, you need to add a single new class (a Partitioner implementation) and some configuration updates.

One final note. Most of this has talked about parallelizing the processing of this data. Spring Batch's FlatFileItemWriter is not thread safe. Your best bet would be to write to multiple files in parallel, then aggregate them afterwards if speed is your number one concern.

2
  • You did mention about a Multithreaded step before Partitioning. However there is a possibility of ResultSetExhaustedException in case of a JdbcCursorItemReader in a multithreaded step. How could it be tackled?
    – ram
    Dec 31, 2013 at 15:56
  • JdbcCursorItemReader isn't considered thread safe. Use the JdbcPagingItemReader in a multithreaded step. You can read the details as to why here: jira.springsource.org/browse/… Jan 2, 2014 at 21:06
10
+25

You should profile this in order to make a choice. In plain JDBC I would start with something that:

  • prepares statements with ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY. Several JDBC drivers "simulate" cursors in client side unless you use those two, and for large result sets you don't want that as it will probably lead you to OutOfMemoryError because your JDBC driver is buffering the entire data set in memory. By using those options you increase the chance that you get server side cursors and get the results "streamed" to you bit by bit, which is what you want for large result sets. Note that some JDBC drivers always "simulate" cursors in client side, so this tip might be useless for your particular DBMS.
  • set a reasonable fetch size to minimize the impact of network roundtrips. 50-100 is often a good starting value for profiling. As fetch size is hint, this might also be useless for your particular DBMS.

JdbcCursorItemReader seems to cover both things, but as it is said before they are not guaranteed to give you best performance in all DBMS, so I would start with that and then, if performance is inadequate, try JdbcPagingItemReader.

I don't think doing simple processing with JdbcCursorItemReader will be slow for your data set size unless you have very strict performance requirements. If you really need to parallelize using JdbcPagingItemReader might be easier, but the interface of those two is very similar, so I would not count on it.

Anyway, profile.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.