ETL in Java Spring Batch vs Apache Spark Benchmarking

Question

I have been working with Apache Spark + Scala for over 5 years now (Academic and Professional experiences). I always found Spark/Scala to be one of the robust combos for building any kind of Batch or Streaming ETL/ ELT applications.

But lately, my client decided to use Java Spring Batch for 2 of our major pipelines :

Read from MongoDB --> Business Logic --> Write to JSON File (~ 2GB | 600k Rows)
Read from Cassandra --> Business Logic --> Write JSON File (~ 4GB | 2M Rows)

I was pretty baffled by this enterprise-level decision. I agree there are greater minds than mine in the industry but I was unable to comprehend the need of making this move.

My Questions here are:

Has anybody compared the performances between Apache Spark and Java Spring Batch?
What could be the advantages of using Spring Batch over Spark?
Is Spring Batch "truly distributed" when compared to Apache Spark? I came across methods like chunk(), partition etc in offcial docs but I was not convinced of its true distributedness. After all Spring Batch is running on a single JVM instance. Isn't it ???

I'm unable to wrap my head around these. So, I want to use this platform for an open discussion between Spring Batch and Apache Spark.

@thebluephantom: Is volume the only deciding factor? What could be the other reasons from a technological perspective? — underwood, Dec 9, 2018 at 19:25
Not sure really, may be some bright spark can shed some light here. There are different ways to achieve the same goal. Having just read up on this, I see no real advanrage over Spark. — thebluephantom, Dec 9, 2018 at 19:51
Out of curiosity, since this is from 2018, I assume your pipelines are now in production. What does your performance look like? — Philippe, Jul 23, 2020 at 2:32

Michael Minella · Accepted Answer · 2018-12-11 18:44:50Z

As the lead of the Spring Batch project, I’m sure you’ll understand I have a specific perspective. However, before beginning, I should call out that the frameworks we are talking about were designed for two very different use cases. Spring Batch was designed to handle traditional, enterprise batch processing on the JVM. It was designed to apply well understood patterns that are common place in enterprise batch processing and make them convenient in a framework for the JVM. Spark, on the other hand, was designed for big data and machine learning use cases. Those use cases have different patterns, challenges, and goals than a traditional enterprise batch system, and that is reflected in the design of the framework. That being said, here are my answers to your specific questions.

Has anybody compared the performances between Apache Spark and Java Spring Batch?

No one can really answer this question for you. Performance benchmarks are a very specific thing. Use cases matter. Hardware matters. I encourage you to do your own benchmarks and performance profiling to determine what works best for your use cases in your deployment topologies.

What could be the advantages of using Spring Batch over Spark?

Programming model similar to other enterprise workloads
Enterprises need to be aware of the resources they have on hand when making architectural decisions. Is using new technology X worth the retraining or hiring overhead of technology Y? In the case of Spark vs Spring Batch, the ramp up for an existing Spring developer on Spring Batch is very minimal. I can take any developer that is comfortable with Spring and make them fully productive with Spring Batch very quickly. Spark has a steeper learning curve for the average enterprise developer, not only because of the overhead of learning the Spark framework but all the related technologies to prodictionalize a Spark job in that ecosystem (HDFS, Oozie, etc).

No dedicated infrastructure required
When running in a distributed environment, you need to configure a cluster using YARN, Mesos, or Spark’s own clustering installation (there is an experimental Kubernetes option available at the time of this writing, but, as noted, it is labeled as experimental). This requires dedicated infrastructure for specific use cases. Spring Batch can be deployed on any infrastructure. You can execute it via Spring Boot with executable JAR files, you can deploy it into servlet containers or application servers, and you can run Spring Batch jobs via YARN or any cloud provider. Moreover, if you use Spring Boot’s executable JAR concept, there is nothing to setup in advance, even if running a distributed application on the same cloud-based infrastructure you run your other workloads on.

More out of the box readers/writers simplify job creation
The Spark ecosystem is focused around big data use cases. Because of that, the components it provides out of the box for reading and writing are focused on those use cases. Things like different serialization options for reading files commonly used in big data use cases are handled natively. However, processing things like chunks of records within a transaction are not.

Spring Batch, on the other hand, provides a complete suite of components for declarative input and output. Reading and writing flat files, XML files, from databases, from NoSQL stores, from messaging queues, writing emails...the list goes on. Spring Batch provices all of those out of the box.

Spark was built for big data...not all use cases are big data use cases
In short, Spark’s features are specific for the domain it was built for: big data and machine learning. Things like transaction management (or transactions at all) do not exist in Spark. The idea of rolling back when an error occurs doesn’t exist (to my knowledge) without custom code. More robust error handling use cases like skip/retry are not provided at the level of the framework. State management for things like restarting is much heavier in Spark than Spring Batch (persisting the entire RDD vs storing trivial state for specific components). All of these features are native features of Spring Batch.

Is Spring Batch “truly distributed”

One of the advantages of Spring Batch is the ability to evolve a batch process from a simple sequentially executed, single JVM process to a fully distributed, clustered solution with minimal changes. Spring Batch supports two main distributed modes:

Remote Partitioning - Here Spring Batch runs in a master/worker configuration. The masters delegate work to workers based on the mechanism of orchestration (many options here). Full restartability, error handling, etc. is all available for this approach with minimal network overhead (transmission of metadata describing each partition only) to the remote JVMs. Spring Cloud Task also provides extensions to Spring Batch that allow for cloud native mechanisms to dynamically deploying the workers.
Remote Chunking - Remote chunking delegates only the processing and writing phases of a step to a remote JVM. Still using a master/worker configuration, the master is responsible for providing the data to the workers for processing and writing. In this topology, the data travels over the wire, causing a heavier network load. It is typically used only when the processing advantages can surpass the overhead of the added network traffic.

There are other Stackoverflow answers that discuss these features in further detail (as does as the documentation):

Advantages of spring batch
Difference between spring batch remote chunking and remote partitioning
Spring Batch Documentation

It does not really strike me as a set of answers, but a set of perspectives. Interesting, but in relation to his/her question I see no real reason to switch. — thebluephantom, Dec 11, 2018 at 18:55
Apache Spark and Spring batch are not comparable few product spl pivotal gemfire give good connectivity with spring batch but apache spark has no connection,i am working on few usecase may be try to compare performance,sparing batchwill be pain when connect to hive to load data as hive map reduce is very slow all performance getting kill while same in spark you can directly read HDFS and will be very fast. — vaquar khan, Aug 31, 2019 at 18:27
one of the major reason as mention in answer is to find developer who knows or used or can learn spark quickly , in my opinion there are lots of cases where spark is not required still people love to migrate because it sounds cool, I handle around 800 million + rows data crunching a lot of transformation using good old single server java. this amount of data is not very common use case logic is more important than framework I believe so. — JustTry, Feb 6, 2022 at 0:14
Wondering if Spring Batch has considered Apache beam and say a spark runner? I’ve used Google data flow with apache beam and it was great for batch processing. — Tony Murphy, Jul 17, 2022 at 14:19

Collectives™ on Stack Overflow

ETL in Java Spring Batch vs Apache Spark Benchmarking

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
spring
spring-boot
apache-spark
spring-batch
etl
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged springspring-bootapache-sparkspring-batchetl or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
spring
spring-boot
apache-spark
spring-batch
etl
or ask your own question.