Which Spring Batch Partition is best suited - spring-boot

I have one DB having around 400k records and using spring batch I need to migrate that to another DB.
Using single threaded step may not give me performance edge , so thought of using scalability options provided by spring batch.
After reading from multiple posts and documentation, I got to know below are the ways by which you could optimize the batch job.
Multithreaded Step : Not good if you need retry functionality
AsyncItemProcessor/AsynItemWriter : unsuitable for my usecase as reader also need to work in parallel
Partitioning : Thinking of using local partitioning as remote need inbound/outbound channels.
Remote Chunking : Does not want to use it due to extra complexity
Please suggest best approach for my usecase.
I am thinking to use local partitioning. However, as the id column is varchar, I am unable to understand how to partition that and spring batch example shows the example of ColumnRangePartitioner where column is numeric id.
Does gridSize represent number of slave threads which will be spawned? If yes, I want to make it dynamic using Runtime.getRuntime().availableProcessors()+1. is that right approach for I/O Job?

Does gridSize represent number of slave threads which will be spawned?
Not necessarily. The grid size is the number of partitions that will be created by the partitioner. Note that this is just a hint to the partitioner, some partitioners do not use it (like the MultiResourcePartitioner).
This is different from the number of workers. You can have more partitions than workers and vice versa.
If yes, I want to make it dynamic using Runtime.getRuntime().availableProcessors()+1.
You can use Runtime.getRuntime().availableProcessors() to dynamically spawn as much workers as available cores (even though I don't see the value of adding 1 to that, unless I'm missing something).
is that right approach for I/O Job?
It depends on how you are processing each record. But I think this is a good start since each worker will handle a different partition and all workers can be executed in parallel.

Related

Putting cache entries to specific Ignite Server

I have an Ignite data grid of five servers(say A,B,C,D and E). A partitioned cache has been distributed across these five servers with the number of backups set as 1.
I want to store 100 million entries in this partitioned cache. But, I want to control the partitioning of my cache entries to the Ignite servers.
Is it possible that I can direct my Ignite client to put a cache entry on a particular server (say E)?
The only way to do this is to implement your own Affinity Function instead of the ones provided out of the box. However, I would encoredge you to rethink this approach because it's not scalable. Affinity functions included in Ignite are designed to provide even distribution on any set of nodes, so you can dynamically scale up and down whenewer you need this. Your approach is much less flexible.
Also I would recommend you to go through documentation page about Affinity Collocation. Very likely this will give you hints on how to implement your logic in a better way.
And fincally, can you give some more details about your use case? I will be happy to give some advice on how to approach it.

What is the difference and how to choose between distributed queue and distributed computing platform?

there are many files need to process with two computers real-timely,I want to distribute them to the two computers and these tasks need to be completed as soon as possibile(means real-time processing),I am thinking about the below plan:
(1) distributed queue like Gearman
(2)distributed computing platform like hadoop/spark/storm/s4 and so on
I have two questions
(1)what is the advantage and disadvantage between (1) and (2)?
(2) How to choose in (2),hadoop?spark?storm?s4?or other?
thanks!
Maybe I have not described the question clearly. In most case,there are 1000-3000 files with the same format , these files are independent,you do not need to care their order,the size of one file maybe tens to hundreds of KB and in the future, the number of files and size of single file will rise. I have wrote a program , it can process the file and pick up the data and then store the data in mongodb. Now there are only two computers, I just want a solution that can process these files with the program quickly(as soon as possibile) and is easy to extend and maintain
distributed queue is easy to use in my case bur maybe hard to extend and maintain , hadoop/spark is to "big" in the two computers but easy to extend and maintain, which is better, i am confused.
It depends a lot on the nature of your "processing". Some dimensions that apply here are:
Are records independent from each other or you need some form of aggregation? i.e: do you need some pieces of data to go together? Say, all transactions from a single user account.
Is you processing CPU bound? Memory bound? FileSystem bound?
What will be persisted? How will you persist it?
Whenever you see new data, do you need to recompute any of the old?
Can you discard data?
Is the data somewhat ordered?
What is the expected load?
A good solution will depend on answers to these (and possibly others I'm forgetting). For instance:
If computation is simple but storage and retrieval is the main concern, you should maybe look into a distributed DB rather than either of your choices.
It could be that you are best served by just logging things into a distributed filesystem like HDFS and then run batch computations with Spark (should be generally better than plain hadoop).
Maybe not, and you can use Spark Streaming to process as you receive the data.
If order and consistency are important, you might be better served by a publish/subscribe architecture, especially if your load could be more than what your two servers can handle, but there are peak and slow hours where your workers can catch up.
etc. So the answer to "how you choose?" is "by carefully looking at the constraints of your particular problem, estimate the load demands to your system and picking the solution that better matches those". All of these solutions and frameworks dominate the others, that's why they are all alive and kicking. The choice is all in the tradeoffs you are willing/able to make.
Hope it helps.
First of all, dannyhow is right - this is not what real-time processing is about. There is a great book http://www.manning.com/marz/ which says a lot about lambda archtecture.
The two ways you mentioned serves completly different purposes and are connected to the definition of word "task". For example, Spark will take a whole job you got for him and divide it into "tasks", but the outcome of one task is useless for you, you still need to wait for whole job to finish. You can create small jobs working on the same dataset and use spark's caching to speed it up. But then you won't get much advantage from distribution (if they have to be run one after another).
Are the files big? Are there connected somehow to each other? If yes, I'd go with Spark. If no, distributed queue.

Spring Batch: Which ItemReader implementation to use for high volume & low latency

Use case: Read 10 million rows [10 columns] from database and write to a file (csv format).
Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason?
Which would be better performing (fast) in the above use case?
Would the selection be different in case of a single-process vs multi-process approach?
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?
To process that kind of data, you're probably going to want to parallelize it if that is possible (the only thing preventing it would be if the output file needed to retain an order from the input). Assuming you are going to parallelize your processing, you are then left with two main options for this type of use case (from what you have provided):
Multithreaded step - This will process a chunk per thread until complete. This allows for parallelization in a very easy way (simply adding a TaskExecutor to your step definition). With this, you do loose restartability out of the box because you will need to turn off state persistence on either of the ItemReaders you have mentioned (there are ways around this with flagging records in the database as having been processed, etc).
Partitioning - This breaks up your input data into partitions that are processed by step instances in parallel (master/slave configuration). The partitions can be executed locally via threads (via a TaskExecutor) or remotely via remote partitioning. In either case, you gain restartability (each step processes it's own data so there is no stepping on state from partition to partition) with parallization.
I did a talk on processing data in parallel with Spring Batch. Specifically, the example I present is a remote partitioned job. You can view it here: https://www.youtube.com/watch?v=CYTj5YT7CZU
To your specific questions:
Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? - Either of these two options can be tuned to meet many performance needs. It really depends on the database you're using, driver options available as well as processing models you can support. Another consideration is, do you need restartability?
Which would be better performing (fast) in the above use case? - Again it depends on your processing model chosen.
Would the selection be different in case of a single-process vs multi-process approach? - This goes to how you manage jobs more so than what Spring Batch can handle. The question is, do you want to manage partitioning external to the job (passing in the data description to the job as parameters) or do you want the job to manage it (via partitioning).
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple? - I won't deny that remote partitioning adds a level of complexity that local partitioning and multithreaded steps don't have.
I'd start with the basic step definition. Then try a multithreaded step. If that doesn't meet your needs, then move to local partitioning, and finally remote partitioning if needed. Keep in mind that Spring Batch was designed to make that progression as painless as possible. You can go from a regular step to a multithreaded step with only configuration updates. To go to partitioning, you need to add a single new class (a Partitioner implementation) and some configuration updates.
One final note. Most of this has talked about parallelizing the processing of this data. Spring Batch's FlatFileItemWriter is not thread safe. Your best bet would be to write to multiple files in parallel, then aggregate them afterwards if speed is your number one concern.
You should profile this in order to make a choice. In plain JDBC I would start with something that:
prepares statements with ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY. Several JDBC drivers "simulate" cursors in client side unless you use those two, and for large result sets you don't want that as it will probably lead you to OutOfMemoryError because your JDBC driver is buffering the entire data set in memory. By using those options you increase the chance that you get server side cursors and get the results "streamed" to you bit by bit, which is what you want for large result sets. Note that some JDBC drivers always "simulate" cursors in client side, so this tip might be useless for your particular DBMS.
set a reasonable fetch size to minimize the impact of network roundtrips. 50-100 is often a good starting value for profiling. As fetch size is hint, this might also be useless for your particular DBMS.
JdbcCursorItemReader seems to cover both things, but as it is said before they are not guaranteed to give you best performance in all DBMS, so I would start with that and then, if performance is inadequate, try JdbcPagingItemReader.
I don't think doing simple processing with JdbcCursorItemReader will be slow for your data set size unless you have very strict performance requirements. If you really need to parallelize using JdbcPagingItemReader might be easier, but the interface of those two is very similar, so I would not count on it.
Anyway, profile.

When to use MultithreadedMapper

When should I use the MultithreadedMapper?
Will I make my job faster if I use the MultithreadedMapper where my application is pure computation. (No latency type mappers)
It depends but I would say avoid using MultithreadedMapper as first solution.
As such it is better to scale using a single threaded Mapper by having simultaneous launch of more mappers so that they can work on multiple inputs. The more cores you have, the higher you can set your mapred.tasktracker.map.tasks.maximum value. Of course, you will need beefier machines for this.
My understanding is that MultithreadedMapper is useful when you are I/O bound like fetch pages from web which has more latency than from local i/o. In such case, using MultithreadedMapper would help as you are not blocked on a single network I/O call and you can continue processing as data is made available to you.
But if you have large data in HDFS to be processed then they are readily fetched as the data is localized and if the computation is CPU bound then multi-core, multi-process solution is more helpful.
Also you will have to ensure that your mappers are thread safe.
Check this articles 1 and 2 on when to and not-to use multiple threads in a mapper. The recommendation is to increase the number of map slots on each node than to use MultithreadedMapper.

Best practices for deploying a high performance Berkeley DB system

I am looking to use Berkeley DB to create a simple key-value storage system. The keys will be SHA-1 hashes, so they are in 160-bit address space. I have a simple server working, that was easy enough thanks to the fairly well written documentation from Berkeley DB website. However, I have some questions about how best to set up such a system, to get good performance and flexibility. Hopefully, someone has had more experience with Berkeley DB and can help me.
The simplest setup is a single process, with a single thread, handling a single DB; inserts and gets are performed on this one DB, using transactions.
Alternative 1: single process, multiple threads, single DB; inserts and gets are performed on this DB, by all the threads in the process.
Does using multiple threads provide much performance improvements? There is one single DB, and therefore it's on one disk, and therefore I am guessing I won't get too much boost. But if Berkeley DB caches a lot of stuff in memory, then perhaps one thread will be able to run and answer from cache while another has blocked waiting for disk? I am using GNU Pth, user level cooperative threading. I am not familiar with the details of Pth, so I am also not sure if with Pth you can have a userlevel thread run while another userlevel thread has blocked.
Alternative 2: single process, one or multiple threads, multiple DBs where each DB covers a fraction of the 160-bit address space for keys.
I see a few advantages in having multiple DBs: we can put them on different disks, less contention, easier to move/partition DBs onto different physical hosts if we want to do that. Does anyone have experience with this setup and see significant benefits?
Alternative 3: multiple processes, each with one thread, each handles a DB that covers a fraction of the 160-bit address space for keys.
This has the advantages of using multiple DBs, but we are using multiple processes. Is this better than the second alternative? I suspect using processes rather than user-level threads to get parallelism will get you better SMP caching behaviors (less invalidates, etc), but will I get killed with all the process overheads and context switches?
I would love to hear if someone has tried the options, and have seen positive or negative results.
Thanks.
Alternative 2 gives you high scalability. You basically partition your database across
multiple servers. If you need a high performance distributed key/value database, I would
suggest looking at membase. I am doing that right now but we need to run on an appliance
and would like to limit dependencies (of membase).
You can use BerkeleyDB replication and have read only copies with servers to serve read/get
requests.

Resources