Optimal Batch Size PostgreSQL Update - ruby

I am using Postgres and I have a ruby task that updates the contents of an entire table at an hourly rate. Currently this is achieved by updating the table in batches. However, I am not exactly sure what the formula is for finding an optimal batch size. Is there a formula or standard for determining an appropriate batch size?

In my opinion there is no theoretical optimal batch size. The optimal batch size will surely depend on your application model, the internal structure and of the accessed tables, the query structure and so on. The only reliable way I see to determine its size is benchmarking.
There are some optimization tips that can help you build a faster application, buy these tips cannot be followed blindly because many of them have corner cases where cannot be applied successfully. Again, the way to determine if a change (adding an index, changing the batch size, enabling the query cache...) improves the performance is benchmarking before and after every single change.

Related

loading method for the stage in business intelligence

good night,
a query when the origin is passed to the stage base in business intelligence the loading method is total
or total + incremental,
I'm thinking of deleting all the data and reloading it, but if it were a very large database and many records would not be optimal. What do good practices suggest?
I will appreciate your opinions,
thank you very much,
I'm thinking of deleting all the data and reloading it, but if it were
a very large database and many records would not be optimal. What do
good practices suggest?
It depends.
Many companies are more comfortable with Delete-Truncate pattern because it is easy to implement and the amount data isn't a problem only if some conditions are verified (hardware, DBA..)
Incremental Loads (or Up-Sert pattern) are often used to keep data between two systems in sync with one another. They are used in cases when source data is being loaded into the destination on a repeating basis, such as every night or throughout the day.
Benefits of Incremental Data Loads :
They typically run considerably faster since they touch less data. Assuming no bottlenecks, the time to move and transform data is proportional to the amount of data being touched. If you touch half as much data, the run time is often reduced at a similar scale.
Disadvantages of Incremental Data Loads :
Maintainability: With a full load, if there's an error you can re-run the entire load without having to do much else in the way of cleanup / preparation. With an incremental load, the files generally need to be loaded in order. So if you have a problem with one batch, others queue up behind it till you correct it.
TRUNCATING and then INSERTING is two operations whereas UPDATEing is one, making the TRUNCATE and INSERT take (theoretically) more time.
There's also the ease-of-use factor. If you TRUNCATE then INSERT, you have to manually keep track of every column value. If you UPDATE, you just need to know what you want to change.

What considerations should I take into account when increasing the size in the Scroll API in Elasticsearch?

I am currently toying around with the Scroll API of Elasticsearch, and want to use it to obtain a large set of data and do some manual processing on it. The processing is performed by an external library and is not of the type that can easily be included as a script.
While this seems to work nicely at the moment, I was wondering what considerations that I should take into account when fine-tuning the scroll size for performing this form of processing. A quick observation seems to indicate that increasing the scroll size will reduce the latency of the operation. While I suspect that larger scroll sizes will generally reduce throughput, I have no idea whether this hypothesis is correct. Also, I have no idea if there are any other consequences that I do not envision right now.
So to summarize, my question is: what impact does changing Elasticsearch's scroll size have, especially on performance, in a scenario where the results are processed for each batch that is obtained?
Thanks in advance!
The one (and the only I know of) consideration is to be able to process batch fast enough to not release scroll context (which is controlled by ?scroll=X parameter).
Assuming that you will consume all the data from query, there, scroll should be tuned based on network and 3rd-party app performance. I.e.
if your app can process data in stream-like manner, bigger chunks is better
if your app processing data in batches (waiting for full ES response first), the upper limit for batch size should guarantee processing time < scroll release time
if you work in poor network environment, less batch size is better to handle overhead of dropped connections/retries
generally, bigger batch is obviously better, as it eliminates some network/ES cpu overhead

Hazelcast: What would be the implications of adding indexes to huge existing IMaps?

Given 4-5 nodes having many IMaps with lots of data in it, some of the predicate queries started to become significantly slow. One of the solutions for solving this performance issue (as I think) could be adding indexes. However, this data is part of a sensible system which is currently being used in production.
Before adding indexes, I was wondering what would be the consequences of doing it on huge IMaps? (would it lock the entire map ?; would it bring down the entire system?; etc.) Hazelcast documentation includes information about how to do it, but doesn't give any other explanation.
If you want to add the index in runtime this is what will happen:
the AddIndexOperation will be executed on every partition
during the execution of the AddIndexOperation the partition will be blocked until all partition data are iterated and added to the index.
Queries won't be blocked in this timeframe - but get/put operations will.
I would recommend doing it in the "maintenance window" where you have the smallest load.
lots of data is relative - just execute a test in your dev environment having exactly the same amount of data to see how long it will take to add an index in your environment.

Hive find expected run time of query

I want to find the expected run time of query in Hive. Using EXPLAIN gives the execution plan. Is there a way to find the expected time?
I need Hive equivalent of SQL query EXPLAIN COSTS .
There is no OOTB feature at this moment that facilitates this. One way to achieve this would be to learn from history. Gather patterns based on similar data and queries you have run previously and try to deduce some insights. You might find tools like Starfish helpful in the process.
I would not recommend you to decide anything based on a subset of your data, as running queries on a small dataset and on the actual dataset are very different. This is good to test the functionality but not for any kind of cost approximation. The reason behind this is that a lot of factors are involved in the process, like system resources(disk, CPU slots, N/W etc), system configuration, other running jobs etc. You might find smooth operation on a small dataset, but as the data size increases all these factors start playing much important role. Even a small configuration parameter may play an important role.(You might have noticed sometimes that a Hive query runs fast initially but starts getting slow gradually). Also, execution of a Hive query is much more involved than a simple MR job.
See this JIRA, to get some idea, where they are talking about developing a Cost Based Query optimization for Joins in Hive. You might also find this helpful.
I think that is not possible to because internally map reduce gets executed for any particular Hive query. Moreover map reduce job's execution time depends on the cluster load and its configuration. So it is tough to predict the execution time. May be you can do one thing you can use some timer before running the query and then after that finishes up you can calculate the exact execution time that was needed for execution.
May be you could sample a small % of records from your table using partitions , bucket features etc then run the query against the small dataset. Note the execution time and then multiply with the factor (total_size/sample_size).

Cost of a query in/dependent of amount of data

Could you please tell me whether the cost of a query is dependent on the amount of data available in the database at that time?
means, does the cost varies with the variation in the amount of data?
Thanks,
Savitha
The answer is, Yes, the data size will influence the query execution plan, that is why you must test your queries with real amounts of data (and if possible realistic data as the distribution of the data is also important and will influence the query cost).
Any Database management system is different in some respect and what works well for Oracle,MS SQL, PostgreSQL may not work well for MySQL and other way around. Even storage engines have very important differences which can affect performance dramatically.
Of course, mass data will Slow down the process, In fact If u are firing a query, it need to traverse and search into the database. For more data it ll take time, The three main issues you should be concerned if you’re dealing with very large data sets are Buffers, Indexes and Joins..

Resources