NiFi ExecuteSQL best configuration for larger tables - apache-nifi

We are trying to use ExecuteSQL on Oracle IOT Table which has data around 50M+.
Here is my configuration:-
I was trying by changing various combinations for Max Rows Per Flow File, Output Batch Size and Fetch Size to make extraction faster.
with the above/any other configuration up to few minuets extraction was very fast and then it reduces it's speed in extraction.
Any best practice to speed up the extraction of such large tables...

Related

Maximum size of database for DQ mode in Power BI

I am using a database worth of 500 GBs. I want to visualize different columns to study the relationship between them using Power BI. However, there are performance issues while loading graphs.
I am using in DQ mode.
Its annoying to wait for 10 minutes for each visual to load.
Could anyone tell me if its a good idea to use Power BI for visualisation/making dashboard out of 500GBs of data?
What is the maximum limit of database we can use in DQ mode to create visuals efficiently?
DQ doesn't have a defined limit, MS have shown demos using a Petabyte database in this case for long running queries on a database, you have a few options.
Understand what queries are being run, and optimise your indexing strategy, maybe for example add a covering index
Optimise your data source, by using a column store index to move it in memory
Create database or table(s) with a the necessary subset of data from your main data.
Examine what objects are being used, and remove nested logic, views on top of views etc, with scalar conditions etc
The petabyte example by MS also used aggregation mode (Mentioned by WB in their answer) to store a subset of the data
I have used Direct Query to sit over data sources that have been around the 200GB range, however these have been mostly standard Star Schema data warehouses, or a defined reporting table, both which had the relevant indexes, covering indexes, or Column Store Indexes to allow more efficient retrieval of data. Direct Query Mode will slow down due to the number of query's that it has the do on the data source based on the measure, relationships and the connection overhead. Another can be the number of visuals on page, as each visual is a query and each one has to run on the data source.
You might want to look at aggregates in Power BI. You can basically import aggregate tables to Power BI that would satisfy needs for most of your visuals and resort to Direct Query for details that you might rarely need. When properly configured, aggregations will be cached and visuals that hit the aggregation will make use of that while those that don't will seamlessly query the DQ source.
Also, VertiPaq engine with its columnar store is quite efficient at compresses data. So given some smart modelling (get rid of unneeded high cardinality columns), you might actually end up with a much smaller model than your original data for all import.
Your mileage may vary.
As to the dataset limit itself, I believe it's 1GB/dataset when uploading to the service.

Spring batch to process huge data

I have around 10 million files in my database in blob format which I need to convert and save them in pdf format. Each file size is around between 0.5 - 10mb and combined files size is around 20 TB. I’m trying to implement the functionality using spring batch. However my question is when I run the batch can the server memory hold that much amount of data? I’m trying to use chunk based processing and thread pool task executor. Please suggest if this best approach to run the job to process that much amount of data in less time
Each file size is 0.5 to 10 MB and approach you mentioned is perfect with chunks. You can get more control with below and monitor the processing.
Create Partition based on thread pool count(Based on your System resource) from file table.
Each partition step of reader will select only 1 file at a time.
You can calculate memory based on number of parallel steps and give as VM argument.
Configure Commit chunk based on memory calculation of total parallel steps.
Please refer below for example code.
Spring Batch multiple process for heavy load with multiple thread under every process

Which time series database supports these specific requirements?

We have a database with more than a billion daily statistical records. Each record has multiple metrics (m1 through m10), and several immutable tags.
Record can also be associated with zero or more groups. The idea was to use multiple tags (e.g. g1, g2) to indicate the belonging of specific record to specific group.
Our data is stored on daily level, and most time-series databases are really optimized for more granular data. This represents a problem when we want to produce monthly, or quarterly graphs (e.g. InfluxDB have maximum aggregation period of 7d). We need a database that is really optimized for day-level data points and can produce quick aggregations on month/quarter/year level.
Furthermore, the relationship between records and groups is mutable. We need the database to support the batch update of records (pseudo: ADD TAG group1 TO records WHERE record_id: 101), or at least fast deletion/reinserting of updated data. This operation should be relatively fast.
We need something that can produce near-real-time results when aggregating data across tens of millions (filtered) records.
Our original solution is based on elasticsearch and it works quite well, but wanted to explore alternatives in time-series databases niche. Can anyone recommend a time-series database that supports these features?
Try ClickHouse. It is optimized for real-time processing and querying big amounts of data. We successfully used it to store hundreds of billions of records per day on a 15-node cluster. ClickHouse is able to scan billions of records per second per CPU core and its query performance scales linearly with the number of available CPU cores.
ClickHouse also supports infrequent data updates, so you can update groups for particular rows.
If you want more tradituonal TSDB, then take a look at VictoriaMetrics. It is built on architecture ideas from ClickHouse, so it is fast and provides good on-disk data compression.

Optimal Batch Size PostgreSQL Update

I am using Postgres and I have a ruby task that updates the contents of an entire table at an hourly rate. Currently this is achieved by updating the table in batches. However, I am not exactly sure what the formula is for finding an optimal batch size. Is there a formula or standard for determining an appropriate batch size?
In my opinion there is no theoretical optimal batch size. The optimal batch size will surely depend on your application model, the internal structure and of the accessed tables, the query structure and so on. The only reliable way I see to determine its size is benchmarking.
There are some optimization tips that can help you build a faster application, buy these tips cannot be followed blindly because many of them have corner cases where cannot be applied successfully. Again, the way to determine if a change (adding an index, changing the batch size, enabling the query cache...) improves the performance is benchmarking before and after every single change.

Speeding up Redshift COPY loading

I am loading files into Redshift with the COPY command using a manifest. The files are in S3. Unfortunately, there's about 2,000 files per table, so it's like
users1.csv.gz, users2.csv.gz, users3.csv.gz, users4.csv.gz, etc
I don't know if that matters or not, because the files are loaded with a manifest, and the manifest is supposed to parallelize this. That being said, it is really slow to load a table, and I need to speed it up.
What are some things I could do to speed this up?
In my case, I was importing lots of small tables (~100 tables of less than 1k rows each). In this case, adding the following options did help:
COMPUPDATE OFF
and
STATUPDATE OFF
documentation for COPY
documentation for COMPUPDATE.
documentation for STATUPDATE
Keep in mind that this does skip automatic compression and stats update. Refer to the documentation for the exact consequences of this.
If the size of the each user*.csv.gz file is very small, then Redshift might be spending some compute effort in uncompressing. If it is small, you may consider, uploading the csv files directly without compressing.
If you may want only specific columns from the CSV, you may use the column list to ignore a few columns. The below link describes column lists.
https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-column-mapping.html#copy-column-list
You may disable the COMPUPDATE option during load if it is unnecessary.
Is it an empty table or does the table possess any data. If so, please execute VACUUM and ANALYSE commands before/after the load. VACUUM & ANALYSE are time consuming activities as well, if thr is any sort key and the data in your csv is also in the same sorted order, the above operation should be faster.
Define relevant sort keys which will have an impact on disk I/O and columnar compression & Load data in the sort key order. https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key-order.html
Define relevant distribution styles, which will distribute data across multiple slices and will impact disk I/O across the cluster.
Specify compression types for columns which reduces disk size and disk I/O subsequenty.
May I know the numbers, how many records in total and how long does it take to load?
Hope the above points help

Resources