Speeding up Redshift COPY loading - performance

I am loading files into Redshift with the COPY command using a manifest. The files are in S3. Unfortunately, there's about 2,000 files per table, so it's like
users1.csv.gz, users2.csv.gz, users3.csv.gz, users4.csv.gz, etc
I don't know if that matters or not, because the files are loaded with a manifest, and the manifest is supposed to parallelize this. That being said, it is really slow to load a table, and I need to speed it up.
What are some things I could do to speed this up?

In my case, I was importing lots of small tables (~100 tables of less than 1k rows each). In this case, adding the following options did help:
COMPUPDATE OFF
and
STATUPDATE OFF
documentation for COPY
documentation for COMPUPDATE.
documentation for STATUPDATE
Keep in mind that this does skip automatic compression and stats update. Refer to the documentation for the exact consequences of this.

If the size of the each user*.csv.gz file is very small, then Redshift might be spending some compute effort in uncompressing. If it is small, you may consider, uploading the csv files directly without compressing.
If you may want only specific columns from the CSV, you may use the column list to ignore a few columns. The below link describes column lists.
https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-column-mapping.html#copy-column-list
You may disable the COMPUPDATE option during load if it is unnecessary.
Is it an empty table or does the table possess any data. If so, please execute VACUUM and ANALYSE commands before/after the load. VACUUM & ANALYSE are time consuming activities as well, if thr is any sort key and the data in your csv is also in the same sorted order, the above operation should be faster.
Define relevant sort keys which will have an impact on disk I/O and columnar compression & Load data in the sort key order. https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key-order.html
Define relevant distribution styles, which will distribute data across multiple slices and will impact disk I/O across the cluster.
Specify compression types for columns which reduces disk size and disk I/O subsequenty.
May I know the numbers, how many records in total and how long does it take to load?
Hope the above points help

Related

extremely high SSD write rate with multiple concurrent writers

I'm using QuestDB as backend for storing collected data using the same script for different data sources.
My problem ist the extremly high disk (ssd) usage. During 4 days it has written 335MB per second.
What am I doing wrong?
Inserting data using the ILP interface
sender.row(
metric,
symbols=symbols,
columns=data,
at=row['ts']
)
I don't know how much data you are ingesting, so not sure if 335 MB per second is much or not. But since you are surprised by it I am going to assume your throughput is lower than that. It might be the case your data is out of order, specially if ingesting from multiple data sources.
QuestDB keeps the data per table always in incremental order by designated timestamp. If data arrives out of order, the whole partition needs to be rewritten. This might lead to write amplification where you see your data is being rewritten very often.
Until literally a few days ago, to fine tune this you would need to change the default config, but since version 6.6.1, this is dynamically adjusted.
Maybe you want to give a try to version 6.6.1, or alternatively if data from different sources is arriving out of order (relative to each other), you might want to create separate tables for different sources, so data is always in order for each table.
I have been experimenting a lot and it seems that you're absolutely right. I was ingesting 14 different clients into a single table. After having splitted this to 14 tables, one for each client, the problem disappeared.
Another advantage is the fact that I need a symbol less as I do not have to distinguish the rows.
By the way - thank you and your team for this marvellous tool you gave us! It makes my work so much easier!!
Saludos

Collecting large statistical sets with pg_stat_statements?

According to Postgres pg_stat_statements documentation:
The module requires additional shared memory proportional to
pg_stat_statements.max. Note that this memory is consumed whenever the
module is loaded, even if pg_stat_statements.track is set to none.
and also:
The representative query texts are kept in an external disk file, and
do not consume shared memory. Therefore, even very lengthy query texts
can be stored successfully. However, if many long query texts are
accumulated, the external file might grow unmanageably large.
From these it is unclear what the actual memory cost of a high pg_stat_statements.max would be - say at 100k or 500k (default is 5k). Is it safe to set the levels that high, would could be the negative repercussions of such high levels? Would aggregating statistics into an external database via logstash/fluentd be a preferred approach above certain sizes?
1.
from what I have read, it hashes the query and keeps it in DB, saving the text to FS. So next concern is more expected then overloaded shared memory:
if many long query texts are accumulated, the external file might grow
unmanageably large
the hash of text is so much smaller then text, that I think you should not worry about extension memory consumption comparing long queries. Especially knowing that extension uses Query Analyser (which will work for EVERY query ANYWAY):
the queryid hash value is computed on the post-parse-analysis
representation of the queries
Setting pg_stat_statements.max 10 times bigger should take 10 times more shared memory I believe. The grows should be linear. It does not say so in documentation, but logically should be so.
There is no answer if it is safe or not to set setting to distinct value, because there is no data on other configuration values and HW you have. But as growth should be linear, consider this answer: "if you set it to 5K, and query runtime has grown almost nothing, then setting it to 50K will prolong it almost nothing times ten". BTW, my question - who is gong to dig 50000 slow statements? :)
2.
This extension already makes a pre-aggregation for "dis-valued" statement. You can select it straight on DB, so moving data to other db and selecting it there will only give you the benefit of unloading the original DB and loading another. In other words you save 50MB for a query on original, but spend same on another. Does it make sense? For me - yes. This is what I do myself. But I also save execution plans for statement (which is not a part of pg_stat_statements extension). I believe it depends on what you have and what you have. Definitely there is no need for that just because of a number of queries. Again unless you have so big file that extension can
As a recovery method if that happens, pg_stat_statements may choose to
discard the query texts, whereupon all existing entries in the
pg_stat_statements view will show null query fields

Column Store Strategies

We have an app that requires very fast access to columns from delimited flat files, whithout knowing in advance what columns will be in the files. My initial approach was to store each column in a MongoDB document as a array, but it does not scale as the files get bigger as I hit the 16MB document limit.
My second approach is to split the file columnwise and essentially treat them as blobs that can be served off disk to the client app. I'd intuitively think that storing the location in database and the files on disk is the best approach, and that storing them in a database as blobs (or in mongo gridfs) is adding unnessasary overhead, however there may be advatages that are not apparant to me at the moment. So my question is: what would be the advantage to storing them as blobs in a database such as (Oracle/Mongo) and are there any databases that are particularly well suited to this task.
Thanks,
Vackar

Optimizing file reading from HD

I have the following loop:
for fileName in fileList:
f = open(fileName)
txt = open(f).read()
analyze(txt)
The fileList is a list of more than 1 million small files. Empirically, I have found that call to open(fileName) takes more than 90% of the loop running time. What would you do in order to optimize this loop. This is a "software only" question, buying a new hardware is not an option.
Some information about this file collection:
Each file name is a 9-13 digit ID. The files are arranged in subfolders according to the first 4 digits of the ID. The files are stored on an NTFS disk and I rather not change disk format for reasons I won't get into, unless someone here has a strong belief that such a change will make a huge difference.
Solution
Thank you all for the answers.
My solution was to pass over all the files, parsing them and putting the results in an SQLite database. No the analyses that I perform on the data (select several entries, do the math) take only seconds. Already said, the reading part took about 90% of the time, so parsing the XML files in advance had little effect on the performance, compared to the effect of not having to read the actual files from the disk.
Hardware solution
You should really benefit from using a solid state drive (SSD). These are a lot faster than traditional hard disk drives, because they don't have any hardware components that need to spin and move around.
Software solution
Are these files under your control, or are they coming from an external system? If you're in control, I'd suggest you use a database to store the information.
If a database is too much of a hassle for you, try to store the information in a single file and read from that. If the isn't fragmented too much, you'll have much better performance compared to having millions of small files.
If opening and closing of files is taking most of your time, a good idea will be use a database or data store for your storage rather than a collection of flat files
To address your final point:
unless someone here has a strong belief that such a change will make a huge difference
If we're really talking about 1 million small files, merging them into one large file (or a small number of files) will almost certainly make a huge difference. Try it as an experiment.
Store the files in a single .zip archive and read them from that. You are just reading these files, right?
So, let's get this straight: you have sound empirical data that shows that your bottleneck is the filesystem, but you don't want to change your file structure? Look up Amdahl's law. If opening the files takes 90% of the time, then without changing that part of the program, you will not be able to speed things up by more than 10%.
Take a look at the properties dialog box for the directory containing all those files. I'd imagine the "size on disk" value is much larger than the total size of the files, because of the overhead of the filesystem (things like per-file metadata that is probably very redundant, and files being stored with an integer number of 4k blocks).
Since what you have here is essentially a large hash table, you should store it in a file format that is more suited to that kind of usage. Depending on whether you will need to modify these files and whether the data set will fit in RAM, you should look in to using a full-fledged database, a ligheweight embeddable one like sqlite, your language's hash table/dictionary serialization format, a tar archive, or a key-value store program that has good persistence support.

Storage for Write Once Read Many

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Resources