Merging dataset which was split to comply with upload size limits

Merging dataset which was split to comply with upload size limits - amazon-quicksight

I have about 5GB of CSV data I've split into parts to comply with QuickSight's 1GB file size limit.
Now that it's in QuickSight SPICE, how do I merge this data back to its original form?
The obvious feature to use is "Add dataset" to an existing dataset. But this only supports joins, which merge rows together horizontally. I need to merge the data vertically.
My hope is the solution is supported by QuickSight and this is a stupid question...

You can always extend the capacity of your spice.
Go to account setting Manage Quicksight>SPICE capacity and it will show you the option of "purchase more capacity", but remember, you will be leaving the free tier layer.

Related

Maximum size of database for DQ mode in Power BI

I am using a database worth of 500 GBs. I want to visualize different columns to study the relationship between them using Power BI. However, there are performance issues while loading graphs.
I am using in DQ mode.
Its annoying to wait for 10 minutes for each visual to load.
Could anyone tell me if its a good idea to use Power BI for visualisation/making dashboard out of 500GBs of data?
What is the maximum limit of database we can use in DQ mode to create visuals efficiently?

DQ doesn't have a defined limit, MS have shown demos using a Petabyte database in this case for long running queries on a database, you have a few options.
Understand what queries are being run, and optimise your indexing strategy, maybe for example add a covering index
Optimise your data source, by using a column store index to move it in memory
Create database or table(s) with a the necessary subset of data from your main data.
Examine what objects are being used, and remove nested logic, views on top of views etc, with scalar conditions etc
The petabyte example by MS also used aggregation mode (Mentioned by WB in their answer) to store a subset of the data
I have used Direct Query to sit over data sources that have been around the 200GB range, however these have been mostly standard Star Schema data warehouses, or a defined reporting table, both which had the relevant indexes, covering indexes, or Column Store Indexes to allow more efficient retrieval of data. Direct Query Mode will slow down due to the number of query's that it has the do on the data source based on the measure, relationships and the connection overhead. Another can be the number of visuals on page, as each visual is a query and each one has to run on the data source.

You might want to look at aggregates in Power BI. You can basically import aggregate tables to Power BI that would satisfy needs for most of your visuals and resort to Direct Query for details that you might rarely need. When properly configured, aggregations will be cached and visuals that hit the aggregation will make use of that while those that don't will seamlessly query the DQ source.
Also, VertiPaq engine with its columnar store is quite efficient at compresses data. So given some smart modelling (get rid of unneeded high cardinality columns), you might actually end up with a much smaller model than your original data for all import.
Your mileage may vary.
As to the dataset limit itself, I believe it's 1GB/dataset when uploading to the service.

vertica how restrict database size

Could you please help me with the following issue?
I have installed vertica cluster. I can't understand how I can restrict database size in time or in size. For example, data in a database must be deleted older than 30 days or when a database size of 100 GB is reached (what comes first)

There is no automated way of doing this, and no logical way of "restricting database size". You can't just trim "data" from a "database".
What are you talking about (in terms of limiting data outside of 30 days old) needs to be done on the table level. You would need some kind of date field and delete anything older than 30 days. However, I would advice against deleting rows in this way. It is non-performant and can cause queries against the table to be slow: see DELETE and UPDATE Performance Considerations. The best way of doing this would be to partition the table by day, and create an automated script (bash, python, etc) to each day drop the partition that corresponds with the date 30 days ago: see Dropping Partitions.
As for deleting data—if the size of the "database" goes above 100GB—this requirement is extremely vague and would be impossible to enforce. Let's say you have 50 tables, and the size of several of those tables grows so that the total size of the database is over 100GB, how would you decide which table to prune? This also must be done on a table by table level (or in this case—technically—on a projection level, since that is where the data is actually stored).
To see the compressed size (size on disk) of the database you can use this query:
SELECT SUM(used_bytes) / ( 1024^3 ) AS database_size_gb
FROM projection_storage;
However, since data can only be deleted with a DELETE or DROP PARTITION statement on a table, it would also be helpful to see the size of each table. You can do this by using this query:
SELECT projection_schema, anchor_table_name, SUM(used_bytes) / ( 1024^3 ) AS table_size_gb
FROM projection_storage
GROUP BY 1, 2
ORDER BY 3 DESC;
From the results you can decide which tables you want to prune.
A couple of notes (as a Vertica DBA):
Data is stored in projections. Having too many projections on a single table can not only cause queries to be slow but will also increase the overall data footprint. Avoid using too many projections (especially too many superprojections, don't have more than two per table, and most tables will only need one). Use the database designer or follow the guidelines in the documentation for creating custom projections: Design Fundamentals.
Also, another trick to keep database size down is to use the DESIGNER_DESIGN_PROJECTION_ENCODINGS function. Unless your projections are created with the database designer, they will likely only contain the auto encoding. Using the DESIGNER_DESIGN_PROJECTION_ENCODINGS function will help you to pick the most optimal encoding for each column. I have seen properly encoded projections take up a mere 2% disk size compared to the previously un-optimized projection. That is rare, but in my experience you will still see at least a 20-40% reduction in size. Do not be afraid to use this function liberally. It is one of my favorite tools as a Vertica DBA.
Also

Speeding up Redshift COPY loading

I am loading files into Redshift with the COPY command using a manifest. The files are in S3. Unfortunately, there's about 2,000 files per table, so it's like
users1.csv.gz, users2.csv.gz, users3.csv.gz, users4.csv.gz, etc
I don't know if that matters or not, because the files are loaded with a manifest, and the manifest is supposed to parallelize this. That being said, it is really slow to load a table, and I need to speed it up.
What are some things I could do to speed this up?

In my case, I was importing lots of small tables (~100 tables of less than 1k rows each). In this case, adding the following options did help:
COMPUPDATE OFF
and
STATUPDATE OFF
documentation for COPY
documentation for COMPUPDATE.
documentation for STATUPDATE
Keep in mind that this does skip automatic compression and stats update. Refer to the documentation for the exact consequences of this.

If the size of the each user*.csv.gz file is very small, then Redshift might be spending some compute effort in uncompressing. If it is small, you may consider, uploading the csv files directly without compressing.
If you may want only specific columns from the CSV, you may use the column list to ignore a few columns. The below link describes column lists.
https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-column-mapping.html#copy-column-list
You may disable the COMPUPDATE option during load if it is unnecessary.
Is it an empty table or does the table possess any data. If so, please execute VACUUM and ANALYSE commands before/after the load. VACUUM & ANALYSE are time consuming activities as well, if thr is any sort key and the data in your csv is also in the same sorted order, the above operation should be faster.
Define relevant sort keys which will have an impact on disk I/O and columnar compression & Load data in the sort key order. https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key-order.html
Define relevant distribution styles, which will distribute data across multiple slices and will impact disk I/O across the cluster.
Specify compression types for columns which reduces disk size and disk I/O subsequenty.
May I know the numbers, how many records in total and how long does it take to load?
Hope the above points help

Column Store Strategies

We have an app that requires very fast access to columns from delimited flat files, whithout knowing in advance what columns will be in the files. My initial approach was to store each column in a MongoDB document as a array, but it does not scale as the files get bigger as I hit the 16MB document limit.
My second approach is to split the file columnwise and essentially treat them as blobs that can be served off disk to the client app. I'd intuitively think that storing the location in database and the files on disk is the best approach, and that storing them in a database as blobs (or in mongo gridfs) is adding unnessasary overhead, however there may be advatages that are not apparant to me at the moment. So my question is: what would be the advantage to storing them as blobs in a database such as (Oracle/Mongo) and are there any databases that are particularly well suited to this task.
Thanks,
Vackar

max number of couchbase views per bucket

How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?

Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry

As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio