How does GreenPlum handle multiple large joins and simultaneous workloads? - greenplum

Our product is extracts from our database, they can be as large as 300GB+ in file format. To achieve that we join multiple large tables (tables close to 1TB in size in some cases). We do not aggregate data period, it's pure extracts. How does GreenPlum handle these kind of large data sets (The join keys are 3+ column keys and not every table has the same keys to join with, the only common key is the first key and if data would be distributed by that there will be a lot of skew since the data itself is not balanced).

You should use writable external tables for those types of large data extracts because it can leverage gpfdist and write data in parallel. It will be very fast.
https://gpdb.docs.pivotal.io/510/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
Also, your use case doesn't really indicate skew. Skew would be either storing the data by a poor column choice like gender_code or processing skew where you filter by a column or columns where only a few segments has the data.

In general, Greenplum Database handles this kind of load just fine. The query is executed in parallel on the segments.
Your bottleneck is likely the final export from the database - if you use SQL (or COPY), everything has to go through the master to the client. That takes time, and is slow.
As Jon pointed out, consider using an external table, and write out the data as it comes out of the query. Also avoid any kind of sort operation in your query, if possible. This is unnecessary because the data arrives unsorted in the external table file.

Related

Proper way to populate cache from Cassandra

I want to have a memory cache layer in my application. To populate cache with items, I have to get data from a large Cassandra table. Select all is not recommended, because without using partition keys, it's a slow read operation. Prior to that I can "predict" partition keys using other Cassandra table that I'll have to read all again, but relatively it's a smaller volume table. After reading data from user table and creating a list of potential partition keys (userX, userY) that may or may not be present in initial table. With that list try and populate cache by executing select queries with each potential key. That also doesn't sound like a really good idea.
So the question is? How to properly populate cache layer with data from Cassandra DB?
The second option is preferred for warming up or pre-loading your cache.
Single-partition asynchronous queries from multiple client/app instances is much better than doing a full table scan. Asynchronous queries from lots of clients distributes the load efficiently to all nodes in the cluster which is why they perform better.
It should be said that if you've got your data model right and you've sized your cluster correctly, you can achieve single-digit millisecond latencies. I work with a lot of large organisations who have a 95% SLA for 6-8ms reads. Cheers!

In spark, if one column contains a lot of data does this affect the performance of queries on other columns?

Let's assume that I'm using spark and storing data in parquet format. I have a table with multiple columns, one of which may contain a relatively large amount of data (for example let's say it could contain a 10000 word string).
Now I want to make a simple query on the number of rows for a given partition of data. I would expect that it wouldn't matter how much data is in one of the columns. As I understand it, parquet is a columnar format so it stores and operates on data column-wise. So it could just load one of the columns with a small amount of data and count the rows. But I'm finding that when a column contains large data it slows down the query performance. In spark UI I can see that the input data is much larger. This is true even if the query explicitly excludes that column.
Is this the expected behavior? Isn't the advantage of columnar data specifically to improve performance in cases like this?

HBase: Create multiple tables or single table with many columns?

When does it make sense to create multiple tables as opposed to a single table with a large number of columns. I understand that typically tables have only a few column families (1-2) and that each column family can support 1000+ columns.
When does it make sense to create separate tables when HBase seems to perform well with a potentially large number of columns within a single table?
Before answering the question itself, let me first state some of the major factors that come into play. I am going to assume that the file system in use is HDFS.
A table is divided into non-overlapping partitions of the keyspace called regions.
The key-range -> region mapping is stored in a special single region table called meta.
The data in one HBase column family for a region is stored in a single HDFS directory. It's usually several files but for all intents and purposes, we can assume that a region's data for a column family is stored in a single file on HDFS called a StoreFile / HFile.
A StoreFile is essentially a sorted file containing KeyValues. A KeyValue logically represents the following in order: (RowLength, RowKey, FamilyLength, FamilyName, Qualifier, Timestamp, Type). For example, if you have only two KVs in your region for a CF where the key is same but values in two columns, this is how the StoreFile will look like (except that it's actually byte encoded, and metadata like length etc. is also stored as I mentioned above):
Key1:Family1:Qualifier1:Timestamp1:Value1:Put
Key1:Family1:Qualifier2:Timestamp2:Value2:Put
The StoreFile is divided into blocks (default 64KB) and the key range contained in each data block is indexed by multi-level indexes. A random lookup inside a single block can be done using index + binary search. However, the scans have to go serially through a particular block after locating the starting position in the first block needed for scan.
HBase is a LSM-tree based database which means that it has an in-memory log (called Memstore) that is periodically flushed to the filesystem creating the StoreFiles. The Memstore is shared for all columns inside a single region for a particular column family.
There are several optimizations involved while dealing with reading/writing data from/to HBase, but the information given above holds true conceptually. Given the above statements, the following are the pros of having several columns vs several tables over the other approach:
Single Table with multiple columns
Better on-disk compression due to prefix encoding since all data for a Key is stored together rather than on multiple files across tables. This also results in reduced disk activity due to smaller data size.
Lesser load on meta table because the total number regions is going to be smaller. You'll have N number of regions for just one table rather than N*M regions for M tables. This means faster region lookup and low contention on meta table, which is a concern for large clusters.
Faster reads and low IO amplification (causing less disk activity) when you need to read several columns for a single row key.
You get advantage of row level transactions, batching and other performance optimizations when writing to multiple columns for a single row key.
When to use this:
If you want to perform row level transactions across multiple columns, you have to put them in a single table.
Even when you don't need row level transactions, but you often write to or query from multiple columns for the same row key. A good rule for thumb is that if on an average, more than 20% for your columns have values for a single row, you should try to put them together in a single table.
When you have too many columns.
Multiple Tables
Faster scans for each table and low IO amplification if the scans are mostly concerned only with one column (remember sequential look-ups in scans will unnecessarily read columns they don't need).
Good logical separation of data, especially when you don't need to share row keys across columns. Have one table for one type of row keys.
When to use:
When there is a clear logical separation of data. For example, if your row key schema differs across different sets of columns, put those sets of columns in separate tables.
When only a small percentage of columns have values for a row key (Look below for a better approach).
You want to have different storage configs for different sets of columns. E.g. TTL, compaction rate, blocking file counts, memstore size etc. (Look below for a better approach in this use case).
An alternative of sorts: Multiple CFs in single table
As you can see from above, there are pros of both the approaches. The choice becomes really difficult in cases where you have same structure of row key for several columns (so, you want to share row key for storage efficiency or need transactions across columns) but the data is very sparse (which means you write/read only small percentage of columns for a row key).
It seems like you need the best of both worlds in this case. That's where column families come in. If you can partition your column set into logical subsets where you mostly access/read/write only to a single subset, or you need storage level configs per subset (like TTL, Storage class, write heavy compaction schedule etc.), then you can make each subset a column family.
Since data for a particular column family is stored in single file (set of files), you get better locality while reading a subset of columns without slowing down the scans.
However, there is a catch:
Do not try to unnecessarily use column families. There is a cost associated with them, and HBase does not do well with 10+ CFs due to how region level write locks, monitoring etc. work in HBase. Use CFs only if you have a logical relationship between columns across CFs but you don't generally perform operations across CFs or need to have different storage configs for different CFs.
It's perfectly fine to use only a single CF containing all your columns if you share row key schema across them, unless you have a very sparse data set, in which case you might need different CFs or different tables based on above mentioned points.

Is it generally better to transform semi-structured into structured data on Hadoop if the possibility exists?

I have large and growing datasets of semi-structured data in JSON files on a Hadoop cluster. The data is fairly benign but one of the keys which holds a list of maps can change heavily in size, it can vary between zero and up to few thousands of those maps each with a few dozen keys themselves.
However the data could be transformed into two separate tables of structured data linked by foreign keys. Both would be narrow tables, one of them would roughly be ten times as long as the other.
I could either keep the data in a semi-structured format and use a wide-column store like HBase to store it or alternatively use a columnar storage like Parquet to store the data in two large relational tables.
It is unlikely the data format will change, but it can't be ruled out.
I'm new to Hadoop and Big Data, so which of the two possibilities is generally preferable? Should semi-structured data be changed into structured data if the possibility exists and the data format is fairly constant?
EDIT: Additional info as requested by Rahul Sharma.
The data consists of shopping carts from a shopping software, the variable length comes from the variable number of items in the carts. Initially the data is in XML format but then is transformed into JSON, but not by me, I have no control over that step.
No realtime analytics planned, only batch analytics.
The relationship in both tables would be that one table is the customer/purchase info while the other would be the purchased items. Both would be linked with a fitting key.
I hope this helps.

Apache Spark data modelling - should I prefer denormalisation or joins for query performance?

I have two logical data tables - the first contains raw data financial data with about 500 million rows per day. The second is a reference data table containing about 10,000 rows per day, with some 100 columns. Each row in the raw data will have a corresponding row in the reference data.
I need to run ad-hoc Spark queries that cut, slice and aggregate the raw financial data depending on supplied reference data parameters.
Question - is it better to denormalise the data such that each raw data row also contains all 100 reference data items, or instead do a join from raw to reference by some key?
The performance will bet best if you completely denormalise the data and write to Parquet. With Parquet columnar compression you will get great compression on your dimension attributes.
If you were to join the two data sets make sure to use a broadcast join. The broadcast join will broadcast your smaller table to all mappers and you can perform a map side join without shuffle
BTW, it should be quick to test the two scenarios. You can then compare performance and data size of the two use cases and make an informed decision

Resources