Efficiently store daily dumps in Hadoop HDFS - hadoop

I believe a common usage pattern for Hadoop is to build a "data lake" by loading regular (e.g. daily) snapshots of data from operational systems. For many systems, the rate of change from day to day is typically less than 5% of rows (and even when a row is updated, only a few fields may change).
Q: How can such historical data be structured on HDFS, so that it is both economical in space consumption, and efficient to access.
Of course, the answer will depend on how the data is commonly accessed. On our Hadoop cluster:
Most jobs only read and process the most recent version of the data
A few jobs process a period of historical data (e.g. 1 - 3 months)
A few jobs process all available historical data
This implies that, while keeping historical data is important, it shouldn't come at the cost of severely slowing down those jobs that only want to know what the data looked like at close-of-business yesterday.
I know of a few options, none of which seem quite satisfactory:
Store each full dump independently as a new subdirectory. This is the most obvious design, simple, and very compatible with the MapReduce paradigm. I'm sure some people use this approach, but I have to wonder how they justify the cost of storage? Supposing 1Tb is loaded each day, then that's 365Tb added to the cluster per year of mostly duplicated data. I know disks are cheap these days, but most budget-makers are accustomed to infrastructure expanding proportional to business growth, as opposed to growing linearly over time.
Store only the differences (delta) from the previous day. This is a natural choice when the source systems prefer to send updates in the form of deltas (a mindset which seems to date from the time when data was passed between systems in the form of CD-ROMs). It is more space efficient, but harder to get right (for example, how do you represent deletion?), and even worse it implies the need for consumers to scan the whole of history, "event sourcing"-style, in order to arrive at the current state of the system.
Store each version of a row once, with a start and end date. Known by terms such as "time variant data", this pattern pops up very frequently in data warehousing, and more generally in relational database design when there is a need to store historical values. When a row changes, update the previous version to set the "end date", then insert the new version with today as the "start date". Unfortunately, this doesn't translate well to the Hadoop paradigm, where append-only datasets are favoured, and there is no native concept of updating a row (although that effect can be achieved by overwriting the existing data files). This approach requires quite complicated logic to load the data, but admittedly it can be quite convenient to consume data with this structure.
(It's worth noting that all it takes is one particularly volatile field changing every day to make the latter options degrade to the same space efficiency as option 1).
So...is there another option that combines space efficiency with ease of use?

I'd suggest a variant of option 3 that respects the append only nature of HDFS.
Instead of one data set, we keep two with different kinds of information, stored separately:
The history of expired rows, most likely partitioned by the end date (perhaps monthly). This only has rows added to it when their end dates become known.
A collection of snapshots for particular days, including at least the most recent day, most likely partitioned by the snapshot date. New snapshots can be added each day, and old snapshots can be deleted after a couple of days since they can be reconstructed from the current snapshot and the history of expired records.
The difference from option 3 is just that we consider the unexpired rows to be a different kind of information from the expired ones.
Pro: Consistent with the append only nature of HDFS.
Pro: Queries using the current snapshot can run safely while a new day is added as long as we retain snapshots for a few days (longer than the longest query takes to run).
Pro: Queries using history can similarly run safely as long as they explicitly give a bound on the latest "end-date" that excludes any subsequent additions of expired rows while they are running.
Con: It is not just a simple "update" or "overwrite" each day. In practice in HDFS this generally needs to be implemented via copying and filtering anyway so this isn't really a con.
Con: Many queries need to combine the two data sets. To ease this we can create views or similar that appropriately union the two to produce something that looks exactly like option 3.
Con: Finding the latest snapshot requires finding the right partition. This can be eased by having a view that "rolls over" to the latest snapshot each time a new one is available.

Related

What is the actual use of partitions in clickhouse?

It says partitions make it easier to drop or move data so that there is hit only on limited data. In various blogs it is suggested to use month as a partitioning key (toYYYYMM(date)). In many places it is also suggested to not have more than a couple of partitions. I am using clickhouse as a database to store time series data which do not undergo frequent deletions. What would be the advisable partitioning key for timeseries data of high volume? Does there have to be one if I do not want to perform deletes frequently?
In production I noticed that startup was very slow and I was suspecting that having too many partitions is the culprit. So I decided to test it out by inserting time-series data fresh into a table (which created >2300 partitions for ~20Bil rows) by selecting data from another table (so that it doesn't have an opportunity to optimize the table). Immediately I dropped the original table and tried a restart. It finished fast in about 10s. This is in complete opposite to what I observed in production with 800GB+ of data (with many databases and tables as opposed to my test node which had only one table).
Edit: As it was pointed out, I mixed up parts and partitions. Regarding startup time of clickhouse being affected, I'd better post another question.
This is a pretty common question, and for disclosure, I work at ClickHouse.
Partitions are particularly useful when you have timeseries data, as you noted. When determining the number of partitions, we often recommend a few guidelines:
The use of partitioning should be determined by a couple of questions as to why you're using them:
are you generally going to query only a single partition? For example, if your queries are often for results within a one day or one month period, it could make sense to partition at that period duration
are you wanting to "tier" or set a TTL on your data such that once the partition reaches an age of X (e.g., 91 days old, 7 months old), you want to do something special with it? (e.g., TTL to lower cost tier storage, backup and delete from ClickHouse, etc.)
We often recommend to keep the number of partitions less than around 100. Up to 1000 partitions can work, but it is suboptimal and will have some performance impact at the filesystem and index/memory sizes, which can impact startup time insert/query time
Given these guidelines, hoping that helps with your question. It is probably most common to partition at the day or month, but since ClickHouse can manage large tables quite easily, might want to move towards fewer partitions if possible - partitioning by month probably most common.
I didn't fully understand your test results so please feel free to expand. 2300 partitions sounds like too many but might work, just with some performance implications. Reducing your number of partitions (and therefore increasing the partition size) seems like a good recommendation.

extremely high SSD write rate with multiple concurrent writers

I'm using QuestDB as backend for storing collected data using the same script for different data sources.
My problem ist the extremly high disk (ssd) usage. During 4 days it has written 335MB per second.
What am I doing wrong?
Inserting data using the ILP interface
sender.row(
metric,
symbols=symbols,
columns=data,
at=row['ts']
)
I don't know how much data you are ingesting, so not sure if 335 MB per second is much or not. But since you are surprised by it I am going to assume your throughput is lower than that. It might be the case your data is out of order, specially if ingesting from multiple data sources.
QuestDB keeps the data per table always in incremental order by designated timestamp. If data arrives out of order, the whole partition needs to be rewritten. This might lead to write amplification where you see your data is being rewritten very often.
Until literally a few days ago, to fine tune this you would need to change the default config, but since version 6.6.1, this is dynamically adjusted.
Maybe you want to give a try to version 6.6.1, or alternatively if data from different sources is arriving out of order (relative to each other), you might want to create separate tables for different sources, so data is always in order for each table.
I have been experimenting a lot and it seems that you're absolutely right. I was ingesting 14 different clients into a single table. After having splitted this to 14 tables, one for each client, the problem disappeared.
Another advantage is the fact that I need a symbol less as I do not have to distinguish the rows.
By the way - thank you and your team for this marvellous tool you gave us! It makes my work so much easier!!
Saludos

loading method for the stage in business intelligence

good night,
a query when the origin is passed to the stage base in business intelligence the loading method is total
or total + incremental,
I'm thinking of deleting all the data and reloading it, but if it were a very large database and many records would not be optimal. What do good practices suggest?
I will appreciate your opinions,
thank you very much,
I'm thinking of deleting all the data and reloading it, but if it were
a very large database and many records would not be optimal. What do
good practices suggest?
It depends.
Many companies are more comfortable with Delete-Truncate pattern because it is easy to implement and the amount data isn't a problem only if some conditions are verified (hardware, DBA..)
Incremental Loads (or Up-Sert pattern) are often used to keep data between two systems in sync with one another. They are used in cases when source data is being loaded into the destination on a repeating basis, such as every night or throughout the day.
Benefits of Incremental Data Loads :
They typically run considerably faster since they touch less data. Assuming no bottlenecks, the time to move and transform data is proportional to the amount of data being touched. If you touch half as much data, the run time is often reduced at a similar scale.
Disadvantages of Incremental Data Loads :
Maintainability: With a full load, if there's an error you can re-run the entire load without having to do much else in the way of cleanup / preparation. With an incremental load, the files generally need to be loaded in order. So if you have a problem with one batch, others queue up behind it till you correct it.
TRUNCATING and then INSERTING is two operations whereas UPDATEing is one, making the TRUNCATE and INSERT take (theoretically) more time.
There's also the ease-of-use factor. If you TRUNCATE then INSERT, you have to manually keep track of every column value. If you UPDATE, you just need to know what you want to change.

Hazelcast: What would be the implications of adding indexes to huge existing IMaps?

Given 4-5 nodes having many IMaps with lots of data in it, some of the predicate queries started to become significantly slow. One of the solutions for solving this performance issue (as I think) could be adding indexes. However, this data is part of a sensible system which is currently being used in production.
Before adding indexes, I was wondering what would be the consequences of doing it on huge IMaps? (would it lock the entire map ?; would it bring down the entire system?; etc.) Hazelcast documentation includes information about how to do it, but doesn't give any other explanation.
If you want to add the index in runtime this is what will happen:
the AddIndexOperation will be executed on every partition
during the execution of the AddIndexOperation the partition will be blocked until all partition data are iterated and added to the index.
Queries won't be blocked in this timeframe - but get/put operations will.
I would recommend doing it in the "maintenance window" where you have the smallest load.
lots of data is relative - just execute a test in your dev environment having exactly the same amount of data to see how long it will take to add an index in your environment.

Doing analytical queries on large dynamic sets of data

I have a requirement where I have large sets of incoming data into a system I own.
A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time.
The requirements are as follows -
Large sets of data can experience state changes. Updates need to be fast.
I should be able to aggregate data pivoted on various attributes.
Ideally - there should be a way to correlate individual data units to an aggregated results i.e. I want to drill down into the specific transactions that produced a certain aggregation.
(I am aware of the race conditions here, like the state of a data unit changing after an aggregation is performed ; but this is expected).
All aggregations are time based - i.e. sum of x on pivot y over a day, 2 days, week, month etc.
I am evaluating different technologies to meet these use cases, and would like to hear your suggestions. I have taken a look at Hive/Pig which fit the analytics/aggregation use case. However, I am concerned about the large bursts of updates that can come into the system at any time. I am not sure how this performs on HDFS files when compared to an indexed database (sql or nosql).
You'll probably arrive at the optimal solution only by stress testing actual scenarios in your environment, but here are some suggestions. First, if write speed is a bottleneck, it might make sense to write the changing state to an append-only store, separate from the immutable data, then join the data again for queries. Append-only writing (e.g., like log files) will be faster than updating existing records, primarily because it minimizes disk seeks. This strategy can also help with the problem of data changing underneath you during queries. You can query against a "snapshot" in time. For example, HBase keeps several timestamped updates to a record. (The number is configurable.)
This is a special case of the persistence strategy called Multiversion Concurrency Control - MVCC. Based on your description, MVCC is probably the most important underlying strategy for you to perform queries for a moment in time and get consistent state information returned, even while updates are happening simultaneously.
Of course, doing joins over split data like this will slow down query performance. So, if query performance is more important, then consider writing whole records where the immutable data is repeated along with the changing state. That will consume more space, as a tradeoff.
You might consider looking at Flexviews. It supports creating incrementally refreshable materialized views for MySQL. A materialized view is like a snapshot of a query that is updated periodically with the data which has changed. You can use materialized views to summarize on multiple attributes in different summary tables and keep these views transactionally consistent with each other. You can find some slides describing the functionality on slideshare.net
There is also Shard-Query which can be used in combination with InnoDB and MySQL partitioning, as well as supporting spreading data over many machines. This will satisfy both high update rates and will provide query parallelism for fast aggregation.
Of course, you can combine the two together.

Resources