How to append the data to existing hive table without partition - hadoop

I have created hive table which contains historical stock data of past 10 years. From now i have to append the data on daily bases.
I thought of creating the partition based on date but it leads many partitions approximately 3000 plus a new partition for every new date, i think this is not feasible.
Can any one suggest a best approach to store all the historical data in the table and append the new data as it comes.

As for every partitioned table, the decision on how to partition your table depends primarily on how you are going to query the table.
Another consideration is how much data you're going to have per partition, as partitions should not bee too small. Each one should be at least at as an absolute minimum as big as one HDFS block since it would otherwise take too many directories.
This said, I don't think 3000 partitions would be a problem. At a previous job we had a huge table with one partition per hour, each hour was about 20Gbytes, and we had 6 months of data, so about 4000 partitions, and it worked just fine.
In our case, most people care the most about the last week and the last day.
I suggest as first thing you research how the table is going to be used, that is, are all the 10 years be used, or just mostly the most recent data ?
As second thing, study how big is the data, consider if it may grow in size with the new loads, and see how big each partition is going to be.
Once you've determined these 2 points, you can make a decision, you could just use daily partitions (which could be fine, 3000 partitions is not bad), or you could do weekly, or monthly.

You can use this command
LOAD DATA LOCAL INPATH '<FILE_PATH>' INTO TABLE <TABLE_NAME>;
It will create new files under HDFS directory mapped to table name. Even though there are not too many partitions with it, you will still run into too many files issue.
Periodically, you need to do this:
Create stage table
Move data by running LOAD command from target table to stage table
You can run insert command into target table selecting from stage table
Now it will load data with number of files equal to number of reducers.
You can delete stage table
You can run this process at regular intervals (probably once in a month).

Related

What is the actual use of partitions in clickhouse?

It says partitions make it easier to drop or move data so that there is hit only on limited data. In various blogs it is suggested to use month as a partitioning key (toYYYYMM(date)). In many places it is also suggested to not have more than a couple of partitions. I am using clickhouse as a database to store time series data which do not undergo frequent deletions. What would be the advisable partitioning key for timeseries data of high volume? Does there have to be one if I do not want to perform deletes frequently?
In production I noticed that startup was very slow and I was suspecting that having too many partitions is the culprit. So I decided to test it out by inserting time-series data fresh into a table (which created >2300 partitions for ~20Bil rows) by selecting data from another table (so that it doesn't have an opportunity to optimize the table). Immediately I dropped the original table and tried a restart. It finished fast in about 10s. This is in complete opposite to what I observed in production with 800GB+ of data (with many databases and tables as opposed to my test node which had only one table).
Edit: As it was pointed out, I mixed up parts and partitions. Regarding startup time of clickhouse being affected, I'd better post another question.
This is a pretty common question, and for disclosure, I work at ClickHouse.
Partitions are particularly useful when you have timeseries data, as you noted. When determining the number of partitions, we often recommend a few guidelines:
The use of partitioning should be determined by a couple of questions as to why you're using them:
are you generally going to query only a single partition? For example, if your queries are often for results within a one day or one month period, it could make sense to partition at that period duration
are you wanting to "tier" or set a TTL on your data such that once the partition reaches an age of X (e.g., 91 days old, 7 months old), you want to do something special with it? (e.g., TTL to lower cost tier storage, backup and delete from ClickHouse, etc.)
We often recommend to keep the number of partitions less than around 100. Up to 1000 partitions can work, but it is suboptimal and will have some performance impact at the filesystem and index/memory sizes, which can impact startup time insert/query time
Given these guidelines, hoping that helps with your question. It is probably most common to partition at the day or month, but since ClickHouse can manage large tables quite easily, might want to move towards fewer partitions if possible - partitioning by month probably most common.
I didn't fully understand your test results so please feel free to expand. 2300 partitions sounds like too many but might work, just with some performance implications. Reducing your number of partitions (and therefore increasing the partition size) seems like a good recommendation.

vertica how restrict database size

Could you please help me with the following issue?
I have installed vertica cluster. I can't understand how I can restrict database size in time or in size. For example, data in a database must be deleted older than 30 days or when a database size of 100 GB is reached (what comes first)
There is no automated way of doing this, and no logical way of "restricting database size". You can't just trim "data" from a "database".
What are you talking about (in terms of limiting data outside of 30 days old) needs to be done on the table level. You would need some kind of date field and delete anything older than 30 days. However, I would advice against deleting rows in this way. It is non-performant and can cause queries against the table to be slow: see DELETE and UPDATE Performance Considerations. The best way of doing this would be to partition the table by day, and create an automated script (bash, python, etc) to each day drop the partition that corresponds with the date 30 days ago: see Dropping Partitions.
As for deleting data—if the size of the "database" goes above 100GB—this requirement is extremely vague and would be impossible to enforce. Let's say you have 50 tables, and the size of several of those tables grows so that the total size of the database is over 100GB, how would you decide which table to prune? This also must be done on a table by table level (or in this case—technically—on a projection level, since that is where the data is actually stored).
To see the compressed size (size on disk) of the database you can use this query:
SELECT SUM(used_bytes) / ( 1024^3 ) AS database_size_gb
FROM projection_storage;
However, since data can only be deleted with a DELETE or DROP PARTITION statement on a table, it would also be helpful to see the size of each table. You can do this by using this query:
SELECT projection_schema, anchor_table_name, SUM(used_bytes) / ( 1024^3 ) AS table_size_gb
FROM projection_storage
GROUP BY 1, 2
ORDER BY 3 DESC;
From the results you can decide which tables you want to prune.
A couple of notes (as a Vertica DBA):
Data is stored in projections. Having too many projections on a single table can not only cause queries to be slow but will also increase the overall data footprint. Avoid using too many projections (especially too many superprojections, don't have more than two per table, and most tables will only need one). Use the database designer or follow the guidelines in the documentation for creating custom projections: Design Fundamentals.
Also, another trick to keep database size down is to use the DESIGNER_DESIGN_PROJECTION_ENCODINGS function. Unless your projections are created with the database designer, they will likely only contain the auto encoding. Using the DESIGNER_DESIGN_PROJECTION_ENCODINGS function will help you to pick the most optimal encoding for each column. I have seen properly encoded projections take up a mere 2% disk size compared to the previously un-optimized projection. That is rare, but in my experience you will still see at least a 20-40% reduction in size. Do not be afraid to use this function liberally. It is one of my favorite tools as a Vertica DBA.
Also

Cassandra lookup query is quite slow after deleting large bundle of data

Currently, I have a cassandra column family with large rows of data, to say more than 100,000. Now, I'd like to remove all data in this column family and the problem came up:
After all data is removed, I execute a lookup query in this column family, the cassandra will take tens of seconds to return a empty query result. And the time cost will increase Linearly when the original data is larger
It is caused by the tombstone feature while deleting data from the cassandra database. The lookup speed won't recover to normal until the next GC is fired. See Cassandra Distributed Deletes.
Because such query operations are frequently used in my system, I cannot bear the huge latency up to a few seconds.
Would you please give me a solution to this problem?
This sounds like a very bad way to use a database. Populate it, empty it, repeat. One way you can solve your problem is by using different CF names each time, as in when you empty the data and start repopulating it, create a new column family and use that and just drop the other colum family however this is hacky.
I'd suggest using compaction (gets rid of all the tombstones it can detect) to solve your problem, it is CPU intensive but it's better than waiting for tens of seconds for queries to respond. You can make the task less intensive on your machine by providing the specific ks & cf you want to compact:
./nodetool compact <ks_name> <cf_name>
Ritchard's point is a good one, gc_grace_seconds is set to 10 days by default so you will probably have to tweak this to allow for compaction to get rid of tombstones.
#Fify
If your column family is frequently modified (read then update then read the update again...), you should use the leveled compaction strategy
To make deleted columns removed quickier, change the property gc_grace_seconds of your column family

Oracle Partitioned table is taking long time to fetch

I have a partioned table based on date in oracle db, where each partition has crores of records. The front end application is build to search the data based on a date range (meanining it scans through multiple partitions). What is the best logic to get the data in quickest time?
You should create local indexes which work on partitions.
Normally we go for global indexes which work on whole table while local index is specific to partition which will make partition search faster.
Check this link to see how local indexes work: http://docs.oracle.com/cd/E11882_01/server.112/e25523/partition.htm#i461446
If local indexes don't work then query tuning might help. If that doesn't help then you shld look to redesign schema.
EDIT:
Having said all that, just one basic check to ensure that your query is not scanning all partitions. This can be achieved by including partition criteria [date in your case] as part of where clause.
Interval partitioning may help. It makes partition management much
easier, which then makes it reasonable to have thousands of partitions instead of just dozens or hundreds.
For example, if the current table is partitioned by month, a query for a week will need to read a lot of extra data. But if the table is partitioned by day
then almost no extra data will be scanned.
create table partition_test(a number primary key, b date)
partition by range (b) interval (interval '1' day)
(
partition p1 values less than (date '2000-01-01')
);
But even if this reduces the data per partition from crores to lakhs, that's still a lot of data for an application. Local indexes, as #loki suggested, may help.

TSQL Merge Performance

Scenario:
I have a table with roughly 24 million records. The table has pricing history related to individual customers and is computed daily. There are on average 6 million records for each day. Every morning a the price list is generated and a merge statement is ran to reflect the changes in their pricing.
The merge statement begins with the previous day's previous data being inserted into a variable table, that table is then merged into the actual table. The main problem is that the merge statement takes pretty long.
My real question centers around the performance of using a variable table vs physical table vs temp table. What is the best practice for large merges like this?
Thoughts
I'd consider a temp table: these have statistics which will help. A table variable is always assumed to have one row. Also, the IO can be shunted onto separate drives (assuming you have tempdb separately)
If a single transaction is not required, I'd split the MERGE too into a DELETE, UPDATE, INSERT sequence to reduce the amount of work needed in each action (which reduces the amount of rollback info needed and the amount of locking etc
Temp tables often perform better than table variables for large data sets. Additionally you can put the data into the temp table and then index it.
Check if you indexes on the tables. Indexes would be updated every time you add/delete records on that table.
Try removing the indexes before merging the records and then re-create it again after the merge.

Resources