I have a huge data to process every month. Before this I used to create range partition followed by indexes on that partition. I want to eliminate the task of creating partition every month by using range interval partition, but how can I eliminate creating indexes every time for these monthly partitions.
I am using Oracle database.
Related
I somehow want to calculate partition overhead for very small tables. I have a Oracle DB with 5K tables sizing from lets say 10KB to 1TB. All of them are range partitioned based on a DATE column. What I want to calculate is the difference in the table size if I will store all data in 1 partition Vs if I store it in lets say 30 partitions. Block Size is 16KB.
I have created hive table which contains historical stock data of past 10 years. From now i have to append the data on daily bases.
I thought of creating the partition based on date but it leads many partitions approximately 3000 plus a new partition for every new date, i think this is not feasible.
Can any one suggest a best approach to store all the historical data in the table and append the new data as it comes.
As for every partitioned table, the decision on how to partition your table depends primarily on how you are going to query the table.
Another consideration is how much data you're going to have per partition, as partitions should not bee too small. Each one should be at least at as an absolute minimum as big as one HDFS block since it would otherwise take too many directories.
This said, I don't think 3000 partitions would be a problem. At a previous job we had a huge table with one partition per hour, each hour was about 20Gbytes, and we had 6 months of data, so about 4000 partitions, and it worked just fine.
In our case, most people care the most about the last week and the last day.
I suggest as first thing you research how the table is going to be used, that is, are all the 10 years be used, or just mostly the most recent data ?
As second thing, study how big is the data, consider if it may grow in size with the new loads, and see how big each partition is going to be.
Once you've determined these 2 points, you can make a decision, you could just use daily partitions (which could be fine, 3000 partitions is not bad), or you could do weekly, or monthly.
You can use this command
LOAD DATA LOCAL INPATH '<FILE_PATH>' INTO TABLE <TABLE_NAME>;
It will create new files under HDFS directory mapped to table name. Even though there are not too many partitions with it, you will still run into too many files issue.
Periodically, you need to do this:
Create stage table
Move data by running LOAD command from target table to stage table
You can run insert command into target table selecting from stage table
Now it will load data with number of files equal to number of reducers.
You can delete stage table
You can run this process at regular intervals (probably once in a month).
I would like to scan the entire Hbase table and get the count of the number of records added on a particular day on daily basis.
Since we do not have multiple versions of the columns, I can use the time stamp of the latest version(which will always be one).
One approach is to use map reduce. Where map scans all the rows, and we emit timestamp(actual date) and 1 as key and value. Then the reducer, I would count based on timestamp value. Approach is similar to group count based on timestamp.
Is there a better way of doing this? Once implemented, this job would be run on a daily basis, to verify the counts with other modules(Hive table row count and solr document count). I use this, as the starting point to identify any errors, during flow at different integration points in application.
I am using Postgresql 9.1 and I have a table consisting of 36 column and almost 10 cr. 50 lacks record with date time stamp On this Table we have one composite primary key (DEVICE ID TEXT AND DT_DATETIME timestamp without time zone)
Now to get query performance we have partition the table day wise based on the DT_DATETIME Fild. Now After partition I see that the query data retrieval time takes more that the unpartition table. I have on the parameter called constraint_exclusion in config file.
Please any solution for the same.
Let me explain Little farther
I have 45 days GPS data in a table of size 40 GB. Every second We insert min 27 new records(2.5 million record in a day). To keep the table size at steady 45 days we delete 45th days data every night. Now This poses problem in vacuum on this table due to lock.If we have partition table we can simply drop the 45th days child table.
so by partitioning we wanted to increase query performance as well as solve locking problem. We have tried pg_repack but Twice the system load factor increased to 21 and we had to reboot the server.
Ours is a 24x7 system so there is no down time.
try to use pg_bouncer for connection management and memory management or increase RAM in your server....
I have a partioned table based on date in oracle db, where each partition has crores of records. The front end application is build to search the data based on a date range (meanining it scans through multiple partitions). What is the best logic to get the data in quickest time?
You should create local indexes which work on partitions.
Normally we go for global indexes which work on whole table while local index is specific to partition which will make partition search faster.
Check this link to see how local indexes work: http://docs.oracle.com/cd/E11882_01/server.112/e25523/partition.htm#i461446
If local indexes don't work then query tuning might help. If that doesn't help then you shld look to redesign schema.
EDIT:
Having said all that, just one basic check to ensure that your query is not scanning all partitions. This can be achieved by including partition criteria [date in your case] as part of where clause.
Interval partitioning may help. It makes partition management much
easier, which then makes it reasonable to have thousands of partitions instead of just dozens or hundreds.
For example, if the current table is partitioned by month, a query for a week will need to read a lot of extra data. But if the table is partitioned by day
then almost no extra data will be scanned.
create table partition_test(a number primary key, b date)
partition by range (b) interval (interval '1' day)
(
partition p1 values less than (date '2000-01-01')
);
But even if this reduces the data per partition from crores to lakhs, that's still a lot of data for an application. Local indexes, as #loki suggested, may help.