I have to partition table according to date and hour from resultdate field which is in the format 2/5/2013 9:24:00 AM.
I am using dynamic partitioning with date & hour and doing an
insert overwrite table partition(date, hour)
{
select x,y,z, date , hour
}
from table 1.
I have about 1.5 million records, and it is taking about 4 hrs to complete. Is this normal, what would be some ways to optimize?
increase the cluster size otherwise it will take much time.
this is not normal, except if you are working in a virtual machine with 1 node :) .. Try setting this flag
set hive.optimize.sort.dynamic.partition=false;
I am not sure why it is set to true by default in some distros.
There are many scenarios to this,
Check whether TEZ engine can be used to make your execution time better.
whether the way we store the file can be changed, RC Format might help.
optimizing the hive.exec.max.dynamic.partitions & hive.exec.max.dynamic.partitions to a optimal value.
Increasing the cluster is also good ( if feasible )
Related
I've got a 3GB SQLite database file with a single table with 40 million rows and 14 fields (mostly integers and very short strings and one longer string), no indexes or keys or other constraints -- so really nothing fancy. I want to check if there are entries where a specific integer field has a specific value. So of course I'm using
SELECT EXISTS(SELECT 1 FROM FooTable WHERE barField=?)
I haven't got much experience with SQLite and databases in general and on my first test query, I was shocked that this simple query took about 30 seconds. Subsequent tests showed that it is much faster if a matching row occurs at the beginning, which of course makes sense.
Now I'm thinking of doing an initial SELECT DISTINCT barField FROM FooTable at application startup, and caching the results in software. But I'm sure there must be a cleaner SQLite way to do this, I mean, that should be part of what a DBMS's job right?
But so far, I've only created primary keys for speeding up queries, which doesn't work here because the field values are non-unique. So how can I speed up this query so that it works at constant time? (It doesn't have to be lightning fast, I'd be completely fine if it was under one second.)
Thanks in advance for answering!
P.S. Oh, and there will be about 500K new rows every month for an indefinite period of time, and it would be great if that doesn't significantly increase query time.
Adding an index on barField should speed up the subquery inside the EXISTS clause:
CREATE INDEX barIdx ON FooTable (barField);
To satisfy the query, SQLite would only have to seek the index once and detect that there is at least one matching value.
I do not figure out how to increase the max number of entries per query. I would like to insert a thousand entries per query, and the default value is 100.
According to the doc, the parameter max_partitions_per_insert_block defines the limit of simultaneous entries.
I've tried to modify it from the ClickHouse client, but my insertion still fails :
$ clickhouse-client
my-virtual-machine :) set max_partitions_per_insert_block=1000
*SET* max_partitions_per_insert_block = 1000
Ok.
0 rows in set. Elapsed: 0.001 sec.
Moreover, this is no max_partitions_per_insert_block field in the /etc/clickhouse-server/config.xml file.
After modifying max_partitions_per_insert_block, I've tried to insert my data, but I'm stuck with this error :
infi.clickhouse_orm.database.ServerError: Code: 252, e.displayText() = DB::Exception: Too many partitions for single INSERT block (more than 100). The limit is controlled by 'max_partitions_per_insert_block' setting. Large number of partitions is a common misconception. It will lead to severe negative performance impact, including slow server startup, slow INSERT queries and slow SELECT queries. Recommended total number of partitions for a table is under 1000..10000. Please note, that partitioning is not intended to speed up SELECT queries (ORDER BY key is sufficient to make range queries fast). Partitions are intended for data manipulation (DROP PARTITION, etc). (version 19.5.3.8 (official build))
EDIT: I'm still stuck with this. I cannot even manually set the parameter to the value I want with SET max_partitions_per_insert_block = 1000: the value is changed but goes back to 100 after exiting and reopening clickhouse-client (even with sudo, so it does not look like a permission problem).
I figured it out when reading again the documentation, especially this document. I have recognized in the web profile settings I saw in the system.settings table. I just tried to insert the following in my default's profile, reloaded, and my insert of a thousand entries wen well : <max_partitions_per_insert_block>1000</max_partitions_per_insert_block>
I guess it was obvious for some, but probably not for unexperimented people.
Most likely you should change the partitioning scheme. Each partition generates several files on the file system, which can lead to disruption of the OS. In addition, this may be the cause of long mergers.
Recently I used Oracle 11g database to do my homework. I had 12 tables, like trip_data_11 and trip_data_12.
They have same structure and the number of records is almost the same. I created the same indexes on each table.
So for trip_data_11 table:
create index pick_add_11 on trip_data_11(pickup_longitude,pickup_latitude);
create index drop_add_11 on trip_data_11(dropoff_longitude,dropoff_latitude);
The same operation to trip_data_12.
Then I used the following select statement to select the taxi numbers per day.
SELECT
COUNT(DISTINCT(td.medallion)) AS taxi_num
FROM
SYS.TRIP_DATA_11 td
WHERE
(td.pickup_longitude >= -74.2593 AND td.pickup_longitude <= -73.7011
AND td.pickup_latitude >= 40.4770 AND td.pickup_latitude <= 40.9171
)
AND
(td.dropoff_longitude >= -74.2593 AND td.dropoff_longitude <= -73.7011
AND td.dropoff_latitude >= 40.4770 AND td.dropoff_latitude <= 40.9171
)
AND
td.trip_distance > 0
AND
td.passenger_count > 0
GROUP BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}')
ORDER BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}');
It costs 38sec。When I changed the table name to SYS.TRIP_DATA_12, the problem coming, it costs more than 2 hours.
What's more, it did not end. I don't know why.
Today I ask my classmate and he said: clear the cache. So I used the following statements to do it.
alter system flush shared_pool;
alter system flush buffer_cache;
alter system flush global context;
Now when I use the same select statement for SYS.TRIP_DATA_11 I get the same poor performance like SYS.TRIP_DATA_12. Why?
It seems like your classmate was having a good joke at your expense.
Clearly your query was only performing well because you had a warm buffer cache full of all the data you needed from TRIP_DATA_11. By flushing the caches you have zapped all that, and now you have the same bad performance for all tables.
Tuning queries is hard, because there are lots of possibilities. Please read the documentation on it.
To pick just one thing: you're searching ranges, which is problematic. How many rows fill -74.2593 to -73.7011 ? It might be a lot more than say -71.00 to -68.59 even though that's a broader range. Understanding your data - its volume, its distribution and its skew - is crucial.
As a first step learn how to use EXPLAIN PLAN. Find out more. To get better plans, gather statistics on your tables and their indexes, using DBMS_STATS package. Find out more.
One tip. Oracle only uses one index to access a table. So it will choose pick_add_11 or drop_add_11 but not both. It will then read all the matching records from the table and filter them by the other criteria. You may get much better performance from a index designed to service this query:
create index add_11 on trip_data_11
(pickup_longitude
, pickup_latitude
, dropoff_longitude
, dropoff_latitude
, trip_distance
, passenger_count )
;
The select statement will execute the entire filter against this index and only touch the table to get the MEDALLION values. (You could add medallion to the index too). Experiment with the column order. As latitude has a narrower range than longitude probably that should go first; maybe drop-off value should appear before pick-up. You want an index in which the greatest number of related records are clustered together.
Indexes like this can be an overhead, so we wouldn't want to maintain too many of them in real life. But they are a valuable technique for tuning expensive queries which are run frequently.
Oh, and #Justin's right: don't use SYS for doing application work. Even for a school assignment you should create a fresh schema and create your tables, etc in that.
We are having a database design where we have table on which we have 1 Day Interval Partitioning on the column named as "5mintime" and on the same column we have created index also.
"5mintime" column can have data such as 1-Mar-2011,2-Mar-2011, in short there is no time component in it and from the UI also the user can select only one day period as minimum date.
My question is that while firing the select queries is there any advantage gained because of indexes since the partition is already there, on the flip side if i remove the indexes the insertion will be come faster, so any help on this would be greatly appreciated.
If I understand you right, then I think there's no need for the index:
A local index is indexed for every partition, which in your case has the same value in all rows (ie: 1-Mar-2011 in the 1-Mar-2011 partition, 2-Mar-2011 in the 2-Mar-2011 partiotion and so on).
A global Index will actually index the whole table but will find a whole partiotion, which is also not usefull since you already have partiones ...
But, why not check it?
If each day's data goes into its own partition and you can never search within days, but only for entire days worth of data, then, no, I don't see this index adding any value.
You can confirm whether or not SQL queries are using this index by enabling monitoring:
alter index myindex monitoring usage;
And then check to see if it's been used by querying v$object_usage for it some time later.
I have a function with 1 argument (date) which encapsulates 1 query like
SELECT COUNT(*)
FROM tbl
WHERE some_date_field BETWEEN param_date - INTERVAL '0 1:00:00' DAY TO SECOND
AND param_date
What I want to do is to cache somewhere the result of this query with ttl = 1 minute. The cached result should be shared across all sessions, not just current one.
Any proposals?
PS: Yes, I know about oracle function result cache, but it doesn't fit the requirements.
PPS: Yes, we can create 2nd artificial argument with some value like date in format of yyyymmddhh24mi so it changes each minute and we're able to use function result cache, but I hope it is a solution which will allow me to hide the caching dependencies inside.
I'd use a global application context, and a job with a refresh interval of 1 minute to set the context.
PS: INTERVAL '1' HOUR is shorter and more meaningful than INTERVAL '0 1:00:00' DAY TO SECOND
You want to cache the result of this query, and share the cache across all sessions. The only way I can think of is to wrap the query in a function call, store the result in a small table. The function will query the small table to see if the count has already been stored within the last 1 minute, and if so, return it.
You would keep the table small by running a job periodically to delete rows in the "cache table" that are older than 1 minute - or better still, perhaps truncate it.
However, I can only see this being of benefit if the original SELECT COUNT(*) is a relatively expensive query.