I'm testing MonetDB as a solution for a data-science project. I have a table of 21K columns - all but three are features described as float (32bit) and 6.5M rows (which may or may not become larger, perhaps up to 20M rows).
My aim is to use the integrated Python on MonetDB to achieve the ability to train without exporting the data from the DB every time. In addition, queries on specific columns are necessary so the columnar storage can be a significant advantage.
I have compiled MonetDB 11.31.13 to gain the embedded Python support. OS is CentOS 7. Storage is not SSD. 48 core server with ~300GB of memory. I created an (unique) index on the table (without analyze).
I noticed that when I
SELECT * FROM [TABLE_NAME] SAMPLE 50;
it takes a long long time to complete.
I then tried:
SELECT f1, f2, ..., f501 from [TABLE_NAME] SAMPLE 50;
SELECT f1, f2, ..., f1001 from [TABLE_NAME] SAMPLE 50;
SELECT f1, f2, ..., f2001 from [TABLE_NAME] SAMPLE 50;
...
SELECT * from [TABLE_NAME] SAMPLE 50;
I ran the queries locally with mclient and used time to measure the amount of time it took and I noticed two things:
There is a period where a single core is taking 100% CPU. The more columns the longer it takes to complete. Only when it finishes I can see all cores working, data being consumed, etc... In addition, during that time, the query does not appear in the result of
select * from sys.queue();
Eventually, the time needed to get 50 rows from the table was almost 4 hours.
The amount of columns is doubled but between each step in the test the amount of time it takes to get a result is tripled.
So my questions is:
Is this behaviour expected or does it reflect something I did wrong?
The data requested from the table should be around 4MB (50 * 21000 * 4Bytes), so this reflects a significant time waiting for such a small amount of data.
Help is appreciated!
Related
I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.
I'm fairly new to using Hadoop in production. I used scoop to bring in a large table from a database into Hive. Scoop created a comma delimited text file and created the corresponding table in Hive.
I then executed a create table new_table_orc stored as orc as select * from old_table_csv
Since a text file is as about as inefficient as can be compared to ORC (binary data, column wise data storage for fat tables, compression, etc.), I expected a huge, orders of magnitude improvement but the query execution time doesn't seem to have changed at all!
I used the same simple query on both version (text, ORC and even parquet) and did the same thin when several of these tables were used in a join.
Additional info:
The main table I'm testing has around 430 million rows and around 50 columns.
I'm running a couple of queries:
select sum(col1) from my_table; <= 40 sec
select sum(col1) from my_table_orc; <= 31 sec
And
select distinct col2 from my_table where col3 = someval; <= 53 sec
select distinct col2 from my_table_orc where col3 = someval; <= 35 sec
I also enabled vectorization, as #sahil desai suggested but does seem to have made a huge different (it did reduce the time by a couple of seconds).
What is going on here, why am I not seeing orders of magnitude speedup? What more detail do you need?
As per my experience ORC is faster. Using ORC File for every HIVE table should extremely beneficial to get fast response times for your HIVE queries. I think you have to enable the vectorization, Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
there are many ways to improve the hive performance like Tez execution, cost based query optimization(CBO) etc.
I have a Postgres 9.6 installation and I am running into this weird case where - if I run a same query having multiple joins after 10 to 15 mins, there is increase in the value of query cost in the order of few hundreds and its keep on increasing.
I do understand what vacuuming and analyse does, but I am worried about the query cost which starts increases within few minutes of performing vacuum and analyse. I am afraid this might lead do future performance bottlenecks.
PS: I have two table out of which one is heavily written (about 5 million records ) and other is heavily updated (70 K records with postGIS this table mostly have updates on lat lon & geom column)
Does this means I should have auto vacuum run every few hours?
Make Autovaccum aggressive; but if you think autovaccum is using up resources(by looking at cpu ]and IO usage) you could tweak-- autovacuum_vacuum_cost_delay and autovacuum_vacuum_threshold paramters at table level
I used to have a PostgreSQL 9.2 database with 3 tables:
A - contains 12 millions records
B - contains 24 millions records
C - contains 20 millions records
Tables are connected like:
A (one to many) B
B (one to zero/one) C
I have decieded to archive/migrate older data to 2nd database to speed up my main database (less data = better performance).
After I have migrated about 20% of data from every table, I have done VACUUM ANALYZE on my main database tables to clean up a little bit.
I thought that next 20% will be much faster to migrate.... I was wrong. Every next percent of data to archive process slower and slower...
I thought maybe VAACUM FULL is needed here, but I have read it is not recommended to it. What is more it is a very slow query and requires almost double of disk space (it creates a new table then delete old one).
What can be a reason of slower processing despite the less data left? Am I missing some step which can increase my database speed after migration? Some kind of clean up other then VACUUM ANALYZE
Need to specify that I have measure time of processing 3 steps: selecting a data to copy from main database, inserting into 2nd database, delete from main database.
Selecting a data is a problem.
About archiving process:
I select a A table rows older then x days. Copy it and remove then.
Then I select a B rows connected to A rows selected before. Copy it and remove then.
Last I select a C rows connected to B rows selected before. Copy it and remove then.
Conf:
8GB RAM.
max_connections = 100
shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 32MB
maintenance_work_mem = 512MB
checkpoint_segments = 32
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 2.0
Try to figure out where the time is spent. Is it the SELECT to find the rows in B and C? Is it the DELETE?
Once you have found the problematic statement, look at the EXPLAIN (ANALYZE) output for it; it will tell you were the time is spent.
Deleting rows from a table does not make it smaller and does not necessarily speed up queries on the table. What may help is VACUUM (FULL), particularly if there are sequential scans. You don't have to run it on all tables in the database; if space is a problem, you can tun it on one table after the other.
But first look at the execution plans to see if that will help at all.
oracle version:10.2.0.4.0
table: va_edges_detail_temp
The fields are the following:
source_label: varchar2
target_label: varchar2
edge_weight: number
The following query:
select v.*, level
from va_edges_detail_temp v
start with v.source_label = 'smith'
connect by nocycle prior v.target_label = v.source_label
order by level;
When there are 552 rows in the table it only takes 0.005 seconds.
When there are 6600 rows in the table, execution never finishes. I waited for hours, but it does not finish, returns no result but shows no error either.
What's the matter?
Well, its too wide question.
In common it depends on your data. And count of rows provided via connecting of rows in va_edges_detail_temp. Its may be n^2 or n^4 or
even n!.
In any case its may increase dramatically and may not
Another part of performance its memory size. If resulted rows set are
fits into RAM oracle do it in memory. If not Oracle will try to fold data into hard drive. Its time-expensive operation in common.