Orc not faster than csv in Hive? - hadoop

I'm fairly new to using Hadoop in production. I used scoop to bring in a large table from a database into Hive. Scoop created a comma delimited text file and created the corresponding table in Hive.
I then executed a create table new_table_orc stored as orc as select * from old_table_csv
Since a text file is as about as inefficient as can be compared to ORC (binary data, column wise data storage for fat tables, compression, etc.), I expected a huge, orders of magnitude improvement but the query execution time doesn't seem to have changed at all!
I used the same simple query on both version (text, ORC and even parquet) and did the same thin when several of these tables were used in a join.
Additional info:
The main table I'm testing has around 430 million rows and around 50 columns.
I'm running a couple of queries:
select sum(col1) from my_table; <= 40 sec
select sum(col1) from my_table_orc; <= 31 sec
And
select distinct col2 from my_table where col3 = someval; <= 53 sec
select distinct col2 from my_table_orc where col3 = someval; <= 35 sec
I also enabled vectorization, as #sahil desai suggested but does seem to have made a huge different (it did reduce the time by a couple of seconds).
What is going on here, why am I not seeing orders of magnitude speedup? What more detail do you need?

As per my experience ORC is faster. Using ORC File for every HIVE table should extremely beneficial to get fast response times for your HIVE queries. I think you have to enable the vectorization, Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
there are many ways to improve the hive performance like Tez execution, cost based query optimization(CBO) etc.

Related

What to do when the SQL index does not solve the problem of performance?

This query ONE
SELECT * FROM TEST_RANDOM WHERE EMPNO >= '236400' AND EMPNO <= '456000';
in the Oracle Database is running with cost 1927.
And this query TWO :
SELECT * FROM TEST_RANDOM WHERE EMPNO = '236400';
is running with cost 1924.
This table TEST_RANDOM has 1.000.000 rows, I created this table so:
Create table test_normal (empno varchar2(10), ename varchar2(30), sal number(10), faixa varchar2(10));
Begin
For i in 1..1000000
Loop
Insert into test_normal values(
to_char(i), dbms_random.string('U',30),
dbms_random.value(1000,7000), 'ND'
);
If mod(i, **10000)** = 0 then
Commit;
End if;
End loop;
End;
Create table test_random
as
select /*+ append */ * from test_normal order by dbms_random.random;
I created a B-Tree index in the field EMPNO so:
CREATE INDEX IDX_RANDOM_1 ON TEST_RANDOM (EMPNO);
After this, the query TWO improved, and the cost changed to 4.
But the query ONE did not improve, because Oracle Database ignored it, for some reason Oracle Database understood that this query is not worth it to use the plan execution with the index...
My question is: What could we do to improve this query ONE performance? Because the solution of the index did not solve and its cost continues to be expensive...
For this query, Oracle does not use an index because the optimizer correctly estimated the number of rows and correctly decided that a full table scan would be faster or more efficient.
B-Tree indexes are generally only useful when they can be used to return a small percentage of rows, and your first query returns about 25% of the rows. It's hard to say what the ideal percentage of rows is, but 25% is almost always too large. On my system, the execution plan changes from full table scan to index range scan when the query returns 1723 rows - but that number will likely be different for you.
There are several reasons why full table scans are better than indexes for retrieving a large percentage of rows:
Single-block versus multi-block: In Oracle, like in almost all computer systems, it can be significantly faster to retrieve multiple chunks of data at a time (sequential access) instead of retrieving one random chunk of data at a time (random access).
Clustering factor: Oracle stores all rows in blocks, which are usually 8KB large and are analogous to pages. If the index is very inefficient, like if the index is built on randomly sorted data and two sequential reads rarely read from the same block, then reading 25% of all the rows from an index may still require reading 100% of the table blocks.
Algorithmic complexity: A full table scan reads the data as a simple heap, which is O(N). A single index access is much faster, at O(LOG(N)). But as the number of index accesses increases, the benefit wears off, until eventually using the index is O(N * LOG(N)).
Some things you can do to improve performance without indexes:
Partitioning: Partitioning is the idea solution for retrieving a large percentage of data from a table (but the option must be licensed). With partitioning, Oracle splits the logical table into multiple physical tables, and the query can only read from the required partitions. This can create the benefit of multi-block reads, but still limits the amount of data scanned.
Parallelism: Make Oracle work harder instead of smarter. But parallelism probably isn't worth the trouble for such a small table.
Materialized views: Create tables that only store exactly what you need.
Ordering the data: Improve the index clustering factor by sorting the table data by the relevant column instead of doing it randomly. In your case, replace order by dbms_random.random with order by empno. Depending on your version and platform, you may be able to use a materialized zone map to keep the table sorted.
Compression: Shrink the table to make it faster to read the whole thing.
That's quite a lot of information for what is possibly a minor performance problem. Before you go down this rabbit hole, it might be worth asking if you actually have an important performance problem as measured by a clock or by resource consumption, or are you just fishing for performance problems by looking at the somewhat meaningless cost metric?

Clickhouse: Should i optimize MergeTree table manually?

I have a table like:
create table test (id String, timestamp DateTime, somestring String) ENGINE = MergeTree ORDER BY (id, timestamp)
i inserted 100 records then inserted another 100 records and i run select query
select * from test clickhouse returning with 2 parts their lengths are 100 and they are ordered in themselves. Then i run the query optimize table test and it started to return with 1 part and its length is 200 and ordered. So should i run optimize query after all insert and does it increase select query performance like select count(*) from test where id = 'foo' ?
Merges are eventual and may never happen. It depends on the number of inserts that happened after, the number of parts in the partition, size of parts. If the total size of input parts are greater than the maximum part size then they will never be merged.
It is very unreasonable to constantly merge up to one part.
Merger does not have such goal. In the contrary the goal is to have the minimum number of parts withing smallest number of merges. Merges consume the huge amount of disk and processor resources.
It makes no sense to merge two 300GB parts into one 600GB part for 3 hours. Merger have to read, decompress 600GB, merge, compress, write them back, after that the performance of the selects will not grow at all or will grow minimally.
Usually not, you can rely on Clickhouse background merges.
Also, Clickhouse has no intention to merge all the data from the partition into one part file, because "over-optimization" can affect performance too

MonetDb performance on a large table with 20K columns

I'm testing MonetDB as a solution for a data-science project. I have a table of 21K columns - all but three are features described as float (32bit) and 6.5M rows (which may or may not become larger, perhaps up to 20M rows).
My aim is to use the integrated Python on MonetDB to achieve the ability to train without exporting the data from the DB every time. In addition, queries on specific columns are necessary so the columnar storage can be a significant advantage.
I have compiled MonetDB 11.31.13 to gain the embedded Python support. OS is CentOS 7. Storage is not SSD. 48 core server with ~300GB of memory. I created an (unique) index on the table (without analyze).
I noticed that when I
SELECT * FROM [TABLE_NAME] SAMPLE 50;
it takes a long long time to complete.
I then tried:
SELECT f1, f2, ..., f501 from [TABLE_NAME] SAMPLE 50;
SELECT f1, f2, ..., f1001 from [TABLE_NAME] SAMPLE 50;
SELECT f1, f2, ..., f2001 from [TABLE_NAME] SAMPLE 50;
...
SELECT * from [TABLE_NAME] SAMPLE 50;
I ran the queries locally with mclient and used time to measure the amount of time it took and I noticed two things:
There is a period where a single core is taking 100% CPU. The more columns the longer it takes to complete. Only when it finishes I can see all cores working, data being consumed, etc... In addition, during that time, the query does not appear in the result of
select * from sys.queue();
Eventually, the time needed to get 50 rows from the table was almost 4 hours.
The amount of columns is doubled but between each step in the test the amount of time it takes to get a result is tripled.
So my questions is:
Is this behaviour expected or does it reflect something I did wrong?
The data requested from the table should be around 4MB (50 * 21000 * 4Bytes), so this reflects a significant time waiting for such a small amount of data.
Help is appreciated!

How to modify this hive query to increase number of reducers allocated?

I am trying to execute this query using hive, but it takes forever to run, especially after going to the reducer step. It say mappers:451, reducers:1.
create table mb.ref201501_nontarget as select * from adv.raf_201501 where target=0 limit 200000;
My motivation to change the query came from this answer:
Hive unable to manually set number of reducers
I tried changing the query to:
create table mb.ref201501_nontarget as select * from (select * from adv.raf_201501 limit 200000) where target=0;
but its throwing error.
This question is very vague, if you think the last query produces the proper result (note that it is not the same as the first one!!) this should do the trick:
create table mytmptbl = select * from advanl.raf_201501 limit 200000;
create table mbansa001c.ref201501_nontarget as select * from (mytmptbl ) where target=0;
After which you probably want to delete the temporary table again.
Hadoop is a framework for distributed computing. Some data processing actions are a good fit because they are "embarrassingly parallel". Some data processing actions are a bad fit because they cannot be distributed. Most real-life cases are somewhere in between.
I strongly suspect that what you want to do is get a sample of the raw data with approximately 200k items. But your query requires exactly 200k items.
The simplest way for Hive to do that would be to run the WHERE clause in parallel (451 Mappers on 451+ file blocks) then dump all partial results in a single "sink" (1 Reducer) that lets the first 200k rows to pass through and ignore the rest. But all records will be processed, even the ones to be ignored.
Bottom line: you have a very inefficient sampler, and the result will probably have a strong bias -- smaller file blocks will be Mapped faster and processed earlier by the Reducer, hence larger file blocks have almost no chance to be represented in the sample.
I guess you know how many records match the WHERE clause, so you would be better off with some kind of random sampling that retrieves approx. 500K or 1M records -- that can be done up front, inside each Mapper -- then a second query with the LIMIT if you really want an arbitrary number of records -- a single Reducer will be OK for this kind of smallish volume.
Ok. this is what worked for me. now taking only 2-5 minutes for about 27m records:
create table mb.ref201501_nontarget as SELECT * FROM adv.raf_201501 TABLESAMPLE(0.02 PERCENT) where target=0;
When using limit or rand(), it uses at least 1 reducers and the process takes more than 2 hours and kinda freezes at 33% reducing step.
In Tablesample without limit it assigned only 1 mapper and 0 reducer.

Performance tuning - Query on partitioned vs non-partitioned table

I have two queries, one of which involves a partitioned table in the query while the other query is the same except that it involves the non-partitioned equivalent table. The original (non-partitioned table) query is performing better than the partitioned counter-part. I am not sure how to isolate the problem for this. Looking at the execution plan, I find that the indexes used are the same b/w the two queries and that the new query shows the PARTITION RANGE clause in its execution plan meaning that partition pruning is taking place. The query is of the following form:-
Select rownum, <some columns>
from partTabA
inner join tabB on condition1
inner join tabC on condition2
where partTabA.column1=<value> and <other conditions>
and partTabA.column2 in (select columns from tabD where conditions)
where partTabA is the partitioned table and partTabA.column1 is the partitioning key(range partition). In the original query, this gets replaced by the non-partitioned equivalent of the same table. What parameters should I look at to find out why the new query performs badly. Tool that I have is Oracle SQL Developer.
PARTITION RANGE ITERATOR does not necessarily mean that partition pruning is happening.
You'll also want to look at the Pstart and Pstop in the explain plan, to see which partitions are being used.
There are several potential reasons the partitioned query will be slower, even though it's reading the same data. (Assuming that the partitioned query isn't properly pruning, and is reading from the whole table.)
Reading from multiple local indexes may be much less efficient than reading from a single, larger index.
There may be a lot of wasted space from large initial segment sizes, a large number of partitions, etc. Compare the segment sizes with this: select * from dba_segments where segment_name in ('PARTTABA', 'TABA'); If that's the issue, you may want to look into your tablespace settings, or using deferred segment creation.
I believe that you're dealing with partitioning overhead, if you have partitioned table then oracle has to find which partition to scan first.
Could you paste here both execution plans? How large are the tables? How selective are indexes used here?
Did you try to gather statistics?
You may also try to look into trace file to see what's going on.

Resources