How to modify this hive query to increase number of reducers allocated? - hadoop

I am trying to execute this query using hive, but it takes forever to run, especially after going to the reducer step. It say mappers:451, reducers:1.
create table mb.ref201501_nontarget as select * from adv.raf_201501 where target=0 limit 200000;
My motivation to change the query came from this answer:
Hive unable to manually set number of reducers
I tried changing the query to:
create table mb.ref201501_nontarget as select * from (select * from adv.raf_201501 limit 200000) where target=0;
but its throwing error.

This question is very vague, if you think the last query produces the proper result (note that it is not the same as the first one!!) this should do the trick:
create table mytmptbl = select * from advanl.raf_201501 limit 200000;
create table mbansa001c.ref201501_nontarget as select * from (mytmptbl ) where target=0;
After which you probably want to delete the temporary table again.

Hadoop is a framework for distributed computing. Some data processing actions are a good fit because they are "embarrassingly parallel". Some data processing actions are a bad fit because they cannot be distributed. Most real-life cases are somewhere in between.
I strongly suspect that what you want to do is get a sample of the raw data with approximately 200k items. But your query requires exactly 200k items.
The simplest way for Hive to do that would be to run the WHERE clause in parallel (451 Mappers on 451+ file blocks) then dump all partial results in a single "sink" (1 Reducer) that lets the first 200k rows to pass through and ignore the rest. But all records will be processed, even the ones to be ignored.
Bottom line: you have a very inefficient sampler, and the result will probably have a strong bias -- smaller file blocks will be Mapped faster and processed earlier by the Reducer, hence larger file blocks have almost no chance to be represented in the sample.
I guess you know how many records match the WHERE clause, so you would be better off with some kind of random sampling that retrieves approx. 500K or 1M records -- that can be done up front, inside each Mapper -- then a second query with the LIMIT if you really want an arbitrary number of records -- a single Reducer will be OK for this kind of smallish volume.

Ok. this is what worked for me. now taking only 2-5 minutes for about 27m records:
create table mb.ref201501_nontarget as SELECT * FROM adv.raf_201501 TABLESAMPLE(0.02 PERCENT) where target=0;
When using limit or rand(), it uses at least 1 reducers and the process takes more than 2 hours and kinda freezes at 33% reducing step.
In Tablesample without limit it assigned only 1 mapper and 0 reducer.

Related

Clickhouse: Should i optimize MergeTree table manually?

I have a table like:
create table test (id String, timestamp DateTime, somestring String) ENGINE = MergeTree ORDER BY (id, timestamp)
i inserted 100 records then inserted another 100 records and i run select query
select * from test clickhouse returning with 2 parts their lengths are 100 and they are ordered in themselves. Then i run the query optimize table test and it started to return with 1 part and its length is 200 and ordered. So should i run optimize query after all insert and does it increase select query performance like select count(*) from test where id = 'foo' ?
Merges are eventual and may never happen. It depends on the number of inserts that happened after, the number of parts in the partition, size of parts. If the total size of input parts are greater than the maximum part size then they will never be merged.
It is very unreasonable to constantly merge up to one part.
Merger does not have such goal. In the contrary the goal is to have the minimum number of parts withing smallest number of merges. Merges consume the huge amount of disk and processor resources.
It makes no sense to merge two 300GB parts into one 600GB part for 3 hours. Merger have to read, decompress 600GB, merge, compress, write them back, after that the performance of the selects will not grow at all or will grow minimally.
Usually not, you can rely on Clickhouse background merges.
Also, Clickhouse has no intention to merge all the data from the partition into one part file, because "over-optimization" can affect performance too

Hive partition scenario and how it impacts performance

I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.

JDBC Update millions of records on Oracle database

I have a design issue. I have a database with millions of records which I need to update.
We will use JDBC because we have to do some processing to calculate new fields values.
It is a one go, and I will not need it any more. So I was thinking about something simple. I wanted to create new tables and delete the old ones, but the DBA do not want to, because the need for storage would be huge.
I will have to process about 80 millions rows, and for each row to update 3 fields.
Would a simple jdbc approach, with a setFetchSize(1000) for example, would work?
I mean select a, b, c from mutable for update;
then the update ...
Would a JDBC program be able to support the workload?
I was also thinking about using SpringBatch or EasyBatch. But I am wondering if it is worth investigating time in this for just one go (and some very short timelines).
What is your experience with this?
I think you can do this in JDBC. I would suggest something like the following:
Create two or three threads. Each thread does the following
Create a connection.
Create a prepared statement that retrieves a disjoint subset of the rows
Set the fetch size to 100 or so. Definitely less than 1000.
Create an update statement
Execute the query
Iterate over the result set
For each row add batch to update the row
After fetch size rows execute the batch
Lets assume the fetch size is 100. The first execute will do a round trip which takes time. While that's happening go run another thread. When an execute returns processing the next 100 rows does not do a database round trip. The rows have already been fetched and the updates are being batched so this will not do a database round trip. After 100 rows execute the batch which will do a round trip and so will switch threads. Then it will fetch 100 more rows which will switch threads. I'm not sure whether two or three threads would be optimal but if I had to guess I'd try three.
But the above assumes the machine only has a single hardware thread which is not true. Most CPUs support 12 or more hardware threads so I would actually use 30 or so threads depending on what the hardware can support. Even with multiple CPUs you probably don't want more than 50 or so threads as that will start to introduce contention in the database.
The above assumes the external service is fast, much faster than the database. If not then processing each row is going to wait for the external service. In that case more threads. Since the updates will hit the database more slowly thread contention in the database is less of a concern.
One way to partition the query results into disjoint subsets is as follows:
SELECT c1, c2, etc, row
FROM (SELECT c1, c2, etc, ROWNUM FROM ...)
WHERE MOD(row, number_of_partitions) = ?
Then set the query param from 0 to number_of_partitions - 1, one for each thread. You have to do this as a subquery to get ROWNUM to work right.
Do not use updatable result sets. The performance will be abysmal, guaranteed.

pig script to sample 10 chunks of training data, pig script is jammed

BACKGROUND
I have a binary classification task where the data is highly imbalanced. Specifically, there are
way more data with label 0 than that with label 1. In order to solve this problem, I plan to subsampling
data with label 0 to roughly match the size of data with label 1. I did this in a pig script. Instead of
only sampling one chunk of training data, I did this 10 times to generate 10 data chunks to train 10 classifiers
similar to bagging to reduce variance.
SAMPLE PIG SCRIPT
---------------------------------
-- generate training chunk i
---------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData '$RATIO';
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunk1,labelOneTrainingData;
-- join two tables to get all the features back from table 'dataFeatures'
trainingChunkiFeatures = JOIN trainingChunkiRaw BY id, dataFeatures BY id;
-- in order to shuffle data, I give a random number to each data
trainingChunki = FOREACH trainingChunkiFeatures GENERATE
trainingChunkiRaw::id AS id,
trainingChunkiRaw::label AS label,
dataFeatures::features AS features,
RANDOM() AS r;
-- shuffle the data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
-- store this chunk of data into s3
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
In my real pig script, I do this 10 times to generate 10 data chunks.
PROBLEM
The problem I have is that if I choose to generate 10 chunks of data, there are so many mapper/reducer tasks, more than 10K. The majority of
mappers do very little things (runs less 1 min). And at some point, the whole pig script is jammed. Only one mapper/reducer task could run and all other mapper/reducer tasks are blocked.
WHAT I'VE TRIED
In order to figure out what happens, I first reduced the number of chunks to generate to 3. The situation was less severe.
There were roughly 7 or 8 mappers running at the same time. Again these mappers did very little things (runs about
1 min).
Then, I increased the number of chunks to 5, at this point, I observed the the same problem I have when I set the number of chunks
to be 10. At some point, there was only one mapper or reducer running and all other mappers and reducers were blocked.
I removed some part of script to only store id, label without features
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
STORE trainingChunkiRaw INTO '$training_data_i_s3_path' USING PigStorage(',');
This worked without any problem.
Then I added the shuffling back
--------------------------------------------------------------------------
-- generate training chunk i
--------------------------------------------------------------------------
-- subsampling data with label 0
labelZeroTrainingDataChunki = SAMPLE labelZeroTrainingData $RATIO;
-- combine data with label 0 and label 1
trainingChunkiRaw = UNION labelZeroTrainingDataChunki, labelOneTrainingData;
trainingChunki = FOREACH trainingChunkiRaw GENERATE
id,
label,
features,
RANDOM() AS r;
-- shuffle data
trainingChunkiShuffledRandom = ORDER trainingChunki BY r;
trainingChunkiToStore = FOREACH trainingChunkiShuffledRandom GENERATE
id AS id,
label AS label,
features AS features;
STORE trainingChunkiToStore INTO '$training_data_i_s3_path' USING PigStorage(',');
The same problem reappears. Even worse, at some point, there was no mapper/reducer running. The whole program hanged without making any progress. I added another machine and the program ran for a few minutes before it jammed again. Looks like there are some dependency issues here.
WHAT'S THE PROBLEM
I suspect there are some dependency which leads to deadlock. The confusing thing is that before shuffling, I already
generate the data chunks. I was expecting the shuffling could be executed in parallel since these data chunks are independent
with each other.
Also I noticed there are many mappers/reducers do very little thing (exists less than 1 min). In such case, I would
imagine the overhead to launch mappers/reducers would be high, is there any way to control this?
What's the problem, any suggestions?
Is there standard way to do this sampling. I would imagine there are many cases where we need to do these subsampling like bootstrapping or bagging. So, there might be some standard way to do this in pig. I couldn't find anything useful online.
Thanks a lot
ADDITIONAL INFO
The size of table 'labelZeroTrainingData' is really small, around 16MB gziped.
table 'labelZeroTrainingData' is also generated in the same pig script by filtering.
I ran the pig script on 3 aws c3.2xlarge machines.
table 'dataFeatures' could be large, around 15GB gziped.
I didn't modify any default configuration of hadoop.
I checked the disk space and memory usage. Disk space usage is around 40%. Memory usage is around 90%. I'm not sure memory is the problem. Since
I was told if the memory is the issue, the whole task should fail.
After a while, I think I figure out something. The problem is likely to be the multiple STORE statements there. Looks like pig script will be running in batch by default. So, for each chunk of the data, there is a job running which leads to lack of resource, e.g. slots for mapper and reducer. None of the job could finish because each needs more mapper/reducer slots.
SOLUTION
use piggybank. There is a storage function called MultiStorage which might be useful in this case. I had some version incompatible issue between piggybank and hadoop. But it might work.
Disable pig executing operations in batch. Pig tries to optimize the execution. I simply disable this multiquery feature by adding -M. So, when you run pig script, it looks like something pig -M -f pig_script.pg which executes one statement at a time without any optimization. This might not be ideal because no optimization is done. For me, it's acceptable.
Use EXEC in pig to enforce certain execution order which is helpful in this case.

Which Hadoop product is more appropriate for a quick query on a large data set?

I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set)
The queries will be performed against chip sequencing data. Each record is one line in a file. To be clear below shows a sample record in the data set.
one line (record) looks like:
1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 ***103570835*** F .. 23G 24C
The highlighted field is called "position of match" and the query we are interested in is the # of sequences in a certain range of this "position of match". For instance the range can be "position of match" > 200 and "position of match" + 36 < 200,000.
Any suggestions on the Hadoop product I should start with to accomplish the task? HBase,Pig,Hive, or ...?
Rough guideline: If you need lots of queries that return fast and do not need to aggregate data, you want to use HBase. If you are looking at tasks that are more analysis and aggregation-focused, you want Pig or Hive.
HBase allows you to specify start and end rows for scans, meaning it should be satisfy the query example you provide, and seems most appropriate for your use case.
For posterity, here's the answer Xueling received on the Hadoop mailing list:
First, further detail from Xueling:
The datasets wont be updated often.
But the query against a data set is
frequent. The quicker the query, the
better. For example we have done
testing on a Mysql database (5 billion
records randomly scattered into 24
tables) and the slowest query against
the biggest table (400,000,000
records) is around 12 mins. So if
using any Hadoop product can speed up
the search then the product is what we
are looking for.
The response, from Cloudera's Todd Lipcon:
In that case, I would recommend the
following:
Put all of your data on HDFS
Write a MapReduce job that sorts the data by position of match
As a second output of this job, you can write a "sparse index" -
basically a set of entries like this:
where you're basically giving offsets
into every 10K records or so. If you
index every 10K records, then 5
billion total will mean 100,000 index
entries. Each index entry shouldn't be
more than 20 bytes, so 100,000 entries
will be 2MB. This is super easy to fit
into memory. (you could probably index
every 100th record instead and end up
with 200MB, still easy to fit in
memory)
Then to satisfy your count-range
query, you can simply scan your
in-memory sparse index. Some of the
indexed blocks will be completely
included in the range, in which case
you just add up the "number of entries
following" column. The start and
finish block will be partially
covered, so you can use the file
offset info to load that file off
HDFS, start reading at that offset,
and finish the count.
Total time per query should be <100ms
no problem.
A few subsequent replies suggested HBase.
You could also take a short look at JAQL (http://code.google.com/p/jaql/), but unfortunately it's for querying JSON data. But maybe this helps anyway.
You may need to look at No-SQL Database approaches like HBase or Cassandra. I would prefer HBase, as it has a growing community.

Resources