how to speed up sort in hive - hadoop

I would like to speed up hive process,
but I do not know how to
do it.
The data is about 200GB and about 300000000 lines text data,
and I split it into 50file in advance, then 1 file is about 4GB.
I would like to get 1 file as a result of the sort then I select the number of reducer is 1 and the number of mapper is 50.
Each line of the data consists of word and frepuency.
The same word should be grouped and frepuency of it should be sumed.
All of files are gzip files.
It takes a few day to complete the process,
and I would like to speed up
it to a few hours if I can.
Which parameter should I chgange to speed up the process?

Thank you for your reply,
Yes, I define external Hive table pointing to HDFS location.
I show my pseudo code,
create external table A count int, word string,
row format delimited fields terminated by '\t',
location 'HDFS path';
select count, word from A group by word sort by count desc;

Related

sum() function gives wrong answer in hiveql

I was playing around with a simple dataset that you can find here.
No matter what I do, calling the SUM() aggregate function on the 4th column of the given data set returns the wrong answer.
Here is the exact code that I have used:
create database beep_boop;
use beep_boop;
create table cause (year INT, sex STRING, cause STRING, value INT)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
tblproperties("skip.header.line.count" = "1");
load data inpath '/user/verterse/CauseofDeath.csv' into table cause;
select sum(value) from cause;
The answer that I get is 11478567 as shown in the screenshot here.
But using the SUM() in MS Excel gives an answer of 12745563.
I tried deleting the table/database and recreating them from scratch. I tried uploading the csv file again. I tried using different datatypes like INT and BIGINT for the value column. I tried skipping and not skipping the header line. Nothing works. I also know that the file is being read completely because select count(*) from cause; returns a correct answer of 1016.
P.S.: I am new to Hadoop, Hive and big data in general.

How to insert data into table in following scenario?

I am newbie in hadoop and I have to add data into table in hive.
I have data from FIX4.4 protocol, something like this...
8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>
8=FIX.4.4<SHO>9=69<SHO>35=A<SHO>34=1<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>98=0<SHO>108=30<SHO>10=093<SHO>
8=FIX.4.4<SHO>9=66<SHO>35=2<SHO>34=2<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>7=1<SHO>16=0<SHO>10=174<SHO>
8=FIX.4.4<SHO>9=110<SHO>35=5<SHO>34=525<SHO>49=SSGMdemo<SHO>52=20150410-15:25:58.164<SHO>56=Trumid<SHO>58=MsgSeqNum too low, expecting 361 but received 1<SHO>10=195<SHO>
Firstly, what i want is, in 8=FIX.4.4 8 as column name, and FIX.4.4 as value of that column, in 9=66 9 should be column name and 66 would be value of that column and so on.... and there are so many rows in raw file like this.
Secondly, same thing for another row, and that data would append in next row of table in hive.
Now what should i do that i am not able to think.
Any help would be appriciable.
I would first create a tab-separated-file containing this data. I suggested to use a regex in the comments but if that is not your strong suit you can just split on the <SHO> tag and =. Since you did not specify the language you want to use I will suggest a 'solution' in Python.
The code below shows you how to write one of your input lines to a CSV file.
This can easily be extended to support multiple of these lines or to append lines to the CSV files once it is already created.
import csv
input = "8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>"
l = input.split('<SHO>')[:-1] # Don't include last element since it's empty
list_of_pairs = map(lambda x: tuple(x.split('=')),l)
d = dict(list_of_pairs)
with open('test.tsv', 'wb') as c:
cw = csv.writer(c, delimiter='\t')
cw.writerow(d.keys()) # Comment this if you don't want to have a header
cw.writerow(d.values())
What this code does is first split the input line on <SHO> meaning it creates a list of col=val strings. What I does next is create a list of tuple pairs where each tuple is (col,val).
Then it creates a dictionary from this, which is not strictly necessary but might help you if you want to extend the code for more lines.
Next I create a tab-separated-value file test.tsv containing a header and the values in the next line.
This means now you have a file which Hive can understand.
I am sure you can find a lot of articles on importing CSV or tab-separated-value files, but I will give you an example of a generic Hive query you can use to import this file once it is in HDFS.
CREATE TABLE if not exists [database].[table]
([Col1] Integer, [Col2] Integer, [Col3] String,...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA inpath '[HDFS path]'
overwrite INTO TABLE [database].[table];
Hope this gives you a better idea on how to proceed.
Copy the file to HDFS and create an external table with a single column (C8), then use the below select statement to extract each columns
create external table tablename(
c8 string )
STORED AS TEXTFILE
location 'HDFS path';
select regexp_extract(c8,'8=(.*?)<SHO>',1) as c8,
regexp_extract(c8,'9=(.*?)<SHO>',1) as c9,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c35,
regexp_extract(c8,'34=(.*?)<SHO>',1) as c34,
regexp_extract(c8,'49=(.*?)<SHO>',1) as c49,
regexp_extract(c8,'52=(.*?)<SHO>',1) as c52,
regexp_extract(c8,'56=(.*?)<SHO>',1) as c56,
regexp_extract(c8,'98=(.*?)<SHO>',1) as c98,
regexp_extract(c8,'108=(.*?)<SHO>',1) as c108,
regexp_extract(c8,'554=(.*?)<SHO>',1) as c554,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c10
from tablename

SQL LOADER ERROR_0102

Below is my Table structure and .CTL FILE & .CSV file while loading data i am always getting error on first row & other data is getting is getting loaded. if i left a complete blank line on first record all data gets inserted.
can you please help us why i am getting error on first record.
TABLE_STRUCTURE
ING_DATA
(
ING_COMPONENT_ID NUMBER NOT NULL,
PARENT_ING_ID NUMBER NOT NULL,
CHILD_ING_ID NUMBER NOT NULL,
PERCENTAGE NUMBER(7,4) NOT NULL
);
CTL FILE
LOAD DATA
INFILE 'C:\Users\pramod.uthkam\Desktop\Apex\Database\SQL LOADER-PROD\ING_COMPONENT\ingc.csv'
BADFILE 'D:\SQl Loader\bad_orders.txt'
INTO TABLE ING_data
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
ING_Component_ID ,
Parent_ING_ID ,
Child_ING_ID ,
Percentage
)
CSV FILE
1,3,4,95.0000
2,3,5,5.0000
3,6,7,5.0000
4,6,4,95.0000
5,18,19,19.0000
6,18,20,80.0000
7,18,21,1.0000
8,34,35,85.0000
LOG FILE
Record 1: Rejected - Error on table ING_COMPONENT, column ING_COMPONENT_ID.
ORA-01722: invalid number
Table ING_COMPONENT:
7 Rows successfully loaded.
1 Row not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.
Space allocated for bind array: 66048 bytes(64 rows)
Read buffer bytes: 1048576
Total logical records skipped: 0
Total logical records read: 7
Total logical records rejected: 1
Total logical records discarded: 0
BAD FILE
1,3,4,95.0000
I tried loading your file by creating it as is. For me it runs fine. All 8 rows got loaded. No issues. I was trying in Red Hat Linux.
Then I tried 2 things.
dos2unix myctl.ctl
Ran SQLLDR. All rows got inserted.
Then I tried:
unix2dos myctl.ctl
RAN SQLLDR again, all 8 records rejected. So what I believe is that, your first record line ending is not as per you sqlplus readable format. When you enter a blank line manually, your default environment (like in my case UNIX) creates the correct line ending, and the records get read. I'm not sure, but I assume this, based on my own try as above.
So lets say you are loading this file in windows(I think this because your path looks like :)) In your csv file, give a blank line in beginning, then remove it, and do the same thing after first record also(give a blank line after first record, then remove it). Then try again. Maybe it works if the issue is what I suspect.

Sequence Number UDF in Hive

i have tried this UDF in hive : UDFRowSequence.
But its not generating unique value i.e it is repeating the sequence depending on mappers.
Suppose i have one file (Having 4 records) availble at HDFS .it will create one mapper for this job and result will be like
1
2
3
4
but when there are multiple file (large size) at HDFS Location , Multiple mapper will get created for that job and for each mapper repetitive sequence number will get generated like below
1
2
3
4
1
2
3
4
1
2
.
Is there any solution for this so that unique number should be generated for each record
I think you are looking for ROW_NUMBER(). You can read about it and other "windowing" functions here.
Example:
SELECT *, ROW_NUMBER() OVER ()
FROM some_database.some_table
#GoBrewers14 :- Yes i did try that. We tried to use the ROW_NUMBER function ,but when we try to query this on small size data eg. file containing 500 rows, it’s working perfectly. But when it comes to large size data, query runs for couple of hours and finally fails to generate output .
I have came to know below information regarding this :-
Generating a sequential order in a distributed processing query is not possible with simple UDFs. This is because the approach will require some centralised entity to keep track of the counter, which will also result in severe inefficiency for distributed queries and is not recommended to apply.
If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence
Query to generate Sequences. We can use this as Surrogate Key in Dimension table as well.
WITH TEMP AS
(SELECT if(max(seq) IS NULL, 0, max(seq)) max_seq
FROM seq_test)
SELECT col_id,
col_val,
row_number() over() + max_seq AS seq
FROM souce_table
INNER JOIN TEMP ON 1 = 1;
seq_test: Its your target table.
source_table: Its your source.
Seq: Surrogate key / Sequence number / Key column

Hadoop Remove unnecessary \n in the input files

I have a large input file, values are pipe delimited. And there are 20 values in a row. after 19th pipe, if new line character comes, that is a record.
But my input file is having \n not only after 19 pipes but also in the other values. sample line looks like this...
101101|this\nis my sample|12547|sample\nxyz|......(19th pipe)|end of record\n
I am new to Hadoop and I don't know how to divide lines to create key value pairs based on this condition.
Another related question I have is, input split happens at the client side and if I have to split the input file conditionally on the client side(one machine), will it not be very slow considering the large file? Please help.
In Hive NULL column values are represented as "\N" that's the default behaviour of Hive. This is done to differentiate NULL and "NULL" (string NULL).
If you don't want \N to appears to appear in your export you can use COALESCE UDF.
Roughly your query may look like this
SELECT
COALESCE (my_column, '') AS my_column
FROM
my_table

Resources