I have a log file "sample.log" which looks like below:
41 Texas 2000
42 Louisiana4 3211
43 Texas 5000
22 Iowa 4998p
In the log file first column is id, second state name and third amount. If you see State name it has Louisiana4 and sales total it has 4998p. How can I cleanse it so I can insert it into Hive (using Python or other way?). Could you please show the steps?
I want to insert into Hive table tblSample:
Table schema is:
CREATE TABLE tblSample(
id int,
state string,
sales int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/cloudera/Staging'
;
To load data into Hive table I could do:
load data local inpath '/home/cloudera/sample.log' into table tblSample;
Thank you!
You could load data as is into a hive table and then use UDFs to cleanse data and load into another table. This would be far more efficient than Python as it will be running as a mapr reduce.
I would rather store the data as it is and do the cleansing while fetching the data. It would be much simpler. No external code required. For example :
hive> CREATE TABLE tblSample(
> id string,
> state string,
> sales string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE
> LOCATION '/user/cloudera/Staging';
hive> select regexp_replace(state, "[0-9]", ""), regexp_replace(sales, "[a-z]", "") from tblSample;
HTH
Related
I am looking to encode columns of a table in hive.
I tried:
hive> create table encode_test(id int, name STRING, phone STRING, address STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES ('column.encode.columns'='phone,address', 'column.encode.classname'='org.apache.hadoop.hive.serde2.Base64WriteOnly') STORED AS TEXTFILE;
Say i have a CSV file, with following row
100,'navis','010-0000-0000','Seoul Seocho'
Now i tried to use.
LOAD DATA LOCAL INPATH
'/home/path/to/csv/test.csv'
INTO TABLE encode_test;
But when doing Select * from encode_test i am getting all columns NULL
Whereas the result should have been
100 navis MDEwLTAwMDAtMDAwMA== U2VvdWwsIFNlb2Nobw==
Also i want to give Fields TERMINATED BY ',' IN create table encode_test query.
but i am getting error: EOF error Near Fields
I also tried creating another table sample
create table sample(id int, name STRING, phone STRING, address STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
And then imported the csv file in the sample table. and it was successfully imported.
then i tried using.
insert into encode_test select * from sample;
But i am getting this new error
Permission denied: user=root, access=WRITE, inode="/user":h dfs:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.c heckFsPermission(DefaultAuthorizationProvider.java:279)
I'n new into hadoop
Please refer to this link from where i tried this problem
In Hive DDL, ROW FORMAT SERDE and FIELDS TERMINATED BY cannot co-exist together. Instead you can use, field.delim serde property.
create table encode_test(id int, name STRING, phone STRING, address STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'column.encode.columns'='phone,address',
'column.encode.classname'='org.apache.hadoop.hive.serde2.Base64WriteOnly')
STORED AS TEXTFILE;
And for the PermissionDenied exception, run the hive queries as either hdfs or hive user since root user does not have WRITE access to HDFS.
While loading the data from file to hive tables Null values are getting inserted.
sqlCon.sql("create table hive_6(id Int,name String) partitioned by (date String) row format delimited fields terminated by ','");
sqlCon.sql("load data local inpath '/home/cloudera/file.txt' into table hive_6 partition(date='19July')");
sqlCon.sql("select * from hive_6").show()
+----+----+------+
| id|name| date|
+----+----+------+
|null|null|19July|
|null|null|19July|
|null|null|19July|
|null|null|19July|
|null|null|19July|
|null|null|19July|
|null|null|19July|
+----+----+------+
I was facing the same issue when I was reading data from parquet files.
The hive queries will give the correct data, although spark-sql will show null values.
The reason is schema,you should have following-
Firstly-- The column names in the file(txt/parquet) you are reading should be all in lowercase.
Secondly-- The column names in the hive table that you have created should exactly be same as that of the file you are reading.
Thirdly-- The datatypes in both txt/parquet files and hive table should be same.
How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps
I am new to Hadoop Hive and have just started to do basic querying in hive.
My intention is I have an input text file (which has large number of records per line). The format of the file is something like this:
1;23;0;;;;1;3;2;1;1;4;5;6;;;;
1;43;6;;;;1;3;2;1;1;4;5;5;;;;
1;53;7;;;;1;3;2;1;1;4;5;2;;;;
(Each integer before a ";" has a meaning which I am intending to put it in Hive table as column names - and each line contains about 400 fields)
So for inserting this I have created a table "test" - using the following query:
CREATE TABLE test (field1 INT, field2 INT, field3 INT, field4 INT, ... field390 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\073";
And I load my text file with the records using the LOAD query as below:
LOAD DATA LOCAL INPATH '/tmp/test.txt'
OVERWRITE INTO TABLE test;
For now all the fields are getting inserted into the table upto 50 fields accurately. Later I have mismatches.
What I have in my format of input is - at 50th field in the test.txt I have a INT number which decides the number of fields to take following the field.
Example:
50th field: 2 -> Hive has to take the next 2*10 field INT values and insert in the table.
50th field: 1 -> Hive has to take the next 1*10 field INT values and insert in the table. And the rest 10 fields can be set NULL.
(The maximum value of 50th field is 2 - so I have reserved 2*10 fields for this in the table)
After 50th+(2*10) fields , the data should be read normally in the sequence as it did before the 50th field.
Do we have a way in which we can have a condition on the input so that the data gets inserted accordingly in Hive.
A help may be appreciated. Need a solution which will not guide me to pre-process the test.txt and then supply to the table.
I have tried to answer it at http://www.knowbigdata.com/page/hive-hadoop-need-load-data-table-based-conditions-input-file#comment-85
Does it make sense?
You can use where clause in Hive.
First load data into Hive raw table or HDFS, then again create table and load data based on where clause.
Means:
SELECT * FROM table_reference
WHERE name like "%venu%"
GROUP BY City;
Resource: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
How do I store PIG output as Ctrl-a delimited output for storage into hive?
To get the expected result you can follow below mentioned process
Store your relation using below command
STORE <Relation> INTO '<file_path>' USING PigStorage('\u0001');
Expose hive table referring to generated file
hive>CREATE EXTERNAL TABLE TEMP(
c1 INT,
c2 INT,
c3 INT,
c4 INT
.....
)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '<file_path>';
If output file present in linux local directory then create table
hive>CREATE TABLE TEMP(
c1 INT,
c2 INT,
c3 INT,
c4 INT
.....
)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
and load the data into table
hive> load data local inpath '<file_path>' into table temp;
Can you try like this?
STORE <OutpuRelation> INTO '<Outputfile>' USING PigStorage('\u0001');
Example:
input.txt
1,2,3,4
5,6,7,8
9,10,11,12
PigScript:
A = LOAD 'input.txt' USING PigStorage(',');
STORE A INTO 'out' USING PigStorage('\u0001');
Output:
1^A2^A3^A4
5^A6^A7^A8
9^A10^A11^A12
UPDATE:
The above pig script output is stored into file name 'part-m-00000' and i am trying to load this file into hive. Everything works fine and i didn't see any issue.
hive> create table test_hive(f1 INT,f2 INT,f3 INT,f4 INT);
OK
Time taken: 0.154 seconds
hive> load data local inpath 'part-m-00000' overwrite into table test_hive;
OK
Time taken: 0.216 seconds
hive> select *from test_hive;
OK
1 2 3 4
5 6 7 8
9 10 11 12
Time taken: 0.076 seconds
hive>