I am new to hbase. I am using HBase version 1.1.2 on Microsoft Azure. I have data that looks like this
id num1 rating
1 254 2
2 40 3
3 83 1
4 120 1
5 91 5
6 101 2
7 17 1
8 10 2
9 11 3
10 31 1
I tried to create a table with two colum families of the form
create 'table1', 'family1', 'family2'
when I loaded my table
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
-Dimporttsv.columns="HBASE_ROW_KEY,family1:num1, family2:rating" table1 /metric.csv
I got the error
Error: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 5560 actions: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family family2 does not exist in region table1
when I modified my table with one column family it worked
create 'table1', 'family1'
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
-Dimporttsv.columns="HBASE_ROW_KEY,family1:num1, family1:rating" table1 /metric.csv
How do I adjust my table creation to account for multiple column families?
HBase ImportTsv internally uses PUT operations to load the data into HBase tables.
PUT only supports loading into single column family at a time
Here Here and from Documentation
Related
I have following query.
Following table is the input
table_1: student
id
ETL_create_date
student_marks
1
2023-02-10
85
2
2023-02-10
75
3
2023-02-10
80
4
2023-02-09
90
5
2023-02-09
65
6
2023-02-09
100
The expected output should as below:-
consider system_date is 2023-02-10.
The status column should check if the max etl_create_date of the student table is equal to the system date or not. If the dates are same then status should be LOADED else NOT LOADED.
etl_create_date column is max etl_create_date of the student table.
Count column the count is number of records in the students for given max etl_create_date
output_table:
table_name
ETL_create_date
STATUS
count
student_table
2023-02-10
LOADED
3
I have a set of 5 tables that have 2 millions rows and 450 columns approximately
My job look like this :
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap---tUnite---tDBOutput
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap-----
It's my 5 tables tables that I'm trying to union, with the tMap where I'm adding an Id to trace which table date come from + reduce number of columns (from 450 to 20)
Then I unite the 5 in one tUnite that load a table in Truncate - Insert mode
I'm trying to make it work but always have the same error which is "The code of method tDBInput-10Process is exceeding the 65535 bytes limit"
If you use only 20 of 450 columns, you could select only those columns in each of your tDBInput, instead of extracting all columns and filtering them in tMap.
For data like below
Col1
----
1
23
34
124
Output should be like below
Out
1
2
3
4
I tried the below hierarchical query but its giving repeated data
select substr(col1, level, 1)
from table1
connect by level <= length(col1);
I can't use distinct as this is sample and main table where I have to use this query has quite large data.
Thanks
I have following data table.
ID salary occupation
1 5000 Engineer
2 6000 Doctor
3 8000 Pilot
4 1000 Army
1 3000 Engineer
2 4000 Teacher
3 2000 Engineer
1 1000 Teacher
3 1000 Engineer
1 5000 Doctor
Now I want to add another column flag to this table so that it looks in the following way.
ID salary occupation Flag
1 5000 Engineer 0
2 6000 Doctor 0
3 8000 Pilot 0
4 1000 Army 0
1 3000 Engineer 1
2 4000 Teacher 1
3 2000 Engineer 1
1 1000 Teacher 2
3 1000 Engineer 2
1 5000 Doctor 3
Now how can I update my original table to the above format using HIVE?
Kindly help me.
Provided that you have data in your files for the additional column you can use Add Column clause for Alter Table.
In your example do something like this:
Alter table Test ADD COLUMNS (flag TINYINT);
Or you can try REPLACE COLUMNS as well:
Alter Table test REPLACE COLUMNS (id int, salary int, occupation String, flag tinyint);
You might need to load(overwrite) your dataset again though(just a speculation!!!).
You can definitely add new columns in HIVE table using alter command as told above
hive>Alter table Test ADD COLUMNS (flag TINYINT);
In Hive 0.13 and earlier releases, column will have NULL values but HIVE 0.14.0 and later release, you can update the column values using UPDATE command
Another way is, after adding column using ALTER command, you can overwrite the existing data with the new data(having Flag column)
hive> LOAD DATA LOCAL INPATH 'flagfile.txt' OVERWRITE INTO TABLE <tablename>;
I have a table which holds more then 2 million records, I am trying to update a table using following query
UPDATE toc T
SET RANK =
65535
- (SELECT COUNT (*)
FROM toc T2
WHERE S_KEY LIKE '00010001%'
AND A_ID IS NOT NULL
AND T2.TARGET = T.TARGET
AND T2.RANK > T.RANK)
WHERE S_KEY LIKE '00010001%' AND A_ID IS NOT NULL
Usually this query tooks 5 mins to update 50000 rows in our staging db which is a exact replica of production db but in our production db it is taking 6 hours to execute...
I tried Oracle advisory to select the correct execution plan but nothing is working...
Plan
UPDATE STATEMENT ALL_ROWSCost: 329,471
6 UPDATE TT.TOC
2 TABLE ACCESS BY INDEX ROWID TABLE TT.TOC Cost: 5 Bytes: 4,173,236 Cardinality: 54,911
1 INDEX SKIP SCAN INDEX TT.DATASTAT_SORTKEY_IDX Cost: 4 Cardinality: 1
5 SORT AGGREGATE Bytes: 76 Cardinality: 1
4 TABLE ACCESS BY INDEX ROWID TABLE TT.TOC Cost: 5 Bytes: 76 Cardinality: 1
3 INDEX SKIP SCAN INDEX TT.DATASTAT_SORTKEY_IDX Cost: 4 Cardinality: 1
I can see the following wait events
1,066 db file sequential read 10,267 0 3,993 0 6 39,933,580
1,066 db file scattered read 413 0 188 0 6 1,876,464
Any help will be greatly appreciated.
here is the current list of indexes
DSTAT_SKEY_IDX D_STATUS 1
DSTAT_SKEY_IDX S_KEY 2
IDX$$_165A0002 N_LABEL 1
S_KEY_IDX S_KEY 1
XAK1_TOC N_RELATIONSHIP 1
XAK2_TOC TARGET 1
XAK2_TOC N_LABEL 2
XAK2_TOC D_STATUS 3
XAK2_TOC A_ID 4
XIE1_TOC N_RELBASE 1
XIF4_TOC SOURCE_FILE_ID 1
XIF5_TOC A_ID 1
XPK_TOC N_ID 1
Atif
You're doing a skip scan where you supposedly want to do a range scan.
A range scan is only possible when the index columns are ordered by descending selectivity - in your case it seems that it should be S_KEY - TARGET - RANK
Update: rewriting the query in different order wouldn't make any difference. What matters is the sequence of the columns in the indexes of that table.
first show us the current index columns for that table:
select index_name, column_name, column_position from all_ind_columns where table_name = 'TOC'
then you could create a new index, e.g.
create index toc_i_s_key_target_rank on toc (s_key, target, rank) compress;