How does hive store sequencefile? - hadoop

There is a hive internal table which is stored as sequence file, and the first column type is string and the field seperator is '\1',I want to process it using Mapreduce directly, and find out that the input key is BytesWritable. My question is how hive store data in sequencefile? Is the reason I get bytesWritable key is that the first column type is string? I doesn't configure map's key seperator as '\1', so I am puzzled by the second question

Hive does not treat the first column as a key for a SequenceFile. Rather, the key gets ignored completely. [1] [2]. So when you are writing your Mapper to operate on a Hive SequenceFile you should also disregard the Key. All of your columns will be part of the Value.
Just in case your Value is also a BytesWritable and you want it to be Text, try SequenceFileAsTextInputFormt (docs). The answer to this similar question question may help you get set up. You should be able to get a String from the Text with a simple toString(). Your seperator '\1' will come in here. Split your String on '\1' to separate it into your columns from Hive.

Related

Informatica - Concatenate Max value from each colum present in multiple rows for same Primary Key

enter image description here
I have tried traditional approach of using Agg (Group By: ID, Store Name) and Max(Each Object) columns separately.
Then in next expression, Concat(Val1 Val2 Val3 || Val4).
How ever, I'm getting output as '0100'.
But, REQUIRED OUTPUT: 1100
Please let me know, how this can be done in IICS.
IICS is similar to the Powercenter on-prem.
First use an aggregator.
in Group By tab add ID, Store Name
in Aggregate tab add max(object1)... please note to set data type and length correctly.
Then use an expression transformation.
link ID, Store Name first.
Then concat the max_* columns using pipe -
out_max=max_col1||max_col2||... please note to set data type and length correctly.
This should generate correct output. I think you are having wrong output because of data length or data type of object fields. Make sure you trim spaces from object data before aggregator.

Querying in Hbase can't find key because its got a hexadecimal in it

Not much of a hbase guy so bear with me. Just a data analyst trying to do his job.
Let's say for the same of simplicity there's a hbase table called Student with the following info:
Key - Student ID
Value - SSN
So I'm trying to run the following command:
get 'Student_id','88812'
I'm trying to produce the following:
COLUMN CELL
H:00_ETAG timestamp=1525760141144, value=1234567891
However, nothing yields. After scanning the table I've come to discover that the key has some sort of hexidecimal value in front of it. So the key is actually like
\x80\x00\x02F188812
I understand that in order to execute the get command I'd just need to use double quotes like this
get 'Student',"\x80\x00\x02F188812"
Now where the real issue arises for me is the fact that I have NO clue what the hexadecimal prefix for each of these keys should be. It seems like the table that I'm working out of has a different hexadecimal prefix for each key. Is there a way that I can somehow execute the get command without the hexadecimal or at least find out what the hexadecimal should be? How about doing a reverse search where instead I try and find the key by searching by value?
And no, I can't scan the entire table since there are millions of records that exist.

How to load nested collections in hive with more than 3 levels

I'm struggling to load data into Hive, defined like this:
CREATE TABLE complexstructure (
id STRING,
date DATE,
day_data ARRAY<STRUCT<offset:INT,data:MAP<STRING,FLOAT>>>
) row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by ':';
The day_data field contains a complex structure difficult to load with load data inpath...
I've tried with '\004', ^D... a lot of options, but the data inside the map doesn't get loaded.
Here is my last try:
id_3054,2012-09- 22,3600000:TOT'\005'0.716'\004'PI'\005'0.093'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|7200000:TOT'\005'0.367'\004'PI'\005'0.066'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|10800000:TOT'\005'0.268'\004'PI'\005'0.02'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.159'\004'RES'\005'0.0|14400000:TOT'\005'0.417'\004'PI'\005'0.002'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.165'\004'RES'\005'0.0`
Before posting here, I've tried (many many) options, and this example doesn't work:
HIVE nested ARRAY in MAP data type
I'm using the image from HDP 2.2
Any help would be much appreciated
Thanks
Carlos
So finally I found a nice way to generate the file from java. The trick is that Hive uses the first 8 ASCII characters as separators, but you can only override the first three. From the fourth on, you need to generate thee actual ASCII charaters.
After many tests, I ended up editing my file with an HEX editor, and inserting the right value worked, but how can I do that in Java? Can't be more simple: just cast an int into char, and that will generate the corresponding ASCII character:
ASCII 4 -> ((char)4)
ASCII 5 -> ((char)5)
...
And so on.
Hope this helps!!
Carlos
You could store Hive table in Parquet or ORC format which support nested structures natively and more efficiently.

Sqoop Increment with string column

I'm trying to use an incremental sqoop job across all tables in a database. Some of the databases only have string values in the columns. Is there a way to increment on a string value? There is a common string name across all tables.
After my initial comment I was thinking if the question you asked even made sense. It would if your database forced you to store either the record date or the incrementing number into a text column, but the odds of that is very slim.
If you have a date field you can actually use, you can just use 'lastmodified' mode instead of 'append' mode.

how to replace characters in hive?

I have a string column description in a hive table which may contain tab characters '\t', these characters are however messing some views when connecting hive to an external application.
is there a simple way to get rid of all tab characters in that column?. I could run a simple python program to do it, but I want to find a better solution for this.
regexp_replace UDF performs my task. Below is the definition and usage from apache Wiki.
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT):
This returns the string resulting from replacing all substrings in INITIAL_STRING
that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT,
e.g.: regexp_replace("foobar", "oo|ar", "") returns fb
Custom SerDe might be a way to do it. Or you could use some kind of mediation process with regex_replace:
create table tableB as
select
columnA
regexp_replace(description, '\\t', '') as description
from tableA
;
select translate(description,'\\t','') from myTable;
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well. (Available as of Hive 0.10.0, for string types)
Char/varchar support added as of Hive 0.14.0
You can also use translate(). If the third argument is too short, the corresponding characters from the second argument are deleted. Unlike regexp_replace() you don't need to worry about special characters.
Source code.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
There is no OOTB feature at this moment which allows this. One way to achieve that could be to write a custom InputFormat and/or SerDe that will do this for you. You might this JIRA useful : https://issues.apache.org/jira/browse/HIVE-3751. (not related directly to your problem though).

Resources