RCFile - emitting GZip compressed int columns - hadoop

For some reason, Hive is not recognizing columns emitted as integers, but does recognize columns emitted as strings.
Is there something about Hive or RCFile or GZ that is preventing proper rendering of int?
My Hive DDL looks like:
create external table if not exists db.table (intField int, strField string) stored as rcfile location '/path/to/my/data';
And the relevant portion of my Java looks like:
BytesRefArrayWritable dataWrite = new BytesRefArrayWritable(2);
byte[] byteArray;
BytesRefWritable bytesRefWritable = new BytesRefWritable(); intWritable.set(myObj.getIntField());
byteArray = WritableUtils.toByteArray(intWritable.get());
bytesRefWritable.set(byteArray, 0, byteArray.length);
dataWrite.set(0, bytesRefWritable); // sets int field as column 0
bytesRefWritable = new BytesRefWritable();
textWritable.set(myObj.getStrField());
bytesRefWritable.set(textWritable.getBytes(), 0, textWritable.getLength());
dataWrite.set(1, bytesRefWritable); // sets str field as column 1
The code runs fine, and through logging I can see the various Writables have bytes within them.
Hive can read the external table as well, but the int field shows up as NULL, indicating some error.
SELECT * from db.table;
OK
NULL my string field
Time taken: 0.647 seconds
Any idea what might be going on here?

So, I'm not sure exactly why this is the case, but I got it working using the following method:
In the code that writes the byte array representing the integer value, instead of using WritableUtils.toByteArray(), I instead Text.set(Integer.toString(intVal)).getBytes().
In other words, I convert the integer to its String representation, and use the Text writable object to get the byte array as if it were a string.
Then, in my Hive DDL, I can call the column an int and it interprets it correctly.
I'm not sure what was initially causing the problem, be it a bug in WritableUtils, some incompatibility with compressed integer byte arrays, or a faulty understanding of how this stuff works on my part. In any event, the solution described above successfully meets the task's needs.

Related

If Statement error: Expressions that yield variant data-type cannot be used to define calculated columns

I'm new in Power BI and I'm more used to work with Excel. I try to translate following Excel formula:
=IF(A2="UPL";0;IF(MID(D2;FIND("OTP";D2)+3;1)=" ";"1";(MID(D2;FIND("OTP";D2)+3;1))))
in Power Bi as follows:
Algo =
VAR FindIT = FIND("OTP",Fixed_onTop_Data[Delivery Date],1,0)
RETURN
IF(Fixed_onTop_Data[Delivery Type] = "UPL", 0,
IF(FindIT = BLANK(), 1, MID(Fixed_onTop_Data[Delivery Date],FindIT+3,1))
)
Unfortunately I receive following error message:
Expressions that yield variant data-type cannot be used to define calculated columns.
My values are as follows:
Thank you so much for your help!
You cant mix Two datatypes in your output; In one part of if, you return an INT (literally 0/1), and is second you return a STRING
MID(Fixed_onTop_Data[Delivery Date],FindIT+3,1)
You must unify your output datatype -> everything to string or everything to INT
Your code must be returning BLANK in some cells therefore PowerBI isn't able to choose a data type for the column, wrap your code inside CONVERT(,INTEGER).

Ruby: how to insert Hash value into Cassandra map

Am attempting to store values provided as int, time, hash into Cassandra using Datastax driver.
Hash appears as { "Q17.1_4"=>"val", "Q17.2"=>"other", ...}
Have defined table as:
int
timestamp
map
PK( int, timestamp )
I can get the PK inserted ok but am having trouble coercing the hash values into something that Cassandra can cope with.
Have created a prepared statement and using it while (trying to) looping through values:
stmt = session.prepare( "insert into forms( id, stamp, questions ) values ( ?,?,? )" )
...
json_val.each{ |key,val|
result = session.execute( stmt, val[ 'ID' ].to_i, Time.parse( val[ 'tDate' ]).to_i, val )
}
If I insert 'val' as a hash I get this error:
/lib/cassandra/statements/prepared.rb:53:in `bind': expecting exactly 3 bind parameters, 2 given (ArgumentError)
This says (to me) that there isn't a method to convert the hash to what Cassandra wants.
If I insert the val as a string (using hash.to_s) I get this error:
/lib/cassandra/protocol/type_converter.rb:331:in varchar_to_bytes': undefined methodencode' for 1:Fixnum (NoMethodError)
... same goes for converting it to a json string, changing double quotes to single and so on.
I can insert the values using the cqlsh (command line) ( EG insert into forms (id, stamp, questions) values ( 123, 12345678, { 'one':'two','three':'four'});)
So the question is - how do I get this Ruby hash into a format that the Cassandra driver will accept?
Using:
Ruby 2.1.3
Cassandra 2.0 (with latest Datastax driver)
EDIT:
JSON value of 'questions' hash is { "Q17.1_4":"val", "Q17.2":"other", ...}
Also - I can insert the first two columns just fine. Omitting the third value from the insert statement works so I know those two values aren't the culprit here.
Found two problems:
There is a note in the docs:
Last argument will be treated as options if it is a Hash. Therefore, make sure to pass empty options when executing a statement with the last parameter required to be a map datatype.
When I either put an empty hash as the last parameter or swapped my hash value for a different param the error message changed to:
type_converter.rb:331:in varchar_to_bytes': undefined methodencode' for 1:Fixnum (NoMethodError)
This lead me to believe that there was an invalid value in there somewhere. To fix this I created a temp hash and checked every value to make sure it was a string before adding it to this temp hash. Then using this temp value I was able to add to the DB using the stored procedure.

Can we aggregate dynamic number of rows using Talend Open Studio

I'm a beginner in Talend Open Studio, and I'm trying to do the transformation below.
From a SQL Table that contains:
DeltaStock Date
------------------------
+50 (initial stock) J0
+80 J1
-30 J2
... ...
I want to produce this table:
Stock Date
-----------
50 J0
130 J1
100 J2
... ...
Do you think this could be possible using TOS? I thought of using tAggregateRow, but I didn't find it appropriate to my issue.
There's probably an easier way to do this using the tMemorizeRows component but the first thought that comes to mind is to use the globalMap to store a rolling sum.
In Talend it is possible to store an object (any value or any type) in the globalMap so that it can be retrieved later on in the job. This is used automatically if you ever use a tFlowToIterate component which allows you to retrieve the values for that row that is being iterated on from the globalMap.
A very basic sample job might look like this:
In this we have a tJava component that only initialises the rolling sum in the globalMap with the following code:
//Initialise the rollingSum global variable
globalMap.put("rollingSum", 0);
After this we connect this component onSubjobOk to make sure we only carry on if we've managed to put the rollingSum into the globalMap.
I then provide my data using a tFixedFlowInput component which allows me to easily hardcode some values for this example job. You could easily replace this with any input. I have used your sample input data from the question:
We then process the data using a tJavaRow which will do some transformations on the data row by row. I've used the following code which works for this example:
//Initialise the operator and the value variables
String operator = "";
Integer value = 0;
//Get the current rolling sum
Integer rollingSum = (Integer) globalMap.get("rollingSum");
//Extract the operator
Pattern p = Pattern.compile("^([+-])([0-9]+)$");
Matcher m = p.matcher(input_row.deltaStock);
//If we have any matches from the regular expression search then extract the operator and the value
if (m.find()) {
operator = m.group(1);
value = Integer.parseInt(m.group(2));
}
//Conditional to use the operator
if ("+".equals(operator)) {
rollingSum += value;
} else if ("-".equals(operator)) {
rollingSum -= value;
} else {
System.out.println("The operator provided wasn't a + or a -");
}
//Put the new rollingSum back into the globalMap
globalMap.put("rollingSum", rollingSum);
//Output the data
output_row.stock = rollingSum;
output_row.date = input_row.date;
There's quite a lot going on there but basically it starts by getting the current rollingSum from the globalMap.
Next, it uses a regular expression to split up the deltaStock string into an operator and a value. From this it uses the operator provided (plus or minus) to either add the deltaStock to the rollingSum or subtract the deltaStock from the rollingSum.
After this it then adds the new rollingSum back into the globalMap and outputs the 2 columns of stock and date (unchanged).
In my sample job I then output the data using a tLogRow which will print the values of the data to the console. I typically select the table formatting option in it and in this case I get the following output:
.-----+----.
|tLogRow_8 |
|=----+---=|
|stock|date|
|=----+---=|
|50 |J0 |
|130 |J1 |
|100 |J2 |
'-----+----'
Which should be what you were looking for.
You should be able to do it in Talend Open Studio.
I attach here an image with the JOB, the content of the tJavaRow and the execution result.
I left under the tFixedFlowInput used to simulate the input a tJDBCInput that you should use to read the data from your DB. Hopefully you can use a specific tXXXInput for your DB instead of the generic JDBC one.
Here is some simple code in the tJavaRow.
//Code generated according to input schema and output schema
output_row.delta = input_row.delta;
output_row.date = input_row.date;
output_row.rollingSum =
Integer.parseInt(globalMap.get("rollingSum").toString());
int delta = Integer.parseInt(input_row.delta);
output_row.rollingSum += delta;
// Save rolling SUM for next round
globalMap.put("rollingSum", output_row.rollingSum);
Beware of the exceptions in the parseInt(). You should handle them the way you feel right.
In my projects I usually have a SafeParse library that does not throws exceptions but returns a default value I can pass together with the vale to be parsed.

Does varchar perform better than string in Hive?

Since version 0.12 Hive supports the VARCHAR data type.
Will VARCHAR provide better performance than STRING in a typical analytical Hive query?
In hive by default String is mapped to VARCHAR(32762) so this means
if value exceed 32762 then the value is truncated
if data does not require the maximum VARCHAR length for storage (for example, if the column never exceeds 100 characters), then it allocates unnecessary resources for the handling of that column
The default behavior for the STRING data type is to map the type to SQL data type of VARCHAR(32762), the default behavior can lead to performance issues
This explanation is on the basis of IBM BIG SQL which uses Hive implictly
IBM BIGINSIGHTS doc reference
varchar datatype is also saved internally as a String. The only difference I see is String is unbounded with a max value of 32,767 bytes and Varchar is bounded with a max value of 65,535 bytes. I don't think we will have any performance gain because the internal implementation for both the cases is String. I don't know much about hive internals but I could see the additional processing done by hive for truncating the varchar values. Below is the code (org.apache.hadoop.hive.common.type.HiveVarchar) :-
public static String enforceMaxLength(String val, int maxLength) {
String value = val;
if (maxLength > 0) {
int valLength = val.codePointCount(0, val.length());
if (valLength > maxLength) {
// Truncate the excess chars to fit the character length.
// Also make sure we take supplementary chars into account.
value = val.substring(0, val.offsetByCodePoints(0, maxLength));
}
}
return value;
}
If anyone has done performance analysis/benchmarking please share.

How can I use the map datatype in Apache Pig?

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?
First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:
1,2
3,4
5,6
And I try to load it as a map:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
I get three empty tuples:
()
()
()
So I try to load tuples and then generate the map:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Many variations on the syntax also fail (e.g., generate [$0#$1]).
OK, so I munge my map into Pig's map literal format as map.pig:
[1#2]
[3#4]
[5#6]
And load it up:
m = load 'map.pig' as (M: []);
Now let's load up some keys and try lookups:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Hrm, OK, maybe since there are two relations involved, we need a join:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).
Finally, I'd just like to be able to find all they keys in my map:
d = foreach m generate ...oh, forget it.
Is Pig's map type half-baked? What am I missing?
Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').
The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.
http://pig.apache.org/docs/r0.9.1/basic.html#map-schema
In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.
I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.
This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.
Great question!
I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.
As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.
If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data.
e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.
So why do we need Map in Pig?
Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.
How can one handle the Map data?
There are really 2 use cases:
Map keys are constants.
e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key.
e.g. header#'clientIp'.
Map keys are variables.
In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.
I hope this helps.
1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"
2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"
3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"
my file pigtest.csv like this
1234|emp_1234#company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567#company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
my schema:
a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;
b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;
dump b;
I think you need to think in term of relations and the map is just one field of one record. Then you can apply some operations on the relations, like joining the two sets data and mapping:
Input
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Pig
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Output
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
This answer could also help you if you just want to do a simple look up:
pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation
You can load up any data and then convert and store in key value format to read for later use
data = load 'somedata.csv' using PigStorage(',')
STORE data into 'folder' using PigStorage('#')
and then read as a mapped data.

Resources