Does varchar perform better than string in Hive? - hadoop

Since version 0.12 Hive supports the VARCHAR data type.
Will VARCHAR provide better performance than STRING in a typical analytical Hive query?

In hive by default String is mapped to VARCHAR(32762) so this means
if value exceed 32762 then the value is truncated
if data does not require the maximum VARCHAR length for storage (for example, if the column never exceeds 100 characters), then it allocates unnecessary resources for the handling of that column
The default behavior for the STRING data type is to map the type to SQL data type of VARCHAR(32762), the default behavior can lead to performance issues
This explanation is on the basis of IBM BIG SQL which uses Hive implictly
IBM BIGINSIGHTS doc reference

varchar datatype is also saved internally as a String. The only difference I see is String is unbounded with a max value of 32,767 bytes and Varchar is bounded with a max value of 65,535 bytes. I don't think we will have any performance gain because the internal implementation for both the cases is String. I don't know much about hive internals but I could see the additional processing done by hive for truncating the varchar values. Below is the code (org.apache.hadoop.hive.common.type.HiveVarchar) :-
public static String enforceMaxLength(String val, int maxLength) {
String value = val;
if (maxLength > 0) {
int valLength = val.codePointCount(0, val.length());
if (valLength > maxLength) {
// Truncate the excess chars to fit the character length.
// Also make sure we take supplementary chars into account.
value = val.substring(0, val.offsetByCodePoints(0, maxLength));
}
}
return value;
}
If anyone has done performance analysis/benchmarking please share.

Related

What is the memory difference between serializing a value as string and serializing a value as other data type in protobuf

Let's say I have a variable a, which I want to assign a decimal.
If my proto file is
syntax = "proto3";
message Test{
string a = 1;
}
How much memory will that take and how much will be the difference if I change string a =1 to float a =1.
Is there a documentation where you can see how much memory is assigned to different datatype?

Does Oracle's ADO.NET data provider report wrong data types for NUMBER(x,y) and FLOAT or REAL?

I suspect that Oracle's data provider for ADO.NET reports wrong data types for columns whose type is NUMBER(x,y), FLOAT or REAL.
The following simple program creates a table with these data types and then prints the data types as reported by the data provider. When executed, it prints:
NUM: Double - System.Double - Double
FLT: Decimal - System.Decimal - Decimal
REL: Decimal - System.Decimal - Decimal
However, I feel, it should be the other way round and get a DECIMAL type for FLT and REL and a Double tyep for NUM.
Can someone confirm my suspicion?
using System;
using Oracle.DataAccess.Client;
class Prg {
static void Main() {
OracleConnection ora = new OracleConnection($"user Id=rene;password=rene;data source=ORA18");
ora.Open();
OracleCommand stmt = ora.CreateCommand();
stmt.CommandText = "begin execute immediate 'drop table DataTypeTest'; exception when others then null; end;";
stmt.ExecuteNonQuery();
stmt.CommandText = "create table DataTypeTest (num number(10,3), flt float, rel real)";
stmt.ExecuteNonQuery();
stmt.CommandText = "select * from DataTypeTest";
OracleDataReader res = stmt.ExecuteReader();
for (int fld=0; fld<res.FieldCount; fld++) {
Console.WriteLine($"{res.GetName(fld)}: {res.GetDataTypeName(fld)} - {res.GetFieldType(fld)} - {Type.GetTypeCode(res.GetFieldType(fld))}");
}
}
}
From Oracle Datatypes Documentation:
FLOAT [(p)]
A subtype of the NUMBER datatype having precision p. A FLOAT value is represented internally as NUMBER. The precision p can range from 1 to 126 binary digits. A FLOAT value requires from 1 to 22 bytes.
And further down that documentation page:
ANSI SQL Datatype: REAL (Note d)
Oracle Datatype: FLOAT(63)
Notes 1d: The REAL datatype is a floating-point number with a binary precision of 63, or 18 decimal.
So REAL is a sub-type of FLOAT which, in turn, is a sub-type of NUMBER and that NUMBER should allow the greatest precision and REAL the least precision of these data types.
I suspect that Oracle's data provider for ADO.NET reports wrong data types for columns whose type is NUMBER(x,y), FLOAT or REAL.
Looking at:
ODBC Data Type Mappings:
ODBC type .NET Framework type
----------- -------------------
SQL_REAL Single
SQL_NUMERIC Decimal
SQL_DOUBLE Double
(SQL_FLOAT was not listed in the table and may default to SQL_NUMERIC's mapping since one is a sub-type of the other.)
It does appear that your test data does not match with these mappings.
However:
Since both REAL and FLOAT are sub-types of NUMERIC in the Oracle database then if they are being reported as their NUMERIC super-type rather than their specific sub-type then the mapping to Decimal does match the type mapping.
Your NUM column is of type NUMBER(10,3) so can have at most 7 whole digits and 3 decimal places and it may be that the data provider has determined that this can be accurately stored in a Double data type and that a Decimal data type is overkill. You can see if a different data type is returned using NUMBER (without precision or scale) or NUMBER(38,3).

Data structure for occurrence counting in long tail distribution

I have a big list of elements (tens of millions).
I am trying to count the number of occurrence of several subset of these elements.
The occurrence distribution is long-tailed.
The data structure currently looks like this (in an OCaml-ish flavor):
type element_key
type element_aggr_key
type raw_data = element_key list
type element_stat =
{
occurrence : (element_key, int) Hashtbl.t;
}
type stat =
{
element_stat_hashtable : (element_aggr_key, element_stat) Hashtbl.t;
}
Element_stat currently use hashtable where the key is each elements and the value is an integer. However, this is inefficient because when many elements have a single occurrence, the occurrence hashtable is resized many times.
I cannot avoid resizing the occurrence hashtable by setting a big initial size because there actually are many element_stat instances (the size of hashtable in stat is big).
I would like to know if there is a more efficient (memory-wise and/or insertion-wise) data structure for this use-case. I found a lot of existing data structure like trie, radix tree, Judy array. But I have trouble understanding their differences and whether they fit my problem.
What you have here is a table mapping element_aggr_key to tables that in turn map element_key to int. For all practical purposes, this is equivalent to a single table that maps element_aggr_key * element_key to int, so you could do:
type stat = (element_aggr_key * element_key, int) Hashtbl.t
Then you have a single hash table, and you can give it a huge initial size.

RCFile - emitting GZip compressed int columns

For some reason, Hive is not recognizing columns emitted as integers, but does recognize columns emitted as strings.
Is there something about Hive or RCFile or GZ that is preventing proper rendering of int?
My Hive DDL looks like:
create external table if not exists db.table (intField int, strField string) stored as rcfile location '/path/to/my/data';
And the relevant portion of my Java looks like:
BytesRefArrayWritable dataWrite = new BytesRefArrayWritable(2);
byte[] byteArray;
BytesRefWritable bytesRefWritable = new BytesRefWritable(); intWritable.set(myObj.getIntField());
byteArray = WritableUtils.toByteArray(intWritable.get());
bytesRefWritable.set(byteArray, 0, byteArray.length);
dataWrite.set(0, bytesRefWritable); // sets int field as column 0
bytesRefWritable = new BytesRefWritable();
textWritable.set(myObj.getStrField());
bytesRefWritable.set(textWritable.getBytes(), 0, textWritable.getLength());
dataWrite.set(1, bytesRefWritable); // sets str field as column 1
The code runs fine, and through logging I can see the various Writables have bytes within them.
Hive can read the external table as well, but the int field shows up as NULL, indicating some error.
SELECT * from db.table;
OK
NULL my string field
Time taken: 0.647 seconds
Any idea what might be going on here?
So, I'm not sure exactly why this is the case, but I got it working using the following method:
In the code that writes the byte array representing the integer value, instead of using WritableUtils.toByteArray(), I instead Text.set(Integer.toString(intVal)).getBytes().
In other words, I convert the integer to its String representation, and use the Text writable object to get the byte array as if it were a string.
Then, in my Hive DDL, I can call the column an int and it interprets it correctly.
I'm not sure what was initially causing the problem, be it a bug in WritableUtils, some incompatibility with compressed integer byte arrays, or a faulty understanding of how this stuff works on my part. In any event, the solution described above successfully meets the task's needs.

Generating integer within range from unique string in ruby

I have a code that should get unique string(for example, "d86c52ec8b7e8a2ea315109627888fe6228d") from client and return integer more than 2200000000 and less than 5800000000. It's important, that this generated int is not random, it should be one for one unique string. What is the best way to generate it without using DB?
Now it looks like this:
did = "d86c52ec8b7e8a2ea315109627888fe6228d"
min_cid = 2200000000
max_cid = 5800000000
cid = did.hash.abs.to_s.split.last(10).to_s.to_i
if cid < min_cid
cid += min_cid
else
while cid > max_cid
cid -= 1000000000
end
end
Here's the problem - your range of numbers has only 3.6x10^9 possible values where as your sample unique string (which looks like a hex integer with 36 digits) has 16^32 possible values (i.e. many more). So when mapping your string into your integer range there will be collisions.
The mapping function itself can be pretty straightforward, I would do something such as below (also, consider using only a part of the input string for integer conversion, e.g. the first seven digits, if performance becomes critical):
def my_hash(str, min, max)
range = (max - min).abs
(str.to_i(16) % range) + min
end
my_hash(did, min_cid, max_cid) # => 2461595789
[Edit] If you are using Ruby 1.8 and your adjusted range can be represented as a Fixnum, just use the hash value of the input string object instead of parsing it as a big integer. Note that this strategy might not be safe in Ruby 1.9 (per the comment by #DataWraith) as object hash values may be randomized between invocations of the interpreter so you would not get the same hash number for the same input string when you restart your application:
def hash_range(obj, min, max)
(obj.hash % (max-min).abs) + [min, max].min
end
hash_range(did, min_cid, max_cid) # => 3886226395
And, of course, you'll have to decide what to do about collisions. You'll likely have to persist a bucket of input strings which map to the same value and decide how to resolve the conflicts if you are looking up by the mapped value.
You could generate a 32-bit CRC, drop one bit, and add the result to 2.2M. That gives you a max value of 4.3M.
Alternately you could use all 32 bits of the CRC, but when the result is too large, append a zero to the input string and recalculate, repeating until you get a value in range.

Resources