How to choose data type for creating table in hive - hadoop

I have to ingest data in hive table from hdfs but I don't know how to choose correct data type for the data mentioned below:-
$34740$#$Disrupt Worldwide LLC$#$40425$#$null$#$13$#$6$#$317903$#$null$#$Scott Bodily$#$+$#$null$#$10$#$0$#$1$#$0$#$disruptcentral.com$#$null$#$null$#$1$#$null$#$null$#$null$#$Scott Bodily$#$1220DB56-56D7-E411-80D6-005056A451E3$#$true$
$34741$#$The Top Tipster Leagues Limited$#$35605$#$null$#$13$#$7$#$317902$#$null$#$AM Support Team$#$+447886 027371$#$null$#$1$#$1$#$1$#$0$#$www.toptipsterleagues.com, www.toptipsterleagues.co.uk, http://test.toptipsterleague.com$#$Jamil Johnson$#$Cheng Liem Li$#$1$#$0.70$#$1.50$#$1.30$#$Bono van Nijnatten$#$0B758BF9-F1D6-E411-80D7-005056A44C5C$#$true$

Refer this link for different data types,
Click here
Other than all the numeric and decimal fields you can use STRING data type. For the numeric fields based on the range and precision you can use INT or DECIMAL.
Using string and varchar or any other string data types will read null in your data as string i.e "null" for handling nul you should mention the table properties like below,
ALTER TABLE tablename SET
SERDEPROPERTIES ('serialization.null.format' = 'null');
Let me know if anything needed on this.

Related

Query/flatten a serialized protobuf stored as a string column in Clickhouse

TLDR
I have a column of serialized protobufs in a table in Clickhouse, and I would like to flatten those protobufs into another table via a materialized view. How do I format the materialized view to do this?
Details
I have a RabbitMQ queue which serves messages to my Clickhouse server. The message outer.proto consists of a service name and a serialized protobuf message payload:
//outer.proto
syntax = "proto2"
message WrappedMessage {
required string svc_id = 1;
required bytes msg = 2;
}
This service name and payload are then stored in Clickhouse as such:
CREATE TABLE raw_records_rmq
(
svc_id String NOT NULL
, msg String NOT NULL
) ENGINE = RabbitMQ SETTINGS
--skip rabbitmq settings, this part works
rabbitmq_format='ProtobufSingle'
rabbitmq_schema = 'outer:WrappedMessage'
;
CREATE TABLE raw_records
(
svc_id String NOT NULL
, msg String NOT NULL
) ENGINE = MergeTree ORDER BY tuple()
;
CREATE MATERIALIZED VIEW raw_mv
TO raw_records AS
SELECT * FROM raw_records_rmq
This process works as expected, with the serialized message stored in raw_records.msg. The message is defined as such:
//inner.proto
syntax = "proto2"
message Person {
optional uint32 id = 1;
optional string name = 2;
optional bool arb_bool = 3;
}
I would now like to query the contents of the stored message; to simplify this, I create a destination table:
CREATE TABLE people
(
id UInt32
, name String
, arb_bool UInt8
) ENGINE = MergeTree ORDER BY tuple()
But this is where my success stops. My attempts so far have been to query the column as a subquery and then attempt to parse the results as protobufs using Clickhouse FORMAT and SETTINGS, as described in their documentation:
CREATE MATERIALIZED VIEW mv
TO people AS
SELECT * FROM (SELECT proto FROM raw_records)
FORMAT ProtobufSingle
SETTINGS format_schema='inner:Person'
However, this fails to unpack the protobuf message. Changing from a Materialized View to a standard View shows that Clickhouse is returning only the single column specified in the subquery with the entire protobuf message as each result.
Any advice on how to properly format this materialized view, or alternatives for processing protobufs-inside-protobufs would be greatly appreciated!
You can't convert a single column into multiple values.
You can use the protobuf format but it will be the entire protobuf message: https://clickhouse.com/docs/en/interfaces/formats/#protobuf

How to update table from JSON flowfile

I have a flow-files with the below structure
{
"PN" : "U0-WH",
"INPUT_DATE" : "44252.699895833335",
"LABEL" : "Marker",
"STATUS" : "Approved",
}
and I need to execute an update statement using some fields
update table1 set column1 = 'value' where pn=${PN}
I found convertJsonToSQL but am not sure how to use it in this case
You can use a processor namely ConvertjSONToSQL. Using this you can convert your json into an update query.
ConvertjSONToSQL Description
It takes the following parameters :
1. JDBC Connection Pool : Create a JDBC pool which takes DB connection information as input.
2. Statement Type : Here you need to provide type of statement you want to create. In your case its 'UPDATE'.
3. Table Name : Name of the table for which update query needed to be created
4. Schema Name : Name of the schema of your database.
5. Translate Field Names : If true, the Processor will attempt to translate JSON field names into the appropriate column names for the table specified. If false, the JSON field names must match the column names exactly, or the column will not be updated
6. Unmatched Field Behaviour : if an incoming JSON element has a field that does not map to any of the database table's columns, this property specifies how to handle the situation
7. Unmatched Column Behaviour : If an incoming JSON element does not have a field mapping for all of the database table's columns, this property specifies how to handle the situation
8. Update Keys : A comma-separated list of column names that uniquely identifies a row in the database for UPDATE statements. If the Statement Type is UPDATE and this property is not set, the table's Primary Keys are used. In this case, if no Primary Key exists, the conversion to SQL will fail if Unmatched Column Behaviour is set to FAIL. This property is ignored if the Statement Type is INSERT
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Read the description above and try to use the properties given. Detailed description of the processor is given in the link.
ConvertjSONToSQL Description

How to return a query from cosmos db order by date string?

I have a cosmos db collection. I need to query all documents and return them in order of creation date. Creation date is a defined field but for historical reason it is in string format as MM/dd/yyyy. For example: 02/09/2019. If I just order by this string, the result is chaos.
I am using linq lambda to write my query in webapi. I have tried to parse the string and try to convert the string. Both returned "method not supported".
Here is my query:
var query = Client.CreateDocumentQuery<MyModel>(CollectionLink)
.Where(f => f.ModelType == typeof(MyModel).Name.ToLower() && f.Language == getMyModelsRequestModel.Language )
.OrderByDescending(f => f.CreationDate)
.AsDocumentQuery();
Appreciate for any advice. Thanks. It will be huge effort to go back and modify the format of the field (which affects many other things). I wish to avoid it if possible.
Chen Wang.Since the order by does not support derived values or sub query(link),so you need to sort the derived values by yourself i think.
You could construct the MM/dd/yyyy to yyyymmdd by UDF in cosmos db.
udf:
function getValue(datetime){
return datetime.substring(6,10)+datetime.substring(0,2)+datetime.substring(3,5);
}
sql:
SELECT udf.getValue(c.time) as time from c
Then you could sort the array by property value of class in c# code.Please follow this case:How to sort an array containing class objects by a property value of a class instance?

Can I control how Oracle maps the integer types in ADO.NET?

I've got a legacy database that was created with the database type INTEGER for many (1.000+) Oracle columns. A database with the same structure exists for MS SQL. As I was told, the original definition was created using a tool that generated the scripts from a logical model to the specific one for MS SQL and Oracle.
Using C++ and MFC the columns were mapped nicely to the integer type for both DBMs.
I am porting this application to .NET and C#. The same C# codebase is used to access both MS SQL and Oracle. We use the same DataSets and logic and we need the same types (int32 for both).
The ODP.NET driver from Oracle maps them to Decimal. This is logical as Oracle created the integer columns as NUMBER(37) automatically. The columns in MS SQL map to int32.
Can I somehow control how to map the types in the ODP.NET driver? I would like to say something like "map NUMBER(37) to int32". The columns will never hold values bigger than the limits of an int32. We know this because it is being used in the MS SQL version.
Alternatively, can I modify all columns from NUMBER(37) to NUMBER(8) or SIMPLE_INTEGER so that they map to the right type for us? Many of these columns are used as primary keys (think autoincrement).
Regarding type mapping, hope this is what you need
http://docs.oracle.com/cd/E51173_01/win.122/e17732/entityDataTypeMapping.htm#ODPNT8300
Regarding type change, if table is empty, you may use following script (just replace [YOUR_TABLE_NAME] with table name in upper case):
DECLARE
v_table_name CONSTANT VARCHAR2(30) := '[YOUR_TABLE_NAME]';
BEGIN
FOR col IN (SELECT * FROM user_tab_columns WHERE table_name = v_table_name AND data_type = 'NUMBER' AND data_length = 37)
LOOP
EXECUTE IMMEDIATE 'ALTER TABLE '||v_table_name||' MODIFY '||col.column_name||' NUMBER(8)';
END LOOP;
END;
If some of these columns are not empty, then you can't decrease precision for them
If you have not too much data, you may move it to temp table
create table temp_table as select * from [YOUR_TABLE_NAME]
then truncate original table
truncate [YOUR_TABLE_NAME]
then run script above
then move data back
insert /*+ append */ into [YOUR_TABLE_NAME] select * from temp_table
commit
If data amount is substantial it is better to move it once. In such case it is faster to create new table with correct datatypes and all indexes, constraints and so on, then move data, then rename both tables to make new table have proper name.
Unfortunately the mapping of numeric types between .NET and Oracle is hardcoded in OracleDataReader class.
In general I usually prefer to setup appropriate data types in the database, so if possible I would change the column datatypes because they better represent the actual values and their constraints.
Another option is to wrap the tables using views casting to NUMBER(8) but will negatively impact execution plans because it prohibits index lookups.
Then you have also some application implementation options:
Implement your own data reader or subset of ADO.NET classes (inheriting from DbProviderFactory, DbConnection, DbCommmand, DbDataReader, etc. and wrapping Oracle classes), depending on how complex is your implementation. Oracle.DataAccess, Devart and all providers do exactly the same because it gives total control over everything including any magic with the data types. If the datatype conversion is the only thing you want to achieve, most of the implementation would be just calling wrapped class methods/properties.
If you have access to OracleDataReader after command is executed and before you start to read it you can do a simple hack and set the resulting numeric type using reflection (following implementation is just simplified demonstration).
However this will not work with ExecuteScalar as this method never exposes the underlying data reader.
var connection = new OracleConnection("DATA SOURCE=HQ_PDB_TCP;PASSWORD=oracle;USER ID=HUSQVIK");
connection.Open();
var command = connection.CreateCommand();
command.CommandText = "SELECT 1 FROM DUAL";
var reader = command.ExecuteDatabaseReader();
reader.Read();
Console.WriteLine(reader[0].GetType().FullName);
Console.WriteLine(reader.GetFieldType(0).FullName);
public static class DataReaderExtensions
{
private static readonly FieldInfo NumericAccessorField = typeof(OracleDataReader).GetField("m_dotNetNumericAccessor", BindingFlags.NonPublic | BindingFlags.Instance);
private static readonly object Int32DotNetNumericAccessor = Enum.Parse(typeof(OracleDataReader).Assembly.GetType("Oracle.DataAccess.Client.DotNetNumericAccessor"), "GetInt32");
private static readonly FieldInfo MetadataField = typeof(OracleDataReader).GetField("m_metaData", BindingFlags.NonPublic | BindingFlags.Instance);
private static readonly FieldInfo FieldTypesField = typeof(OracleDataReader).Assembly.GetType("Oracle.DataAccess.Client.MetaData").GetField("m_fieldTypes", BindingFlags.NonPublic | BindingFlags.Instance);
public static OracleDataReader ExecuteDatabaseReader(this OracleCommand command)
{
var reader = command.ExecuteReader();
var columnNumericAccessors = (IList)NumericAccessorField.GetValue(reader);
columnNumericAccessors[0] = Int32DotNetNumericAccessor;
var metadata = MetadataField.GetValue(reader);
var fieldTypes = (Type[])FieldTypesField.GetValue(metadata);
fieldTypes[0] = typeof(Int32);
return reader;
}
}
I implemented extension method for command execution returning the reader where I can set up the desired column numeric types. Without setting the numeric accessor (it's just internal enum Oracle.DataAccess.Client.DotNetNumericAccessor) you will get System.Decimal, with accessor set you get Int32. Using this you can get all Int16, Int32, Int64, Float or Double.
columnNumericAccessors index is a column index and it will applied only to numeric types, if column is DATE or VARCHAR the numeric accessor is just ignored. If your implementation doesn't expose the provider specific type, make the extension method on IDbCommand or DbCommand and then safe cast the DbDataReader to OracleDataReader.
EDIT: Added the hack for GetFieldType method. But it might happen that the static mapping hashtable might be updated so this could have unwanted effects. You need to test it properly. The fieldTypes array holds the types returned for all columns of the data reader.

Extracting an Array of Structs in Hive

I have an external table in hive
CREATE EXTERNAL TABLE FOO (
TS string,
customerId string,
products array< struct <productCategory:string, productId:string> >
)
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'some.serde'
WITH SERDEPROPERTIES ('error.ignore'='true')
LOCATION 'some_locations'
;
A record of the table may hold data such as:
1340321132000, 'some_company', [{"productCategory":"footwear","productId":"nik3756"},{"productCategory":"eyewear","productId":"oak2449"}]
Do anyone know if there is a way to simply extract all the productCategory from this record and return it as an array of productCategories without using explode. Something like the following:
["footwear", "eyewear"]
Or do I need to write my own GenericUDF, if so, I do not know much Java (a Ruby person), can someone give me some hints? I read some instructions on UDF from Apache Hive. However, I do not know which collection type is best to handle array, and what collection type to handle structs?
===
I have somewhat answered this question by writing a GenericUDF, but I ran into 2 other problems. It is in this SO Question
You can use json serde or build-in functions get_json_object, json_tuple.
With rcongiu's Hive-JSON SerDe the usage will be:
define table:
CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>)
load sample json into it (it is important for this data to be one-lined):
{"DocId":"ABC","Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}
Then fetching orders ids is as easy as:
SELECT Orders.ItemId FROM complex_json LIMIT 100;
It will return the list of ids for you:
itemid
[1111,2222]
Proven to return correct results on my environment. Full listing:
add jar hdfs:///tmp/json-serde-1.3.6.jar;
CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA INPATH '/tmp/test.json' OVERWRITE INTO TABLE complex_json;
SELECT Orders.ItemId FROM complex_json LIMIT 100;
Read more here:
http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html
One way would be to use either the inline or explode functions, like so:
SELECT
TS,
customerId,
pCat,
pId,
FROM FOO
LATERAL VIEW inline(products) p AS pCat, pId
Otherwise you can write UDF. Check out this post and this post for that. Along with the following resources:
Matthew Rathbone's guide to writing generic UDFs
Mark Grover's how to guide
the baynote blog post on generic UDFs
If size of array is fixed ( like 2 ). Please try:
products[0].productCategory,products[1].productCategory
But if not, UDF should be the right solution. I guess that you could do it in JRuby. GL!

Resources