Query/flatten a serialized protobuf stored as a string column in Clickhouse - protocol-buffers

TLDR
I have a column of serialized protobufs in a table in Clickhouse, and I would like to flatten those protobufs into another table via a materialized view. How do I format the materialized view to do this?
Details
I have a RabbitMQ queue which serves messages to my Clickhouse server. The message outer.proto consists of a service name and a serialized protobuf message payload:
//outer.proto
syntax = "proto2"
message WrappedMessage {
required string svc_id = 1;
required bytes msg = 2;
}
This service name and payload are then stored in Clickhouse as such:
CREATE TABLE raw_records_rmq
(
svc_id String NOT NULL
, msg String NOT NULL
) ENGINE = RabbitMQ SETTINGS
--skip rabbitmq settings, this part works
rabbitmq_format='ProtobufSingle'
rabbitmq_schema = 'outer:WrappedMessage'
;
CREATE TABLE raw_records
(
svc_id String NOT NULL
, msg String NOT NULL
) ENGINE = MergeTree ORDER BY tuple()
;
CREATE MATERIALIZED VIEW raw_mv
TO raw_records AS
SELECT * FROM raw_records_rmq
This process works as expected, with the serialized message stored in raw_records.msg. The message is defined as such:
//inner.proto
syntax = "proto2"
message Person {
optional uint32 id = 1;
optional string name = 2;
optional bool arb_bool = 3;
}
I would now like to query the contents of the stored message; to simplify this, I create a destination table:
CREATE TABLE people
(
id UInt32
, name String
, arb_bool UInt8
) ENGINE = MergeTree ORDER BY tuple()
But this is where my success stops. My attempts so far have been to query the column as a subquery and then attempt to parse the results as protobufs using Clickhouse FORMAT and SETTINGS, as described in their documentation:
CREATE MATERIALIZED VIEW mv
TO people AS
SELECT * FROM (SELECT proto FROM raw_records)
FORMAT ProtobufSingle
SETTINGS format_schema='inner:Person'
However, this fails to unpack the protobuf message. Changing from a Materialized View to a standard View shows that Clickhouse is returning only the single column specified in the subquery with the entire protobuf message as each result.
Any advice on how to properly format this materialized view, or alternatives for processing protobufs-inside-protobufs would be greatly appreciated!

You can't convert a single column into multiple values.
You can use the protobuf format but it will be the entire protobuf message: https://clickhouse.com/docs/en/interfaces/formats/#protobuf

Related

How to upload Query Result from Snowflake to S3 Directly?

I have a Query Interface where the user writes a SQL Query and Gets Result, The warehouse we use is Snowflake to Query Data and display the Queried SQL Result. We use Snowflake JDBC to establish a connection, Asynchronously Queue the Query get a Query ID(UUID) from snowflake and use the Query ID to get status and fetch the Result.
Sample Code:
try {
ResultSetMetaData resultSetMetaData = resultSet.getMetaData();
int numColumns = resultSetMetaData.getColumnCount();
for (int i = 1; i <= numColumns; i++) {
arrayNode.add(objectMapper.createObjectNode().put("name", resultSetMetaData.getColumnName(i))
.put("attribute_number", i)
.put("data_type", resultSetMetaData.getColumnTypeName(i))
.put("type_modifier", (Short) null)
.put("scale", resultSetMetaData.getScale(i)).put("precision",
resultSetMetaData.getPrecision(i)));
}
rootNode.set("metadata", arrayNode);
arrayNode = objectMapper.createArrayNode();
while (resultSet.next()) {
ObjectNode resultObjectNode = objectMapper.createObjectNode();
for (int i = 1; i <= numColumns; i++) {
String columnName = resultSetMetaData.getColumnName(i);
resultObjectNode.put(columnName, resultSet.getString(i));
}
arrayNode.add(resultObjectNode);
}
rootNode.set("results", arrayNode);
// TODO: Instead of returning the entire result string, send it in chunk to S3 utility class for upload
resultSet.close();
jsonString = objectMapper.writeValueAsString(rootNode);
}
As you can see here our use case is we need to send the metadata info(column details) along with the result. The result set is then uploaded to S3 and users are given a S3 link to view the results.
I am trying to figure if this scenario can be handled in Snowflake itself, where snowflake can generate the metadata for the query and upload the result set to a user-defined bucket SO that consumers of Snowflake won't have to do this. I have read about Snowflake Stream, Copy from Stages. Can someone help me understand if this is feasible and if yes how this can be achieved?
Is there any way where I can upload the result of a Query using QueryId from snowflake to S3 directly without fetching and uploading it to S3.
You can store the results in an S3 bucket using the COPY command. This is a simplified example showing the process on a temporary internal stage. For your use case, you would create and use an external stage in S3:
create temp stage FOO;
select * from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION";
copy into #FOO from (select * from table(result_scan(last_query_id())));
The reason you want to use COPY from a previous select is that the COPY command is somewhat limited in what it can use for the query. By running the query as a regular select first and then running a select * from that result, you get past those limitations.
The COPY command supports other file formats. This way will use the default CSV format. You can also specify JSON, Parquet, or a custom delimited format using a named file format.
https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html

Maximo MAXINTMSGTRK table: How to extract text from MSGDATA column? (HUGEBLOB)

I'm attempting to extract the text from the MSGDATA column (HUGEBLOB) in the MAXINTMSGTRK table:
I've tried the options outlined here: How to query hugeblob data:
select
msg.*,
utl_raw.cast_to_varchar2(dbms_lob.substr(msgdata,1000,1)) msgdata_expanded,
dbms_lob.substr(msgdata, 1000,1) msgdata_expanded_2
from
maxintmsgtrk msg
where
rownum = 1
However, the output is not text:
How can I extract text from MSGDATA column?
It's is possible to do it using Automation script, uncompress data using psdi.iface.jms.MessageUtil class.
from psdi.iface.jms import MessageUtil
...
msgdata_blob = maxintmsgtrkMbo.getBytes("msgdata")
byteArray = MessageUtil.uncompressMessage(msgdata_blob, maxintmsgtrkMbo.getLong("msglength"))
msgdata_clob = ""
for symb1 in byteArray:
msgdata_clob = msgdata_clob + chr(symb1)
It sounds like it's not possible because the value is compressed:
Starting in Maximo 7.6, the messages written by the Message Tracking
application are stored in the database. They are no longer written as
xml files as in previous versions.
Customers have asked how to search and view MSGDATA data from the
MAXINTMSGTRK table.
It is not possible to search or retrieve the data in the maxintmsgtrk
table in 7.6.using SQL. The BLOB field is stored compressed.
MIF 7.6 Message tracking changes

How to choose data type for creating table in hive

I have to ingest data in hive table from hdfs but I don't know how to choose correct data type for the data mentioned below:-
$34740$#$Disrupt Worldwide LLC$#$40425$#$null$#$13$#$6$#$317903$#$null$#$Scott Bodily$#$+$#$null$#$10$#$0$#$1$#$0$#$disruptcentral.com$#$null$#$null$#$1$#$null$#$null$#$null$#$Scott Bodily$#$1220DB56-56D7-E411-80D6-005056A451E3$#$true$
$34741$#$The Top Tipster Leagues Limited$#$35605$#$null$#$13$#$7$#$317902$#$null$#$AM Support Team$#$+447886 027371$#$null$#$1$#$1$#$1$#$0$#$www.toptipsterleagues.com, www.toptipsterleagues.co.uk, http://test.toptipsterleague.com$#$Jamil Johnson$#$Cheng Liem Li$#$1$#$0.70$#$1.50$#$1.30$#$Bono van Nijnatten$#$0B758BF9-F1D6-E411-80D7-005056A44C5C$#$true$
Refer this link for different data types,
Click here
Other than all the numeric and decimal fields you can use STRING data type. For the numeric fields based on the range and precision you can use INT or DECIMAL.
Using string and varchar or any other string data types will read null in your data as string i.e "null" for handling nul you should mention the table properties like below,
ALTER TABLE tablename SET
SERDEPROPERTIES ('serialization.null.format' = 'null');
Let me know if anything needed on this.

Can I control how Oracle maps the integer types in ADO.NET?

I've got a legacy database that was created with the database type INTEGER for many (1.000+) Oracle columns. A database with the same structure exists for MS SQL. As I was told, the original definition was created using a tool that generated the scripts from a logical model to the specific one for MS SQL and Oracle.
Using C++ and MFC the columns were mapped nicely to the integer type for both DBMs.
I am porting this application to .NET and C#. The same C# codebase is used to access both MS SQL and Oracle. We use the same DataSets and logic and we need the same types (int32 for both).
The ODP.NET driver from Oracle maps them to Decimal. This is logical as Oracle created the integer columns as NUMBER(37) automatically. The columns in MS SQL map to int32.
Can I somehow control how to map the types in the ODP.NET driver? I would like to say something like "map NUMBER(37) to int32". The columns will never hold values bigger than the limits of an int32. We know this because it is being used in the MS SQL version.
Alternatively, can I modify all columns from NUMBER(37) to NUMBER(8) or SIMPLE_INTEGER so that they map to the right type for us? Many of these columns are used as primary keys (think autoincrement).
Regarding type mapping, hope this is what you need
http://docs.oracle.com/cd/E51173_01/win.122/e17732/entityDataTypeMapping.htm#ODPNT8300
Regarding type change, if table is empty, you may use following script (just replace [YOUR_TABLE_NAME] with table name in upper case):
DECLARE
v_table_name CONSTANT VARCHAR2(30) := '[YOUR_TABLE_NAME]';
BEGIN
FOR col IN (SELECT * FROM user_tab_columns WHERE table_name = v_table_name AND data_type = 'NUMBER' AND data_length = 37)
LOOP
EXECUTE IMMEDIATE 'ALTER TABLE '||v_table_name||' MODIFY '||col.column_name||' NUMBER(8)';
END LOOP;
END;
If some of these columns are not empty, then you can't decrease precision for them
If you have not too much data, you may move it to temp table
create table temp_table as select * from [YOUR_TABLE_NAME]
then truncate original table
truncate [YOUR_TABLE_NAME]
then run script above
then move data back
insert /*+ append */ into [YOUR_TABLE_NAME] select * from temp_table
commit
If data amount is substantial it is better to move it once. In such case it is faster to create new table with correct datatypes and all indexes, constraints and so on, then move data, then rename both tables to make new table have proper name.
Unfortunately the mapping of numeric types between .NET and Oracle is hardcoded in OracleDataReader class.
In general I usually prefer to setup appropriate data types in the database, so if possible I would change the column datatypes because they better represent the actual values and their constraints.
Another option is to wrap the tables using views casting to NUMBER(8) but will negatively impact execution plans because it prohibits index lookups.
Then you have also some application implementation options:
Implement your own data reader or subset of ADO.NET classes (inheriting from DbProviderFactory, DbConnection, DbCommmand, DbDataReader, etc. and wrapping Oracle classes), depending on how complex is your implementation. Oracle.DataAccess, Devart and all providers do exactly the same because it gives total control over everything including any magic with the data types. If the datatype conversion is the only thing you want to achieve, most of the implementation would be just calling wrapped class methods/properties.
If you have access to OracleDataReader after command is executed and before you start to read it you can do a simple hack and set the resulting numeric type using reflection (following implementation is just simplified demonstration).
However this will not work with ExecuteScalar as this method never exposes the underlying data reader.
var connection = new OracleConnection("DATA SOURCE=HQ_PDB_TCP;PASSWORD=oracle;USER ID=HUSQVIK");
connection.Open();
var command = connection.CreateCommand();
command.CommandText = "SELECT 1 FROM DUAL";
var reader = command.ExecuteDatabaseReader();
reader.Read();
Console.WriteLine(reader[0].GetType().FullName);
Console.WriteLine(reader.GetFieldType(0).FullName);
public static class DataReaderExtensions
{
private static readonly FieldInfo NumericAccessorField = typeof(OracleDataReader).GetField("m_dotNetNumericAccessor", BindingFlags.NonPublic | BindingFlags.Instance);
private static readonly object Int32DotNetNumericAccessor = Enum.Parse(typeof(OracleDataReader).Assembly.GetType("Oracle.DataAccess.Client.DotNetNumericAccessor"), "GetInt32");
private static readonly FieldInfo MetadataField = typeof(OracleDataReader).GetField("m_metaData", BindingFlags.NonPublic | BindingFlags.Instance);
private static readonly FieldInfo FieldTypesField = typeof(OracleDataReader).Assembly.GetType("Oracle.DataAccess.Client.MetaData").GetField("m_fieldTypes", BindingFlags.NonPublic | BindingFlags.Instance);
public static OracleDataReader ExecuteDatabaseReader(this OracleCommand command)
{
var reader = command.ExecuteReader();
var columnNumericAccessors = (IList)NumericAccessorField.GetValue(reader);
columnNumericAccessors[0] = Int32DotNetNumericAccessor;
var metadata = MetadataField.GetValue(reader);
var fieldTypes = (Type[])FieldTypesField.GetValue(metadata);
fieldTypes[0] = typeof(Int32);
return reader;
}
}
I implemented extension method for command execution returning the reader where I can set up the desired column numeric types. Without setting the numeric accessor (it's just internal enum Oracle.DataAccess.Client.DotNetNumericAccessor) you will get System.Decimal, with accessor set you get Int32. Using this you can get all Int16, Int32, Int64, Float or Double.
columnNumericAccessors index is a column index and it will applied only to numeric types, if column is DATE or VARCHAR the numeric accessor is just ignored. If your implementation doesn't expose the provider specific type, make the extension method on IDbCommand or DbCommand and then safe cast the DbDataReader to OracleDataReader.
EDIT: Added the hack for GetFieldType method. But it might happen that the static mapping hashtable might be updated so this could have unwanted effects. You need to test it properly. The fieldTypes array holds the types returned for all columns of the data reader.

Insert image into SQL Server 2008 Express database without front end application

I am working with test data in Visual Studio's server explorer. How would I put images into the database just as test images? I have no front end component built to take care of image uploading.
It will work for SQL server 2008r2...but first u have to create a filestream database.
//create databse
CREATE DATABASE Archive
ON
PRIMARY ( NAME = Arch1,FILENAME = 'c:\data\archdat1.mdf'),
FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM( NAME = Arch3,FILENAME = 'c:\data\filestream1')
LOG ON ( NAME = Archlog1,FILENAME = 'c:\data\archlog1.ldf')
GO
//table creation
Use Archive
GO
CREATE TABLE [FileStreamDataStorage]
(
[ID] [INT] IDENTITY(1,1) NOT NULL,
[FileStreamData] VARBINARY(MAX) FILESTREAM NULL,
[FileStreamDataGUID] UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE DEFAULT NEWSEQUENTIALID(),
[DateTime] DATETIME DEFAULT GETDATE()
)
ON [PRIMARY]
FILESTREAM_ON FileStreamGroup1
GO
//inserting value
Use Archive
GO
INSERT INTO [FileStreamDataStorage] (FileStreamData)
SELECT * FROM
OPENROWSET(BULK N'C:\Users\Public\Pictures\Sample Pictures\image1.jpg' ,SINGLE_BLOB) AS Document
GO
You can upload an image into a db (and retrieve it) using byte[] as data type, assuming that in your db corresponding column is a BLOB.
So if you load your image with byte[] img = File.ReadAllBytes(your_file) then you can use a query like this INSERT INTO table SET image_col = #par, where par is a parameter whose value is img.

Resources