How should protobuf message with repeated fields be converted to parquet to be queried by Athena? - parquet

We write parquet files to S3 and then use Athena to query from that data. We use "parquet-protobuf" library to convert proto message into parquet record. We recently added a repeated field into our proto message definition and we were expecting to be able to fetch/query that using Athena.
message LedgerRecord {
....
repeated uint64 shared_account_ids = 168; //newly added field
}
Resulting parquet schema as seen by parquet-tools inspect command.
############ Column(shared_account_ids) ############
name: shared_account_ids
path: shared_account_ids
max_definition_level: 1
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
Athena table alternation
alter TABLE order_transactions_with_projections add columns (shared_account_ids array);
Simple Query that fails
With T1 as (( SELECT DISTINCT record_id, msg_type, rec, time_sent, shared_account_ids FROM ledger_int_dev.order_transactions_with_projections WHERE environment='int-dev-cert' AND ( epoch_day>=19082 AND epoch_day<=19172) AND company='12' )) select * from T1 ORDER BY time_sent DESC LIMIT 100
Error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<file_name>.snappy.parquet (offset=0, length=48634): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO
How do I convert protobuf message with "repeated" field into parquet so that Athena understands the field as an ARRAY instead of primitive type ? Is intermediate conversion to AVRO necessary as mentioned in https://github.com/rdblue/parquet-avro-protobuf/blob/master/README.md or parquet-mr library can directly be used?

Related

Hive get_json_object(): How to check if JSON field exists?

I am using Hive and the get_json_object() function to query data stored as JSON. The JSON has a coordinate key and two fields (latitude and longitude) that look like the following:
"coordinate":{
"center":{
"lat":36.123413127558536,
"lng":-115.17381648045654
},
"precision":10
}
I am running my Hive query to retrieve data within some geocoordinate box, like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user.name/sample/sample1.txt'
SELECT * FROM mytable
WHERE
get_json_object(mytable.`value`, '$.coordinate.center.lat') > 36.115767
AND get_json_object(mytable.`value`, '$.coordinate.center.lng') > -115.314051
AND get_json_object(mytable.`value`, '$.coordinate.center.lat') < 36.285595
AND get_json_object(mytable.`value`, '$.coordinate.center.lng') < -115.085399
DISTRIBUTE BY rand()
SORT by rand()
LIMIT 10000;
However, the problem is that for some rows, the coordinate field is missing, or the center field is missing, or the lat and/or lng field is missing. How can I modify my Hive SELECT query to only get rows that have a complete valid coordinate entry with existing lat and lng?
I would make a separate VIEW for the table where you do
WHERE get_json_object(...) IS NOT NULL
for each field you're interested in.
Then run the given query over that view
Alternatively, fix your input source to generate some consistent data using Avro instead, for example

ElasticSearch / Kibana / Logstash: how to split a string in data table?

I have a data table in kibana with data (ingested using logstash).
the log format is something like that
"myvals": ["k1:v1", "k2:v3"],
the data in the data table looks like this
k1:v1 10
k1:v2 8
k2:v3 5
k2:v4 1
is it possible to split by ":" to create 2 columns and aggregate by k1, k2, ecc? or what format should I provide in the log to do that
thanks

sampling of records inside group by throwing error

sample data : (tsv file: sampl)
1 a
2 b
3 c
raw= load 'sampl' using PigStorage() as (f1:chararray,f2:chararray);
grouped = group raw by f1;
describe grouped;
fields = foreach grouped {
x = sample raw 1;
generate x;
}
When I run this I am getting error at the line x = sample raw 1;
ERROR 1200: mismatched input 'raw' expecting LEFT_PAREN
Is sampling not allowed for a grouped record?
You can't use 'sample' command inside nested block.This is not supported in pig.
Only few operations operations like (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY) are allowed in nested block. You have to use the sample command outside of the nested block.
The other problem is, you are loading your input data using default delimiter ie tab. But your input data is delimited with space, so you need to change your script like this
raw= load 'sampl' using PigStorage(' ') as (f1:chararray,f2:chararray);

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Pig: What is the correct syntax to flatten a nested bag (2-levels deep)

I'm loading this data:
data6 = 'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
Using this command
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
I'm attempting to flatten the whole thing this command.
A_flattened = FOREACH A GENERATE item, d, things::thing AS thing; things::d1 AS d1, FLATTEN(things::values) AS value;
But I just get this error:
Invalid field projection. Projected field [things::thing] does not exist in schema: item:chararray,d:int,things:bag{:tuple(thing:chararray,d1:int,values:bag{:tuple(v:chararray)})}
I tried naming the inner things tuple, but I get a similar error. Can someone help me with the right syntax here?
You need to use things.thing, things.d1, thangs.values, as you want to do the projection on bag. The # is used by the projection on map.
Here is an introduction of Bag projection (search "Bag projection" in this page): http://ofps.oreilly.com/titles/9781449302641/intro_pig_latin.html
:: is used to avoid the name conflict when you join some inputs with the same field names. join preserves the names of the fields of the inputs passed to it. It also prepends the name of the relation the field came from, followed by a ::. For example,
-- join2key.pig
daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd = join daily by (symbol, date), divs by (symbol, date);
The description of jnd is:
jnd: {daily::exchange: bytearray,daily::symbol: bytearray,daily::date: bytearray,
daily::open: bytearray,daily::high: bytearray,daily::low: bytearray,
daily::close: bytearray,daily::volume: bytearray,daily::adj_close: bytearray,
divs::exchange: bytearray,divs::symbol: bytearray,divs::date: bytearray,
divs::dividends: bytearray}
The daily:: prefix does not need to be used unless the field name is no longer unique in the record. In this example, you will need to use daily::date or divs::date if you wish to refer to one of the date fields after the join. But fields such as open and divs you do not, because there is no ambiguity.

Resources