Loading JSON file with serde in Cloudera - hadoop

I am trying to work with a JSON file with this bag structure :
{
"user_id": "kim95",
"type": "Book",
"title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.",
"year": "1995",
"publisher": "ACM Press and Addison-Wesley",
"authors": [
{
"name": "null"
}
],
"source": "DBLP"
}
{
"user_id": "marshallo79",
"type": "Book",
"title": "Inequalities: Theory of Majorization and Its Application.",
"year": "1979",
"publisher": "Academic Press",
"authors": [
{
"name": "Albert W. Marshall"
},
{
"name": "Ingram Olkin"
}
],
"source": "DBLP"
}
I tried to use serde to load JSON data for Hive. I followed both ways that I saw here : http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
With this code :
CREATE EXTERNAL TABLE IF NOT EXISTS serd (
user_id:string,
type:string,
title:string,
year:string,
publisher:string,
authors:array<struct<name:string>>,
source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/hdfs/data/book-seded_workings-reduced.json';
I got this error:
error while compiling statement: failed: parseexception line 2:17 cannot recognize input near ':' 'string' ',' in column type
I alson tried this version : https://github.com/rcongiu/Hive-JSON-Serde
which gave a different error :
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerde
Any idea?
I also want to know what are alternatives to work with a JSON like this to make queries on 'name' field in 'authors'. Whether it's Pig or Hive?
I have already converted it in to a "tsv" file. But, since my authors column is a tuple, I don't know how make requests on 'name' with Hive, If I build a table from this file. Should I change my script for "tsv" conversion or keep it? Or are there any alternatives with Hive or Pig?

Hive does not have built in support for JSON. So for using JSON with Hive we need to use third part jars like:
https://github.com/rcongiu/Hive-JSON-Serde
You have couple of issues with the create table statement. It should look like this:
CREATE EXTERNAL TABLE IF NOT EXISTS serd (
user_id string,type string,title string,year string,publisher string,authors array<string>,source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION...
The JSON records your are using keep each record in a single line like this:
{"user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":"null"}], "source": "DBLP"}
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press","authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}
After downloading the project from GIT you need to compile the the project which will create a jar you need to add this jar in the Hive session before running the create table statement.
Hope it helps...!!!

add jar only add to session which won't be available and finally it is getting error.
Get the JAR loaded on all the nodes at Hive and Map Reduce path like the below location so that HIVE and Map Reduce component will pick this whenever it’s been called.
/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0- 1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-dependencies.jar
/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies.jar
Note: This path varies to cluster.

Related

Oracle Table-valued Functions returns erroneous decimals in Data Factory

I am working on a cloud datawarehouse using Azure Data Factory v2.
Quite a few of my data sources are on-prem Oracle 12g databases.
Extracting tables 1-1 is not a problem.
However, from time to time I need to extract data generated by parametrized computations on the fly in my Copy Activities.
Since I cannot use PL/SQL stored procedures as sources in ADF, I instead use table-valued functions in the source database and query them in the copy activity.
In the majority of the cases, this works fine. However, when my table-valued function returns a decimal type column, ADF sometimes returns erroneous values. That is: executing the TVF on the source db and previeweing/copying through ADF yields different results.
I have done some experiments if the absolute value or the sign of the decimal number matters, but I cannot find any pattern in which decimals are returned correctly and which are not.
Here are a few examples of the erroneously mapped numbers:
Value in Oracle db
Value in ADF
-658388.5681
188344991.6319
-205668.1648
58835420.6352
10255676.84
188213627.97348
Has any of you experienced similar problems?
Do you know if this is a bug in ADF (which is not integrating well to PL/SQL in the first place)?
First hypothesis
At first I thought the issue was related to NLS, casting or something similar.
I tested this hypothesis by creating a table on the Oracle db side, persisted the output form the TVF there and then extracted from the table in ADF.
Using this method, the decimals were returned correctly in ADF. Thus the hypothesis does not hold.
Second hypothesis
It might have to do with user accesses.
However the linked service used in ADF uses the same db credentials as the ones used to log in to the database to execute the TVF there.
Observation
The error seems to happen more often when a lot of aggregate functions are involved in the tvf's logic
Minimum reproducible example
Oracle db:
CREATE OR REPLACE TYPE test_col AS OBJECT
(
dec_col NUMBER(20,5)
)
/
CREATE OR REPLACE TYPE test_tbl AS TABLE OF test_col;
create or replace function test_fct(param date) return test_tbl
AS
ret_tbl test_tbl;
begin
select
test_col(
<"some complex logic which return a decimal">
)
bulk collect into ret_tbl
from <"some complex joins and group by's">;
return ret_tbl;
end test_fct;
select dec_col from table(test_fct(sysdate));
ADF:
Dataset:
{
"name": "test_dataset",
"properties": {
"linkedServiceName": {
"referenceName": "some_name",
"type": "LinkedServiceReference"
},
"folder": {
"name": "some_name"
},
"annotations": [],
"type": "OracleTable",
"structure": [
{
"name": "dec_col",
"type": "Decimal"
}
]
}
}
Pipeline:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "OracleSource",
"oracleReaderQuery": "select * from table(test_fct(sysdate))",
"partitionOption": "None",
"queryTimeout": "02:00:00"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "test_dataset",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}

Pipeline on Oracle cdap to BigQuery Multitables

I am building a Pipeline on the cdap, where I have an oracle database where I connect and get a table, then connect this data to the BigQuery Multitables component.
Individually both components were validated and by the cdap tool itself, when I tested the execution of the complete pipeline I received the error:
ERROR Spark program 'phase1' failed with error: BQ_TEST has no outputs.Please check that the sink calls addOutput at some point.
To use the bigquery multi sink, you will need to set some runtime arguments to tell the sink which table to write to. The key of the arguments will be like multisink.{dataset-name}.{table-name}, and the value of the arguments will be the json string representation of the table schema.
It sounds like the source might not have any records.
Adding to the response of #Yaojie Feng, the sink needs the schema in Avro format, however, the Multiple Database Tables source plugin would produce the schema required by the BigQuery Multi Table plug-in, example below.
Sample pipeline runtime arguments with schema in Avro format:
Key: multisink.NEW_TABLE_NAME
Value:
{
"name": "NEW_TABLE_NAME",
"type": "record",
"fields": [
{"name": "id", "type": "long" },
{ "name": "name", "type": "string"}
]
}
Source.

Kafka Connect JDBC sink - write Avro field into PG JSONB

I'm trying to build a pipeline where Avro data is written into a Postgres DB. Everything works fine with simple schemas and the AvroConverter for the values. However, I would like to have a nested field written into a JSONB column. There are a couple of problems with this. First, it seems that the Connect plugin does not support STRUCT data. Second, the plugin cannot write directly into the JSONB column.
The second problem should be avoided by adding a cast in PG, as described in this issue. The first problem is proving more diffult. I have tried different transformations but have not been able to get the Connect plugin to interpret one complex field as a string. The schema in questions looks something like this (in practice there would be more fields on the first level besides the timestamp):
{
"namespace": "test.schema",
"name": "nested_message",
"type": "record",
"fields": [
{
"name": "timestamp",
"type": "long"
},
{
"name": "nested_field",
"type": {
"name": "nested_field_record",
"type": "record",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "prop",
"type": "float",
"doc": "Some property"
}
]
}
}
]
}
The message is written in Kafka as
{"timestamp":1599493668741396400,"nested_field":{"name":"myname","prop":377.93887}}
In order to write the contents of nested_field into a single DB column, I would like to interpret this entire field as a string. Is this possible? I have tried the cast transformation, but this only supports prmitive Avro types. Something along the lines of HoistField could work, but I don't see a way to limit this to a single field. Any ideas or advice would be greatly appreciated.
A completely different approach would be to use two connect plugins and UPSERT into the table. One plugin would use the AvroConverter for all fields save the nested one, while the second plugin uses the StringConverter for the nested field. This feels wrong in all kinds of ways though.

How do I use FreeFormTextRecordSetWriter

I my Nifi controller I want to configure the FreeFormTextRecordSetWriter, but I have no Idea what I should put in the "Text" field. I'm getting the text from my source (in my case GetSolr), and just want to write this, period.
Documentation and mailinglist do not seem to tell me how this is done, any help appreciated.
EDIT: Here the sample input + output I want to achieve (as you can see: not ransformation needed, plain text, no JSON input)
EDIT: I now realize, that I can't tell GetSolr to return just CSV data - but I have to use Json
So referencing with attribute seems to be fine. What the documentation omits is, that the ${flowFile} attribute should containt the complete flowfile that is returned.
Sample input:
{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"_": "1553686715465"
}
},
"response": {
"numFound": 3194,
"start": 0,
"docs": [
{
"id": "{402EBE69-0000-CD1D-8FFF-D07756271B4E}",
"MimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"FileName": "Test.docx",
"DateLastModified": "2019-03-27T08:05:00.103Z",
"_version_": 1629145864291221504,
"LAST_UPDATE": "2019-03-27T08:16:08.451Z"
}
]
}
}
Wanted output
{402EBE69-0000-CD1D-8FFF-D07756271B4E}
BTW: The documentation says this:
The text to use when writing the results. This property will evaluate the Expression Language using any of the fields available in a Record.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
I want to use my source's text, so I'm confused
You need to use expression language as if the record's fields are the FlowFile's attributes.
Example:
Input:
{
"t1": "test",
"t2": "ttt",
"hello": true,
"testN": 1
}
Text property in FreeFormTextRecordSetWriter:
${t1} k!${t2} ${hello}:boolean
${testN}Num
Output(using ConvertRecord):
test k!ttt true:boolean
1Num
EDIT:
Seems like what you needed was reading from Solr and write a single column csv. You need to use CSVRecordSetWriter. As for the same,
I should tell you to consider to upgrade to 1.9.1. Starting from 1.9.0, the schema can be inferred for you.
otherwise, you can set Schema Access Strategy as Use 'Schema Text' Property
then, use the following schema in Schema Text
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
this should work
I'll edit it into my answer. If it works for you, please choose my answer :)

Count and flatten in pig

Hi I have a data like this :
{"user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":"null"}], "source": "DBLP"}
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press", "authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}
{"user_id": "knuth86a", "type": "Book", "title": "TeX: The Program", "year": "1986", "publisher": "Addison-Wesley", "authors": [{"name":"Donald E. Knuth"}], "source": "DBLP"}
...
And I would like to get the publisher,title and then applied a count on the group but I got error ' a column needs to be...' with this script :
books = load 'data/book-seded-workings-reduced.json'
using JsonLoader('user_id:chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,authors:{(name:chararray)},source:chararray');
doc = group books by publisher;
res = foreach doc generate group,books.title,count(books.publisher);
DUMP res;
On a second query I would like to have a structure like this :(name,year),title
So I tried this one :
books = load 'data/book-seded-workings-reduced.json'
using JsonLoader('user_id:chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,authors:{(name:chararray)},source:chararray');
flat =group books by (generate FLATTEN((authors.name),year);
tab = foreach flat generate group, books.title;
DUMP tab;
But it also doesn't work...
Any idea please?
What is the error you are getting on trying out the first query?
COUNT being inbuilt function has to be in all caps, you cannot invoke COUNT(group), group is internal identifier generated by Pig.
I get following result on running your first query -
(Academic Press,{(Inequalities: Theory of Majorization and Its Application.)},1)
(Addison-Wesley,{(TeX: The Program)},1)
(ACM Press and Addison-Wesley,{(Modern Database Systems: The Object Model, Interoperability, and Beyond.)},1)
The expected format of (name,year), title can also be achieved this way -
flat = foreach books generate FLATTEN(authors.name) as authorName, year, title;
tab = group flat by (authorName, year);
finaltab = foreach tab generate group, flat.title;
Only problem in your first code i could see is "COUNT" instead of count (CAPS on)
if you use without caps count then you will get a error
Could not resolve count using imports:

Resources