Issue Hive AvroSerDe tblProperties max length - hadoop

I try to create a table with AvroSerDe.
I have already tried following command to create the table:
CREATE EXTERNAL TABLE gaSession
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs://<<url>>:<<port>>/<<path>>/<<file>>.avsc');
The creation seems to work, but following table is generated:
hive> show create table gaSession;
OK
CREATE EXTERNAL TABLE `gaSession`(
`error_error_error_error_error_error_error` string COMMENT 'from deserializer',
`cannot_determine_schema` string COMMENT 'from deserializer',
`check` string COMMENT 'from deserializer',
`schema` string COMMENT 'from deserializer',
`url` string COMMENT 'from deserializer',
`and` string COMMENT 'from deserializer',
`literal` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
...
After it, I copied the definition and replaced 'avro.schema.url' with 'avro.schema.literal', but the table still doesn't work.
But when I delete some (random) fields, it works (e.g. with follwoing definition).
CREATE TABLE gaSession
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{"type": "record",
"name": "root",
"fields": [
{
"name": "visitorId",
"type": [
"long",
"null"
]
},
{
"name": "visitNumber",
"type": [
"long",
"null"
]
},
{
"name": "visitId",
"type": [
"long",
"null"
]
},
{
"name": "visitStartTime",
"type": [
"long",
"null"
]
},
{
"name": "date",
"type": [
"string",
"null"
]
},
{
"name": "totals",
"type": [
{
"type": "record",
"name": "totals",
"fields": [
{
"name": "visits",
"type": [
"long",
"null"
]
},
{
"name": "hits",
"type": [
"long",
"null"
]
},
{
"name": "pageviews",
"type": [
"long",
"null"
]
},
{
"name": "timeOnSite",
"type": [
"long",
"null"
]
},
{
"name": "bounces",
"type": [
"long",
"null"
]
},
{
"name": "transactions",
"type": [
"long",
"null"
]
},
{
"name": "transactionRevenue",
"type": [
"long",
"null"
]
},
{
"name": "newVisits",
"type": [
"long",
"null"
]
},
{
"name": "screenviews",
"type": [
"long",
"null"
]
},
{
"name": "uniqueScreenviews",
"type": [
"long",
"null"
]
},
{
"name": "timeOnScreen",
"type": [
"long",
"null"
]
},
{
"name": "totalTransactionRevenue",
"type": [
"long",
"null"
]
}
]
},
"null"
]
}
]
}');
Has TBLPROPERTIES/avro.schema.literal has a max length or other limitations?
Hive-Version: 0.14.0

The Hortonworks support team confirmed, that there is 4000 character limit for tblproperties.
So, by removing whitespaces you're able to define a larger table. Otherwise, you have to work with 'avro.schema.url'.

Related

Is there a better/faster way to insert data to a database from an external api in Laravel?

I am currently getting data from an external API for use my in Laravel API. I have everything working but I feel like it is slow.
I'm getting the data from the API with Http:get('url) and that works fast. It is only when I start looping through the data and making edits when things are slowing down.
I don't need all the data, but it would still be nice to edit before entering the data to the database as things aren't very consitent if possible. I also have a few columns that use data and some logic to make new columns so that each app/site doesn't need to do it.
I am saving to the database on each foreach loop with the eloquent Model:updateOrCreate() method which works but these json files can easily be 6000 lines long or more so it obviously takes time to loop through each set modify values and then save to the database each time. There usually isn't more than 200 or so entries but it still takes time. Will probably eventually update this to the new upset() method to make less queries to the database. Running in my localhost it is currently take about a minute and a half to run, which just seams way too long.
Here is a shortened version of how I was looping through the data.
$json = json_decode($contents, true);
$features = $json['features'];
foreach ($features as $feature){
// Get ID
$id = $feature['id'];
// Get primary condition data
$geometry = $feature['geometry'];
$properties = $feature['properties'];
// Get secondary geometry data
$geometryType = $geometry['type'];
$coordinates = $geometry['coordinates'];
Model::updateOrCreate(
[
'id' => $id,
],
[
'coordinates' => $coordinates,
'geometry_type' => $geometryType,
]);
}
Most of what I'm doing behind the scenes to the data before going into the database is cleaning up some text strings but there are a few logic things to normalize or prep the data for websites and apps.
Is there a more efficient way to get the same result? This will ultimately be used in a scheduler and run on an interval.
Example Data structure from API documentation
{
"$schema": "http://json-schema.org/draft-04/schema#",
"additionalProperties": false,
"properties": {
"features": {
"items": {
"additionalProperties": false,
"properties": {
"attributes": {
"type": [
"object",
"null"
]
},
"geometry": {
"additionalProperties": false,
"properties": {
"coordinates": {
"items": {
"items": {
"type": "number"
},
"type": "array"
},
"type": "array"
},
"type": {
"type": "string"
}
},
"required": [
"coordinates",
"type"
],
"type": "object"
},
"properties": {
"additionalProperties": false,
"properties": {
"currentConditions": {
"items": {
"properties": {
"additionalData": {
"type": "string"
},
"conditionDescription": {
"type": "string"
},
"conditionId": {
"type": "integer"
},
"confirmationTime": {
"type": "integer"
},
"confirmationUserName": {
"type": "string"
},
"endTime": {
"type": "integer"
},
"id": {
"type": "integer"
},
"sourceType": {
"type": "string"
},
"startTime": {
"type": "integer"
},
"updateTime": {
"type": "integer"
}
},
"required": [
"id",
"userName",
"updateTime",
"startTime",
"conditionId",
"conditionDescription",
"confirmationUserName",
"confirmationTime",
"sourceType",
"endTime"
],
"type": "object"
},
"type": "array"
},
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"nameId": {
"type": "string"
},
"parentAreaId": {
"type": "integer"
},
"parentSubAreaId": {
"type": "integer"
},
"primaryLatitude": {
"type": "number"
},
"primaryLongitude": {
"type": "number"
},
"primaryMP": {
"type": "number"
},
"routeId": {
"type": "integer"
},
"routeName": {
"type": "string"
},
"routeSegmentIndex": {
"type": "integer"
},
"secondaryLatitude": {
"type": "number"
},
"secondaryLongitude": {
"type": "number"
},
"secondaryMP": {
"type": "number"
},
"sortOrder": {
"type": "integer"
}
},
"required": [
"id",
"name",
"nameId",
"routeId",
"routeName",
"primaryMP",
"secondaryMP",
"primaryLatitude",
"primaryLongitude",
"secondaryLatitude",
"secondaryLongitude",
"sortOrder",
"parentAreaId",
"parentSubAreaId",
"routeSegmentIndex",
"currentConditions"
],
"type": "object"
},
"type": {
"type": "string"
}
},
"required": [
"type",
"geometry",
"properties",
"attributes"
],
"type": "object"
},
"type": "array"
},
"type": {
"type": "string"
}
},
"required": [
"type",
"features"
],
"type": "object"
}
Second, related question.
Since this is being updated on an interval I have it updating and creating records from the json data, but is there an efficient way to delete old records that are no longer in the json file? I currently get an array of current ids and compare them to the new ids and then loop through each and delete them. There has to be a better way.
Have no idea what to say to your first question, but I think you may try to do something like this regarding the second question.
SomeModel::query()->whereNotIn('id', $newIds)->delete();
$newIds you can collect during the first loop.

Nifi JoltTransformRecord UUID in default transform not working as expected

I have a Nifi workflow which uses JoltTranformRecord for doing some manipulation in the data which is record based. I have to create a default value uuid in each message in flow file.
My JoltTranformRecord configuration is as below.
Jolt specification :
[{
"operation": "shift",
"spec": {
"payload": "data.payload"
}
}, {
"operation": "default",
"spec": {
"header": {
"source": "${source}",
"client_id": "${client_id}",
"uuid": "${UUID()}",
"payload_type":"${payload_type}"
}
}
}]
Shift operation and all other default operations are working fine as expected. But UUID is coming same for all the messages. I need different UUIDs for each messages. I don't want to add another processor for this purpose only.
My workflow below :
Reader & Writer configurations for JoltRecord processor is :
IngestionSchemaJsonTreeReader ( From JsonTreeReader Processor ):
IngestionSchemaAvroRecordSetWriter ( From AvroWriter Processor ) :
Configured schema registry has below schemas defined in it.
com.xyz.ingestion.pre_json
{
"type": "record",
"name": "event",
"namespace": "com.xyz.ingestion.raw",
"doc": "Event ingested to kafka",
"fields": [
{
"name": "payload",
"type": [
"null",
"string"
],
"default": "null"
}
]
}
com.xyz.ingestion.raw -
{
"type": "record",
"name": "event",
"namespace": "com.xyz.ingestion.raw",
"doc": "Event ingested to kafka",
"fields": [
{
"type": {
"name": "header",
"type": "record",
"namespace": "com.xyz.ingestion.raw.header",
"doc": "Header data for event ingested",
"fields": [
{
"name": "payload_type",
"type": "string"
},
{
"name": "uuid",
"type": "string",
"size": "36"
},
{
"name": "client_id",
"type": "string"
},
{
"name": "source",
"type": "string"
}
]
},
"name": "header"
},
{
"type": {
"name": "data",
"type": "record",
"namespace": "com.xyz.ingestion.raw.data",
"doc": "Payload for event ingested",
"fields": [
{
"name": "payload",
"type": [
"null",
"string"
],
"default": "null"
}
]
},
"name": "data"
}
]
}
The expression language is evaluated per record. UUID() is executed for each evaluation. So uuid must be unique for each record. From the information you provided I cannot see why you are getting duplicate uuids.
I tried to reproduce your problem with following flow:
GenerateFlowFile:
SplitJson: configure $ as JsonPathExpression to split Json array into records.
JoltTransformRecord:
As you can see the way I am adding the UUID is not different from how you do it. But I am getting different UUIDs as expected:

kafka connect JDBC sink. Error flattening JSON records

I'm using the Kafka connect JDBC Sink Connector to stored data from topics into a SQL Server table. The data needs to be flattened. I've created a SQL Server table and a JSON record based on the example provided by Confluent.
So my record is this one:
{
"payload":{
"id": 42,
"name": {
"first": "David"
}
},
"schema": {
"fields": [
{
"field": "id",
"optional": true,
"type": "int32"
},
{
"name": "name",
"optional": "false",
"type": "struct",
"fields": [
{
"field": "first",
"optional": true,
"type": "string"
}
]
}
],
"name": "Test",
"optional": false,
"type": "struct"
}
}
As you can see, I want to flatten the fields concatenating the delimiter "_". So my Sink Connector configuration is as follows:
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
table.name.format=MyTable
transforms.flatten.type=org.apache.kafka.connect.transforms.Flatten$Value
topics=myTopic
tasks.max=1
transforms=flatten
value.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.json.JsonConverter
connection.url=jdbc:sqlserver:[url]
transforms.flatten.delimiter=_
When I write that record in the topic, I get the following exception:
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:487)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:464)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:320)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: Struct schema's field name not specified properly
at org.apache.kafka.connect.json.JsonConverter.asConnectSchema(JsonConverter.java:512)
at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:360)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:487)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
With records that don't require flattening, the sink connector works fine. Is there anything wrong with the configuration? Is it possible to flatten JSON files with the schema?
P.S. Kafka connect version: 5.3.0-css
Any help would be greatly appreciated.
Ok, the problem was the field name of the nested field. The correct field name is "field", not "name":
{
"payload":{
"id": 42,
"name": {
"first": "David"
}
},
"schema": {
"fields": [
{
"field": "id",
"optional": true,
"type": "int32"
},
{
**"field": "name",**
"optional": "false",
"type": "struct",
"fields": [
{
"field": "first",
"optional": true,
"type": "string"
}
]
}
],
"name": "Test",
"optional": false,
"type": "struct"
}
}

How do I parse a date field and generate a date in string format in NiFi

Each of my flow file contains 2000 records. I would like to parse 01/01/2000 into a column year = 2000, column month = Jan and column day = 01
i.e. the input column 01/01/2000 into 3 values separated by commas 01,Jan,2000
Lets say you have a schema like this for a person with a birthday and you want to split out the birthday:
{
"name": "person",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" },
{ "name": "birthday", "type": "string" }
]
}
You would need to modify the schema so it had the fields you want to add:
{
"name": "person",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" },
{ "name": "birthday", "type": "string" },
{ "name": "birthday_year", "type": ["null", "string"] },
{ "name": "birthday_month", "type": ["null", "string"] },
{ "name": "birthday_day", "type": ["null", "string"] }
]
}
Lets say the input record has the following text:
bryan,bende,1980-01-01
We can use UpdateRecord with a CsvReader and CsvWriter, and UpdateRecord can populate the three fields we want by parsing the original birthday field.
If we send the output to LogAttribute we should see the following now:
first_name,last_name,birthday,birthday_year,birthday_month,birthday_day
bryan,bende,1980-01-01,1980,01,01
Here is the link to the record path guide for details on the toDate and format functions:
https://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html
You can use UpdateRecord for this, assuming your input record has the date column called "myDate", you'd set the Replacement Value Strategy to Record Path Value, and your user-defined properties might look something like:
/day format(/myDate, "dd")
/month format(/myDate, "MMM")
/year format(/myDate, "YYYY")
Your output schema would look like this:
{
"namespace": "nifi",
"name": "myRecord",
"type": "record",
"fields": [
{"name": "day","type": "int"},
{"name": "month","type": "string"},
{"name": "year","type": "int"}
]
}

Load data from text file in hive table for Nested records having complex data types

I am trying to load bulk data from a text file in a hive table with schema mentioned below.
{
"type": "record",
"name": "logs",
"namespace": "com.xx.a.log",
"fields": [{
"name": "request_id",
"type": ["null", "string"]
},
{
"name": "results",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "result",
"fields": [{
"name": "d_type",
"type": ["null", "string"],
"default": null
}, {
"name": "p_type",
"type": ["null", "string"]
}]
}
}
}, {
"name": "filtered_c",
"type": {
"type": "array",
"items": "string"
}
}, {
"name": "filtered_rejected",
"type": {
"type": "map",
"values": "int"
}
}]
}
Created table query:
create table log_table(
request_id string ,
results array<struct<d_type:string,p_type:string>> ,
filtered_c array<string> ,
filtered_rejected map<string,int>
) row format delimited fields terminated by ',' collection items terminated by '|' map keys terminated by ':' stored as textfile
TBLPROPERTIES("serialization.null.format"="NULL");
But i am unable to get a proper delimiter for nested record elements inside an array.
If i use ',' to to separate child entries of results(nested) elements then only one entry is taken for inner object. rest is spilled over to parent level columns.
But if i use '|' as delimenter for inner column of row element then multiple objects of result gets created with only first entry initialized and rest as null.
Please suggest any query params and criteria if i am missing or please suggest any alternate way to solve this problem

Resources