I am building a Pipeline on the cdap, where I have an oracle database where I connect and get a table, then connect this data to the BigQuery Multitables component.
Individually both components were validated and by the cdap tool itself, when I tested the execution of the complete pipeline I received the error:
ERROR Spark program 'phase1' failed with error: BQ_TEST has no outputs.Please check that the sink calls addOutput at some point.
To use the bigquery multi sink, you will need to set some runtime arguments to tell the sink which table to write to. The key of the arguments will be like multisink.{dataset-name}.{table-name}, and the value of the arguments will be the json string representation of the table schema.
It sounds like the source might not have any records.
Adding to the response of #Yaojie Feng, the sink needs the schema in Avro format, however, the Multiple Database Tables source plugin would produce the schema required by the BigQuery Multi Table plug-in, example below.
Sample pipeline runtime arguments with schema in Avro format:
Key: multisink.NEW_TABLE_NAME
Value:
{
"name": "NEW_TABLE_NAME",
"type": "record",
"fields": [
{"name": "id", "type": "long" },
{ "name": "name", "type": "string"}
]
}
Source.
Related
During developing pipeline which will use Elasticsearch as a source I faced with issue related paging. I am using SQL Elasticsearch API. Basically, I've started to do request in postman and it works well. The body of request looks following:
{
"query":"SELECT Id,name,ownership,modifiedDate FROM \"core\" ORDER BY Id",
"fetch_size": 20,
"cursor" : ""
}
After first run in response body it contains cursor string which is pointer to next page. If in postman I send the request and provide cursor value from previous request it return data for second page and so on. I am trying to archive the same result in Azure Data Factory. For this I using copy activity, which store response to Azure blob. Setup for source is following.
copy activity source configuration
This is expression for body
{
"query": "SELECT Id,name,ownership,modifiedDate FROM \"#{variables('TableName')}\" WHERE ORDER BY Id","fetch_size": #{variables('Rows')}, "cursor": ""
}
I have no idea how to correctly setup pagination rule. The pipeline works properly but only for the first request. I've tried to setup Headers.cursor and expression $.cursor but this setup leads to an infinite loop and pipeline fails with the Elasticsearch restriction.
I've also tried to read document at https://learn.microsoft.com/en-us/azure/data-factory/connector-rest#pagination-support but it seems pretty limited in terms of usage examples and difficult for understanding.
Could somebody help me understand how to build the pipeline with paging abilities utilization?
Responce with the cursor looks like:
{
"columns": [
{
"name": "companyId",
"type": "integer"
},
{
"name": "name",
"type": "text"
},
{
"name": "ownership",
"type": "keyword"
},
{
"name": "modifiedDate",
"type": "datetime"
}
],
"rows": [
[
2,
"mic Inc.",
"manufacture",
"2021-03-31T12:57:51.000Z"
]
],
"cursor": "g/WuAwFaAXNoRG5GMVpYSjVWR2hsYmtabGRHTm9BZ0FBQUFBRUp6VGxGbUpIZWxWaVMzcGhVWEJITUhkbmJsRlhlUzFtWjNjQUFBQUFCQ2MwNWhaaVIzcFZZa3Q2WVZGd1J6QjNaMjVSVjNrdFptZDP/////DwQBZgljb21wYW55SWQBCWNvbXBhbnlJZAEHaW50ZWdlcgAAAAFmBG5hbWUBBG5hbWUBBHRleHQAAAABZglvd25lcnNoaXABCW93bmVyc2hpcAEHa2V5d29yZAEAAAFmDG1vZGlmaWVkRGF0ZQEMbW9kaWZpZWREYXRlAQhkYXRldGltZQEAAAEP"
}
I finally find the solution, hopefully, it will be useful for the community.
Basically, what needs to be done it is split the solution into four steps.
Step 1 Make the first request as in the question description and stage file to blob.
Step 2 Read blob file and get the cursor value, set it to variable
Step 3 Keep requesting data with a changed body
{"cursor" : "#{variables('cursor')}" }
Pipeline looks like this:
pipeline
Configuration of pagination looks following
pagination . It is a workaround as the server ignores this header, but we need to have something which allows sending a request in loop.
I am working on a cloud datawarehouse using Azure Data Factory v2.
Quite a few of my data sources are on-prem Oracle 12g databases.
Extracting tables 1-1 is not a problem.
However, from time to time I need to extract data generated by parametrized computations on the fly in my Copy Activities.
Since I cannot use PL/SQL stored procedures as sources in ADF, I instead use table-valued functions in the source database and query them in the copy activity.
In the majority of the cases, this works fine. However, when my table-valued function returns a decimal type column, ADF sometimes returns erroneous values. That is: executing the TVF on the source db and previeweing/copying through ADF yields different results.
I have done some experiments if the absolute value or the sign of the decimal number matters, but I cannot find any pattern in which decimals are returned correctly and which are not.
Here are a few examples of the erroneously mapped numbers:
Value in Oracle db
Value in ADF
-658388.5681
188344991.6319
-205668.1648
58835420.6352
10255676.84
188213627.97348
Has any of you experienced similar problems?
Do you know if this is a bug in ADF (which is not integrating well to PL/SQL in the first place)?
First hypothesis
At first I thought the issue was related to NLS, casting or something similar.
I tested this hypothesis by creating a table on the Oracle db side, persisted the output form the TVF there and then extracted from the table in ADF.
Using this method, the decimals were returned correctly in ADF. Thus the hypothesis does not hold.
Second hypothesis
It might have to do with user accesses.
However the linked service used in ADF uses the same db credentials as the ones used to log in to the database to execute the TVF there.
Observation
The error seems to happen more often when a lot of aggregate functions are involved in the tvf's logic
Minimum reproducible example
Oracle db:
CREATE OR REPLACE TYPE test_col AS OBJECT
(
dec_col NUMBER(20,5)
)
/
CREATE OR REPLACE TYPE test_tbl AS TABLE OF test_col;
create or replace function test_fct(param date) return test_tbl
AS
ret_tbl test_tbl;
begin
select
test_col(
<"some complex logic which return a decimal">
)
bulk collect into ret_tbl
from <"some complex joins and group by's">;
return ret_tbl;
end test_fct;
select dec_col from table(test_fct(sysdate));
ADF:
Dataset:
{
"name": "test_dataset",
"properties": {
"linkedServiceName": {
"referenceName": "some_name",
"type": "LinkedServiceReference"
},
"folder": {
"name": "some_name"
},
"annotations": [],
"type": "OracleTable",
"structure": [
{
"name": "dec_col",
"type": "Decimal"
}
]
}
}
Pipeline:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "OracleSource",
"oracleReaderQuery": "select * from table(test_fct(sysdate))",
"partitionOption": "None",
"queryTimeout": "02:00:00"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "test_dataset",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
I'm trying to build a pipeline where Avro data is written into a Postgres DB. Everything works fine with simple schemas and the AvroConverter for the values. However, I would like to have a nested field written into a JSONB column. There are a couple of problems with this. First, it seems that the Connect plugin does not support STRUCT data. Second, the plugin cannot write directly into the JSONB column.
The second problem should be avoided by adding a cast in PG, as described in this issue. The first problem is proving more diffult. I have tried different transformations but have not been able to get the Connect plugin to interpret one complex field as a string. The schema in questions looks something like this (in practice there would be more fields on the first level besides the timestamp):
{
"namespace": "test.schema",
"name": "nested_message",
"type": "record",
"fields": [
{
"name": "timestamp",
"type": "long"
},
{
"name": "nested_field",
"type": {
"name": "nested_field_record",
"type": "record",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "prop",
"type": "float",
"doc": "Some property"
}
]
}
}
]
}
The message is written in Kafka as
{"timestamp":1599493668741396400,"nested_field":{"name":"myname","prop":377.93887}}
In order to write the contents of nested_field into a single DB column, I would like to interpret this entire field as a string. Is this possible? I have tried the cast transformation, but this only supports prmitive Avro types. Something along the lines of HoistField could work, but I don't see a way to limit this to a single field. Any ideas or advice would be greatly appreciated.
A completely different approach would be to use two connect plugins and UPSERT into the table. One plugin would use the AvroConverter for all fields save the nested one, while the second plugin uses the StringConverter for the nested field. This feels wrong in all kinds of ways though.
I have a kafka es sink properties file like the following
name=elasticsearch.sink.direct
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=16
topics=data.my_setting
connection.url=http://dev-elastic-search01:9200
type.name=logs
topic.index.map=data.my_setting:direct_my_setting_index
batch.size=2048
max.buffered.records=32768
flush.timeout.ms=60000
max.retries=10
retry.backoff.ms=1000
schema.ignore=true
transforms=InsertKey,ExtractId
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=MY_SETTING_ID
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=MY_SETTING_ID
This works perfectly for a single topic (data.my_setting). I would like to use the same connector for data coming in from more than one topic. A message in a different topic will have a different key which I'll need to transform.I was wondering if there's a way to use if else statements with a condition on the topic name or on a single field in the message such that I can then transform the key differently. All the incoming messages are json with schema and payload.
UPDATE based on the answer:
In my jdbc connector I add the key as follows:
name=data.my_setting
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
poll.interval.ms=500
tasks.max=4
mode=timestamp
query=SELECT * FROM MY_TABLE with (nolock)
timestamp.column.name=LAST_MOD_DATE
topic.prefix=investment.ed.data.app_setting
transforms=ValueToKey
transforms.ValueToKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.ValueToKey.fields=MY_SETTING_ID
I still however get the error when a message produced from this connector is read by elasticsearch sink
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
Caused by: org.apache.kafka.connect.errors.DataException: STRUCT is not supported as the document id
The payload looks like this:
{
"schema": {
"type": "struct",
"fields": [{
"type": "int32",
"optional": false,
"field": "MY_SETTING_ID"
}, {
"type": "string",
"optional": true,
"field": "MY_SETTING_NAME"
}
],
"optional": false
},
"payload": {
"MY_SETTING_ID": 9,
"MY_SETTING_NAME": "setting_name"
}
}
Connect standalone property file looks like this:
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/apps/{env}/logs/infrastructure/offsets/connect.offsets
rest.port=8084
plugin.path=/usr/share/java
Is there a way to achieve my goal which is to have messages from multiple topics (in my case db tables) which will have their own unique ids (which will also be the id of a document in ES) be sent to a single ES sink.
Can I use avro for this task. Is there a way to define the key in schema registry or will I run into the same problem?
This isn't possible. You'd need multiple Connectors if the key fields are different.
One option to think about is pre-processing your Kafka topics through a stream processor (e.g. Kafka Streams, KSQL, Spark Streaming etc etc) to standardise the key fields, so that you can then use a single connector. It depends what you're building as to whether this would be worth doing, or overkill.
I am trying to work with a JSON file with this bag structure :
{
"user_id": "kim95",
"type": "Book",
"title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.",
"year": "1995",
"publisher": "ACM Press and Addison-Wesley",
"authors": [
{
"name": "null"
}
],
"source": "DBLP"
}
{
"user_id": "marshallo79",
"type": "Book",
"title": "Inequalities: Theory of Majorization and Its Application.",
"year": "1979",
"publisher": "Academic Press",
"authors": [
{
"name": "Albert W. Marshall"
},
{
"name": "Ingram Olkin"
}
],
"source": "DBLP"
}
I tried to use serde to load JSON data for Hive. I followed both ways that I saw here : http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
With this code :
CREATE EXTERNAL TABLE IF NOT EXISTS serd (
user_id:string,
type:string,
title:string,
year:string,
publisher:string,
authors:array<struct<name:string>>,
source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/hdfs/data/book-seded_workings-reduced.json';
I got this error:
error while compiling statement: failed: parseexception line 2:17 cannot recognize input near ':' 'string' ',' in column type
I alson tried this version : https://github.com/rcongiu/Hive-JSON-Serde
which gave a different error :
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerde
Any idea?
I also want to know what are alternatives to work with a JSON like this to make queries on 'name' field in 'authors'. Whether it's Pig or Hive?
I have already converted it in to a "tsv" file. But, since my authors column is a tuple, I don't know how make requests on 'name' with Hive, If I build a table from this file. Should I change my script for "tsv" conversion or keep it? Or are there any alternatives with Hive or Pig?
Hive does not have built in support for JSON. So for using JSON with Hive we need to use third part jars like:
https://github.com/rcongiu/Hive-JSON-Serde
You have couple of issues with the create table statement. It should look like this:
CREATE EXTERNAL TABLE IF NOT EXISTS serd (
user_id string,type string,title string,year string,publisher string,authors array<string>,source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION...
The JSON records your are using keep each record in a single line like this:
{"user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":"null"}], "source": "DBLP"}
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press","authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}
After downloading the project from GIT you need to compile the the project which will create a jar you need to add this jar in the Hive session before running the create table statement.
Hope it helps...!!!
add jar only add to session which won't be available and finally it is getting error.
Get the JAR loaded on all the nodes at Hive and Map Reduce path like the below location so that HIVE and Map Reduce component will pick this whenever it’s been called.
/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0- 1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-dependencies.jar
/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies.jar
Note: This path varies to cluster.