Rule-based mapping on Copy Activity in Azure Data Factory - parquet

I'm trying to create a dynamic mapping when I use copy data activity on Azure Data Factory.
I want to create a parquet file that contains the same data that I'm reading from the source but I want to modfy some columns names to remove white spaces on it (It's a bug of Parquet format) and I want to do that automatically.
I have seen that this is possible in mapping data flow, but I don't see any such functionality on Copy Activity (Mapping data flow is limited to a few connectors as a source, so I can't use it).
As you can see on the image, it seems that I can only modify individual columns, not a few of them that fullfil certain conditions
How can I do that?
Thanks in advance

Here is a solution to apply a dynamic column name mapping with ADF so that you can still use the copy data activities with parquet format, even when the source column names have pesky white-space characters which are not supported.
The solution involves three parts:
Dynamically generate your list of mapped column names. The example below demonstrates how you could encode the white-space from an SQL database table source dataset dynamically with a lookup activity (referred to as 'lookup column mapping' below).
;with cols as (
select
REPLACE(column_name, (' ', '__wspc__') as new_name,
column_name as old_name
from INFORMATION_SCHEMA.columns
where table_name = '#{pipeline().parameters.SOURCE_TABLE}'
and table_schema = '#{pipeline().parameters.SOURCE_SCHEMA}'
)
select ' |||'+old_name+'||'+new_name+'|' as mapping
from cols;
Use an expression to repack the column mapping derived in the lookup activity in step 1. into the json syntax expected by the copy data activity template. You can insert this into a set variable activity with Array type variable (referred to as 'column_mapping_list' below).
#json(
concat(
'[ ',
join(
split(
join(
split(
join(
split(
join(
xpath(
xml(
json(
concat(
'{\"root_xml_node\": ',
string(activity('lookup column mapping').output),
'}'
)
)
),
'/root_xml_node/value/mapping/text()',
)
','
),
'|||'
),
'{\"source\": { \"name\": \"'
),
'||'
),
'\" },\"sink\": { \"name\": \"'
),
'|'
),
'\" }}'
),
' ]'
)
)
Unfortunately the expression is more convoluted than we would like as the xpath function requires a single root node which is not provided by the lookup activity output, and the string escaping of the ADF json templates present some challenges to simplifying this.
Lastly, use the column mapping list variable as "dynamic content" in the mapping section of the copy data activity with the following expression
#json(
concat(
'{ \"type\": \"TabularTranslator\", \"mappings\":',
string(variables('column_mapping_list')),
'}'
)
)
Expected results:
Step 1.
'my wspccol' -> '|||my wspccol||my__wspc__wspcol|'
Step 2.
'|||my wspccol||my__wspc__wspccol|' -> ['{ "source": { "name": "my wspccol" }, "sink": { "name": "my__wspc__wspccol" } }']
Step 3.
{
"type": "TabularTranslator",
"mappings": [
{
"source": { "name": "my wspccol" },
"sink": { "name": "my__wspc__wspccol" }
}
]
}
Additionally:
Keep in mind that the solution can be as easily reversed, so that if you want to load that parquet file back into an SQL table with the original column names then you can use the same expressions to build your dynamic copy data mapping; Just switch over the old_names, new_names in step 1. to map back to the original names.
A data type can also by specified in the mapping where needed. Adjust the syntax accordingly, following the documentation here: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.management.datafactory.models.tabulartranslator.mappings?view=azure-dotnet

The Copy activity can change from one file type to another, eg csv to json, parquet to database but it does not inherently allow any transform, such as changing the content of any columns, even adding additional columns.
Alternately consider using ADF to call a Databricks notebook for these complex rule-based transforms.

Related

Add column to nested tables using outer value in Power Query

I have an outer table with "Name" and "Content" columns, I also have nested tables contained in the "Content" column
How do I add a new column in the nested tables using the value "Name" from the outer one?
If I add a new column in the outer using
= Table.AddColumn(Step-1,"NewColOut", each Table.AddColumn([Content],"FileName", (x)=> [Name]))
I have no problem, what if I want to transform "Content" without adding a new column in the outer one?
I tried Table.TransformColumns but to no avail, I am not able to bring in the "Name" value at the nested table level
any help would be greately appreciated
You don't really need to do this, since if you expand the embedded table, it will automatically copy down the filename, but if you wanted to, you could use the simple line
#"Added Custom1" = Table.AddColumn(#"PriorStepNameGoesHere, "NewColOut", each let name=[Name] in Table.AddColumn([Content],"Filename",each name))
or with transform
#"Added Custom1"= Table.FromRecords(Table.TransformRows(#"PriorStepNameGoesHere",
(r) => Record.TransformFields(r,
{"Content", each Table.AddColumn(_,"NewColOut",each r[Name]) })))

Prepared statements with TBS

I'm trying to merge the results of a prepared statement to TBS. Here's my code:
$s = $link->prepare("SELECT * FROM newsletters WHERE newsletter_id = :newsletter_id");
$s->bindParam(":newsletter_id",$newsletter_id);
$s->execute();
$newsletter = $s->fetch(PDO::FETCH_ASSOC);
$tbs->MergeBlock('$newsletter ', $newsletter );
But I can't get the results fields. I get errors like the following:
TinyButStrong Error in field [newsletter.title...]: item 'title' is not an existing key in the array.
I can't find my error.
MergeBlock() is for merging a recordset , so you should use $s->fetchAll() instead of $s->fetch(). The section of the template will be repeated for each record.
But if you have to merge a standalone record, use MergeField() instead of MergeBlock(). The single fields will be merged one by one without repeating.

Full-text searth JSON-string

I have a question: in my DB i have a table, who has a field with JSON-string, like:
field "description"
{
solve_what: "Add project problem",
solve_where: "In project CRUD",
shortname: "Add error"
}
How can i full-text search for this string? For example, I need to find all records, who has "project" in description.solve_what. In my sphinx.conf i have
sql_attr_json = description
P.S.Mb i can do this with elasticSearch?
I've just answered a very similar questio here:
http://sphinxsearch.com/forum/view.html?id=13861
Note there is no support for extracting them as FIELDs at this time -
so cant 'full-text' search the text within the json elements.
(To do that would have to use mysql string manipulation functions to
create a new column to index as a normal field. Something like:
SELECT id, SUBSTR(json_column,
LOCATE('"tag":"', json_column)+7,
LOCATE('"', json_column, LOCATE('"tag":"', json_column)+7)-LOCATE('"tag":"',
json_column)-7 ) AS tag, ...
is messy but should work... )
The code is untested.

Couchbase view with composite keys : How to set the right startkey/endkey range?

Having this document structure:
"document":{
"title":"Cheese ",
"start_date": "2010-02-17T15:35:00",
"source": "s_source1"
}
I would like to create a view that return all documents ids between 2 dates and for a certain source:
function (doc, meta) {
emit([doc.start_date,doc.source], null);
}
I tried to use this range key to get all document of source 1 between "2014-04-04" and "2014-04-05":
startkey=["2014-04-04","s_source1"]&endkey=["2014-04-05","s_source1"]
But this don't work for the sources. It retrieves me all document for the date range but for all sources ( s_source1,s_source2,...) .
I guess that the underscore is the source of problem( some encoding issue)?
How should I set my key range to get only documents of a unique source for a certain date range?
If you reverse your compound key then you'll be able to do the select, keys sort from left to right in Couchbase.
function (doc, meta) {
if(meta.type == "json") {
if(doc.start_date && doc.source) {
emit([doc.source,dateToArray(doc.start_date)],null);
}
}
}
To select all documents with a source value of: "s_source1" since a date in 2010 until the present day you'd have your keys like so:
Start_Key: ["s_source1",[2010,2,18,15,35,0]]
End_key: ["s_source1",[2014,2,18,15,35,0]]
This question on the Couchbase website has some fantastic explanations of compound key sorting, I'd thoroughly recommend reading it: http://www.couchbase.com/communities/q-and-a/couchbase-view-composite-keys
Plus here is an informative section from the official documentation: http://docs.couchbase.com/couchbase-manual-2.0/#selecting-information

With Oracle XML Tables do XQuery selects use XmlIndexes?

I am trying to retrieve keys and parent keys from some structured xml stored as binary xml in oracle. I have tried created unstructured index and also an index with a structured component. The structured component works fine when doing a SELECT against XMLTABLE() but I cannot retrieve values of parent node using XMLTable. I am therefore trying the following Xquery to retrieve parent values but this is not using the index at all. Does this style of query support using XmlIndexes? I can't find anything in the docs that say either way.
SELECT y.*
FROM xml_data x, XMLTABLE(xmlnamespaces( DEFAULT 'namespace'),
'for $i in /foo/bar
return element r {
$i/someKey
,element parentKey { $i/../someKey }
}'
PASSING x.import_xml
COLUMNS
someKey VARCHAR2(100) PATH 'someKey'
,parentKey VARCHAR2(100) PATH 'parentKey'
) y
Thanks, Tom

Resources