Using Azure Data Factory to import a column of type NUMBER in oracle results in a strange precision error - oracle

We're going from ORACLE to SQL in azure.
AFAIK we have to use pipelines and data sets, with a variety of COPY operations.
There does not seem to be a way to import the data from Oracle and manipulate it via Data Flows without putting it into a staging database first, and even then it would be too late for this issue.
The issue is that a column of type NUMBER in oracle might have a value of 1.1234 or 2.23423485
I set the SQL data type to DECIMAL(12, 8) which should cover all the scenarios with a COPY TABLE operation.
I've tried doing the copy as number, and even as varchar:
{
"source": {
"name": "MYDECIMALVALUE",
"type": "String"
},
"sink": {
"name": "MyDecimalValue",
"type": "String",
"physicalType": "varchar"
}
},
However the result for the above two numbers would be:
2.23423485 stays as 2.23423485
1.1234 becomes 1.12340001
Some strange precision issues pulling NUMBER out of oracle.
The same happens with the config above set to
{
"source": {
"name": "MYDECIMALVALUE",
"type": "Decimal"
},
"sink": {
"name": "MyDecimalValue",
"type": "Decimal",
"physicalType": "decimal",
"precision": 12,
"scale": 8,
}
},
Is there any way around this strange quirk?

I also tried to reproduce your scenario and also got similar result when we have Decimal(p,s) data type is already given in SQL table.
DECIMAL(precision, scale)
precision -- the maximum number of digits the decimal may store. Precision includes both left and right side of decimal point. It accepts values from 1 to 38. The default is 18.
scale -- optional, specifies the number of digits after the decimal point. Scale must be between 0 up to the same value as the precision.
As your scenario scale his 8 and the number Geeting from Oracal has scale 4 so SQL is adding 4 zeros at the end of the number to get the required scale of data, like below image:
Decimal has a fixed precision while float has variable precision.
To avoid this the work around can be changing that data type to float in pre-copy script.
alter table MY_TABLE alter column MY_COLUMN float;
Output:

Related

Binning Data With Two Timestamps

I'm posting because I have found no content surrounding this topic.
My goal is essentially to produce a time-binned graph that plots some aggregated value. For Example. Usually this would be a doddle, since there is a single timestamp for each value, making it relatively straight forward to bin.
However, my problem lies in having two timestamps for each value - a start and an end. Similar to a gantt chart, here is an example of my plotted data. I essentially want to bin the values (average) for when the timelines exist within said bin (bin boundaries could be where a new/old task starts/ends). Likeso.
I'm looking for a basic example or an answer to whether this is even supported, in Vega-Lite. My current working example would yield no benefit to this discussion.
I see that you found a Vega solution, but I think in Vega-Lite what you were looking for was something like the following. You put the start field in "x" and the end field in x2, add bin and type to x and all should work.
"encoding": {
"x": {
"field": "start_time",
"bin": { "binned": true },
"type": "temporal",
"title": "Time"
},
"x2": {
"field": "end_time"
}
}
I lost my old account, but I was the person who posted this. Here is my solution to my question. The value I am aggregating here is the sum of times the timelines for each datapoint is contained within each bin.
First you want to use a join aggregate to get the max and min times your data extend to. You could also hardcode this.
{
type: joinaggregate
fields: [
startTime
endTime
]
ops: [
min
max
]
as: [
min
max
]
}
You want to find a step for your bins, you can hard code this later or use a formula and write this into a new field.
You want to create two new fields in your data that is a sequence between the max and min, and the other the same sequence offset by your step.
{
type: formula
expr: sequence(datum.min, datum.max, datum.step)
as: startBin
}
{
type: formula
expr: sequence(datum.min + datum.step, datum.max + datum.step, datum.step)
as: endBin
}
The new fields will be arrays. So if we go ahead and use a flatten transform we will get a row for each data value in each bin.
{
type: flatten
fields: [
startBin
endBin
]
}
You then want to calculate the total time your data spans across each specific bin. In order to do this you will need to round up the start time to the bin start and round down the end time to the bin end. Then taking the difference between the start and end times.
{
type: formula
expr: if(datum.startTime<datum.startBin, datum.startBin, if(datum.startTime>datum.endBin, datum.endBin, datum.startTime))
as: startBinTime
}
{
type: formula
expr: if(datum.endTime<datum.startBin, datum.startBin, if(datum.endTime>datum.endBin, datum.endBin, datum.endTime))
as: endBinTime
}
{
type: formula
expr: datum.endBinTime - datum.startBinTime
as: timeInBin
}
Finally, you just need to aggregate the data by the bins and sum up these times. Then your data is ready to be plotted.
{
type: aggregate
groupby: [
startBin
endBin
]
fields: [
timeInBin
]
ops: [
sum
]
as: [
timeInBin
]
}
Although this solution is long, it is relatively easily to implement in the transform section of your data. From my experience this runs fast and just displays how versatile Vega can be. Freedom to visualisations!

How can I show a table with the sum of value x of all childeren within Kibana

I'm have an elasticsearch database with documents stored the following way(, seperates the documents):
{
"path":"path/to/data"
"kind": "type1"
},
{
"path":"path/to/data/values1"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/values2"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/datasub"
"kind": "type1"
},
{
"path":"path/to/data/datasub/values1"
"kind": "type2"
"x": 1
}
Now I want the create table view/chart show all type2's with all the sum of x of all their childeren.
So I expect the total of path/to/data to be 5 and the total of path/to/data/datasub 1.
To consider: the depth of this structure could theoretically be unlimited
I'm running Elastichsearch 7 and Kibana 7 and I want to use the table visualisation to start with but I would like to be able to use this kind of aggregation throughout multiple visualisations. I have Googles a lot and found all kinds of Elastichsearch queries but nothing on how to achieve this in Kibana.
All help is much appreciated
For those who run into the same question:
The solution I ended up using is to split the path in to tokens prior to importing it into Elasticsearch. So consider a document having a path like "/this/is/a/path". This becomes the following array in the document:
[
"/this",
"/this/is",
"/this/is/a",
"/this/is/a/path"
]
You can then use a terms aggregation on it with various metrics to calculate your desired measurements.

Aggregating nested fields of varying datatypes in Elasticsearch

I have an index based on Products and one of the fields declared in the mapping is Attributes. This field is a nested type as it will contain two values - key and value. The problem I have is that the depending on the context of the attribute the datatype of value can vary between an integer and string.
For example:
{"attributes":[{"key":"StrEx","value":"Red"},{"key":"IntEx","value":2}]}
It seems the datatype for every instance of 'value' within all future nested documents within Attributes is decided based on the first data entered. I need to be able to store it as a integer/long datatype so I can perform range queries.
Any help or alternative ideas would be greatly appreciated.
You need a mapping like this one, for the value field:
"value": {
"type": "string",
"fields": {
"as_number": {
"type": "integer",
"ignore_malformed": true
}
}
}
Basically, your field is string but using fields you can attempt to format it as a numeric field.
When you want to use range queries then use value.as_number, for anything else use value.

Azure data factory copy activity from Storage to SQL: hangs at 70000 rows

I have a data factory with a pipeline copy activity like this:
{
"type": "Copy",
"name": "Copy from storage to SQL",
"inputs": [
{
"name": "storageDatasetName"
}
],
"outputs": [
{
"name": "sqlOutputDatasetName"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
}
}
The input data is approx 90MB in size, about 1.5 million rows, broken into approx. 20 x 4.5MB block blob files in Azure Storage. Here's an example of the data (CSV):
A81001,1,1,1,2,600,3.0,0.47236654,141.70996,0.70854986
A81001,4,11,0,25,588,243.0,5.904582,138.87576,57.392536
A81001,7,4,1,32,1342,278.0,7.5578647,316.95795,65.65895
The sink is an Azure SQL Server of type S2, which is rated at 50 DTUs. I've created a simple table with sensible data types, and no keys, indexes or anything fancy, just columns:
CREATE TABLE [dbo].[Prescriptions](
[Practice] [char](6) NOT NULL,
[BnfChapter] [tinyint] NOT NULL,
[BnfSection] [tinyint] NOT NULL,
[BnfParagraph] [tinyint] NOT NULL,
[TotalItems] [int] NOT NULL,
[TotalQty] [int] NOT NULL,
[TotalActCost] [float] NOT NULL,
[TotalItemsPerThousand] [float] NOT NULL,
[TotalQtyPerThousand] [float] NOT NULL,
[TotalActCostPerThousand] [float] NOT NULL
)
The source, sink and data factory are all in the same region (North Europe).
According to Microsoft's 'Copy activity performance and tuning guide', for Azure Storage Source and Azure SQL S2 sink, I should be getting about 0.4 MBps. By my calculation, that means 90MB should transfer in about half and hour (is that right?).
For some reason it copies 70,000 rows very quickly, then seems to hang. Using SQL management studio I can see the count of rows in the database table is exactly 70,000 and hasn't increased at all in 7 hours. Yet the copy task is still running with no errors:
Any ideas why this is hanging at 70,000 rows? I can't see anything unusual about the 70,001st data row which would cause a problem. I've tried compeltely trashing the data factory and starting again, and I always get the same behaviour. i have another copy activity with a smaller table (8000 rows), which completes in 1 minute.
Just to answer my own question in case it helps anyone else:
The issue was with null values. The reason that my run was hanging at 70,000 rows was that at row 76560 of my blob source file, there was a null value in one of the columns. The HIVE script I had used to generate this blob file had written the null value as '\N'. Also, my sink SQL table specified 'NOT NULL' as part of the column, and the column was a FLOAT value.
So I made two changes: added the following property to my blob dataset definition:
"nullValue": "\\N"
And made my SQL table column nullable. It now runs completely and doesn't hang! :)
The problem is that the Data Factory did not error, it just got stuck - it would be nice if the job had failed with a useful error message, and told me what row of the data was the problem. I think because the write batch size is 10,000 by default, this is why it got stuck at 70,000 and not at 76560.
Here is a new workaround, just set write batch size to cover the default value(10,000)
click here to see my copy data activity config

couchdb view/reduce. sometimes you can return values, sometimes you cant..?

This is on a recent version of couchbase server.
The end goal is for the reduce/groupby to aggregate the values of the duplicate keys in to a single row with an array value.
view result with no reduce/grouping (in reality there are maybe 50 rows like this emitted):
{
"total_rows": 3,
"offset": 0,
"rows": [
{
"id": "1806a62a75b82aa6071a8a7a95d1741d",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "1806a62a75b82aa6071a8a7a95d1741d"
},
{
"id": "47abb54bf31d39946117f6bfd1b088af",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "47abb54bf31d39946117f6bfd1b088af"
},
{
"id": "ed6a3dd3-27f9-4845-ac21-f8a5767ae90f",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "ed6a3dd3-27f9-4845-ac21-f8a5767ae90f"
}
}
with reduce + group_level=1:
function(keys,values,re){
return values;
}
yields an error from couch with the actual 50 or so rows from the real view (even fails with fewer view rows). couch says something about the data not shrinking rapidly enough. However this same type of thing works JUST FINE when the view keys are integers and there is a small amount of data.
Can someone please explain the difference to me?
Reduce values need to remain as small as possible, due to the nature of how they are stored in the internal b-tree data format. There's a little bit of information in the wiki about why this is.
If you want to identify unique values, this needs to be done in your map function. This section on the same wiki page shows you one method you can use to do so. (I'm sure there are others)
I am almost always going to be querying this view with a "key" parameter, so there really is no need to aggregate values via couch, it can be easily and efficiently done in the app.

Resources