DMN rule for checking reference data - validation

New to dmn rule and trying to evaluate a condition.
Suppose I have 2 input tables.
Input A has keys - keyA1, keyA2
Input B has keys - keyB1, keyB2
I want to find out if each row in Input A has corresponding row in Input B based on keys.
Can it be achieved via DMN rule engine?

Assuming Input A is a (DMN) Relation with column keys with the example entries you provided,
Assuming Input B is another Relation with column keys with the example entries you provided, you want to check each element of Input A.keys is contained in the Input B.keys.
You can achieve it with the following expression:
all( for aKey in Input A.keys return list contains( Input B.keys, aKey ) )
refs:
all()
list contains()

Related

How to get an array of values where columns match multiple criteria

I have a table of data similar to:
where I'd like to get just the shapes which match a set of given criteria (in this case week=2 and colour=blue).
I can return the first result using index and match like:
=ArrayFormula(INDEX(C2:C14,MATCH($F$1&$F$2,A2:A14&B2:B14,0)))
but I'd like to return the all matching values (eg square and triangle) in to the range F3:Fsomething. This would preferably be done using a formula that returns a range and isn't "copied-down", as a list of all possible shapes isn't known beforehand.
How can I modify this formula to achieve this?
See if this works:
=FILTER (C2:C14, B2:B14=F2, A2:A14=F1)
to do multiple criteria you want to use * like so
=FILTER (C2:C14, (B2:B14=F2) * (A2:A14=F1))
and if you want the results all in the same cell with a delimiter, use TEXTJOIN
=TEXTJOIN([DELIMETER],[IGNORE EMPTY TEXT],text1)
=TEXTJOIN(", ",TRUE,FILTER(C2:C14,(B2:B14=F2)*(A2:A14=F1)))

sort values of an orddict

In order to extract the values (records) of an orddict as a sorted list, tried this:
-module(test).
-compile(export_all).
-record(node, {name="", cost=0}).
test() ->
List = orddict:append("A",#node{name="A",cost=1},
orddict:append("B",#node{name="B",cost=2},
orddict:new())),
lists:sort(fun({_,A},{_,B}) -> A#node.cost =< B#node.cost end,
orddict:to_list(List)).
The sort fails with exception error: {badrecord,node}.
What would be the correct syntax?
Solved:
The correct insertion method is orddict:store/2 instead of orddict:append/2. Then the pattern {_,A} matches for the comparison function.
The correct syntax is:
lists:sort(fun({_,[A]},{_,[B]}) -> A#node.cost =< B#node.cost end,
orddict:to_list(List)).
I not found note about this in documentation,but you can look in source code of module.
As #Pascal write in comments the reason is that orddict:append/3 is a function provided to append a value to an existing Key/Value pair where Value must be a list. In the use case, the key doesn't exist, so the pair is created and the Value append to an empty list.
Btw, you always can print and compare real and expected result.
io:format("~p~n",[orddict:to_list(List)])
For your example that is:
[{"A",[{node,"A",1}]},{"B",[{node,"B",2}]}]

Join of two datasets in Mapreduce/Hadoop

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?
More specifically, here's what I exactly need to do:
I am having two sets of data:
point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info
Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info
As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number
In order to clarify, here's an example:
For point pairs:
(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)
for line pairs:
(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)
what I want is as following:
for tile 0:
(tile0, point0:line0)
(tile0, point0:line1)
(tile0, point1:line0)
(tile0, point1:line1)
for tile 1:
(tile1, point1:line2)
(tile1, point1:line3)
(tile1, point2:line2)
(tile1, point2:line3)
Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).
So the map output will be something like:
tile0, _point0
tile1, _point0
tile2, _point1
...
tileX, *lineL
tileY, *lineK
...
Then, at the reducer, your input will have this structure:
tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]
and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:
tileX (lineK, pointP)
tileX (lineK, pointR)
...
If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)
Regarding the cross-product which you have to do in the reducer:
You first iterate through the entire values List, separate them into 2 list:
List<String> points;
List<String> lines;
Then do the cross-product using 2 nested for loops.
Then iterate through the resulting list and for each element output:
tile(current key), element_of_the_resulting_cross_product_list
So basically you have two options here.Reduce side join or Map Side Join .
Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient.
If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.

Querying with multiple predicates on sub-objects in a collection with MongoDB (official c# driver)

I've the following data structure:
A
_id
B[]
_id
C[]
_id
UserId
I'm trying to run the following query:
where a.B._id == 'some-id' and a.B.C.UserId=='some-user-id'.
That means I need to find a B document that has a C document within with the relevant UserId, something like:
Query.And(Query.EQ("B._id", id), Query.EQ("B.C.UserId", userId));
This is not good, of course, as it may find B with that id and another different B that has C with that UserId. Not good.
How can I write it with the official driver?
If the problem is only that your two predicates on B are evaluated on different B instances ("it may find B with that ID and another different B that has C with that UserId"), then the solution is to use an operator that says "find me an item in the collection that satisfies both these predicates together".
Seems like the $elemMatch operator does exactly that.
From the docs:
Use $elemMatch to check if an element in an array matches the specified match expression. [...]
Note that a single array element must
match all the criteria specified; [...]
You only need to use this when more than 1 field must be matched in the array element.
Try this:
Query.ElemMatch("B", Query.And(
Query.EQ("_id", id),
Query.EQ("C.UserId", userId)
));
Here's a good explanation of $elemMatch and dot notation, which matches this scenario exactly.

How can I use the map datatype in Apache Pig?

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?
First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:
1,2
3,4
5,6
And I try to load it as a map:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
I get three empty tuples:
()
()
()
So I try to load tuples and then generate the map:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Many variations on the syntax also fail (e.g., generate [$0#$1]).
OK, so I munge my map into Pig's map literal format as map.pig:
[1#2]
[3#4]
[5#6]
And load it up:
m = load 'map.pig' as (M: []);
Now let's load up some keys and try lookups:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Hrm, OK, maybe since there are two relations involved, we need a join:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).
Finally, I'd just like to be able to find all they keys in my map:
d = foreach m generate ...oh, forget it.
Is Pig's map type half-baked? What am I missing?
Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').
The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.
http://pig.apache.org/docs/r0.9.1/basic.html#map-schema
In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.
I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.
This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.
Great question!
I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.
As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.
If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data.
e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.
So why do we need Map in Pig?
Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.
How can one handle the Map data?
There are really 2 use cases:
Map keys are constants.
e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key.
e.g. header#'clientIp'.
Map keys are variables.
In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.
I hope this helps.
1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"
2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"
3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"
my file pigtest.csv like this
1234|emp_1234#company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567#company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
my schema:
a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;
b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;
dump b;
I think you need to think in term of relations and the map is just one field of one record. Then you can apply some operations on the relations, like joining the two sets data and mapping:
Input
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Pig
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Output
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
This answer could also help you if you just want to do a simple look up:
pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation
You can load up any data and then convert and store in key value format to read for later use
data = load 'somedata.csv' using PigStorage(',')
STORE data into 'folder' using PigStorage('#')
and then read as a mapped data.

Resources