How to fuse rows in Pentaho Kettle - oracle

So, I'm moving an Oracle DB to a Mongo DB. I have a collection called Work, where all films, painting and the rest are stocked. I also have a collection called Authority, where all people that ever interacted with one work are (actors, painter, etc.) I'm trying to make a link between Authorities and Works inside the Work collection this way:
"workCS": {
"casting": [
{
"authority": ObjectID("anID"),
"role": [
"actor",
"realisator"
]
}
],
[
{
"authority": ObjectID("otherID"),
"role": [
"actor"
]
}
]
}
So, I know how to make a manyToMany joint in Pentaho Kettle, so I had no problem making the basic structure of the collection. However, I can't find a way to make the role table inside the casting table, and end up with something like this:
"workCS": {
"casting": [
{
"authority": ObjectID("anID"),
"role": [
"actor"
]
}
],
[
{
"authority": ObjectID("anID"),
"role": [
"realisator"
]
}
],
[
{
"authority": ObjectID("otherID"),
"role": [
"actor"
]
}
]
}
Which is incoherent with the post treatment we do to our data.
When I make my SQL to get the data from the Oracle DB, I have something like this:
"id"; "LastName"; "FirstName"; "Role";
1; "Radcliffe"; "Daniel"; "Actor";
1; "Radcliffe"; "Daniel"; "Writer";
2; "Grint"; "Rupert"; "Actor";
Is there a way to fuse rows in Pentaho, so this example is displayed this way?
"id"; "LastName"; "FirstName"; "Roles";
1; "Radcliffe"; "Daniel"; "Actor, Writer";
2; "Grint"; "Rupert"; "Actor";

The step you are looking for is the Group by, with and Aggregation type of Concatenate strings separated by , on the roles.
You need to specify the other three columns as keys in the Group field, because even though the only real key is the Authority_id, if you do not specify them as group field, they will disappear.
Also use the Memory Group by,unless you have really, really a lot of rows, in which case use the Group by and make sure the data is sorted by Authority_id (they will automatically also sorted by Names).

Related

Match keys with sibling object JSONATA

I have an JSON object with the structure below. When looping over key_two I want to create a new object that I will return. The returned object should contain a title with the value from key_one's name where the id of key_one matches the current looped over node from key_two.
Both objects contain other keys that also will be included but the first step I can't figure out is how to grab data from a sibling object while looping and match it to the current value.
{
"key_one": [
{
"name": "some_cool_title",
"id": "value_one",
...
}
],
"key_two": [
{
"node": "value_one",
...
}
],
}
This is a good example of a 'join' operation (in SQL terms). JSONata supports this in a path expression. See https://docs.jsonata.org/path-operators#-context-variable-binding
So in your example, you could write:
key_one#$k1.key_two[node = $k1.id].{
"title": $k1.name
}
You can then add extra fields into the resulting object by referencing items from either of the original objects. E.g.:
key_one#$k1.key_two[node = $k1.id].{
"title": $k1.name,
"other_one": $k1.other_data,
"other_two": other_data
}
See https://try.jsonata.org/--2aRZvSL
I seem to have found a solution for this.
[key_two].$filter($$.key_one, function($v, $k){
$v.id = node
}).{"title": name ? name : id}
Gives:
[
{
"title": "value_one"
},
{
"title": "value_two"
},
{
"title": "value_three"
}
]
Leaving this here if someone have a similar issue in the future.

Filter with complex key not work (using startkey and endkey)

I create a view with Map function:
function(doc) {
if (doc.market == "m_warehouse") {
emit([doc.logTime,doc.dbName,doc.tableName], 1);
}
}
I want to filter the data with multi-keys:
_design/select_data/_view/new-view/?limit=10&skip=0&include_docs=false&reduce=false&descending=true&startkey=["2018-06-19T09:16:47,527","stage"]&endkey=["2018-06-19T09:16:43,717","stage"]
but I still got:
{
"total_rows": 248133,
"offset": 248129,
"rows": [
{
"id": "01CGBPYVXVD88FPDVR3NP50VJW",
"key": [
"2018-06-19T09:16:47,527",
"ods",
"o_ad_dsp_pvlog_realtime"
],
"value": 1
},
{
"id": "01CGBQ6JMEBR8KBMB8T7Q7CZY3",
"key": [
"2018-06-19T09:16:44,824",
"stage",
"s_ad_ztc_realpv_base_indirect"
],
"value": 1
},
{
"id": "01CGBQ4BKT8S2VDMT2RGH1FQ71",
"key": [
"2018-06-19T09:16:44,707",
"stage",
"s_ad_ztc_realpv_base_indirect"
],
"value": 1
},
{
"id": "01CGBQ18CBHQX3F28649YH66B9",
"key": [
"2018-06-19T09:16:43,717",
"stage",
"s_ad_ztc_realpv_base_indirect"
],
"value": 1
}
]
}
the key "ods" should not in the results.
What did I do wrong?
Your query is not multi-key .. ist start and endkey.
if you want to have results by dbname in a special time range.. you need to change the emit to [doc.dbName,doc.logTime,doc.tableName]
then you query startkey=["stage","2018-06-19T09:16:43,717"]&endkey=["stage","2018-06-19T09:16:47,527"]
(btw. are you sure that your timestamp is in the right order ? In your example the second TS is larger than the first..)
As you have chosen a full date/time stamp as the first level of your key, down to millisecond precision, there are unlikely to be any repeating values in the first level of your compound key. If you indexed just the date, say, as the first key, your date would be grouped by date, dbame and table name in a more predictable way
e.g.
["2018-06-19","ods","o_ad_dsp_pvlog_realtime"]
["2018-06-19","stage","s_ad_ztc_realpv_base_indirect"]
["2018-06-19",stage","s_ad_ztc_realpv_base_indirect"
["2018-06-19","stage","s_ad_ztc_realpv_base_indirect"
With this key structure, the hierarchical grouping of keys works in your favour i.e. all the data from "2018-06-19" is together in the index, with all the data matching ["2018-06-19","stage"] adjacent to each other.
If you need to get to millisecond precision, you could index the data as follows:
function(doc) {
if (doc.market == "m_warehouse") {
emit([doc.dbName,doc.logTime], 1);
}
}
This would create index organised by dbName, but with a secondary sort on time. You can then extract the data for specified dbName between two timestamps.

Index main-object, sub-objects, and do a search on sub-objects (that return sib-objects)

I've an object like it (simplified here), Each strain have many chromosomes, that have many locus, that have many features, that have many products, ... Here I just put 1 of each.
The structure in json is:
{
"name": "my strain",
"public": false,
"authorized_users": [1, 23, 51],
"chromosomes": [
{
"name": "C1",
"locus": [
{
"name": "locus1",
"features": [
{
"name": "feature1",
"products": [
{
"name": "product1"
//...
}
]
}
]
}
]
}
]
}
I want to add this object in Elasticsearch, for the moment I've add objects separatly: locus, features and products. It's okay to do a search (I want type a keyword, watch in name of locus, name of features, and name of products), but I need to duplicate data like public and authorized_users, in each subobject.
Can I register the whole object in elasticsearch and just do a search on each locus level, features and products ? And get it individually ? (no return the Strain object)
Yes you can search at any level (ie, with a query like "chromosomes.locus.name").
But as you have arrays at each level, you will have to use nested objects (and nested query) to get exactly what you want, which is a bit more complex:
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.3/query-dsl-nested-query.html
For your last question, no, you cannot get subobjects individually, elastic returns the whole json source object.
If you want only data from subobjects, you will have to use nested aggregations.

Rethinkdb: Including a subdocument for nested doc

I am performing an operation, and it works, but I want to know if there is a better or more efficient way to do what I want.
I have an object in my db that looks like this:
{
"id": "testId",
"name": "testName",
"products": [
{
"name": "product1"
"info": "sampleInfo",
"templateIds": [
"asdf-1",
"asdf-2"
]
},
{
"name": "product2"
"info": "sampleInfo",
"templateIds": [
"asdf-1",
"asdf-2"
]
}
]
}
As you can see, each "product" in the "products" array has a sub-array of templateIds. These match templates stored in another table. What I want to do is create a query that merges those templates onto each product object before I send it all back.
Currently I am doing this with sub-merges:
r.table('suites').get('testId').merge(function(suite){
return {
products: suite('products').merge(function(product){
return {
templates: r.expr(product('templateIds')).map(function(id) {
return r.table('templates').get(id)
})
}
})
}
})
My question is: is there a more efficient way to do this? Or is there a completely different way of thinking I should employ to do this?
Thanks guys!
That looks right to me. The only thing I can think of is that r.table('templates').get_all(r.args(product('templateIds'))) is shorter than product('templateIds').map(function(id){ return t.table('templates').get(id);}) and might well be faster.
EDIT: If you have a small number of templates, another thing that would make this run faster would be to do the substitution in the client instead and cache the retrieved templates by ID. RethinkDB will have to do a separate read for each template ID, even if it sees the same one over and over again, because it doesn't know enough to know whether or not caching those values is safe.

How to use two conditon in one array?

I have a list of task stored in Mongo, like below
{
"name": "task1",
"requiredOS": [
{
"name": "linux",
"version": [
"6.0"
]
},
{
"name": "windows",
"version": [
"2008",
"2008R2"
]
}
],
"requiredSW": [
{
"name": "MySQL",
"version": [
"1.0"
]
}
]
}
My purpose is to filter the task by OS and Software, for example the user give me below filter condition
{
"keyword": [
{
"OS": [
{
"name": "linux",
"version": [
"6.0"
]
},
{
"name": "windows",
"version": [
"2008"
]
}
]
},
{
"SW": [ ]
}
]
}
I need filter out all the task can both running on the windows2008 and Linux 6.0 by searching the "requiredOS" and "requiredSW" filed. As you seen, the search condition is an array (the "OS" part). I have a trouble when use an array as search condition. I expect the query to return me a list of Task which satisfy the condition.
A challenging thing is that I need to integrate the query in to spring-data using #Query. so the query must be parameterized
can anyone give me a hand ?
I have tried a query but return nothing. my purpose is to use $all to combine two condition together then use $elemMatch to search the "requiredOS" field
{"requiredOS":{"$elemMatch":{"$all":[{"name":"linux","version":"5.0"},{"name":"windows","version":"2008"}]}}}
If I understood correctly what you are trying, you need to use $elemMatch operator:
http://docs.mongodb.org/manual/reference/operator/query/elemMatch/#op._S_elemMatch
Taking your example, the query should be like:
#Query("{'requiredOS':{$elemMatch:{name:'linux', version:'7.0'},$elemMatch:{name:'windows', version:'2008'}}}")
It match the document you provided.
You basically seem to need to translate your "parameters" into a query form that produces results, rather than passing them straight though. Here is an example "translation" where the "empty" array is considered to match "anything".
Also the other conditions do not "literally" go straight through. The reason for this is that in that form MongoDB considers it to mean an "exact match". So what you want is a combination of the $elemMatch operator for multiple array conditions, and the $and operator which combines the conditions on the same property element.
This is a bit longer than $all but essentially because that operator is a "shortened" form of $and as $in is to $or:
db.collection.find({
"$and": [
{
"requiredOS": {
"$elemMatch": {
"name": "linux",
"version": "6.0"
}
}
},
{
"requiredOS": {
"$elemMatch": {
"name": "windows",
"version": "2008"
}
}
}
]
})
So it just a matter of transforming the properties in the request to actually match the required query form.
Building this into a query object can be done in a number of ways, such as using the Query builder:
DBObject query = new Query(
new Criteria().andOperator(
Criteria.where("requiredOS").elemMatch(
Criteria.where("name").is("linux").and("version").is("6.0")
),
Criteria.where("requiredOS").elemMatch(
Criteria.where("name").is("windows").and("version").is("2008")
)
)
).getQueryObject();
Which you can then pass in to a mongoOperations method as a query object or any other method that accepts the query object.

Resources