Merging and Aggregating data held in JSON objects - ruby

I have two JSON objects, created using JSON.parse, that I would like to merge and aggregate.
I do not have the ability to store the data in a Mongo database and am unclear how to proceed.
The first JSON file contains the raw data:
[
{
"sector": {
"url": "http://TestUrl/api/sectors/11110",
"code": "11110",
"name": "Education policy and administrative management"
},
"budget": 5742
},
{
"sector": {
"url": "http://TestUrl/api/sectors/11110",
"code": "11110",
"name": "Education policy and administrative management"
},
"budget": 5620
},
{
"sector": {
"url": "http://TestUrl/api/sectors/12110",
"code": "12110",
"name": "Health policy and administrative management"
},
"budget": 5524
}, ]
The second JSON file contains the mappings that I require for the data merge operation:
{
"Code (L3)":11110,
"High Level Code (L1)":1,
"High Level Sector Description":"Education",
"Name":"Education policy and administrative management",
"Description":"Education sector policy, planning and programmes; aid to education ministries, administration and management systems; institution capacity building and advice; school management and governance; curriculum and materials development; unspecified education activities.",
"Category (L2)":111,
"Category Name":"Education, level unspecified",
"Category Description":"The codes in this category are to be used only when level of education is unspecified or unknown (e.g. training of primary school teachers should be coded under 11220)."
},
{
"Code (L3)":12110,
"High Level Code (L1)":2,
"High Level Sector Description":"Health",
"Name":"Health policy and administrative management",
"Description":"Health sector policy, planning and programmes; aid to health ministries, public health administration; institution capacity building and advice; medical insurance programmes; unspecified health activities.",
"Category (L2)":121,
"Category Name":"Health, general",
"Category Description":""
},
{
"Code (L3)":99999,
"High Level Code (L1)":9,
"High Level Sector Description":"Unused Code",
"Name":"Extra Code",
"Description":"Shows Data Issue",
"Category (L2)":998,
"Category Name":"Extra, Code",
"Category Description":""
},
I would like to connect the data in the two files using the "code" value in the first file and the "Code (L3)" value in the second file. In SQL terms I would like to do an "inner join" on the files using these values as the connection point.
I would then like to aggregate all of the budget values from the first file for the "High Level Code (L1)" value from the second file to produce the following JSON object:
{
"High Level Code (L1)":1,
"High Level Sector Description":"Education",
"Budget”: 11362
},
{
"High Level Code (L1)":2,
"High Level Sector Description":"Health",
"Budget”: 5524
}
This would be a very simple task with a database but I am afraid that this option is not available. We are running our site on Sinatra so any Rails-specific helper methods are not available to me.
Update: I am now using real data for the inputs and I have found that there are multiple JSON objects in the mappings file that have "Code (L3)" values that do not map to any of the [Sector][code] values in the raw data file.
I have tried a number of workarounds (breaking the data into 2D arrays then trying to bring the resultant array back as a hash table) but I have been unable to get anything to work.
I have come back to the answer that I accepted for this question as it is a very elegant solution and I don't want to ask the same question twice - I just can't figure out how to make it ignore items from the mappings file when they don't match anything from the raw data file.

This is quite easy, image you're first list is named sources, whil the second is named "values", or whatever. We will through "values", and extract the required fields, and for one, find in "sources", the values needed :
values.map do |elem|
{ "High Level Code (L1)" => elem["High Level Code (L1)"],
"High Level Sector Description" => elem["High Level Sector Description"],
"Budget" => sources.select do |source|
source["sector"]["code"] == elem["Code (L3)"].to_s
end.map{|elem|elem["budget"]}.sum
}
end
The equivalent of the "join" with a database is made with the "find" operation. We loop through the sources array to find a sector/code value identical to "Code (L3)", then we extract the "budget" value and we sum all these values extracted....
Results is the following:
[{"High Level Code (L1)"=>1,
"High Level Sector Description"=>"Education",
"Budget"=>11362},
{"High Level Code (L1)"=>2,
"High Level Sector Description"=>"Health",
"Budget"=>5524}]

how about just going through the first dataset and indexing it to a hash using the code as the key, then going through the second dataset and finding the appropriate data for every key from the hash. Sort of brute force but..

Related

Elasticsearch re-index all vs join

I'm pretty new on Elasticsearch and all its concepts. I would like to understand how I could accomplish what I have in my Relational DB in an Elasticsearch architecture.
The scenario is the following
I have a index "data":
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "A1", "B"]
}
The requirement says that data can be queried by:
some text search in the context field
that belongs to a specific type or category
So far, so simple, so good.
This data will not be completed from the creating time. It might happen that new categories will be added/removed to the data later. So, many data uploads/re-indexes might happen along the way
For example:
create the data
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A"]
}
Then it was decided that all data with type=T1 must belong to both A & B categories.
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "B"]
}
If I have a billion hits for type=T1 I would have to update/re-index a billion entries. Maybe it is how things should work and this where my question lands on.
Is ok to re-index all the data just to add/remove a new category, or would it be possible to have a second much smaller index just to do this association and somehow join both indexes at time to query?
Something like it:
Data:
{
"id": "00001",
"content" : "some text here ..",
"type": "T1"
}
DataCategories:
{
"type": "T1"
"categories" : ["A", "B"]
}
Is it acceptable/possible?
This is a common scenario - but unfortunately, there is no 1:1 mapping for RDBMS features in text search engines like Lucene/elasticsearch.
Possible options:
1 - For the best performance, reindex. It may not be practical depending on the velocity of your change
2 - Consider Parent-Child; Though it's a slower option - often will meet performance requirements. The category could be a parent document, each having several thousands of children.
3 - If its category renaming - Consider using IDs for the category and translating it to text in the application.
4 - Update document depends on the number of documents to be updated; maybe for few thousand - run an update query, if more - reindex.
Suggested reading - https://www.elastic.co/blog/managing-relations-inside-elasticsearch

SSAS: The way to hide certain fields in a table from certain users

For a Microsoft Analysis Services Tabular (1500) data cube, given a Sales table:
CREATE TABLE SalesActual (
Id Int,
InvoiceNumber Char(10),
InvoiceLineNumber Char(3),
DateKey Date,
SalesAmount money,
CostAmount money )
Where the GP Calculation in DAX would be
GP := SUM('SalesActual'[SalesAmount]) - SUM('SalesActual'[CostAmount])
I want to limit some users from accessing cost / GP data. Which approach would you recommend?
I can think of the following:
Split all the Sales and Cost into separate rows and create a MetricType flag 'C', 'S', etc. and set Row-Level Security so that some people won't be able to see lines with costs.
Separate the into two different tables and handle it through OLS.
Any other recommendations?
I am leaning towards approach 1 as I have some other RLS set-up and OLS doesn't mix well with RLS, but I also want to hear from the experts what other approach could fulfill such requirements.
Thanks!
UPDATE: I ended up going with the first approach.
Tabular DB is fast for this kind of split
OLS = renders the field invalid; and I'd have to create and maintain two reports... which is undesirable
RLS is easier to control; and I think cost / GP is the only thing I'd need to exclude for now, but it also gives me some flexibility in the filter if I need to restrict other fields; my data will grow vertically, but I can also add additional data type such as sales budget, sales forecast, expenses and other cost, etc. into the model in the future. All easily controlled by RLS
The accepted answer works and would work for many scenario. I appreciate answerer's sharing, just that it doesn't solve my particular situation.
You can create a role where CLS does the job. There is no gui for CLS, but we can use a script (You can script your current role from SSMS "Script Role As", to modify - but better test this on new one)
{
"createOrReplace": {
"object": {
"database": "YourDatabase",
"role": "CLS1"
},
"role": {
"name": "CLS1",
"modelPermission": "read",
"members": [
{
"memberName": "YourOrganization\\userName"
}
],
"tablePermissions": [
{
"name": "Sales",
"columnPermissions": [
{
"name": "SalesBonus",
"metadataPermission": "none"
},
{
"name": "CostAmount",
"metadataPermission": "none"
}
]
}
]
}
}
}
The key element is TablePermissions and columnPermissions in which we define which column / columns the user cannot use).

Dynamic Achievement System algorithm / design

I'm developing this Achievement System and it must have a CRUD, that admins access to create new achievements and it's rules. I need some help with the design & algorithm of this so it can easily evolve with new rules as admins ask.
Rules sample
Medal one: must complete 5 any courses with a score of at least 90
Medal two: must complete two specific courses with a score of at least 85
Medal three: must be top 5 in general ranking at least once
Medal four: must have more than 5000 points
I'll basically store that as metadata in a relational database, probably with these columns below:
action
action quantity
course quantity
score
id course
ranking
position
points
I want to know if there is any known algorithm / design to this kind of problem? Or perhaps I should store them differently to make it easier? Don't know, I want suggestions.
Your doubts may be right. In my opinion, a database is the wrong way to organize this data. Every new kind of achievement you want to create would add extra columns to your database, and most achievements wouldn't use most of the columns. A more flexible data structure, one that doesn't expect for every entry to use all of the possible achievement criteria at once by default, would probably be more useful. Most languages support JSON, so I suggest you use that. The structure could be something like this:
[
{
"name": "Medal One",
"requirements": {
"coursesCompleted": 5,
"scoreMin": 90
}
},
{
"name": "Medal Two",
"requirements": {
"specificCoursesCompleted": [
"Course 1",
"Course 2"
],
"scoreMin": 85
}
},
{
"name": "Medal Three",
"requirements": {
"generalRankingMin": 5
}
},
{
"name": "Medal Four",
"requirements": {
"scoreMin": 5000
}
}
]
You can see here how the criteria types are sometimes reused, but they can be omitted when not needed and new ones can be added to a few achievements without bloating the rest of the dataset as well.
PS: I made the criteria names very verbose for demonstration purposes; shortening them or not in actual use is up to preference.

How can I influence an Elasticsearch autosuggest query to return an exact match first?

Consider the following job titles that are indexed into 3 separate documents:
[ "Software Developer Analyst, Senior",
"Software Developer and Analyst - iOS, iPad, . Net",
"Software Developer" ]
In the real world, we have hundreds of variations for "software developer", so if the autocomplete only returns 10 documents, it's likely buried in noise.
Is it possible to do any sort of ordering to prefer exact matches?
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html
Completion suggester use FST (special in-memory data structure that built at index time) for fast searches, so it is possible to influence the results of future queries only at index time (see issue related to your question at GitHub).
In your case you can add context for your suggester. For example, it could be category field which will contain the features of particular software developer. Such features should be somehow retrieved from data that will be indexing:
PUT jobs/_doc/1
{
"suggest": "Software Developer and Analyst - iOS, iPad, . Net",
"category": ["apple", "ios", "ipad", "dotnet"]
}
And at query time you should try to retrieve such features from user input before send it to ES. For example, if user types "java senior software developer", you transform this input to query
POST jobs/_search
{
"suggest": {
"jobs_suggestion" : {
"prefix" : "java senior software developer",
"completion" : {
"field" : "suggest",
"size": 10,
"contexts": {
"category_context": [ "java", "senior" ]
}
}
}
}
}
Of course, this approach requires complex preliminary analysis of index data and search queries.
Another option is to assign weights to jobs titles at index time, but it does not fit in your case in my opinion.

Joining logstash with parent record

I'm using logstash to analyze my web servers access. At this time, it works pretty well. I used a configuration file that produce to me this kind of data :
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345"
...
}
This record are stored into "logstash-2016.10.02" index (one index per day).
I also created an other index named "publications". This index contains the publication metadata.
A json record looks like this :
{
"type": "publication",
"id": "boreal:12345",
"sm_title": "The title of the publication",
"sm_type": "thesis",
"sm_creator": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
"sm_departement": [
"UCL/CORE - Center for Operations Research and Econometrics",
],
"sm_date": "2001",
"ss_state": "A"
...
}
And I would like to create a query like "give me all access for 'Smith, John' publications".
As all my data are not into the same index, I can't use parent-child relation (Am I right ?)
I read this on a forum but it's an old post :
By limiting itself to parent/child type relationships elasticsearch makes life
easier for itself: a child is always indexed in the same shard as its parent,
so has_child doesn’t have to do awkward cross shard operations.
Using logstash, I can't place all data in a single index nammed logstash. By month I have more than 1M access... In 1 year, I wil have more than 15M record into 1 index... and I need to store the web access data for minimum 5 year (1M * 12 * 15 = 180M).
I don't think it's a good idea to deal with a single index containing more than 18M record (if I'm wrong, please let me know).
Is it exists a solution to my problem ? I don't find any beautifull solution.
The only I have a this time in my python script is : A first query to collect all id's about 'Smith, John' publications ; a loop on each publication to get all WebServer access for this specific publication.
So if "Smith, John" has 321 publications, I send 312 http requests to ES and the response time is not acceptable (more than 7 seconds ; not so bad when you know the number of record in ES but not acceptable for final user.)
Thanks for your help ; sorry for my english
Renaud
An idea would be to use the elasticsearch logstash filter in order to get a given publication while an access log document is being processed by Logstash.
That filter would retrieve the sm_creator field in the publications index having the same object_id and enrich the access log with whatever fields from the publication document you need. Thereafter, you can simply query the logstash-* index.
elasticsearch {
hosts => ["localhost:9200"]
index => publications
query => "id:%{object_id}"
fields => {"sm_creator" => "author"}
}
As a result of this, your access log document will look like this afterwards and for "give me all access for 'Smith, John' publications" you can simply query the sm_creator field in all your logstash indices
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345",
"author": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
...
}

Resources