Aggregate only newest document - elasticsearch

I have an elastic index which has documents for user state history. Data looks like this;
{
"session_id": "yunus",
"state_name": "start",
"entry_time": "2016-11-09 15:27:03"
},
{
"session_id": "yunus",
"state_name": "end",
"entry_time": "2016-11-09 16:30:00"
},
{
"session_id": "can",
"state_name": "start",
"entry_time": "2016-11-09 12:01:00"
},
{
"session_id": "rick",
"state_name": "start",
"entry_time": "2016-11-09 09:00:00"
},
{
"session_id": "rick",
"state_name": "end",
"entry_time": "2016-11-10 10:00:00"
}
I want to aggregate by state name with date histogram but for only relevant last state at that time. So result can be;
2016-11-08
start = 0
end = 0
2016-11-09
start = 2
end = 1
2016-11-10
start = 1
end = 2
Actually plan is to generate grouped bar chart with timeline to show states change over time.
I tried several things like aggregation pipelines, top hits but couldn't make any progress.
Any help appreciated.

For anyone interested, I solved it with spark. I used elastic-spark to read from elasticsearch and then write back to elasticsearch.
Here is the read from es as Rdd;
val allData = sc.esRDD(s"states_${id}/log", query)
Then I first group by session id, sort by date to find only latest state of a session;
val latestStates = allData.groupBy(k => k._2.get("session_id").get).map(k => (k._2).reduceLeft((d1, d2) => {
d1._2.get("timestamp").get.asInstanceOf[Long] > d2._2.get("timestamp").get.asInstanceOf[Long] match {
case true => d1
case false => d2
}
})).map(_._2)
Once I have the latest states of session, I filter the exit states then count by value;
val stateSummary = latestStates
.filter(s => s.isDefinedAt("state_id") && s("state_id").asInstanceOf[Long] != -1)
.map(s => (s("state_id"), s("state_name")))
.countByValue()
.map(d => Map("state_id" -> d._1._1.asInstanceOf[Long], "state_name" -> d._1._2.asInstanceOf[String], "count" -> d._2)).toList
Now we have the current number of sessions in states. (current is configurable so we can set it for a specific time), only thing is left, write back to the elasticsearch;
sc.makeRDD(Seq(finalElasticDoc)).saveToEs(s"states_${id}/analytic_daily")

Related

Group user events into activity sessions

You’re in charge of implementing a new analytics “sessions” view. You’re given a set of data that consists of individual web page visits, along with a visitorId which is generated by a tracking cookie that uniquely identifies each visitor. From this data we need to generate a list of sessions for each visitor.
The data set looks like this:
"events": [
{
"url": "/pages/a-big-river",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754583000
},
{
"url": "/pages/a-small-dog",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754631000
},
{
"url": "/pages/a-big-talk",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512709065294
},
{
"url": "/pages/a-sad-story",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512711000000
},
{
"url": "/pages/a-big-river",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754436000
},
{
"url": "/pages/a-sad-story",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512709024000
}
]
}
Given this input data, we want to create a set of sessions of the incoming data. A sessions is defined as a group of events from a single visitor with no more than 10 minutes between each consecutive event. A visitor can have multiple sessions. So given the example input data above, we would expect output which looks like:
{
"sessionsByUser": {
"f877b96c-9969-4abc-bbe2-54b17d030f8b": [
{
"duration": 41294,
"pages": [
"/pages/a-sad-story",
"/pages/a-big-talk"
],
"startTime": 1512709024000
},
{
"duration": 0,
"pages": [
"/pages/a-sad-story"
],
"startTime": 1512711000000
}
],
"d1177368-2310-11e8-9e2a-9b860a0d9039": [
{
"duration": 195000,
"pages": [
"/pages/a-big-river",
"/pages/a-big-river",
"/pages/a-small-dog"
],
"startTime": 1512754436000
}
]
}
}
Notes
Timestamps are in milliseconds.
Events may not be given in chronological order.
The visitors in sessionsByUser can be in any order.
For each visitor, sessions to be in chronological order.
For each session, the URLs should be sorted in chronological order
For a session with only one event the duration should be zero
Each event in a session (except the first event) must have occurred
within 10 minutes of the preceding event in the session. This means
that there can be more than 10 minutes between the first and the last
event in the session.
Note: I am not gonna show you how to make the exact output format you need, but I will show you the general approach for solving this problem, and hopefully you can figure out how to change it for your requirements.
You are gonna want to start out by grouping the events by user:
events_by_user = events.group_by { |event| event[:visitorId] }
Now, for each user, you need to sort their events by timestamp:
events_by_user.transform_values! do |events|
events.sort_by { |event| event[:timestamp] }
end
Now, you need to loop through each user's events and compare them in sequential order, putting them in groups based on timestamp similarity:
session_length = 10 # seconds
sessions = {}
events_by_user.each do |visitor_id, events|
sessions[visitor_id] = []
events.each do |event|
if sessions[visitor_id].empty?
sessions[visitor_id].push([event])
else
last_session = sessions[visitor_id].last
last_timestamp = last_session.last[:timestamp]
if (event[:timestamp] - last_timestamp) <= session_length
last_session.push(event)
else
sessions[visitor_id].push([event])
end
end
end
end
Now sessions will contain a hash like this:
{
<visitor_id>: [
[<list of events in session 1>],
[<list of events in session 2>]
],
etc.
}
You can then extract the start time, total duration, etc
Group the "events" array by the property "visitorId" first. You can use in JavaScript the
Array.prototype.reduce(): The reduce() method executes a user-supplied
“reducer” callback function on each element of the array, in order,
passing in the return value from the calculation on the preceding
element. The final result of running the reducer across all elements
of the array is a single value. So, set {} as the initial value for the reducer function, at each pass use the visitorId as key to an array that will hold events of the visitor, push the current event to respective position in the array.
The a variable || [] statement is used to make an 'undefined'
value as [], empty array.
Now sort the events array that is built just now by timestamp in ascending order.
Loop through it and compare timestamps pairwise (previous and current array element), if difference is below given session length(eg 10 min), merge the two sessions and push it in an array with 'visitorId' as key. Use a variable to keep track of the index of the session to be merged together.
let data = require('d:\\dataset.json');
//Group by visitorId
let sessions = {
sessionsByUser: data.events.reduce(function (events, session) {
(events[session['visitorId']] = events[session['visitorId']] || []).push(session);
return events;
}, {})
};
//Sort events by timestamp ascending
for(let key in sessions.sessionsByUser){
let events = sessions.sessionsByUser[key];
events = events.sort((a, b) => {
return a.timestamp - b.timestamp;
});
}
let userSessions = {};
for(let key in sessions.sessionsByUser){
let events = sessions.sessionsByUser[key];
let lastIndex = 0;
for(let i = 0; i < events.length; i++)
{
if(i == 0) {
userSessions[key] = [{
duration: 0,
pages: [events[i].url],
startTime: events[i].timestamp
}]
}
else {
//Check difference (10 min)
if(events[i].timestamp - events[i-1].timestamp < 600000) {
let session = userSessions[key][lastIndex];
session.duration += (events[i].timestamp - events[i-1].timestamp);
session.pages.push(events[i].url);
}
else {
userSessions[key].push({
duration: 0,
pages: [events[i].url],
startTime: events[i].timestamp
});
lastIndex++;
}
}
}
}
let soln = {
sessionsByUser: userSessions
};
console.log(JSON.stringify(soln));
Run command in cmd: Node <filename>.js
Change dataset path, navigate to the directory of the file in cmd. Node must
be installed on the system.

IndexedDB "updates" every browser restart and erases data

I wrote a Firefox WebExtension that downloads data files from a website and uses IndexedDB to store/update the data. The .SQLite file that is created is ~2GB in size. Whenever I restart Firefox, the extension executes the onupgradeneeded event, even though I always use version "1". I create the database object stores and indexes in that event, so all my data ends up getting deleted.
The only time this doesn't happen is when I close Firefox while the data is being downloaded or stored. The next time I start Firefox, it does not execute the event (as should be the case). It then continues to update the database as it was programmed to do.
I installed the SQLite Manager extension in hopes that I could identify something causing the issue to the database, but nothing was obvious to me.
Here is part of my background script:
init().then(fetchData).then(addData).catch(dberror);
function init() {
req = indexedDB.open("db", 1);
req.onupgradeneeded = e => {
var name;
var key;
console.log("Upgrading database...", e.oldVersion, e.newVersion);
db = e.currentTarget.result;
var store = db.createObjectStore("db", { keyPath: "KEY" });
db.createObjectStore("version", { keyPath: "version" });
for (name in indexes) {
key = ...
store.createIndex(name, key);
};
};
return new Promise( (resolve, reject) => {
req.onsuccess = e => {
db = e.currentTarget.result;
db.onerror = dberror;
var cursor = db.transaction("MECs").objectStore("MECs").index("STATUS_DATE").openCursor(null, 'prev');
cursor.onsuccess = e => {
if (e.target.result) {
lastMod = e.target.result.key;
fileYear = lastMod.getFullYear();
}
else lastMod = new Date(startingfileYear, 0);
resolve(lastMod);
}
cursor.onerror = reject;
};
req.onerror = e => {
dberror(e);
reject(e);
}
});
}
function fetchData(param) {
// Get data based on the param and return it
return fetchFile(filename);
}
function addData(data) {
var trans = db.transaction("db", "readwrite");
var store = trans.objectStore("db");
var req;
var n = 0;
var data2 = [];
var addPromise;
trans.onerror = event => console.log("Error! Error! ", event.target.error);
trans.onabort = event => console.log("Abort! Abort! ", event.target.error);
data.forEach((row, index) => {
//process data here
data2 = ...
});
(function storeRegData(n) {
var row = data[n];
if (!row) return;
req = store.put(row);
req.onsuccess = event => {
numUpdated++;
storeRegData(++n);
}
req.onabort = event => console.log("Abort! Abort! ", event.target.error);
req.onerror = event => console.log("Error! Error! ", event.target.error);
})(0); // I'm storing one row at a time because the transaction is failing when I queue too many rows.
addPromise = fetchData(data2).then(
response => {
var trans2 = db.transaction("db", "readwrite");
var store2 = trans2.objectStore("db");
var req2;
response.forEach(row => {
req2 = store2.put(row);
req2.onsuccess = event => numUpdated++;
req2.onerror = console.log;
});
return new Promise((resolve, reject) => trans2.oncomplete = e => resolve(response));
},
console.log)
);
return new Promise((resolve, reject) => trans.oncomplete = e => {
if (noMoreData)
resolve(addPromise);
else if (moreData)
resolve( addPromise.then(fetchData).then(addData) );
});
}
And here is my manifest
{
"author": "Name",
"manifest_version": 2,
"name": "Extension",
"description": "Extension",
"version": "3.0",
"applications": {
"gecko": {
"strict_min_version": "50.0",
"id": "myID",
"update_url": "https://update.me"
}
},
"background": {
"scripts": [
"js/background.js"
]
},
"content_scripts": [
{
"matches": [ "https://match.me/*" ],
"js": [
"script.js"
],
"css": [
"style.css"
]
}
],
"icons": {
"48": "icon.png"
},
"options_ui": {
"page": "options.html"
},
"page_action": {
"browser_style": true,
"default_icon": {
"19": "icon-19.png",
"38": "icon-38.png"
},
"default_title": "Extension",
"default_popup": "popup.html"
},
"permissions": [
"https://web.address/*",
"downloads",
"notifications",
"storage",
"tabs",
"webRequest",
"webNavigation"
],
"web_accessible_resources": [
"pictures.png"
]
}
Why does Firefox think the database is at version 0 when I restart the browser? I can use the stored data after I download it, so why does it overwrite it on every restart? I could possibly do a workaround where I only create the store and indexes on extension installation or update, but that's not a solution to the actual issue.
UPDATE: I tried the following to no avail -
Close the database and re-open after storing each data file
Create a new object store for each data file
UPDATE 2: It appears this is related to a storage issue. Apparently, 2GB is the storage limit for non-persistent storage. In Firefox you can by-pass this by making the storage persistent with the following command:
indexedDB.open("db", { version: 1, storage: "persistent" })
See the bugzilla report here.
Unfortunately, when run from a background page, the popup asking for confirmation is not handled, so you can never acknowledge it. Supposedly, when Firefox 56 comes out, you'll be able to use the "unlimitedStorage" permission, which will by-pass the confirmation popup, so it should work from the background page.
Update 3: So it looks like the limit is actually ~1.5 GB. I just spent over a week re-coding the extension to create and use a different database for each year of data, making each database no larger than 150 MB. And still onupgradeneeded executes when I restart the browser and wipes all my data. If, however, I limit the total amount of data in all the databases to the above limit, it works. Unfortunately, I'm still in the same boat.
Does no one have any ideas?
As I mentioned in the updates to my question, there appears to be a limit of ~1.5GB for the "default" storage of indexedDB. Changing the storage to "persistent" will remove that limit. Because persistent storage currently requires user input, however, the database has to be opened from a window that can handle a UI response.
This can be done from the background script by creating a new window with browser.window.create() and opening the database from there. There are security restrictions that prevent inline scripts from running in the new page, so I had to link to local javascript files for that (i.e. <script src="db.js"></script>. I think you can also change the content security policy with a manifest instruction, but I didn't do that.
Hopefully, the unlimitedStorage permission will be supported in Firefox 56, which will remove the popup, allowing a persistent database to be created/accessed directly from the background script.

Iterate and search a JSON array for the element in the array

I have a JSON array that looks like this:
response = {
"items"=>[
{
"tags"=>[
"random"
],
"timestamp"=>12345,
"storage"=>{
"url"=>"https://example.com/example",
"key"=>"mykeys"
},
"envelope"=>{
},
"log-level"=>"info",
"id"=>"random_id_test_1",
"campaigns"=>[
],
"user-variables"=>{
},
"flags"=>{
"is-test-mode"=>false
},
"message"=>{
"headers"=>{
"to"=>"random#example.com",
"message-id"=>"foobar#example.com",
"from"=>"noreply#example.com",
"subject"=>"new subject"
},
"attachments"=>[
],
"recipients"=>[
"result#example.com"
],
"size"=>4444
},
"event"=>"stored"
},
{
"tags"=>[
"flowerPower"
],
"timestamp"=>567890,
"storage"=>{
"url"=>"https://yahoo.com",
"key"=>"some_really_cool_keys_go_here"
},
"envelope"=>{
},
"log-level"=>"info",
"id"=>"some_really_cool_ids_go_here",
"campaigns"=>[
],
"user-variables"=>{
},
"flags"=>{
"is-test-mode"=>false
},
"message"=>{
"headers"=>{
"to"=>"another_great#example.com",
"message-id"=>"email_id#example.com",
"from"=>"from#example.com",
"subject"=>"email_looks_good"
},
"attachments"=>[
],
"recipients"=>[
"example#example.com"
],
"size"=>2222
},
"event"=>"stored"
}]
}
I am trying to obtain the "storage" "url" based on the "to" email.
How do I iterate through this array where x is just the element in the array
response['items'][x]["message"]["headers"]["to"]
Once I find the specific email that I need, it will stop and return the value of x which is the element number.
I was going to use that value for x and call response['items'][x]['storage']['url']
which will return the string for the URL.
I thought about doing this but there's gotta be a better way:
x = 0
user_email = another_great#example.com
while user_email != response['items'][x]["message"]["headers"]["to"] do
x+=1
value = x
puts value
end
target =
response['items'].detect do |i|
i['message']['headers']['to'] == 'another_great#example.com'
end
then
target['storage']['url']
This is another option by creating Hash with key of to's email. And on basis of it fetch required information like this:
email_hash = Hash.new
response["items"].each do |i|
email_hash[i["message"]["headers"]["to"]] = i
end
Now if you want to fetch "storage" "url" then simply do:
user_email = "another_great#example.com"
puts email_hash[user_email]["storage"]["url"] if email_hash[user_email]
#=> "https://yahoo.com"
You can use it as #Satoru suggested. As a suggestion, if you use case involves complex queries on json data (more complex than this), then you can store your data in mongodb, and can elegantly query anything.

Parsing Mulitlevel Javascript Objects in Grails 2.1

I am trying to send data to my controller from an ajax function that needs to have multiple levels, so something like this:
{
"lob": {
"TESTING": [
{
"name": "color",
"value": "1"
},
{
"name": "time",
"value": "2"
},
{
"name": "jeremy",
"value": "3"
},
{
"name": "fourtytwo",
"value": "4"
},
{
"name": "owl",
"value": "5"
},
{
"name": "why",
"value": "6"
},
{
"name": "derp",
"value": "7"
},
{
"name": "where",
"value": "8"
}
]
}
}
but when it sends to grails I am getting this when I print out the params
[lob[TESTING][4][value]:5,
lob[TESTING][3][name]:fourtytwo,
lob[TESTING][6][name]:derp,
lob[TESTING][5][name]:why,
lob[TESTING][3][value]:4,
lob[TESTING][1][value]:2,
lob[TESTING][2][value]:3,
lob[TESTING][5][value]:6,
lob[TESTING][1][name]:time,
lob[TESTING][0][value]:1,
lob[TESTING][6][value]:7,
lob[TESTING][0][name]:color,
lob[TESTING][7][value]:8,
lob[TESTING][4][name]:owl,
lob[TESTING][7][name]:where,
lob[TESTING][2][name]:jeremy,
action:save,
controller:LOB]
The data I am sending from JavaScript:
{
lob: {
TESTING: $form.serializeArray()
}
}
I have been reading multiple forums saying using JSON.parse or request.JSON but these solutions do not seem to be fixing my problems. I want to be able to access the data like
params.lob.testing.each{ a->
println a
}
I will be doing alot more than just that but it would be nice to be able to access the data in that fashion. I am using Grails 2.1 and Jquery 1.7.2
Actually Grails makes it very easy. I've taken your test data and ran it through the following:
import grails.converters.JSON
class LobController {
def save = {
def json = request.JSON
json.lob.TESTING.each {item->
println "Name: ${item.name} - Value: ${item.value}"
}
//render something back if you need to here
}
}
And it outputs:
Name color - Value: 1
Name time - Value: 2
Name jeremy - Value: 3
Name fourtytwo - Value: 4
Name owl - Value: 5
Name why - Value: 6
Name derp - Value: 7
Name where - Value: 8
I created a UrlMapping entry like this (you probably already have this):
"/myApi"(controller: "lob", parseRequest: true) {
action = [POST: "save"]
}
The parseRequest: true will automatically parse the incoming JSON.
I found a `serializeJSON' function that might replace the serializeArray() to format this for JSON. The following was provided by Arjen Oosterkamp on the jQuery serializeArray page:
(function( $ ){
$.fn.serializeJSON=function() {
var json = {};
jQuery.map($(this).serializeArray(), function(n, i){
json[n['name']] = n['value'];
});
return json;
};
})( jQuery );
Simply use as $('form').serializeJSON();
All credit for that function goes to Arjen Oosterkamp...

Map reduce to count tags

I am developing a web app using Codeigniter and MongoDB.
I am trying to get the map reduce to work.
I got a file document with the below structure. I would like to do a map reduce to
check how many times each tag is being used and output it to the collection files.tags.
{
"_id": {
"$id": "4f26f21f09ab66c1030d0000e"
},
"basic": {
"name": "The filename"
},
"tags": [
"lorry",
"house",
"car",
"bicycle"
],
"updated_at": "2012-02-09 11:08:03"
}
I tried this map reduce command but it does not count each individual tag:
$map = new MongoCode ("function() {
emit({tags: this.tags}, {count: 1});
}");
$reduce = new MongoCode ("function( key , values ) {
var count = 0;
values.forEach(function(v) {
count += v['count'];
});
return {count: count};
}");
$this->mongo_db->command (array (
"mapreduce" => "files",
"map" => $map,
"reduce" => $reduce,
"out" => "files.tags"
)
);
Change your Map function to:
function map(){
if(!this.tags) return;
this.tags.forEach(function(tag){
emit(tag, {count: 1});
});
}
Yea, this map/reduce simply calculate total count of tags.
In mongodb cookbook there is example you are looking for.
You have to emit each tag instead of entire collection of tags:
map = function() {
if (!this.tags) {
return;
}
for (index in this.tags) {
emit(this.tags[index], 1);
}
}
You'll need to call emit once for each tag in the input documents.
MongoDB documentation for example says:
A map function calls emit(key,value) any
number of times to feed data to the reducer. In most cases you will
emit once per input document, but in some cases such as counting tags,
a given document may have one, many, or even zero tags.

Resources