You’re in charge of implementing a new analytics “sessions” view. You’re given a set of data that consists of individual web page visits, along with a visitorId which is generated by a tracking cookie that uniquely identifies each visitor. From this data we need to generate a list of sessions for each visitor.
The data set looks like this:
"events": [
{
"url": "/pages/a-big-river",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754583000
},
{
"url": "/pages/a-small-dog",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754631000
},
{
"url": "/pages/a-big-talk",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512709065294
},
{
"url": "/pages/a-sad-story",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512711000000
},
{
"url": "/pages/a-big-river",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754436000
},
{
"url": "/pages/a-sad-story",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512709024000
}
]
}
Given this input data, we want to create a set of sessions of the incoming data. A sessions is defined as a group of events from a single visitor with no more than 10 minutes between each consecutive event. A visitor can have multiple sessions. So given the example input data above, we would expect output which looks like:
{
"sessionsByUser": {
"f877b96c-9969-4abc-bbe2-54b17d030f8b": [
{
"duration": 41294,
"pages": [
"/pages/a-sad-story",
"/pages/a-big-talk"
],
"startTime": 1512709024000
},
{
"duration": 0,
"pages": [
"/pages/a-sad-story"
],
"startTime": 1512711000000
}
],
"d1177368-2310-11e8-9e2a-9b860a0d9039": [
{
"duration": 195000,
"pages": [
"/pages/a-big-river",
"/pages/a-big-river",
"/pages/a-small-dog"
],
"startTime": 1512754436000
}
]
}
}
Notes
Timestamps are in milliseconds.
Events may not be given in chronological order.
The visitors in sessionsByUser can be in any order.
For each visitor, sessions to be in chronological order.
For each session, the URLs should be sorted in chronological order
For a session with only one event the duration should be zero
Each event in a session (except the first event) must have occurred
within 10 minutes of the preceding event in the session. This means
that there can be more than 10 minutes between the first and the last
event in the session.
Note: I am not gonna show you how to make the exact output format you need, but I will show you the general approach for solving this problem, and hopefully you can figure out how to change it for your requirements.
You are gonna want to start out by grouping the events by user:
events_by_user = events.group_by { |event| event[:visitorId] }
Now, for each user, you need to sort their events by timestamp:
events_by_user.transform_values! do |events|
events.sort_by { |event| event[:timestamp] }
end
Now, you need to loop through each user's events and compare them in sequential order, putting them in groups based on timestamp similarity:
session_length = 10 # seconds
sessions = {}
events_by_user.each do |visitor_id, events|
sessions[visitor_id] = []
events.each do |event|
if sessions[visitor_id].empty?
sessions[visitor_id].push([event])
else
last_session = sessions[visitor_id].last
last_timestamp = last_session.last[:timestamp]
if (event[:timestamp] - last_timestamp) <= session_length
last_session.push(event)
else
sessions[visitor_id].push([event])
end
end
end
end
Now sessions will contain a hash like this:
{
<visitor_id>: [
[<list of events in session 1>],
[<list of events in session 2>]
],
etc.
}
You can then extract the start time, total duration, etc
Group the "events" array by the property "visitorId" first. You can use in JavaScript the
Array.prototype.reduce(): The reduce() method executes a user-supplied
“reducer” callback function on each element of the array, in order,
passing in the return value from the calculation on the preceding
element. The final result of running the reducer across all elements
of the array is a single value. So, set {} as the initial value for the reducer function, at each pass use the visitorId as key to an array that will hold events of the visitor, push the current event to respective position in the array.
The a variable || [] statement is used to make an 'undefined'
value as [], empty array.
Now sort the events array that is built just now by timestamp in ascending order.
Loop through it and compare timestamps pairwise (previous and current array element), if difference is below given session length(eg 10 min), merge the two sessions and push it in an array with 'visitorId' as key. Use a variable to keep track of the index of the session to be merged together.
let data = require('d:\\dataset.json');
//Group by visitorId
let sessions = {
sessionsByUser: data.events.reduce(function (events, session) {
(events[session['visitorId']] = events[session['visitorId']] || []).push(session);
return events;
}, {})
};
//Sort events by timestamp ascending
for(let key in sessions.sessionsByUser){
let events = sessions.sessionsByUser[key];
events = events.sort((a, b) => {
return a.timestamp - b.timestamp;
});
}
let userSessions = {};
for(let key in sessions.sessionsByUser){
let events = sessions.sessionsByUser[key];
let lastIndex = 0;
for(let i = 0; i < events.length; i++)
{
if(i == 0) {
userSessions[key] = [{
duration: 0,
pages: [events[i].url],
startTime: events[i].timestamp
}]
}
else {
//Check difference (10 min)
if(events[i].timestamp - events[i-1].timestamp < 600000) {
let session = userSessions[key][lastIndex];
session.duration += (events[i].timestamp - events[i-1].timestamp);
session.pages.push(events[i].url);
}
else {
userSessions[key].push({
duration: 0,
pages: [events[i].url],
startTime: events[i].timestamp
});
lastIndex++;
}
}
}
}
let soln = {
sessionsByUser: userSessions
};
console.log(JSON.stringify(soln));
Run command in cmd: Node <filename>.js
Change dataset path, navigate to the directory of the file in cmd. Node must
be installed on the system.
Related
I have an array of objects from which I need to pass each object separately into async method (process behind is handled with Promise and then converted back to Observable via Observable.fromPromise(...) - this way is needed because the same method is used in case just single object is passed anytime; the process is saving objects into database). For example, this is an array of objects:
[
{
"name": "John",
...
},
{
"name": "Anna",
...
},
{
"name": "Joe",,
...
},
{
"name": "Alexandra",
...
},
...
]
Now I have the method called insert which which inserts object into database. The store method from database instance returns newly created id. At the end the initial object is copied and mapped with its new id:
insert(user: User): Observable<User> {
return Observable.fromPromise(this.database.store(user)).map(
id => {
let storedUser = Object.assign({}, user);
storedUser.id = id;
return storedUser;
}
);
}
This works well in case I insert single object. However, I would like to add support for inserting multiple objects which just call the method for single insert. Currently this is what I have, but it doesn't work:
insertAll(users: User[]): Observable<User[]> {
return Observable.forkJoin(
users.map(user => this.insert(user))
);
}
The insertAll method is inserting users as expected (or something else filled up the database with that users), but I don't get any response back from it. I was debugging what is happening and seems that forkJoin is getting response just from first mapped user, but others are ignored. Subscription to insertAll does not do anything, also there is no any error either via catch on insertAll or via second parameter in subscribe to insertAll.
So I'm looking for a solution where the Observable (in insertAll) would emit back an array of new objects with users in that form:
[
{
"id": 1,
"name": "John",
...
},
{
"id": 2,
"name": "Anna",
...
},
{
"id": 3,
"name": "Joe",,
...
},
{
"id": 4,
"name": "Alexandra",
...
},
...
]
I would be very happy for any suggestion pointing in the right direction. Thanks in advance!
To convert from array to observable you can use Rx.Observable.from(array).
To convert from observable to array, use obs.toArray(). Notice this does return an observable of an array, so you still need to .subscribe(arr => ...) to get it out.
That said, your code with forkJoin does look correct. But if you do want to try from, write the code like this:
insertAll(users: User[]): Observable<User[]> {
return Observable.from(users)
.mergeMap(user => this.insert(user))
.toArray();
}
Another more rx like way to do this would be to emit values as they complete, and not wait for all of them like forkJoin or toArray does. We can just omit the toArray from the previous example and we got it:
insertAll(users: User[]): Observable<User> {
return Observable.from(users)
.mergeMap(user => this.insert(user));
}
As #cartant mentioned, the problem might not be in Rx, it might be your database does not support multiple connections. In that case, you can replace the mergeMap with concatMap to make Rx send only 1 concurrent request:
insertAll(users: User[]): Observable<User[]> {
return Observable.from(users)
.concatMap(user => this.insert(user))
.toArray(); // still optional
}
Given an array of objects which contain a message payload and time parameter like this:
var data = [
{ message:"Deliver me after 1000ms", time:1000 },
{ message:"Deliver me after 2000ms", time:2000 },
{ message:"Deliver me after 3000ms", time:3000 }
];
I would like to create an observable sequence which returns the message part of each element of the array and then waits for the corresponding amount of time specified in the object. I'm open to reorganising the data structure of the array if that is necessary.
I've seen Observable.delay but can't see how it could be used with a dynamic value in this way. I'm working in RxJS 5.
You could use delayWhen:
var data = [
{ message:"Deliver me after 1000ms", time:1000 },
{ message:"Deliver me after 2000ms", time:2000 },
{ message:"Deliver me after 3000ms", time:3000 }
];
Rx.Observable
.from(data)
.delayWhen(datum => Rx.Observable.timer(datum.time))
.do(datum => console.log(datum.message))
.subscribe();
<script src="https://unpkg.com/#reactivex/rxjs#5.0.3/dist/global/Rx.js"></script>
I have an elastic index which has documents for user state history. Data looks like this;
{
"session_id": "yunus",
"state_name": "start",
"entry_time": "2016-11-09 15:27:03"
},
{
"session_id": "yunus",
"state_name": "end",
"entry_time": "2016-11-09 16:30:00"
},
{
"session_id": "can",
"state_name": "start",
"entry_time": "2016-11-09 12:01:00"
},
{
"session_id": "rick",
"state_name": "start",
"entry_time": "2016-11-09 09:00:00"
},
{
"session_id": "rick",
"state_name": "end",
"entry_time": "2016-11-10 10:00:00"
}
I want to aggregate by state name with date histogram but for only relevant last state at that time. So result can be;
2016-11-08
start = 0
end = 0
2016-11-09
start = 2
end = 1
2016-11-10
start = 1
end = 2
Actually plan is to generate grouped bar chart with timeline to show states change over time.
I tried several things like aggregation pipelines, top hits but couldn't make any progress.
Any help appreciated.
For anyone interested, I solved it with spark. I used elastic-spark to read from elasticsearch and then write back to elasticsearch.
Here is the read from es as Rdd;
val allData = sc.esRDD(s"states_${id}/log", query)
Then I first group by session id, sort by date to find only latest state of a session;
val latestStates = allData.groupBy(k => k._2.get("session_id").get).map(k => (k._2).reduceLeft((d1, d2) => {
d1._2.get("timestamp").get.asInstanceOf[Long] > d2._2.get("timestamp").get.asInstanceOf[Long] match {
case true => d1
case false => d2
}
})).map(_._2)
Once I have the latest states of session, I filter the exit states then count by value;
val stateSummary = latestStates
.filter(s => s.isDefinedAt("state_id") && s("state_id").asInstanceOf[Long] != -1)
.map(s => (s("state_id"), s("state_name")))
.countByValue()
.map(d => Map("state_id" -> d._1._1.asInstanceOf[Long], "state_name" -> d._1._2.asInstanceOf[String], "count" -> d._2)).toList
Now we have the current number of sessions in states. (current is configurable so we can set it for a specific time), only thing is left, write back to the elasticsearch;
sc.makeRDD(Seq(finalElasticDoc)).saveToEs(s"states_${id}/analytic_daily")
I have two databases with similar data (organized differently) and I've created a view for each one returning the same response. I have notice that the time response of the query is different even returning the same response, one being 3182ms, other being 217ms approximately, having queried 5 times.
I query both using:
curl -x GET ...db1/_design/query1/view/q1?group=true and
curl -x GET ...db2/_design/query1/view/q1?group=true.
I have checked the data sizes of the design documents using curl -x GET ...db1/_design/query1/_info. The design data size of the first is 146073878 bites and the second is 3739596 bites.
I thought both should have the same size, because they return the same view, and i havent used any filters, both views beeing equal.
Somebody can explain me why the same view created by different databases have different sizes?
My data is organized using two differents roots, but the same data, changing only the root:
Customer data in the root:
{
"c_customer_sk": 65836,
"c_first_name": "Frank",
"c_last_name": "White",
"store_sales": [
{
"ss_sales_price": 20.24,
"ss_ext_sales_price": 1012,
"ss_coupon_amt": 0,
"date": [
{
"d_month_seq": 1187,
"d_year": 1998
}
],
"item": [
{
"i_item_sk": 10454,
"i_item_id": "AAAAAAAAGNICAAAA",
"i_item_desc": "Results highlight as patterns; so right years show. Sometimes suitable lips move with the critics. English, old mothers ought to lift now perhaps future managers. Active, single ch",
"i_current_price": 2.88,
"i_class": "romance",
"i_category_id": 9,
"i_category": "Books"
}
]
},
{
"ss_sales_price": 225,
"ss_ext_sales_price": 1023,
"ss_coupon_amt": 0,...
View function for customer in the root:
function(doc)
{
for each (store_sales in doc.store_sales) {
var s=store_sales.ss_ext_sales_price;
if(s==null){s=0}
for each (item in store_sales.item){
var item_id=item.i_item_id;
var item_desc=item.i_item_desc;
var category=item.i_category;
var class=item.i_class;
var price=item.i_current_price;}
if(category=="Music" || category=="Home" || category=="Sports"){
for each (date in store_sales.date){
var g=date.d_month_seq;}
if (g>=1200 && g<=1211){
emit({item_id:item_id,item_desc:item_desc, category:category, class:class, current_price:price},s);
}
}}}
reduce:_sum
Example of answer:
key:
{"item_id": "AAAAAAAAAAAEAAAA", "item_desc": "Rates expect probably necessary events. Circumstan", "category": "Sports", "class": "optics", "current_price": 3.99}
Value:
106079.49999999999
Item data in the root:
{
"i_item_sk": 10454,
"i_item_id": "AAAAAAAAGNICAAAA",
"i_item_desc": "Results highlight as patterns; so right years show. Sometimes suitable lips move with the critics. English, old mothers ought to lift now perhaps future managers. Active, single ch",
"i_current_price": 2.88,
"i_class": "romance",
"i_category_id": 9,
"i_category": "Books",
"store_sales": [
{
"ss_sales_price": 20.24,
"ss_ext_sales_price": 1012,
"ss_coupon_amt": 0,
"date": [
{
"d_month_seq": 1187,
"d_year": 1998
}
],
"customer": [
{
"c_customer_sk": 65836,
"c_first_name": "Frank",
"c_last_name": "White",
}
]
},
{
"ss_sales_price": 225,
"ss_ext_sales_price": 1023,
"ss_coupon_amt": 0,...
View for item on root:
function(doc)
{
var item_id=doc.i_item_id;
var item_desc=doc.i_item_desc;
var category=doc.i_category;
var class=doc.i_class;
var price=doc.i_current_price;
if(category=="Music" || category=="Home" || category=="Sports"){
for each (store_sales in doc.store_sales) {
var s=store_sales.ss_ext_sales_price;
if(s==null){s=0}
for each (date in store_sales.date){
var g=date.d_month_seq;}
if (g>=1200 && g<=1211){
emit({item_id:item_id,item_desc:item_desc, category:category, class:class, current_price:price},s);
}
}}}
reduce:_sum
Returning the same answer.
I have made the cleanup and compaction of the designs and the time response of the database which the itens data are in the root is much faster, and the sizes of the data size is smaller too, but I dont know why.
Can someone explain me?
could it be a difference of database compaction? When you replicate an existing databases to an empty one, only the last revision of each documents are sent to the new one, making it potentially way lighter. The same applies to views
I am developing a web app using Codeigniter and MongoDB.
I am trying to get the map reduce to work.
I got a file document with the below structure. I would like to do a map reduce to
check how many times each tag is being used and output it to the collection files.tags.
{
"_id": {
"$id": "4f26f21f09ab66c1030d0000e"
},
"basic": {
"name": "The filename"
},
"tags": [
"lorry",
"house",
"car",
"bicycle"
],
"updated_at": "2012-02-09 11:08:03"
}
I tried this map reduce command but it does not count each individual tag:
$map = new MongoCode ("function() {
emit({tags: this.tags}, {count: 1});
}");
$reduce = new MongoCode ("function( key , values ) {
var count = 0;
values.forEach(function(v) {
count += v['count'];
});
return {count: count};
}");
$this->mongo_db->command (array (
"mapreduce" => "files",
"map" => $map,
"reduce" => $reduce,
"out" => "files.tags"
)
);
Change your Map function to:
function map(){
if(!this.tags) return;
this.tags.forEach(function(tag){
emit(tag, {count: 1});
});
}
Yea, this map/reduce simply calculate total count of tags.
In mongodb cookbook there is example you are looking for.
You have to emit each tag instead of entire collection of tags:
map = function() {
if (!this.tags) {
return;
}
for (index in this.tags) {
emit(this.tags[index], 1);
}
}
You'll need to call emit once for each tag in the input documents.
MongoDB documentation for example says:
A map function calls emit(key,value) any
number of times to feed data to the reducer. In most cases you will
emit once per input document, but in some cases such as counting tags,
a given document may have one, many, or even zero tags.