Map reduce to count tags - codeigniter

I am developing a web app using Codeigniter and MongoDB.
I am trying to get the map reduce to work.
I got a file document with the below structure. I would like to do a map reduce to
check how many times each tag is being used and output it to the collection files.tags.
{
"_id": {
"$id": "4f26f21f09ab66c1030d0000e"
},
"basic": {
"name": "The filename"
},
"tags": [
"lorry",
"house",
"car",
"bicycle"
],
"updated_at": "2012-02-09 11:08:03"
}
I tried this map reduce command but it does not count each individual tag:
$map = new MongoCode ("function() {
emit({tags: this.tags}, {count: 1});
}");
$reduce = new MongoCode ("function( key , values ) {
var count = 0;
values.forEach(function(v) {
count += v['count'];
});
return {count: count};
}");
$this->mongo_db->command (array (
"mapreduce" => "files",
"map" => $map,
"reduce" => $reduce,
"out" => "files.tags"
)
);

Change your Map function to:
function map(){
if(!this.tags) return;
this.tags.forEach(function(tag){
emit(tag, {count: 1});
});
}

Yea, this map/reduce simply calculate total count of tags.
In mongodb cookbook there is example you are looking for.
You have to emit each tag instead of entire collection of tags:
map = function() {
if (!this.tags) {
return;
}
for (index in this.tags) {
emit(this.tags[index], 1);
}
}

You'll need to call emit once for each tag in the input documents.
MongoDB documentation for example says:
A map function calls emit(key,value) any
number of times to feed data to the reducer. In most cases you will
emit once per input document, but in some cases such as counting tags,
a given document may have one, many, or even zero tags.

Related

Group user events into activity sessions

You’re in charge of implementing a new analytics “sessions” view. You’re given a set of data that consists of individual web page visits, along with a visitorId which is generated by a tracking cookie that uniquely identifies each visitor. From this data we need to generate a list of sessions for each visitor.
The data set looks like this:
"events": [
{
"url": "/pages/a-big-river",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754583000
},
{
"url": "/pages/a-small-dog",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754631000
},
{
"url": "/pages/a-big-talk",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512709065294
},
{
"url": "/pages/a-sad-story",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512711000000
},
{
"url": "/pages/a-big-river",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754436000
},
{
"url": "/pages/a-sad-story",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512709024000
}
]
}
Given this input data, we want to create a set of sessions of the incoming data. A sessions is defined as a group of events from a single visitor with no more than 10 minutes between each consecutive event. A visitor can have multiple sessions. So given the example input data above, we would expect output which looks like:
{
"sessionsByUser": {
"f877b96c-9969-4abc-bbe2-54b17d030f8b": [
{
"duration": 41294,
"pages": [
"/pages/a-sad-story",
"/pages/a-big-talk"
],
"startTime": 1512709024000
},
{
"duration": 0,
"pages": [
"/pages/a-sad-story"
],
"startTime": 1512711000000
}
],
"d1177368-2310-11e8-9e2a-9b860a0d9039": [
{
"duration": 195000,
"pages": [
"/pages/a-big-river",
"/pages/a-big-river",
"/pages/a-small-dog"
],
"startTime": 1512754436000
}
]
}
}
Notes
Timestamps are in milliseconds.
Events may not be given in chronological order.
The visitors in sessionsByUser can be in any order.
For each visitor, sessions to be in chronological order.
For each session, the URLs should be sorted in chronological order
For a session with only one event the duration should be zero
Each event in a session (except the first event) must have occurred
within 10 minutes of the preceding event in the session. This means
that there can be more than 10 minutes between the first and the last
event in the session.
Note: I am not gonna show you how to make the exact output format you need, but I will show you the general approach for solving this problem, and hopefully you can figure out how to change it for your requirements.
You are gonna want to start out by grouping the events by user:
events_by_user = events.group_by { |event| event[:visitorId] }
Now, for each user, you need to sort their events by timestamp:
events_by_user.transform_values! do |events|
events.sort_by { |event| event[:timestamp] }
end
Now, you need to loop through each user's events and compare them in sequential order, putting them in groups based on timestamp similarity:
session_length = 10 # seconds
sessions = {}
events_by_user.each do |visitor_id, events|
sessions[visitor_id] = []
events.each do |event|
if sessions[visitor_id].empty?
sessions[visitor_id].push([event])
else
last_session = sessions[visitor_id].last
last_timestamp = last_session.last[:timestamp]
if (event[:timestamp] - last_timestamp) <= session_length
last_session.push(event)
else
sessions[visitor_id].push([event])
end
end
end
end
Now sessions will contain a hash like this:
{
<visitor_id>: [
[<list of events in session 1>],
[<list of events in session 2>]
],
etc.
}
You can then extract the start time, total duration, etc
Group the "events" array by the property "visitorId" first. You can use in JavaScript the
Array.prototype.reduce(): The reduce() method executes a user-supplied
“reducer” callback function on each element of the array, in order,
passing in the return value from the calculation on the preceding
element. The final result of running the reducer across all elements
of the array is a single value. So, set {} as the initial value for the reducer function, at each pass use the visitorId as key to an array that will hold events of the visitor, push the current event to respective position in the array.
The a variable || [] statement is used to make an 'undefined'
value as [], empty array.
Now sort the events array that is built just now by timestamp in ascending order.
Loop through it and compare timestamps pairwise (previous and current array element), if difference is below given session length(eg 10 min), merge the two sessions and push it in an array with 'visitorId' as key. Use a variable to keep track of the index of the session to be merged together.
let data = require('d:\\dataset.json');
//Group by visitorId
let sessions = {
sessionsByUser: data.events.reduce(function (events, session) {
(events[session['visitorId']] = events[session['visitorId']] || []).push(session);
return events;
}, {})
};
//Sort events by timestamp ascending
for(let key in sessions.sessionsByUser){
let events = sessions.sessionsByUser[key];
events = events.sort((a, b) => {
return a.timestamp - b.timestamp;
});
}
let userSessions = {};
for(let key in sessions.sessionsByUser){
let events = sessions.sessionsByUser[key];
let lastIndex = 0;
for(let i = 0; i < events.length; i++)
{
if(i == 0) {
userSessions[key] = [{
duration: 0,
pages: [events[i].url],
startTime: events[i].timestamp
}]
}
else {
//Check difference (10 min)
if(events[i].timestamp - events[i-1].timestamp < 600000) {
let session = userSessions[key][lastIndex];
session.duration += (events[i].timestamp - events[i-1].timestamp);
session.pages.push(events[i].url);
}
else {
userSessions[key].push({
duration: 0,
pages: [events[i].url],
startTime: events[i].timestamp
});
lastIndex++;
}
}
}
}
let soln = {
sessionsByUser: userSessions
};
console.log(JSON.stringify(soln));
Run command in cmd: Node <filename>.js
Change dataset path, navigate to the directory of the file in cmd. Node must
be installed on the system.

RxJS Map array to observable and back to plain object in array

I have an array of objects from which I need to pass each object separately into async method (process behind is handled with Promise and then converted back to Observable via Observable.fromPromise(...) - this way is needed because the same method is used in case just single object is passed anytime; the process is saving objects into database). For example, this is an array of objects:
[
{
"name": "John",
...
},
{
"name": "Anna",
...
},
{
"name": "Joe",,
...
},
{
"name": "Alexandra",
...
},
...
]
Now I have the method called insert which which inserts object into database. The store method from database instance returns newly created id. At the end the initial object is copied and mapped with its new id:
insert(user: User): Observable<User> {
return Observable.fromPromise(this.database.store(user)).map(
id => {
let storedUser = Object.assign({}, user);
storedUser.id = id;
return storedUser;
}
);
}
This works well in case I insert single object. However, I would like to add support for inserting multiple objects which just call the method for single insert. Currently this is what I have, but it doesn't work:
insertAll(users: User[]): Observable<User[]> {
return Observable.forkJoin(
users.map(user => this.insert(user))
);
}
The insertAll method is inserting users as expected (or something else filled up the database with that users), but I don't get any response back from it. I was debugging what is happening and seems that forkJoin is getting response just from first mapped user, but others are ignored. Subscription to insertAll does not do anything, also there is no any error either via catch on insertAll or via second parameter in subscribe to insertAll.
So I'm looking for a solution where the Observable (in insertAll) would emit back an array of new objects with users in that form:
[
{
"id": 1,
"name": "John",
...
},
{
"id": 2,
"name": "Anna",
...
},
{
"id": 3,
"name": "Joe",,
...
},
{
"id": 4,
"name": "Alexandra",
...
},
...
]
I would be very happy for any suggestion pointing in the right direction. Thanks in advance!
To convert from array to observable you can use Rx.Observable.from(array).
To convert from observable to array, use obs.toArray(). Notice this does return an observable of an array, so you still need to .subscribe(arr => ...) to get it out.
That said, your code with forkJoin does look correct. But if you do want to try from, write the code like this:
insertAll(users: User[]): Observable<User[]> {
return Observable.from(users)
.mergeMap(user => this.insert(user))
.toArray();
}
Another more rx like way to do this would be to emit values as they complete, and not wait for all of them like forkJoin or toArray does. We can just omit the toArray from the previous example and we got it:
insertAll(users: User[]): Observable<User> {
return Observable.from(users)
.mergeMap(user => this.insert(user));
}
As #cartant mentioned, the problem might not be in Rx, it might be your database does not support multiple connections. In that case, you can replace the mergeMap with concatMap to make Rx send only 1 concurrent request:
insertAll(users: User[]): Observable<User[]> {
return Observable.from(users)
.concatMap(user => this.insert(user))
.toArray(); // still optional
}

Produce a stream of values with data-driven delays in RxJS

Given an array of objects which contain a message payload and time parameter like this:
var data = [
{ message:"Deliver me after 1000ms", time:1000 },
{ message:"Deliver me after 2000ms", time:2000 },
{ message:"Deliver me after 3000ms", time:3000 }
];
I would like to create an observable sequence which returns the message part of each element of the array and then waits for the corresponding amount of time specified in the object. I'm open to reorganising the data structure of the array if that is necessary.
I've seen Observable.delay but can't see how it could be used with a dynamic value in this way. I'm working in RxJS 5.
You could use delayWhen:
var data = [
{ message:"Deliver me after 1000ms", time:1000 },
{ message:"Deliver me after 2000ms", time:2000 },
{ message:"Deliver me after 3000ms", time:3000 }
];
Rx.Observable
.from(data)
.delayWhen(datum => Rx.Observable.timer(datum.time))
.do(datum => console.log(datum.message))
.subscribe();
<script src="https://unpkg.com/#reactivex/rxjs#5.0.3/dist/global/Rx.js"></script>

Perform sequential api calls with RxJs?

Is there a way in RxJs to perform two api calls where the second requires data from the first and return a combined result as a stream? What I'm trying to do is call the facebook API to get a list of groups and the cover image in various sizes. Facebook returns something like this:
// call to facebook /1234 to get the group 1234, cover object has an
// image in it, but only one size
{ id: '1234', cover: { id: '9999' } }
// call to facebook /9999 to get the image 9999 with an array
// with multiple sizes, omitted for simplicity
{ images: [ <image1>, <image2>, ... ] }
// desired result:
{ id: '1234', images: [ <image1>, <image2>, ... ] }
So I have this:
var result = undefined;
rxGroup = fbService.observe('/1234');
rxGroup.subscribe(group => {
rxImage = fbService.observe(`/${group.cover.id}`);
rxImage.subscribe(images => {
group.images = y;
result = group;
}
}
I want to create a method that accepts a group id and returns an Observable that will have the combined group + images (result here) in the stream. I know I can create my own observable and call the next() function in there where I set 'result' above, but I'm thinking there has to be an rx-way to do this. select/map lets me transform, but I don't know how to shoe-in the results from another call. when/and/then seems promising, but also doesn't look like it supports something like that. I could map and return an observable, but the caller would then have to do two subscribes.
Looks like flatMap is the way to go (fiddle). It is called like subscribe and gives you a value from a stream. You return an observable from that and it outputs the values from all the created observables (one for for each element in the base stream) into the resulting stream.
var sourceGroup = { // result of calling api /1234
id: '1234',
cover: {
id: '9999'
}
};
var sourceCover = { // result of calling api /9999
id: '9999',
images: [{
src: 'image1x80.png'
}, {
src: 'image1x320.png'
}]
};
var rxGroup = Rx.Observable.just(sourceGroup);
var rxCombined = rxGroup.flatMap(group =>
Rx.Observable.just(sourceCover)
.map(images => ({
id: group.id,
images: images.images
}))
)
rxCombined.subscribe(x =>
console.log(JSON.stringify(x, null, 2)));
<script src="https://cdnjs.cloudflare.com/ajax/libs/rxjs/4.1.0/rx.all.min.js"></script>
Result:
{
"id": "1234",
"images": [
{
"src": "image1x80.png"
},
{
"src": "image1x320.png"
}
]
}
You should use concatMap instead of flatMap, it will preserve the order of the source emissions.

How can I validate DBRefs in a MongoDB collection?

Assuming I've got a MongoDB instance with 2 collections - places and people.
A typical places document looks like:
{
"_id": "someID"
"name": "Broadway Center"
"url": "bc.example.net"
}
And a people document looks like:
{
"name": "Erin"
"place": DBRef("places", "someID")
"url": "bc.example.net/Erin"
}
Is there any way to validate the places DBRef of every document in the people collection?
There's no official/built-in method to test the validity of DBRefs, so the validation must be performed manually.
I wrote a small script - validateDBRefs.js:
var returnIdFunc = function(doc) { return doc._id; };
var allPlaceIds = db.places.find({}, {_id: 1} ).map(returnIdFunc);
var peopleWithInvalidRefs = db.people.find({"place.$id": {$nin: allPlaceIds}}).map(returnIdFunc);
print("Found the following documents with invalid DBRefs");
var length = peopleWithInvalidRefs.length;
for (var i = 0; i < length; i++) {
print(peopleWithInvalidRefs[i]);
}
That when run with:
mongo DB_NAME validateDBRefs.js
Will output:
Found the following documents with invalid DBRefs
513c4c25589446268f62f487
513c4c26589446268f62f48a
you could add a stored function for that. please note that the mongo documentation discourages the use of stored functions. You can read about it here
In essence you create a function:
db.system.js.save(
{
_id : "myAddFunction" ,
value : function (x, y){ return x + y; }
}
);
and once the function is created you can use it in your where clauses. So you could write a function that checks for the existence of the id in the dbRef.

Resources