How to prevent crossfold re-generation every iteration Lenskit 3.0-M2 - lenskit

My gradle crossfold task looks like this:
task crossfold(type: Crossfold, group: 'evaluate') {
input 'data/mt-500k.yml'
// test on same 1/5 of each user's ratings
holdoutFraction(0.2,'timestamp')
// use 5-fold cross-validation
partitionCount 3
//use partition users method
method 'partition-users'
}
The mt-500k dataset contains all the ratings. Because of my limited amount of RAM, I need to run my algorithms separated. This means that even though my data doesn't change, the crossfolds get re-iterated resulting in different users in training/test folds causing the results to be incomparable. How would it be possible to maintain the same crossfold or prevent lenskit from re-iterating?

Turns out that it's as simple as changing the dataSet variable in the evaluate task to:
dataSet 'build/crossfold.out/datasets.yaml'
Lenskit takes care of everything else.

Related

Concurrency in Redis for flash sale in distributed system

I Am going to build a system for flash sale which will share the same Redis instance and will run on 15 servers at a time.
So the algorithm of Flash sale will be.
Set Max inventory for any product id in Redis
using redisTemplate.opsForValue().set(key, 400L);
for every request :
get current inventory using Long val = redisTemplate.opsForValue().get(key);
check if it is non zero
if (val == null || val == 0) {
System.out.println("not taking order....");
}
else{
put order in kafka
and decrement using redisTemplate.opsForValue().decrement(key)
}
But the problem here is concurrency :
If I set inventory 400 and test it with 500 request thread,
Inventory becomes negative,
If I make function synchronized I cannot manage it in distributed servers.
So what will be the best approach to it?
Note: I can not go for RDBMS and set isolation level because of high request count.
Redis is monothreaded, so running a Lua Script on it is always atomic.
You can define then a Lua script on your Redis instance and running it from your Spring instances.
Your Lua script would just be a sequence of operations to execute against your redis instance (the only one to have the correct value of your stock) and returns the new value for instance or an error if the value is negative.
Your Lua script is basically a Redis transaction, there are other methods to achieve Redis transaction but IMHO Lua is the simplest above all (maybe the least performant, but I have found that in most cases it is fast enough).

JMeter and random variable along with if controller

i want to control my sampler execution by using a random variable . I have a sequence of hits login,welcome,Bla,log out . i want the log out to be performed for 6/10 requests and let others do not login(so to speak 6 requests will perform the whole sequence including log out, 4 of them will perform will not perform log out ).How to achieve the same in JMETER
I have added a random variable rand and set it between 1-10 at the beginning of the thread group .Then just above Logout sampler i placed an IF controller were i check ${rand}>4 . How ever i always get all sequence executed . Please suggest what am i doing wrong
Your approach is a little bit weird, my expectation is that the problem is in the following areas:
Your IF Controller condition is wrong (check jmeter.log file for any suspicious entries)
Your random variable setting is wrong, i.e. it has the same value for all virtual users (threads) so they will either be executed all or none
So I would recommend using Throughput Controller or Switch Controller in order to set up this 60/40 distribution.
See Running JMeter Samplers with Defined Percentage Probability article for more details.
Random Variable in Jmeter is saved in long format be default so
${rand} > 4 won't work. You need to change
Condition to ${rand} > 4.0
or change Random Variable Output format to 00 (2 digits)
see Manual
This was accomplished by creating a combination of config element- random variable and an IF controller
1) a random variable was created with Minim and maxim value to meet above condition
2) and IF controller was able to check ${myrand}>4;
This had derived the desired result - thank you all

Big task or multiple small tasks with Sidekiq

I'm writting a worker to add lot's of users into a group. I'm wondering if it's better to run a big task who had all users, or batch like 100 users or one by one per task.
For the moment here is my code
class AddUsersToGroupWorker
include Sidekiq::Worker
sidekiq_options :queue => :group_utility
def perform(store_id, group_id, user_ids_to_add)
begin
store = Store.find store_id
group = Group.find group_id
rescue ActiveRecord::RecordNotFound => e
Airbrake.notify e
return
end
users_to_process = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
group.users += users_to_process
users_to_process.map(&:id).each do |user_to_process_id|
UpdateLastUpdatesForUserWorker.perform_async store.id, user_to_process_id
end
end
end
Maybe it's better to have something like this in my method :
def add_users
users_to_process = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
users_to_process.map(&:id).each do |user_to_process_id|
AddUserToGroupWorker.perform_async group_id, user_to_process_id
UpdateLastUpdatesForUserWorker.perform_async store.id, user_to_process_id
end
end
But so many find request. What do you think ?
I have a sidekig pro licence if needed (for batch for example).
Here are my thoughts.
1. Do a single SQL query instead of N queries
This line: group.users += users_to_process is likely to produce N SQL queries (where N is users_to_process.count). I assume that you have many-to-many connection between users and groups (with user_groups join table/model), so you should use some Mass inserting data technique:
users_to_process_ids = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
.pluck(:id)
sql_values = users_to_process_ids.map{|i| "(#{i.to_i}, #{group.id.to_i}, NOW(), NOW())"}
Group.connection.execute("
INSERT INTO groups_users (user_id, group_id, created_at, updated_at)
VALUES #{sql_values.join(",")}
")
Yes, it's raw SQL. And it's fast.
2. User pluck(:id) instead of map(&:id)
pluck is much quicker, because:
It will select only 'id' column, so less data is transferred from DB
More importantly, it won't create ActiveRecord object for each raw
Doing SQL is cheap. Creating Ruby objects is really expensive.
3. Use horizontal parallelization instead of vertical parallelization
What I mean here, is if you need to do sequential tasks A -> B -> C for a dozen of records, there are two major ways to split the work:
Vertical segmentation. AWorker does A(1), A(2), A(3); BWorker does B(1), etc.; CWorker does all C(i) jobs;
Horizontal segmentation. UniversalWorker does A(1)+B(1)+C(1).
Use the latter (horizontal) way.
It's a statement from experience, not from some theoretical point of view (where both ways are feasible).
Why you should do that?
When you use vertical segmentation, you will likely get errors when you pass job from one worker down to another. Like such kind of errors. You will pull your hair out if you bump into such errors, because they aren't persistent and easily reproducible. Sometimes they happen and sometimes they aren't. Is it possible to write a code which will pass the work down the chain without errors? Sure, it is. But it's better to keep it simple.
Imagine that your server is at rest. And then suddenly new jobs arrive. Your B and C workers will just waste the RAM, while your A workers do the job. And then your A and C will waste the RAM, while B's are at work. And so on. If you make horizontal segmentation, your resource drain will even itself out.
Applying that advice to your specific case: for starters, don't call perform_async in another async task.
4. Process in batches
Answering your original question – yes, do process in batches. Creating and managing async task takes some resources by itself, so there's no need to create too many of them.
TL;DR So in the end, your code could look something like this:
# model code
BATCH_SIZE = 100
def add_users
users_to_process_ids = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
.pluck(:id)
# With 100,000 users performance of this query should be acceptable
# to make it in a synchronous fasion
sql_values = users_to_process_ids.map{|i| "(#{i.to_i}, #{group.id.to_i}, NOW(), NOW())"}
Group.connection.execute("
INSERT INTO groups_users (user_id, group_id, created_at, updated_at)
VALUES #{sql_values.join(",")}
")
users_to_process_ids.each_slice(BATCH_SIZE) do |batch|
AddUserToGroupWorker.perform_async group_id, batch
end
end
# add_user_to_group_worker.rb
def perform(group_id, user_ids_to_add)
group = Group.find group_id
# Do some heavy load with a batch as a whole
# ...
# ...
# If nothing here is left, call UpdateLastUpdatesForUserWorker from the model instead
user_ids_to_add.each do |id|
# do it synchronously – we already parallelized the job
# by splitting it in slices in the model above
UpdateLastUpdatesForUserWorker.new.perform store.id, user_to_process_id
end
end
There's no silver bullet. It depends on your goals and your application. General questions to ask yourself:
How much user ids could you pass to a worker? Is it possible to pass 100? What about 1000000?
How long your workers can work? Should it have any restrictions about working time? Can they stuck?
For a big applications it's necessary to split passed arguments to smaller chunks, to avoid creating long-running jobs. Creating a lot of small jobs allows you to scale easily - you can always add more workers.
Also it might be a good idea to define kind of timeout for workers, to stop processing of stuck workers.

How do we improve a MongoDB MapReduce function that takes too long to retrieve data and gives out of memory errors?

Retrieving data from mongo takes too long, even for small datasets. For bigger datasets we get out of memory errors of the javascript engine. We've tried several schema designs and several ways to retrieve data. How do we optimize mongoDB/mapReduce function/MongoWire to retrieve more data quicker?
We're not very experienced with MongoDB yet and are therefore not sure whether we're missing optimization steps or if we're just using the wrong tools.
1. Background
For graphing and playback purposes we want to store changes for several objects over time. Currently we have tens of objects per project, but expectations are we need to store thousands of objects. The objects may change every second or not change for long periods of time. A Delphi backend writes to and reads from MongoDB through MongoWire and SuperObjects, the data is displayed in a web frontend.
2. Schema design
We're storing the object changes in minute-second-millisecond objects in a record per hour. The schema design is like described here. Sample:
o: object1,
dt: $date,
v: {0: {0:{0: {speed: 8, rate: 0.8}}}, 1: {0:{0: {speed: 9}}}, …}
We've put indexes on {dt: -1, o: 1} and {o:1}.
3. Retrieving data
We use a mapReduce to construct a new date based on the minute-second-millisecond objects and to put the object back in v:
o: object1,
dt: $date,
v: {speed: 8, rate:0.8}
An average document is about 525 kB before the mapReduce function and has had ~29000 updates. After mapReduce of such a document, the result is about 746 kB.
3.1 Retrieving data from through mongo shell with mapReduce
We're using the following map function:
function mapF(){
for (var i = 0; i < 3600; i++){
var imin = Math.floor(i / 60);
var isec = (i % 60);
var min = ''+imin;
var sec = ''+isec;
if (this.v.hasOwnProperty(min) && this.v[min].hasOwnProperty(sec)) {
for (var ms in this.v[min][sec]) {
if (imin !== 0 && isec !== 0 && ms !== '0' && this.v[min][sec].hasOwnProperty(ms)) {// is our keyframe
var currentV = this.v[min][sec][ms];
//newT is new date computed by the min, sec, ms above
if (toDate > newT && newT > fromDate) {
if (fields && fields.length > 0) {
for (var p = 0, length = fields.length; p < length; p++){
//check if field is present and put it in newV
}
if (newV) {
emit(this.o, {vs: [{o: this.o, dt: newT, v: newV}]});
}
} else {
emit(this.o, {vs: [{o: this.o, dt: newT, v: currentV}]});
}
}
}
}
}
}
};
The reduce function basically just passes the data on. The call to mapReduce:
db.collection.mapReduce( mapF,reduceF,
{out: {inline: 1},
query: {o: {$in: objectNames]}, dt: {$gte: keyframeFromDate, $lt: keyframeToDate}},
sort: {dt: 1},
scope: {toDate: toDateWithinKeyframe, fromDate: fromDateWithinKeyframe, fields: []},
jsMode: true});
Retrieving 2 objects over 1 hour: 2,4 seconds.
Retrieving 2 objects over 5 hour: 8,3 seconds.
For this method we would have to write js and bat files runtime and read the json data back in. We have not measured times fort his yet, because frankly, we don’t like the idea very much.
Another problem with this method is that we get out of memory errors of the v8 javascript engine when we try to retrieve data for longer periods and/or more objects. Using a pc with more RAM works to some extend in preventing out of memory, but it doesn't make retrieving data faster.
This article mentions splitVector, which we might use to devide the workload. But we're not sure on how to use the keyPattern and maxChunkSizeBytes options. Can we use a keyPattern for both o and dt?
We might use multiple collections, but our dataset isn’t that big to start with at the moment, so we’re worried about how much collections we’d need.
3.2 Retrieving data through mongoWire with mapReduce
For retrieving data through mongoWire with mapReduce, we use the same mapReduce functions as above. We use the following Delphi code to start te query:
FMongoWire.Get('$cmd',BSON([
'mapreduce', ‘collection’,
'map', bsonJavaScriptCodePrefix + FMapVCRFunction.Text,
'reduce', bsonJavaScriptCodePrefix + FReduceVCRFunction.Text,
'out', BSON(['inline', 1]),
'query', mapquery,
'sort', BSON(['dt', -1]),
'scope', scope
]));
Retrieving data with this method is about 3-4 times (!) slower. And then the data has to be translated from BSON (IBSONDocument to JSON (SuperObject), which is a major time consuming part in this method. For retrieving raw data we use TMongoWireQuery which translates the BSONdocument in parts, while this mapReduce function uses TMongoWire directly and tries to translate the complete result. This might explain why this takes so long, while normally it's quite fast. If we can reduce the time it takes for the mapReduce to return results, this might be a next step for us to focus on.
3.3 Retrieving raw data and parsing in Delphi
Retrieving raw data to Delphi takes a bit longer then the previous method, but probably because of the use of TMongoWireQuery, the translation from BSON to JSON is much quicker.
4. Questions
Can we do further optimizations on our schema design?
How can we make the mapReduce function faster?
How can we prevent the out of
memory errors of the v8 engine? Can someone give more information on
the splitVector function?
How can we best use of mapReduce from Delphi? Can we use
MongoWireQuery in stead of MongoWire?
5. Specs
MongoDB 3.0.3
MongoWire from 2015 (recently updated)
Delphi 2010 (got XE5 as well)
4GB RAM (tried on 8GB RAM as well, less out of memory, but reading times are about the same)
Phew what a question! First up: I'm not an expert at MongoDB. I wrote TMongoWire as a way to get to know MongoDB a little. Also I really (really) dislike when wrappers have a plethora of overloads to do the same thing but for all kinds of specific types. A long time ago programmers didn't have generics, but we did have Variant. So I built a MongoDB wrapper (and IBSONDocument) based around variants. That said, I apparently made something people like to use, and by keeping it simple performs quite well. (I haven't been putting much time in it lately, but on the top of the list is catering for the new authentication schemes since version 3.)
Now, about your specific setup. You say you use mapreduce to get from 500KB to 700KB? I think there's a hint there you're using the wrong tool for the job. I'm not sure what the default mongo shell does differently than when you do the same over TMongoWire.Get, but if I assume mapReduce assembles the response first before sending it over the wire, that's where the performance gets lost.
So here's my advice: you're right with thinking about using TMongoWireQuery. It offers a way to process data faster as the server will be streaming it in, but there's more.
I strongly suggest to use an array to store the list of seconds. Even if not all seconds have data, store null on the seconds without data so each minute array has 60 items. This is why:
One nicety that turned up in designing TMongoWireQuery, is the assumption you'll be processing a single (BSON) document at a time, and that the contents of the documents will be roughly similar, at least in the value names. So by using the same IBSONDocument instance when enumerating the response, you actually save a lot of time by not having to de-allocate and re-allocate all those variants.
That goes for simple documents, but would actually be nice to have on arrays as well. That's why I created IBSONDocumentEnumerator. You need to pre-load an IBSONDocument instance with an IBSONDocumentEnumerator in the place where you're expecting the array of documents, and you need to process the array in roughly the same way as with TMongoWireQuery: enumerate it using the same IBSONDocument instance, so when subsequent documents have the same keys, time is saved not having to re-allocate them.
In your case though, you would still need to pull the data of an entire hour through the wire just to select the seconds you need. As I said before, I'm not a MongoDB expert, but I suspect there could be a better way to store data like this. Either with a separate document per second (I guess this would let the indexes do more of the work, and MongoDB can take that insert-rate), or with a specific query construction so that MongoDB knows to shorten the seconds array into just that data you're requesting (is that what $splice does?)
Here's an example of how to use IBSONDocumentEnumerator on documents like {name:"fruit",items:[{name:"apple"},{name:"pear"}]}
q:=TMongoWireQuery.Create(db);
try
q.Query('test',BSON([]));
e:=BSONEnum;
d:=BSON(['items',e]);
d1:=BSON;
while q.Next(d) do
begin
i:=0;
while e.Next(d1) do
begin
Memo1.Lines.Add(d['name']+'#'+IntToStr(i)+d1['name']);
inc(i);
end;
end;
finally
q.Free;
end;

Minimizing shuffle when mapper output is mostly sorted

I have a map-reduce process in which the mapper takes input from a file that is sorted by key. For example:
1 ...
2 ...
2 ...
3 ...
3 ...
3 ...
4 ...
Then it gets transformed and 99.9% of the keys stay in the same order in relation to one another and 99% of the remainder are close. So the following might be the output of running the map task on the above data:
a ...
c ...
c ...
d ...
e ...
d ...
e ...
Thus, if you could make sure that a reducer took in a range of inputs and put that reducer in the same node where most of the inputs were already located, the shuffle would require very little data transfer. For example, suppose that I partitioned the data so that a-d were taken care of by one reducer and e-g by the next. Then if a-d could be run on the same node that had handled the mapping of 1-4, only two records for e would need to be sent over the network.
How do I construct a system that takes advantage of this property of my data? I have both Hadoop and Spark available and do not mind writing custom partitioners and the like. However, the full workload is such a classic example of MapReduce that I'd like to stick with a framework which supports that paradigm.
Hadoop mail archives mention consideraton of such an optimization. Would one need to modify the framework itself to implement it?
From the SPARK perspective there is not direct support for this: the closest is mapPartitions with preservePartions=true. However that will not directly help in your case because the keys may not be changed.
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = {
val func = (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter)
new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
}
If you were able to know definitively that none of the keys would move outside of their original partitions the above would work. But the values on the boundaries would likely not cooperate.
What is the scale of the data compared to the migrating keys? You may consider adding a postprocessing step. First construct a partition for all migrating keys. Your mapper would output a special key value for keys needing to migrate. Then postprocess the results to do some sort of append to the standard partitions. That is extra hassle so you would need to evaluate the tradeoff in an extra step and pipeline complexity.

Resources