Algorithm - how to get count of active session in time series? - algorithm

I have a table in which we maintain user's login and logout time. Now I want to display a table to admin with number of active users with times like :
00:00 - 250
00:15 - 225
00:30 - 240
00:45 - 190
01:00 - 240
....
..
What algorithm we should use?
Thanks in advance :)

Make list/array with pairs (time; incr = +1 for login, -1 for logout)
Sort list by time key
Make ActiveCount = 0
Traverse list, adding incr to ActiveCount.
Value of ActiveCount in every moment corresponds to the number of active users
login: 0; logout: 4
login: 2; logout: 6
list
(0;1), (2;1), (4;-1), (6;-1)
count
0 1 2 1 0

You can simply iterate over the list of login-logout pairs for all the users and simply put the user (or increment the count for that user) in appropriate bucket. Now if a particular user spans over multiple buckets, you'll have to consider putting that user (or increment the count for that user) in multiple buckets.
That's all about the algorithm. Probably the simplest one.
If you want to go in the implementation detail, you can use a HashMap or unordered_map whose keys will be your time at which you want to report the number of users and the value will start from zero, you will increment this value each time when you get a new user.

Related

How to slice the dataset in Python in specific intervals

I have a dataset with n rows, how can I access a specific number of rows every specific number of rows through the whole dataset using Python?
For example, in 100 rows data set and I want to access 10 rows every 10 rows, like 1:10, 20:30, 40:50, 60:70, 80:90
I could think of something like this
df.iloc[np.array([int(x/10) for x in df.index]) % 2 == 0]
It takes the index of the dataframe, divides it by 10 and casts it to an int. This basically just removes the last digit in this example.
With the modulo statement the first 10 rows are True, the next 10 False and so on. This is then used with iloc to get just the lines with the True value.
This requires a continuously increasing index. If for example some rows were already filtered out this is not the case. reset_index can be used to reset the index.

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

Retrieve a random 15 children from Firebase

I have a Firebase database with Items in it. There could potentially be up to 1000 items in the database.
I am looking to pull 45 random children out of the database to use.
Any idea how I can do this without pulling them all out first and then weeding them down to what I need?
Assign each item an index, 1-1000
-Jhsu498984
item_name: "my item 0"
item_index: 0
-Ynkkj93ov9
item_name: "my item 24"
item_index: 24
then, with a random number generator, generate 45 random numbers (which match the item_index) and query for those specific items.
or
create all of the items and in separate node, keep their node refs
item_refs
-Jhsu498984: true
-Ynkkj93ov9: true
then you just need to load in the item_refs (into an array) and then randomly pick 45 from the array. Then query for those items.

Sum Formula Crystal Reports Inquiry

Ok, say I have a subreport that populates a chart I have from data in a table. I have a summary sum field that adds up the total of each row displayed. I am about to add two new rows that need to be displayed but not totaled up in the sum. There is a field in the table that has a number from 1-7 in it. If I added these new fields into the database, I would assign a negative number to this like -1 and -2 to differentiate it between the other records. How can I set up a formula so that it will sum up all of the amount fields except for the records that have an 'order' number we will call it of either -1 or -2? Thanks!
Use a Running Total Field and set the evaluate formula to something like {new_field} >= 0. So it will only sum the value when it passes that test.
The way to accomplish this without a running total is with a formula like this:
if {OrderNum} >= 0 Then {Amount}

Azure Table Storage - PartitionKey and RowKey selection to use between query

I am a total newbie with Azure! The purpose is to return the rows based on the timestamp stored in the RowKey. As there is a transaction cost with each query, I want to minimize the number of transactions/queries whilst maintain performance
These are the proposed Partition and Row Keys:
Partition Key: TextCache_(AccountID)_(ParentMessageId)
Row Key: (DateOfMessage)_(MessageId)
Legend:
AccountId - is an integer
ParentMessageId - The parent messageId if there is one, blank if it is the parent
DateOfMessage - Date the message was created - format will be DateTime.Ticks.ToString("d19")
MessageId - the unique Id of the message
I would like to get back from a single query the rows and any childrows that is > or < DateOfMessage_MessageId
Can this be done via my proposed PartitionKeys and RowKeys?
ie.. (in psuedo code)
var results = ctx.PartitionKey.StartsWith(TextCache_AccountId)
&& ctx.RowKey > (TimeStamp)_MessageId
Secondly, if there I have a number of accounts, and only want to return back the first 10, could it be done via a single query
ie.. (in psuedo code)
var results = (
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId1)) &&
&& ctx.RowKey > (TimeStamp1)_MessageId1 )
)
||
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId2)) &&
&& ctx.RowKey > (TimeStamp2)_MessageId2 )
) ...
)
.Take(10)
The short answer to your questions is yes, but there are some things you need to watch for.
Azure table storage doesn't have a direct equivalent of .StartsWith(). If you're using the storage library in combination with LINQ you can use .CompareTo() (> and < don't translate properly) which will mean that if you run a search for account 1 and you ask the query to return 1000 results, but there are only 600 results for account 1, the last 400 results will be for account 10 (the next account number lexically). So you'll need to be a bit smart about how you deal with your results.
If you padded out the account id with leading 0s you could do something like this (pseudo code here as well)
ctx.PartionKey > "TextCache_0000000001"
&& ctx.PartitionKey < "TextCache_0000000002"
&& ctx.RowKey > "123465798"
Something else to bear in mind is that queries to Azure Tables return their results in PartitionKey then RowKey order. So in your case messages without a ParentMessageId will be returned before messages with a ParentMessageId. If you're never going to query this table by ParentMessageId I'd move this to a property.
If TextCache_ is just a string constant, it's not adding anything by being included in the PartitionKey unless this will actually mean something to your code when it's returned.
While you're second query will run, I don't think it will produce what you're after. If you want the first ten rows in DateOfMessage order, then it won't work (see my point above about sort orders). If you ran this query as it is and account 1 had 11 messages it will return only the first 10 messages related to account 1 regardless if whether account 2 had an earlier message.
While trying to minimise the number of transactions you use is good practice, don't be too concerned about it. The cost of running your worker/web roles will dwarf your transaction costs. 1,000,000 transactions will cost you $1 which is less than the cost of running one small instance for 9 hours.

Resources