Query Pandas DataFrame distinctly with multiple conditions by unique row values - performance

I have a DataFrame with event logs:
eventtime, eventname, user, execution_in_s, delta_event_time
The eventname e.g. can be "new_order", "login" or "update_order".
My problem is that I want to know if there is eventname == "error" in the periods between login and update_order by distinct user. A period for me has a start time and an end time.
That all sounded easy until I tried it this morning.
For the time frame of the 24h logs I might not have a pair, because the login might have happened yesterday. I am not sure how to deal with something like that.
delta_event_time is a computed column of the eventtime minus the executions_in_s. I am considering these the real time stamps. I computed them:
event_frame["delta_event_time"] = event_frame["eventtime"] - pandas.to_timedelta(event_frame["execution_in_s"], unit='s')
I tried something like this:
events_keys = numpy.array(["login", "new_order"])
users = numpy.unique(event_frame["user"])
for user in users:
event_name = event_frame[event_frame["eventname"].isin(events_keys) & event_frame["user" == user]]["event_name"]
But this not using the time periods.
I know that Pandas has between_time() but I don't know how to query a DataFrame with periods, by user.
Do I need to iterate over the DataFrame with .iterrows() to calculate the start and end time tupels? It takes a lot of time to do that, just for basic things in my tries. I somehow think that this would make Pandas useless for this task.
I tried event_frame.sort(["user", "eventname"]) which works nicely so that I can see the relevant lines already. I did not have any luck with .groupby("user"), because it mixed users although they are unique row values.
Maybe a better workflow solution is to dump the DataFrame into a MongoDB instead of pursuing a solution with Pandas to perform the analysis in this case. I am not sure, because I am new to the framework.

Here is pseudocode for what I think will solve your problem. I will update it if you share a sample of your data.
grouped = event_frame.groupby('user') # This should work.
# I cannot believe that it didn't work for you! I won't buy it till you show us proof!
for name, group in grouped:
group.set_index('eventtime') # This will make it easier to work with time series.
# I am changing index here because different users may have similar or
# overlapping times, and it is a pain in the neck to resolve indexing conflicts.
login_ind = group[group['eventname'] == 'login'].index
error_ind = group[group['eventname'] == 'error'].index
update_ind = group[group['eventname'] == 'update_order'].index
# Here you can compare the lists login_ind, error_ind and update_ind however you wish.
# Note that the list can even have a length of 0.
# User name is stored in the variable name. So you can get it from there.
Best way might be to create a function that does the comparing. Because then you can create a dict by declaring error_user = {}.
Then calling your function inside for name, group in grouped: like so: error_user[name] = function_which_checks_when_user_saw_error(login_ind, error_ind, update_ind).

Related

How to include an array of weights to adjust importance of observed data in sm.tsa.UnobservedComponents?

I have used the following 5 lines to achieve a kalman filter with your work for a smoothed pricing model, and it worked great.
mod = sm.tsa.UnobservedComponents(obs, 'local level')
lm = sm.OLS(obs, xlm, missing='drop').fit()
obs_noise = abs(lm.resid).mean()
params = [obs_noise, obs_noise / obs_noise_level]
mod_filter, mod_smooth = mod.filter(params), mod.smooth(params)
However currently I would like to adjust the filtering smoothness at certain time, for example, when unemployment rate or interest rate made a big surge, I would like to make the output (Kalman filtered/smoothed) value closer to the observed value, while in most other time I will keep the what it is from the model. So, I have created an array, while a few items greater than 1, and the others will be exactly 1.
e.g.: ir_coeff = np.array([1,1,1,1,1.345,1.23,1.78,1,1,1])
What could be the best approach to achieve this? Thank you a lot in advance.
I have tried to include it in the output file with a dot product operation, however it is not reasonable to do this.

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

What's the 'Rails 4 Way' of finding some number of random records?

User.find(:all, :order => "RANDOM()", :limit => 10) was the way I did it in Rails 3.
User.all(:order => "RANDOM()", :limit => 10) is how I thought Rails 4 would do it, but this is still giving me a Deprecation warning:
DEPRECATION WARNING: Relation#all is deprecated. If you want to eager-load a relation, you can call #load (e.g. `Post.where(published: true).load`). If you want to get an array of records from a relation, you can call #to_a (e.g. `Post.where(published: true).to_a`).
You'll want to use the order and limit methods instead. You can get rid of the all.
For PostgreSQL and SQLite:
User.order("RANDOM()").limit(10)
Or for MySQL:
User.order("RAND()").limit(10)
As the random function could change for different databases, I would recommend to use the following code:
User.offset(rand(User.count)).first
Of course, this is useful only if you're looking for only one record.
If you wanna get more that one, you could do something like:
User.offset(rand(User.count) - 10).limit(10)
The - 10 is to assure you get 10 records in case rand returns a number greater than count - 10.
Keep in mind you'll always get 10 consecutive records.
I think the best solution is really ordering randomly in database.
But if you need to avoid specific random function from database, you can use pluck and shuffle approach.
For one record:
User.find(User.pluck(:id).shuffle.first)
For more than one record:
User.where(id: User.pluck(:id).sample(10))
I would suggest making this a scope as you can then chain it:
class User < ActiveRecord::Base
scope :random, -> { order(Arel::Nodes::NamedFunction.new('RANDOM', [])) }
end
User.random.limit(10)
User.active.random.limit(10)
While not the fastest solution, I like the brevity of:
User.ids.sample(10)
The .ids method yields an array of User IDs and .sample(10) picks 10 random values from this array.
Strongly Recommend this gem for random records, which is specially designed for table with lots of data rows:
https://github.com/haopingfan/quick_random_records
All other answers perform badly with large database, except this gem:
quick_random_records only cost 4.6ms totally.
the accepted answer User.order('RAND()').limit(10) cost 733.0ms.
the offset approach cost 245.4ms totally.
the User.all.sample(10) approach cost 573.4ms.
Note: My table only has 120,000 users. The more records you have, the more enormous the difference of performance will be.
UPDATE:
Perform on table with 550,000 rows
Model.where(id: Model.pluck(:id).sample(10)) cost 1384.0ms
gem: quick_random_records only cost 6.4ms totally
For MYSQL this worked for me:
User.order("RAND()").limit(10)
You could call .sample on the records, like: User.all.sample(10)
The answer of #maurimiranda User.offset(rand(User.count)).first is not good in case we need get 10 random records because User.offset(rand(User.count) - 10).limit(10) will return a sequence of 10 records from the random position, they are not "total randomly", right? So we need to call that function 10 times to get 10 "total randomly".
Beside that, offset is also not good if the random function return a high value. If your query looks like offset: 10000 and limit: 20 , it is generating 10,020 rows and throwing away the first 10,000 of them,
which is very expensive. So call 10 times offset.limit is not efficient.
So i thought that in case we just want to get one random user then User.offset(rand(User.count)).first maybe better (at least we can improve by caching User.count).
But if we want 10 random users or more then User.order("RAND()").limit(10) should be better.
Here's a quick solution.. currently using it with over 1.5 million records and getting decent performance. The best solution would be to cache one or more random record sets, and then refresh them with a background worker at a desired interval.
Created random_records_helper.rb file:
module RandomRecordsHelper
def random_user_ids(n)
user_ids = []
user_count = User.count
n.times{user_ids << rand(1..user_count)}
return user_ids
end
in the controller:
#users = User.where(id: random_user_ids(10))
This is much quicker than the .order("RANDOM()").limit(10) method - I went from a 13 sec load time down to 500ms.

Create Random Integer Based on Id in Ruby

I have a scenario where I need to generate 4 digit confirmation codes for individual orders. I don't want to just do random codes due to the off chance that two exact codes would be generated near the same time. Is there a way to use the id of each order and generate a 4 digit code from that? I know I am going to eventually have repetitive codes with this but it will be ok because they will not be generated around the same time.
Do you really need to base the code on the ID? Four digits only gives you ten thousand possible values so you could generate them all with a script and toss them in a database table. Then just pull a random one out of the database when you need it and put it back in when you're done with it.
Your code table would look like this:
code: The code
uuid: A UUID, a NULL value here indicates that this code is free.
Then, to grab a code, first generate a UUID, uuid, and do this:
update code_table
set uuid = ?
where code = (
select code
from code_table
where uuid is null
order by random()
limit 1
)
-- Depending on how your database handles transactions
-- you might want to add "and uuid is null" to the outer
-- WHERE clause and loop until it works
(where ? would be your uuid) to reserve the code in a safe manner and then this:
select code
from code_table
where uuid = ?
(where ? is again your uuid) to pull the code out of the database.
Later on, someone will use the code for something and then you just:
update code_table
set uuid = null
where code = ?
(where code is the code) to release the code back into the pool.
You only have ten thousand possible codes, that's pretty small for a database even if you are using order by random().
A nice advantage of this approach is that you can easily see how many codes are free; this lets you automatically check the code pool every day/week/month/... and complain if the number of free codes fall below, say, 20% of the entire code space.
You have to track the in-use codes anyway if you want to avoid duplicates so why not manage it all in one place?
If your order id has more than 4 digits, it is theoreticly impossible without checking the generated value in a array of already generated values, you can do something like this:
require 'mutex'
$confirmation_code_mutex = Mutex.new
$confirmation_codes_in_use = []
def generate_confirmation_code
$confirmation_code_mutex.synchronize do
nil while $confirmation_codes_in_use.include?(code = rand(8999) + 1000)
$confirmation_codes_in_use << code
return code
end
end
Remember to clean up $confirmation_codes_in_use after using the code.

Resources