Currently I'm working with relatively large data files, and my computer is not a super computer. I'm creating many subsets of these data sets temporarily and don't remove them from workspace. Obviously those are making a clutter of many variables. But, is there any effect of having many unused variables on performance of R? (i.e. does memory of computer fill at some point?)
When writing code should I start a habit of removing unused variables? Does it worth it?
x <- rnorm(1e8)
y <- mean(x)
# After this point I will not use x anymore, but I will use y
# Should I add following line to my code? or
# Maybe there will not be any performance lag if I skip the following line:
rm(x)
I don't want to add another line to my code. Instead of my code to seem cluttered I prefer my workspace to be cluttered (if there will be no performance improvement).
Yes, having unused objects will affect your performance, since R stores all its objects in memry. Obviously small objects will have negligible impact, and you mostly need to remove only the really big ones (data frames with millions of rows, etc) but having an uncluttered workspace won't hurt anything.
The only risk is removing something that you need later. Even when using a repo, as suggested, breaking stuff accidentally is something you want to avoid.
One way to get around these issues is to make extensive use of local. When you do a computation that scatters around lots of temporary objects, you can wrap it inside a local call, which will effectively dispose of those objects for you afterward. No more having to clean up lots of i, j, x, temp.var, and whatnot.
local({
x <- something
for(i in seq_along(obj))
temp <- some_unvectorised function(obj[[i]], x)
for(j in 1:temp)
temp2 <- some_other_unvectorised_function(temp, j)
# x, i, j, temp, temp2 only exist for the duration of local(...)
})
Adding to the above suggestions, for assisting beginners like me, I would like to list steps to check on R memory:
List the objects that are unused using ls().
Check the objects of interest using object.size("Object_name")
Remove unused/unnecessary objects using rm("Object_name")
Use gc()
Check memory cleared using memory.size()
In case, you are using a new session, use rm(list=ls()) followed by gc().
If one feels that the habit of removing unused variables, can be dangerous, it is always a good practice to save the objects into R images occasionally.
I think it's a good programming practice to remove unused code, regardless of language.
It's also a good practice to use a version control system like Subversion or Git to track your change history. If you do that you can remove code without fear, because it's always possible to roll back to earlier versions if you need to.
That's fundamental to professional coding.
Show distribution of the largest objects and return their names, based on #Peter Raynham:
memory.biggest.objects <- function(n=10) { # Show distribution of the largest objects and return their names
Sizes.of.objects.in.mem <- sapply( ls( envir = .GlobalEnv), FUN = function(name) { object.size(get(name)) } );
topX= sort(Sizes.of.objects.in.mem,decreasing=T)[1:n]
Memorty.usage.stat =c(topX, 'Other' = sum(sort(Sizes.of.objects.in.mem,decreasing=T)[-(1:n)]))
pie(Memorty.usage.stat, cex=.5, sub=make.names(date()))
# wpie(Memorty.usage.stat, cex=.5 )
# Use wpie if you have MarkdownReports, from https://github.com/vertesy/MarkdownReports
print(topX)
print("rm(list=c( 'objectA', 'objectB'))")
# inline_vec.char(names(topX))
# Use inline_vec.char if you have DataInCode, from https://github.com/vertesy/DataInCode
}
Related
I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).
I have a Ruby 1.8.7 script to parse iOS localization files:
singleline_comment = /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m
string_line = /\s*"(.*?)"\s*=\s*"(.*?)"\s*\;\s*/xm
out = decoded_src.scan(/(?:#{singleline_comment}|#{multiline_comment})?\s*?#{string_line}/)
It used to work fine, but today we tested it with a file that is 800Kb, and that doesn't have ; at the end of each line. The result was a high CPU load and no response from the Rails server. My assumption is that it took the whole file as a single string in the capturing group and that blocked the server.
The solution was to add ? (regex quantificator, 0 or 1 time) to the ; literal character:
/\s*"(.*?)"\s*=\s*"(.*?)"\s*\;?\s*/xm
Now it works fine again even with those files in the old iOS format, but my fear now is, what if a user submits a malformed file, like one with no ending ". Will my server get blocked again?
And how do I prevent this? Is there any way to try to run this only for five seconds? What I can I do to avoid halting my whole Rails application?
It looks like you're trying to parse an entire configuration as if it was a string. While that is doable, it's error-prone. Regular expression engines have to do a lot of looking forward and backward, and poorly written patterns can end up wasting a huge amount of CPU time. Sometimes a minor tweak will fix the problem, but the more text being processed, and the more complex the expression, the higher the chance of something happening that will mess you up.
From benchmarking different ways of getting at data for my own work, I've learned that anchoring regexp patterns can make a huge difference in speed. If you can't anchor a pattern somehow, then you are going to suffer from the backtracking and greediness of patterns unless you can limit what the engine wants to do by default.
I have to parse a lot of device configurations, but instead of trying to treat them as a single string, I break them down into logical blocks consisting of arrays of lines, and then I can provide logic to extract data from those blocks based on knowledge that blocks contain certain types of information. Small blocks are faster to search, and it's a lot easier to write patterns that can be anchored, providing huge speedups.
Also, don't hesitate to use Ruby's String methods, like split to tear apart lines, and sub-string matching to find lines containing what you want. They're very fast and less likely to induce slowdowns.
If I had a string like:
config = "name:\n foo\ntype:\n thingie\nlast update:\n tomorrow\n"
chunks = config.split("\n").slice_before(/^\w/).to_a
# => [["name:", " foo"], ["type:", " thingie"], ["last update:", " tomorrow"]]
command_blocks = chunks.map{ |k, v| [k[0..-2], v.strip] }.to_h
command_blocks['name'] # => "foo"
command_blocks['last update'] # => "tomorrow"
slice_before is a very useful method for this sort of task as it lets us define a pattern that is then used to test for breaks in the master array, and group by those. The Enumerable module has lots of useful methods in it, so be sure to look through it.
The same data could be parsed.
Of course, without sample data for what you're trying to do it's difficult to suggest something that works better, but the idea is, break down your input into small manageable chunks and go from there.
As a comment on how you're defining your patterns.
Instead of using /\/.../ (which is known as "leaning-toothpicks syndrome") use %r which allows you to define a different delimiter:
singleline_comment = /\/\/(.*)$/ # => /\/\/(.*)$/
singleline_comment = %r#//(.*)$# # => /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m # => /\/\*(.*?)\*\//m
multiline_comment = %r#/\*(.*?)\*/#m # => /\/\*(.*?)\*\//m
The first line in each sample above is how you're doing it, and the second is how I'd do it. They result in identical regexp objects, but the second ones are easier to understand.
You can even have Regexp help you by escaping things for you:
NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*?)'
GREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*)'
EOL = '$'
Regexp.new(Regexp.escape('//') + GREEDY_CAPTURE_NONE_TO_ALL_CHARS + EOL) # => /\/\/(.*)$/
Regexp.new(Regexp.escape('/*') + NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS + Regexp.escape('*/'), Regexp::MULTILINE) # => /\/\*(.*?)\*\//m
Doing this you can iteratively build up extremely complex expressions while keeping them relatively easy to maintain.
As far as halting your Rails app, don't try to process the files in the same Ruby process. Run a separate job that watches for the files and process them and store whatever you're looking for to be accessed as needed later. That way your server will continue to respond rather than lock up. I wouldn't do it in a thread, but would write a separate Ruby script that looks for incoming data, and if nothing is found, sleeps for some interval of time then looks again. Ruby's sleep method will help with that, or you could use the cron capability of your OS.
This is something that has puzzled me for some time and I have yet to find an answer.
I am in a situation where I am applying a standardized data cleaning process to (supposedly) similarly structured files, one file for each year. I have a statement such as the following:
replace field="Plant" if field=="Plant & Machinery"
Which was a result of the original code-writing based on the data file for year 1. Then I generalize the code to loop through the years of data. The problem becomes if in year 3, the analogous value in that variable was coded as "Plant and MachInery ", such that the code line above would not make the intended change due to the difference in the text string, but not result in an error alerting the change was not made.
What I am after is some sort of confirmation that >0 observations actually satisfied the condition each instance the code is executed in the loop, otherwise return an error. Any combination of trimming, removing spaces, and standardizing the text case are not workaround options. At the same time, I don't want to add a count if and then assert statement before every conditional replace as that becomes quite bulky.
Aside from going to the raw files to ensure the variable values are standardized, is there any way to do this validation "on the fly" as I have tried to describe? Maybe just write a custom program that combines a count if, assert and replace?
The idea has surfaced occasionally that replace should return the number of observations changed, but there are good reasons why not, notably that it is not a r-class or e-class command any way and it's quite important not to change the way it works because that could break innumerable programs and do-files.
So, I think the essence of any answer is that you have to set up your own monitoring process counting how many values have (or would be) changed.
One pattern is -- when working on a current variable:
gen was = .
foreach ... {
...
replace was = current
replace current = ...
qui count if was != current
<use the result>
}
I would like to accomplish the following: upon evaluation of an input cell, it should self-destruct (i.e. delete itself). I tried to hack something together with SelectionMove and NotebookDelete, but didn't quite get what I wanted.
Here are potential use cases:
the command might be a shorthand for a series of other commands that will be generated dynamically and inserted into the notebook
the command might only be used for side effects (e.g. to set a notebook option or to open a new notebook); leaving the command in the notebook after evaluation serves no purpose and creates clutter
Edit: As per Mr. Wizard, the answer is SelectionMove[EvaluationNotebook[], Previous, Cell]; NotebookDelete[];. I don't know why it wasn't working for me before. Here is some code that uses this idiom.
writeAndEval[nb_, boxExpr_] := (NotebookWrite[nb,
CellGroupData[{Cell[BoxData[boxExpr], "Input"]}]];
SelectionMove[nb, Previous, Cell];
SelectionMove[nb, Next, Cell];
SelectionEvaluate[nb]);
addTwoAndTwo[] := Module[{boxExpr},
boxExpr = RowBox[{"2", "+", "2"}];
SelectionMove[EvaluationNotebook[], Previous, Cell];
NotebookDelete[];
writeAndEval[EvaluationNotebook[], boxExpr];
]
Now, running addTwoAndTwo[] deletes the original input and makes it look as if you've evaluated "2+2". Of course, you can do all sorts of things instead and not necessarily print to the notebook.
Edit 2: Sasha's abstraction is quite elegant. If you are curious about "real-world" usage of this, check out the code I posted in the "what's in your toolbag" question: What is in your Mathematica tool bag?
To affect all Input cells, evaluate this is the notebook:
SetOptions[EvaluationNotebook[], CellEvaluationFunction ->
( (
SelectionMove[EvaluationNotebook[], All, EvaluationCell]; NotebookDelete[];
ToExpression##
)&)
]
If you only want to affect one cell, then select the cell and use the Options Inspector to set CellEvaluationFunction as above.
Or, building on Mr. Wizard's solution, you can create a function SelfDestruct, which will delete the input cell, if you intend to only do this occasionally:
SetAttributes[SelfDestruct, HoldAllComplete];
SelfDestruct[e_] := (If[$FrontEnd =!= $Failed,
SelectionMove[EvaluationNotebook[], All, EvaluationCell];
NotebookDelete[]]; e)
Then evaluating 2+3//SelfDestruct outputs 5 and deletes the input cell. This usage scenario seems more appealing to me.
I'm pretty sure this must be in some kind of text book (or more likely in all of them) but I seem to be using the wrong keywords to search for it... :(
A recurring task I'm facing while programming is that I am dealing with lists of objects from different sources which I need to keep in sync somehow. Typically there's some sort of "master list" e.g. returned by some external API and then a list of objects I create myself each of which corresponds to an object in the master list (think "wrappers" or "adapters" - they typically contain extended information about the external objects specific to my application and/or they simplify access to the external objects).
Hard characteristics of all instances of the problem:
the implementation of the master list is hidden from me; its interface is fixed
the elements in the two lists are not assignment-compatible
I have full control over the implementation of the slave list
I cannot control the order of elements in the master list (i.e. it's not sortable)
the master list does either not provide notification about added or removed elements at all or notification is unreliable, i.e. the sync can only happen on-demand, not live
simply clearing and rebuilding the slave list from scratch whenever it's needed is not an option:
initializing the wrapper objects should be considered expensive
other objects will hold references to the wrappers
Additional characteristics in some instances of the problem:
elements in the master list can only be identified by reading their properties rather than accessing them directly by index or memory address:
after a refresh, the master list might return a completely new set of instances even though they still represent the same information
the only interface for accessing elements in the master list might be a sequential enumerator
most of the time, the order of elements in the master list is stable, i.e. new elements are always added either at the beginning or at the end, never in the middle; however, deletion can usually occur at any position
So how would I typically tackle this? What's the name of the algorithm I should google for?
In the past I have implemented this in various ways (see below for an example) but it always felt like there should be a cleaner and more efficient way, especially one that did not require two iterations (one over each list).
Here's an example approach:
Iterate over the master list
Look up each item in the "slave list"
Add items that do not yet exist
Somehow keep track of items that already exist in both lists (e.g. by tagging them or keeping yet another list)
When done, iterate over the slave list and remove all objects that have not been tagged (see 4.) and clear the tag again from all others
Update 1
Thanks for all your responses so far! I will need some time to look at the links.
[...] (text moved to main body of question)
Update 2
Restructered the middle-paragraph into a (hopefully) more easily parseable bullet lists and incorporated details added later in the first update.
The 2 typical solutions are:
1. Copy the master list to the sync list.
2. Do an O(N*N) comparison between all element pairs.
You've excluded the smart options already: shared identity, sorting and change notifications.
Note that it's not relevant whether the lists can be sorted in a meaningful way, or even completely. For instance, when comparing two string lists, it would be ideal to sort alphabetically. But the list comparison would still be more efficient if you'd sort both lists by string length! You'd still have to do a full pairwise comparison of strings of the same length, but that will probably be a much smaller nummber of pairs.
This looks like the set reconciliation problem i.e. the problem of synchronizing unordered data. A question on SO was asked on this: Implementation of set reconciliation algorithm.
Most of the references on google are to technical paper abstracts.
Often the best solution to such problems is to not solve them directly.
IF you really can't use a sorted binary searchable container in your part of the code (like a set or even a sorted vector) then...
Are you very memory bound? If not then I'd just create a dictionary (an std::set for example) containing the contents of one of the lists and then just iterate over the second which I want o sync with the first.
This way you're doing nlogn to create the dictionary (or nX for a hash dictionary depending on which will be more efficient) + mlogn operations to go over the second list and sync it (or just MY) - hard to beat if you really have to use lists in the first place - it's also good you do it only once when and if you need it and it's way better then keeping the lists sorted all the time which would be a n^2 task for both of them.
It looks like a fellow named Michael Heyeck has a good, O(n) solution to this problem. Check out that blog post for an explanation and some code.
Essentially, the solution tracks both the master and slave lists in a single pass, tracking indices into each. Two data structures are then managed: a list of insertions to be replayed on the slave list, and a list of deletions.
It looks straightforward and also has the benefit of a proof of minimalism, which Heyeck followed up with in a subsequent post. The code snippet in this post is more compact, as well:
def sync_ordered_list(a, b):
x = 0; y = 0; i = []; d = []
while (x < len(a)) or (y < len(b)):
if y >= len(b): d.append(x); x += 1
elif x >= len(a): i.append((y, b[y])); y += 1
elif a[x] < b[y]: d.append(x); x += 1
elif a[x] > b[y]: i.append((y, b[y])); y += 1
else: x += 1; y += 1
return (i,d)
Again, credit to Michael Heyeck.
In the C++ STL the algorithm is called set_union. Also, implementing the algorithm is likely to be a lot simpler if you do the union into a 3rd list.
I had such problem in one project in the past.
That project had one master data source and several clients that update the data independently and in the end all of them have to have the latest and unified set of data that is the sum of them.
What I did was building something similar to the SVN protocol, in which every time I wanted to update the master database (which was accessible through a web service) I got the revision number. Updated my local data store to that revision and then commited the entities that aren't covered by any revision number to the database.
Every client has the ability to update it's local data store to the latest revision.
Here is a Javascript version of Michael Heyek's python code.
var b= [1,3,8,12,16,19,22,24,26]; // new situation
var a = [1,2,8,9,19,22,23,26]; // previous situation
var result = sync_ordered_lists(a,b);
console.log(result);
function sync_ordered_lists(a,b){
// by Michael Heyeck see http://www.mlsite.net/blog/?p=2250
// a is the subject list
// b is the target list
// x is the "current position" in the subject list
// y is the "current position" in the target list
// i is the list of inserts
// d is the list of deletes
var x = 0;
var y = 0;
var i = [];
var d = [];
var acc = {}; // object containing inserts and deletes arrays
while (x < a.length || y < b.length) {
if (y >= b.length){
d.push(x);
x++;
} else if (x >= a.length){
i.push([y, b[y]]);
y++;
} else if (a[x] < b[y]){
d.push(x);
x++;
} else if (a[x] > b[y]){
i.push([y, b[y]]);
y++;
} else {
x++; y++;
}
}
acc.inserts = i;
acc.deletes = d;
return acc;
}
Very brute-force and pure technical approach:
Inherit from your List class (sorry don't know what is your language). Override add/remove methods in your child list class. Use your class instead of the base one. Now you can track changes with your own methods and synchronize lists on-line.