R csv.bz2 Shell Windows counting number of lines - windows

I'm having problems for counting the number of lines in a messy csv.bz2 file.
Since this is a huge file I want to be able to preallocate a data frame before reading the bzip2 file with the read.csv() function.
As you can see in the following tests, my results are widely variable, and none of the correspond with the number of actual rows in the csv.bz2 file.
> system.time(nrec1 <- as.numeric(shell('type "MyFile.csv" | find /c ","', intern=T)))
user system elapsed
0.02 0.00 53.50
> nrec1
[1] 1060906
> system.time(nrec2 <- as.numeric(shell('type "MyFile.csv.bz2" | find /c ","', intern=T)))
user system elapsed
0.00 0.02 10.15
> nrec2
[1] 126715
> system.time(nrec3 <- as.numeric(shell('type "MyFile.csv" | find /v /c ""', intern=T)))
user system elapsed
0.00 0.02 53.10
> nrec3
[1] 1232705
> system.time(nrec4 <- as.numeric(shell('type "MyFile.csv.bz2" | find /v /c ""', intern=T)))
user system elapsed
0.00 0.01 4.96
> nrec4
[1] 533062
The most interesting result is the one I called nrec4 since it takes no time, and it returns roughly half the number of rows of nrec1, but I'm totally unsure if the naive multiplication by 2 will be ok.
I have tried several other methods including fread() and hsTableReader() but the former crashes and the later is so slow that I won't even consider it further.
My questions are:
Which reliable method can I use for counting the number of rows in a csv.bz2 file?
It's ok to use a formula for calculating the number of rows directly in a csv.bz2 file without decompressing it?
Thanks in advance,
Diego

Roland was right from the beginning.
When using the garbage collector, the illusion of improved performance still remained.
I had to close and re-start R for doing an accurate test.
Yes, the process is still a little bit faster by few seconds (red line), and the increase in RAM consumption is more uniform when using nrows.
But at least in this case is not worthy the effort of trying to find an optimization for the read.csv() function.
It is slow but it is what it is.
If someone know about a faster approach I'm interested.
fread() crashes just in case.
Thanks.
Without nrows (Blue Line)
Sys.time()
system.time(storm.data <- read.csv(fileZip,
header = TRUE,
stringsAsFactors = F,
comment.char = "",
colClasses = "character"))
Sys.time()
rm(storm.data)
gc()
With nrows (Red Line)
Sys.time()
system.time(nrec12 <- as.numeric(
shell('type "MyFile.csv.bz2" | find /v /c ""',
intern=T)))
nrec12 <- nrec12 * 2
system.time(storm.data <- read.csv(fileZip,
stringsAsFactors = F,
comment.char = "",
colClasses = "character",
nrows = nrec12))
Sys.time()
rm(storm.data)
gc()

Related

Is it possible to vectorize annotation for matplotlib?

As a part of a large QC benchmark I am creating a large number (approx 100K) of scatter plots in a single PDF using PdfPages backend. (See further down for the code)
The issue I am having is that the plotting takes too much time, see output from a custom profiling/debugging effort:
Checkpoint1: Predictions done in 1.110076904296875 millis
Checkpoint2: df created and correlations calculated in 3.108978271484375 millis
Checkpoint3: plotting and accumulating done in 231.31990432739258 millis
Cycle completed in 0.23553895950317383 secs
----------------------
Checkpoint1: Predictions done in 3.718852996826172 millis
Checkpoint2: df created and correlations calculated in 2.353191375732422 millis
Checkpoint3: plotting and accumulating done in 155.93385696411133 millis
Cycle completed in 0.16200590133666992 secs
----------------------
Checkpoint1: Predictions done in 2.920866012573242 millis
Checkpoint2: df created and correlations calculated in 1.995086669921875 millis
Checkpoint3: plotting and accumulating done in 161.8819236755371 millis
Cycle completed in 0.16679787635803223 secs
The figure for plotting gets an 2-3x increase if I annotate the points, which is necessary for the use case. As you can see below I have tried both itertuples() and apply(), switching to apply did not give a significant change in the times as far as I can see.
def annotate(row, ax):
ax.annotate(row.name, (row.exp, row.model),
xytext=(10, 20), textcoords='offset points',
arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
family='sans-serif', fontsize=8, color='darkslategrey')
def plot2File(df, file, seq, z, p, s):
""" Plot predictions vs experimental """
plttitle = f"Correlations for {seq}+{z} \n pearson={p} \n spearman={s}"
ax = df.plot(x='exp', y='model', kind='scatter', title=plttitle, s=40)
df.apply(annotate, ax=ax, axis=1)
# for row in df.itertuples():
# ax.annotate(row.Index, (row.exp, row.model),
# xytext=(10, 20), textcoords='offset points',
# arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
# family='sans-serif', fontsize=8, color='darkslategrey')
plt.savefig(file, bbox_inches='tight', format='pdf')
plt.close()
Given the nice explanation by Jeff on a question regarding iterrows() I was wondering if it would be possible to vectorize the annotation process? Or should I ditch using a data frame altogether?

cplex prints a lot to terminal although corresponding parameters are set

I am using CPLEX in Cpp.
After googling I found out what parameters need to be set to avoid cplex from printing to terminal and I use them like this:
IloCplex cplex(model);
std::ofstream logfile("cplex.log");
cplex.setOut(logfile);
cplex.setWarning(logfile);
cplex.setError(logfile);
cplex.setParam(IloCplex::MIPInterval, 1000);//Controls the frequency of node logging when MIPDISPLAY is set higher than 1.
cplex.setParam(IloCplex::MIPDisplay, 0);//MIP node log display information-No display until optimal solution has been found
cplex.setParam(IloCplex::SimDisplay, 0);//No iteration messages until solution
cplex.setParam(IloCplex::BarDisplay, 0);//No progress information
cplex.setParam(IloCplex::NetDisplay, 0);//Network logging display indicator
if ( !cplex.solve() ) {
....
}
but yet cplex prints such things:
Warning: Bound infeasibility column 'x11'.
Presolve time = 0.00 sec. (0.00 ticks)
Root node processing (before b&c):
Real time = 0.00 sec. (0.01 ticks)
Parallel b&c, 4 threads:
Real time = 0.00 sec. (0.00 ticks)
Sync time (average) = 0.00 sec.
Wait time (average) = 0.00 sec.
------------
Total (root+branch&cut) = 0.00 sec. (0.01 ticks)
Is there any way to avoid printing them?
Use setOut method from IloAlgorithm class (IloCplex inherits from IloAlgorithm). You can set a null output stream as a parameter and prevent logging the message on the screen.
This is what works in C++ according to cplex parameters doc:
cplex.setOut(env.getNullStream());
cplex.setWarning(env.getNullStream());
cplex.setError(env.getNullStream());

Rolling list over unequal times in XTS

I have stock data at the tick level and would like to create a rolling list of all ticks for the previous 10 seconds. The code below works, but takes a very long time for large amounts of data. I'd like to vectorize this process or otherwise make it faster, but I'm not coming up with anything. Any suggestions or nudges in the right direction would be appreciated.
library(quantmod)
set.seed(150)
# Create five minutes of xts example data at .1 second intervals
mins <- 5
ticks <- mins * 60 * 10 + 1
times <- xts(runif(seq_len(ticks),1,100), order.by=seq(as.POSIXct("1973-03-17 09:00:00"),
as.POSIXct("1973-03-17 09:05:00"), length = ticks))
# Randomly remove some ticks to create unequal intervals
times <- times[runif(seq_along(times))>.3]
# Number of seconds to look back
lookback <- 10
dist.list <- list(rep(NA, nrow(times)))
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
> user system elapsed
6.12 0.00 5.85
You should check out the window function, it will make your subselection of dates a lot easier. The following code uses lapply to do the work of the for loop.
# Your code
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
# user system elapsed
# 10.09 0.00 10.11
# My code
system.time(dist.list<-lapply(index(times),
function(x) window(times,start=x-lookback-1,end=x))
)
# user system elapsed
# 3.02 0.00 3.03
So, about a third faster.
But, if you really want to speed things up, and you are willing to forgo millisecond accuracy (which I think your original method implicitly does), you could just run the loop on unique date-hour-second combinations, because they will all return the same time window. This should speed things up roughly twenty or thirty times:
dat.time=unique(as.POSIXct(as.character(index(times)))) # Cheesy method to drop the ms.
system.time(dist.list.2<-lapply(dat.time,function(x) window(times,start=x-lookback-1,end=x)))
# user system elapsed
# 0.37 0.00 0.39

faster way to create variable that aggregates a column by id [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
Is there a faster way to do this? I guess this is unnecessary slow and that a task like this can be accomplished with base functions.
df <- ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc)))
I'm quite new to R. I have looked at by(), aggregate() and tapply(), but didn't get them to work at all or in the way I wanted. Rather than returning a shorter vector, I want to attach the sum to the original dataframe. What is the best way to do this?
Edit: Here is a speed comparison of the answers applied to my data.
> # My original solution
> system.time( ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc))) )
user system elapsed
14.405 0.000 14.479
> # Paul Hiemstra
> system.time( ddply(df, "id", transform, perc.total = sum(cand.perc)) )
user system elapsed
15.973 0.000 15.992
> # Richie Cotton
> system.time( with(df, tapply(df$cand.perc, df$id, sum))[df$id] )
user system elapsed
0.048 0.000 0.048
> # John
> system.time( with(df, ave(cand.perc, id, FUN = sum)) )
user system elapsed
0.032 0.000 0.030
> # Christoph_J
> system.time( df[ , list(perc.total = sum(cand.perc)), by="id"][df])
user system elapsed
0.028 0.000 0.028
Since you are quite new to R and speed is apparently an issue for you, I recommend the data.table package, which is really fast. One way to solve your problem in one line is as follows:
library(data.table)
DT <- data.table(ID = rep(c(1:3), each=3),
cand.perc = 1:9,
key="ID")
DT <- DT[ , perc.total := sum(cand.perc), by = ID]
DT
ID Perc.total cand.perc
[1,] 1 6 1
[2,] 1 6 2
[3,] 1 6 3
[4,] 2 15 4
[5,] 2 15 5
[6,] 2 15 6
[7,] 3 24 7
[8,] 3 24 8
[9,] 3 24 9
Disclaimer: I'm not a data.table expert (yet ;-), so there might faster ways to do that. Check out the package site to get you started if you are interested in using the package: http://datatable.r-forge.r-project.org/
For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.
df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)
Use tapply to get the group stats, then add them back into your dataset afterwards.
Reproducible example:
means_by_wool <- with(warpbreaks, tapply(breaks, wool, mean))
warpbreaks$means.by.wool <- means_by_wool[warpbreaks$wool]
Untested solution for your scenario:
sum_by_id <- with(df, tapply(cand.perc, id, sum))
df$perc.total <- sum_by_id[df$id]
ilprincipe if none of the above fits your needs you could try transposing your data
dft=t(df)
then use aggregate
dfta=aggregate(dft,by=list(rownames(dft)),FUN=sum)
next have back your rownames
rownames(dfta)=dfta[,1]
dfta=dfta[,2:ncol(dfta)]
Transpose back to original orientation
df2=t(dfta)
and bind to original data
newdf=cbind(df,df2)
Why are you using cbind(x, ...) the output of ddply will be append automatically. This should work:
ddply(df, "id", transform, perc.total = sum(cand.perc))
getting rid of the superfluous cbind should speed things up.
You can also load up your favorite foreach backend and try the .parallel=TRUE argument for ddply.

Is it faster to give Perl's print a list or a concatenated string?

option A:
print $fh $hr->{'something'}, "|", $hr->{'somethingelse'}, "\n";
option B:
print $fh $hr->{'something'} . "|" . $hr->{'somethingelse'} . "\n";
The answer is simple, it doesn't matter. As many folks have pointed out, this is not going to be your program's bottleneck. Optimizing this to even happen instantly is unlikely to have any effect on your performance. You must profile first, otherwise you are just guessing and wasting your time.
If we are going to waste time on it, let's at least do it right. Below is the code to do a realistic benchmark. It actually does the print and sends the benchmarking information to STDERR. You run it as perl benchmark.plx > /dev/null to keep the output from flooding your screen.
Here's 5 million iterations writing to STDOUT. By using both timethese() and cmpthese() we get all the benchmarking data.
$ perl ~/tmp/bench.plx 5000000 > /dev/null
Benchmark: timing 5000000 iterations of concat, list...
concat: 3 wallclock secs ( 3.84 usr + 0.12 sys = 3.96 CPU) # 1262626.26/s (n=5000000)
list: 4 wallclock secs ( 3.57 usr + 0.12 sys = 3.69 CPU) # 1355013.55/s (n=5000000)
Rate concat list
concat 1262626/s -- -7%
list 1355014/s 7% --
And here's 5 million writing to a temp file
$ perl ~/tmp/bench.plx 5000000
Benchmark: timing 5000000 iterations of concat, list...
concat: 6 wallclock secs ( 3.94 usr + 1.05 sys = 4.99 CPU) # 1002004.01/s (n=5000000)
list: 7 wallclock secs ( 3.64 usr + 1.06 sys = 4.70 CPU) # 1063829.79/s (n=5000000)
Rate concat list
concat 1002004/s -- -6%
list 1063830/s 6% --
Note the extra wallclock and sys time underscoring how what you're printing to matters as much as what you're printing.
The list version is about 5% faster (note this is counter to Pavel's logic underlining the futility of trying to just think this stuff through). You said you're doing tens of thousands of these? Let's see... 100k takes 146ms of wallclock time on my laptop (which has crappy I/O) so the best you can do here is to shave off about 7ms. Congratulations. If you spent even a minute thinking about this it will take you 40k iterations of that code before you've made up that time. This is not to mention the opportunity cost, in that minute you could have been optimizing something far more important.
Now, somebody's going to say "now that we know which way is faster we should write it the fast way and save that time in every program we write making the whole exercise worthwhile!" No. It will still add up to an insignificant portion of your program's run time, far less than the 5% you get measuring a single statement. Second, logic like that causes you to prioritize micro-optimizations over maintainability.
Oh, and its different in 5.8.8 as in 5.10.0.
$ perl5.8.8 ~/tmp/bench.plx 5000000 > /dev/null
Benchmark: timing 5000000 iterations of concat, list...
concat: 3 wallclock secs ( 3.69 usr + 0.04 sys = 3.73 CPU) # 1340482.57/s (n=5000000)
list: 5 wallclock secs ( 3.97 usr + 0.06 sys = 4.03 CPU) # 1240694.79/s (n=5000000)
Rate list concat
list 1240695/s -- -7%
concat 1340483/s 8% --
It might even change depending on what Perl I/O layer you're using and operating system. So the whole exercise is futile.
Micro-optimization is a fool's game. Always profile first and look to optimizing your algorithm. Devel::NYTProf is an excellent profiler.
#!/usr/bin/perl -w
use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
#open my $fh, ">", "/tmp/test.out" or die $!;
#open my $fh, ">", "/dev/null" or die $!;
my $fh = *STDOUT;
my $hash = {
foo => "something and stuff",
bar => "and some other stuff"
};
select *STDERR;
my $r = timethese(shift || -3, {
list => sub {
print $fh $hash->{foo}, "|", $hash->{bar};
},
concat => sub {
print $fh $hash->{foo}. "|". $hash->{bar};
},
});
cmpthese($r);
Unless you are executing millions of these statements, the performance difference will not matter. I really suggest concentrating on performance problems where they do exist - and the only way to find that out is to profile your application.
Premature optimization is something that Joel and Jeff had a podcast on, and whined about, for years. It's just a waste of time to try to optimize something until you KNOW that it's slow.
Perl is a high-level language, and as such the statements you see in the source code don't map directly to what the computer is actually going to do. You might find that a particular implementation of perl makes one thing faster than the other, but that's no guarantee that another implementation might take away the advantage (although they try not to make things slower).
If you're worried about I/O speed, there are a lot more interesting and useful things to tweak before you start worrying about commas and periods. See, for instance, the discussion under Perl write speed mystery.
UPDATE:
I just ran my own test.
1,000,000 iterations of each version took each < 1 second.
10mm iterations of each version took an average of 2.35 seconds for list version vs. 2.1 seconds for string concat version
Have you actually tried profiling this? Only takes a few seconds.
On my machine, it appears that B is faster. However, you should really have a look at Pareto Analysis. You've already wasted far, far more time thinking about this question then you'd ever save in any program run. For problems as trivial as this (character substitution!), you should wait to care until you actually have a problem.
Of the three options, I would probably choose string interpolation first and switch to commas for expressions that cannot be interpolated. This, humorously enough, means that my default choice is the slowest of the bunch, but given that they are all so close to each other in speed and that disk speed is probably going to be slower than anything else, I don't believe changing the method has any real performance benefits.
As others have said, write the code, then profile the code, then examine the algorithms and data structures you have chosen that are in the slow parts of the code, and, finally, look at the implementation of the algorithms and data structures. Anything else is foolish micro-optimizing that wastes more time than it saves.
You may also want to read perldoc perlperf
Rate string concat comma
string 803887/s -- -0% -7%
concat 803888/s 0% -- -7%
comma 865570/s 8% 8% --
#!/usr/bin/perl
use strict;
use warnings;
use Carp;
use List::Util qw/first/;
use Benchmark;
sub benchmark {
my $subs = shift;
my ($k, $sub) = each %$subs;
my $value = $sub->();
croak "bad" if first { $value ne $_->() and print "$value\n", $_->(), "\n" } values %$subs;
Benchmark::cmpthese -1, $subs;
}
sub fake_print {
#this is, plus writing output to the screen is what print does
no warnings;
my $output = join $,, #_;
return $output;
}
my ($x, $y) = ("a", "b");
benchmark {
comma => sub { return fake_print $x, "|", $y, "\n" },
concat => sub { return fake_print $x . "|" . $y . "\n" },
string => sub { return fake_print "$x|$y\n" },
};

Resources