gnuplot fasten process with data read from a file - performance

OSX v10.10.5 and Gnuplot v5.0
I have a data file with three columns of numbers and I read the values that are stored inside to do some calculations. But it is time consuming!
Here is what I have so far:
#user defined function to read data in a file
#see stackoverflow: "Reading dataset value into a gnuplot variable (start of X series)"
at(file, row, col) = system( sprintf("awk -v row=%d -v col=%d 'NR == row {print $col}' %s", row, col, file) )
file="myFile"
do for [k=1:10] { #we read line by line and we want the ratio between column 2/1 and 3/1
f(k) = at(file,k,2)/at(file,k,1)
g(k) = at(file,k,3)/at(file,k,1)
# example of calculation: least square to find the best "i"
do for [i=1:10] {
f1(i) = (a*i**2 + b*i + c) #function for the least square. a,b,c: floats
g1(i) = (d*i**2 + e*i + f) #d,e,f: floats
h(i) = sqrt( (f1(i)-f(k))**2 + (g1(i)-g(k))**2 )
if (h(i)<hMin) {
hMin=h(i)
}
else {}
} #end loop i
print i," ",hMin
} #end loop k
It works but as I said it takes time (around 2min for each k). When I do not make any calculation and only ask print f(k),g(k), it is << 1sec. I suspect then that the division could lead to too many digits and to unefficient calculation. I used round2 function to keep the n=4 first:
#see stackoverflow: How to use floor function in gnuplot
round(x) = x - floor(x) < 0.5 ? floor(x) : ceil(x)
round2(x, n) = round(x*10**n)*10.0**(-n)
f(k) = round2((at(file,k,2)/at(file,k,1)),4)
g(k) = round2((at(file,k,3)/at(file,k,1)),4)
but it did not change the required time.
Any idea about what's going on?

You did not post the full code (definitions for a, b, ..., f are missing). But within the part of the code you have posted I think you can avoid calling awk that often. You can replace the functions f(k) and g(k) by simple variables fk and gk, because in fact they are constant within each k-iteration. There seems to be no need to recalculate them within each i-iteration.
#user defined function to read data in a file
#see stackoverflow: "Reading dataset value into a gnuplot variable (start of X series)"
at(file, row, col) = system( sprintf("awk -v row=%d -v col=%d 'NR == row {print $col}' %s", row, col, file) )
file="myFile"
do for [k=1:10] { #we read line by line and we want the ratio between column 2/1 and 3/1
at1 = at(file,k,1)
fk = at(file,k,2)/at1
gk = at(file,k,3)/at1
# example of calculation: least square to find the best "i"
do for [i=1:10] {
f1i = (a*i**2 + b*i + c) #function for the least square. a,b,c: floats
g1i = (d*i**2 + e*i + f) #d,e,f: floats
hi = sqrt( (f1i-fk)**2 + (g1i-gk)**2 )
if (hi<hMin) {
hMin=hi
} else {
}
} #end loop i
print i," ",hMin
} #end loop k
But there might be more interesting details in the missing code which inhibits this solution.

Related

subscript indices must be either positiveintegers less than 2^31 or logicals

SOS i keep getting errors in the loop solving by finite difference method.
I either get the following error when i start with i = 2 : N :
diffusion: A(I,J): row index out of bounds; value 2 out of bound 1
error: called from
diffusion at line 37 column 10 % note line change due to edit!
or, I get the following error when i do i = 2 : N :
subscript indices must be either positive integers less than 2^31 or logicals
error: called from
diffusion at line 37 column 10 % note line change due to edit!
Please help
clear all; close all;
% mesh in space
dx = 0.1;
x = 0 : dx : 1;
% mesh in time
dt = 1 / 50;
t0 = 0;
tf = 10;
t = t0 : dt : tf;
% diffusivity
D = 0.5;
% number of nodes
N = 11;
% number of iterations
M = 10;
% initial conditions
if x <= .5 && x >= 0 % note, in octave, you don't need parentheses around the test expression
u0 = x;
elseif
u0 = 1-x;
endif
u = u0;
alpha = D * dt / (dx^2);
for j = 1 : M
for i = 1 : N
u(i, j+1) = u(i, j ) ...
+ alpha ...
* ( u(i-1, j) ...
+ u(i+1, j) ...
- 2 ...
* u(i, j) ...
) ;
end
u(N+1, j+1) = u(N+1, j) ...
+ alpha ...
* ( ...
u(N, j) ...
- 2 ...
* u(N+1, j) ...
+ u(N, j) ...
) ;
% boundary conditions
u(0, :) = u0;
u(1, :) = u1;
u1 = u0;
u0 = 0;
end
% exact solution with 14 terms
%k=14 % COMMENTED OUT
v = (4 / ((k * pi) .^ 2)) ...
* sin( (k * pi) / 2 ) ...
* sin( k * pi * x ) ...
* exp .^ (D * ((k * pi) ^ 2) * t) ;
exact = symsum( v, k, 1, 14 );
error = exact - u;
% plot stuff
plot( t, error );
xlabel( 'time' );
ylabel( 'error' );
legend( 't = 1 / 50' );
Have a look at the edited code I cleaned up for you above and study it.
Don't underestimate the importance of clean, readable code when hunting for bugs.
It will save you more time than it will cost. Especially a week from now when you will need to revisit this code and you will not remember at all what you were trying to do.
Now regarding your errors. (all line references are with respect to the cleaned up code above)
Scenario 1:
In line 29 you initialise u as a single value.
If you start your loop in line 35 starting with i = 2, then as soon as you try to do u(i, j+1), i.e. u(2,2) in the next line, octave will complain that you're trying to index the second row, in an array that so far only contains one row. (in fact, the same will apply for j at this point, since at this point you only have one column as well)
Scenario 2:
I assume the second scenario was a typo and you meant to say i = 1 : N.
If you start with i=1 in the loop, then have a look at line 38: you are trying to get element u(i-1, j), i.e. u(0,1). Therefore octave will complain that you're trying to get the zero element, but in octave arrays start from one and zero is not defined. Attempting to access any array with a zero will result in the error you see (try it in a terminal!).
UPDATE
Also, now that the code is clean, you can spot another bug, which octave helpfully warns you about if you try to run the code.
Look at line 26. There is NO condition in the elseif leg, so octave looks for the next statement as the test condition.
This means that the elseif condition will always succeed as long as the result of u0 = 1-x is non-zero.
This is clearly a bug. Either you forgot to put the condition for the elseif, or more likely, you probably just meant to say else, rather than elseif.

MapReduce fundamentals

1) `
map(nr, txt)
words = split (txt, ' ')
for(i=0; i< |words| - 1; i++)
emit(words[i]+' '+words[i+1], 1)
reduce(key, vals)
s=0
for v : vals
s += v
if(s = 5)
emit(key,s)`
2) `map(nr, txt)
words = split (txt, ' ')
for(i=0; i < |words|; i++)
emit(txt, length(words[i]))
reduce(key, vals)
s=0
c=0
for v : vals
s += v
c += 1
r = s/c
emit(key,r)`
I am new to MapReduce and when I am not able to understand if the "if condition in the code(1) will ever satisfy"
Q1 We need to determine what this MapReduce function do in both the code?
Could you please give any input on the above question.
The first block of code emits all bigrams that appear more than 5 times. The reducer if condition satisfies if a pair of adjacent words exists at least 5 times
The second block emits every word of the input text with its length. It attempts to calculate the average length of each word, but since a reducer only sees a single key, then that calculation wouldn't do anything (seeing "foo" 1000 times still has a length of 3)

How to use "column" to center a chart?

I was wondering what the best way to sort a chart using the column command to center each column instead of the default left aligned column was. I have been using the column -t filename command.
Current Output:
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Desired Output: Something like this
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Basically, I want it to be centered instead of left aligned.
Here is an awk solution:
# Collect all lines in "data", keep track of maximum width for each field
{
data[NR] = $0
for (i = 1; i <= NF; ++i)
max[i] = length($i) > max[i] ? length($i) : max[i]
}
END {
for (i = 1; i <= NR; ++i) {
# Split record into array "arr"
split(data[i], arr)
# Loop over array
for (j = 1; j <= NF; ++j) {
# Calculate amount of padding required
pad = max[j] - length(arr[j])
# Print field with appropriate padding, see below
printf "%*s%*s%s", length(arr[j]) + int(pad/2), arr[j], \
pad % 2 == 0 ? pad/2 : int(pad/2) + 1, "", \
j == NF ? "" : " "
}
# Newline at end of record
print ""
}
}
Called like this:
$ awk -f centre.awk infile
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
The printf statement uses padding with dynamic widths:
The first %*s takes care of left padding and the data itself: arr[j] gets printed and padded to a total width of length(arr[j]) + int(pad/2).
The second %*s prints the empty string, left padded to half of the total padding required. pad % 2 == 0 ? pad/2 : int(pad/2) + 1 checks if the total padding was an even number, and if not, adds an extra space.
The last %s prints j == NF ? "" : " ", i.e., two spaces, unless we're at the last field.
Some older awks don't support the %*s syntax, but the formatting string can be assembled like width = 5; "%" width "s" in that case.
Here's a Python program to do what you want. It's probably too hard to do in bash, so you'll need to use a custom program or awk script. Basic algorithm:
count number of columns
[optional] make sure each line has the same number of columns
figure out the maximum length of data for each column
print each line using the max lengths
.
#!/usr/bin/env python3
import sys
def column():
# Read file and split each line into fields (by whitespace)
with open(sys.argv[1]) as f:
lines = [line.split() for line in f]
# Check that each line has the same number of fields
num_fields = len(lines[0])
for n, line in enumerate(lines):
if len(line) != num_fields:
print('Line {} has wrong number of columns: expected {}, got {}'.format(n, num_fields, len(line)))
sys.exit(1)
# Calculate the maximum length of each field
max_column_widths = [0] * num_fields
for line in lines:
line_widths = (len(field) for field in line)
max_column_widths = [max(z) for z in zip(max_column_widths, line_widths)]
# Now print them centered using the max_column_widths
spacing = 4
format_spec = (' ' * spacing).join('{:^' + str(n) + '}' for n in max_column_widths)
for line in lines:
print(format_spec.format(*line))
if __name__ == '__main__':
column()

Algorithm to find an interval with the highest summed weight of weighted overlapping intervals

Well, I think it's hard to explain, so I've made a figure to show that.
As we can see in this figure, there are 6 intervals of time. Each one has its weight. Higher the opacity, higher the weight. I want an algorithm to find the interval with the highest summed weight. In the case of the figure, it'd be the overlapping of the intervals 5 and 6, which is the area with highest opacity.
Split each interval into start and end points.
Sort the points.
Start with a sum of 0.
Iterate through the points using a sweep-line algorithm:
If you get a start point:
Increase the sum by the value of the corresponding interval.
If the sum count is higher than the best sum so far, store this start point and set a flag.
If you get an end point:
If the flag is set, store the stored start point and this end point with the current sum as the best interval so far and reset the flag.
Decrease the count by the value of the corresponding interval.
This is derived from the answer I wrote here, which is based on the unweighted version, i.e. finding the maximum number of overlapping intervals, rather than the maximum summed weight.
Example:
For this example:
The start / end points will be sorted as: (S = start, E = end)
1S, 1E, 2S, 3S, 2E, 3E, 4S, 5S, 4E, 6S, 5E, 6E
Iterating through them, you'll set the flag on 1S, 5S and 6S, and you'll store the respective intervals at 1E, 4E and 5E (which is the first end-points you get to after the above start points).
You won't set the flag on 2S, 3S or 4S, as the sum will be lower than the best sum so far.
The algorithm logic can be derived from the figure. Assuming that resolution of time intervals is 1 min, then an array can be created and used for all the calculations:
create the array of 24 * 60 elements and fill it with 0 weights;
for each time interval add the weight of this interval to the corresponding part of the array;
find a maximum summed weight by iterating the array;
iterate over the array again and output array index (time) with the maximal summed weight.
This algorithm can be modified for a slightly different task, if you need to have interval indices in the output. In this case the array should contain list of the input time interval indices as a second dimension (or it can be a separate array, depending on particular language).
UPD. I was curious if this simple algorithm is significantly slower than more elegant one suggested by #Dukeling. I coded both algorithms and created an input generator to estimate their performance.
Generator:
#!/bin/sh
awk -v n=$1 '
BEGIN {
tmax = 24 * 60; wmax = 100;
for (i = 0; i < n; i++) {
t1 = int(rand() * tmax);
t2 = int(rand() * tmax);
w = int(rand() * wmax);
if (t2 >= t1) {print t1, t2, w} else {print t2, t1, w}
}
}' | sort -n > i.txt
Algorithm #1:
#!/bin/sh
awk '
{t1[++i] = $1; t2[i] = $2; w[i] = $3}
END {
for (i in t1) {
for (t = t1[i]; t <= t2[i]; t++) {
W[t] += w[i];
}
}
Wmax = 0.;
for (t in W){
if (W[t] > Wmax) {Wmax = W[t]}
}
print Wmax;
for (t in W){
if (W[t] == Wmax) {print t}
}
}
' i.txt > a1.txt
Algorithm #2:
#!/bin/sh
awk '
{t1[++i] = $1; t2[i] = $2; w[i] = $3}
END {
for (i in t1) {
p[t1[i] "a" i] = i "S";
p[t2[i] "b" i] = i "E";
}
n = asorti(p, psorted, "#ind_num_asc");
W = 0.; Wmax = 0.; f = 0;
for (i = 1; i <= n; i++){
P = p[psorted[i] ];
k = int(P);
if (index(P, "S") > 0) {
W += w[k];
if (W > Wmax) {
f = 1;
Wmax = W;
to1 = t1[k]
}
}
else {
if (f != 0) {
to2 = t2[k];
f = 0
}
W -= w[k];
}
}
print Wmax, to1 "-" to2
}
' i.txt > a2.txt
Results:
$ ./gen.sh 1000
$ time ./a1.sh
real 0m0.283s
$ time ./a2.sh
real 0m0.019s
$ cat a1.txt
24618
757
$ cat a2.txt
24618 757-757
$ ./gen.sh 10000
$ time ./a1.sh
real 0m3.026s
$ time ./a2.sh
real 0m0.144s
$ cat a1.txt
252452
746
$ cat a2.txt
252452 746-746
$ ./gen.sh 100000
$ time ./a1.sh
real 0m34.127s
$ time ./a2.sh
real 0m1.999s
$ cat a1.txt
2484719
714
$ cat a2.txt
2484719 714-714
The simple on is ~20x slower.

Convert Excel Column Number to Column Name in Matlab

I am using Excel 2007 which supports Columns upto 16,384 Columns. I would like to obtain the Column name corresponding Column Number.
Currently, I am using the following code. However this code supports upto 256 Columns. Any idea how to obtain Column Name if the column number is greater than 256.
function loc = xlcolumn(column)
if isnumeric(column)
if column>256
error('Excel is limited to 256 columns! Enter an integer number <256');
end
letters = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'};
count = 0;
if column-26<=0
loc = char(letters(column));
else
while column-26>0
count = count + 1;
column = column - 26;
end
loc = [char(letters(count)) char(letters(column))];
end
else
letters = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'];
if size(column,2)==1
loc =findstr(column,letters);
elseif size(column,2)==2
loc1 =findstr(column(1),letters);
loc2 =findstr(column(2),letters);
loc = (26 + 26*loc1)-(26-loc2);
end
end
Thanks
As a diversion, here is an all function handle example, with (almost) no file-based functions required. This is based on the dec2base function, since Excel column names are (almost) base 26 numbers, with the frustrating difference that there are no "0" characters.
Note: this is probably a terrible idea overall, but it works. Better solutions are probably found elsewhere in the file exchange.
First, the one file based function that I couldn't get around, to perform arbitrary depth function composition.
function result = compose( fnHandles )
%COMPOSE Compose a set of functions
% COMPOSE({fnHandles}) returns a function handle consisting of the
% composition of the cell array of input function handles.
%
% For example, if F, G, and H are function handles with one input and
% one output, then:
% FNCOMPOSED = COMPOSE({F,G,H});
% y = FNCOMPOSED(x);
% is equivalent to
% y = F(G(H(x)));
if isempty(fnHandles)
result = #(x)x;
elseif length(fnHandles)==1
result = fnHandles{1};
else
fnOuter = fnHandles{1};
fnRemainder = compose(fnHandles(2:end));
result = #(x)fnOuter(fnRemainder(x));
end
Then, the bizarre, contrived path to convert base26 values into the correct string
%Functions leading to "getNumeric", which creates a numeric, base26 array
remapUpper = #(rawBase)(rawBase + (rawBase>='A')*(-55)); %Map the letters 'A-P' to [10:26]
reMapLower = #(rawBase)(rawBase + (rawBase<'A')*(-48)); %Map characters '0123456789' to [0:9]
getRawBase = #(x)dec2base(x, 26);
getNumeric = #(x)remapUpper(reMapLower(getRawBase(x)));
%Functions leading to "correctNumeric"
% This replaces zeros with 26, and reduces the high values entry by 1.
% Similar to "borrowing" as we learned in longhand subtraction
borrowDownFrom = #(x, fromIndex) [x(1:(fromIndex-1)) (x(fromIndex)-1) (x(fromIndex+1)+26) (x((fromIndex+2):end))];
borrowToIfNeeded = #(x, toIndex) (x(toIndex)<=0)*borrowDownFrom(x,toIndex-1) + (x(toIndex)>0)*(x); %Ugly numeric switch
getAllConditionalBorrowFunctions = #(numeric)arrayfun(#(index)#(numeric)borrowToIfNeeded(numeric, index),(2:length(numeric)),'uniformoutput',false);
getComposedBorrowFunction = #(x)compose(getAllConditionalBorrowFunctions(x));
correctNumeric = #(x)feval(getComposedBorrowFunction(x),x);
%Function to replace numerics with letters, and remove leading '#' (leading
%zeros)
numeric2alpha = #(x)regexprep(char(x+'A'-1),'^#','');
%Compose complete function
num2ExcelName = #(x)arrayfun(#(x)numeric2alpha(correctNumeric(getNumeric(x))), x, 'uniformoutput',false)';
Now test using some stressing transitions:
>> num2ExcelName([1:5 23:28 700:704 727:729 1024:1026 1351:1355 16382:16384])
ans =
'A'
'B'
'C'
'D'
'E'
'W'
'X'
'Y'
'Z'
'AA'
'AB'
'ZX'
'ZY'
'ZZ'
'AAA'
'AAB'
'AAY'
'AAZ'
'ABA'
'AMJ'
'AMK'
'AML'
'AYY'
'AYZ'
'AZA'
'AZB'
'AZC'
'XFB'
'XFC'
'XFD'
This function I wrote works for any number of columns (until Excel runs out of columns). It just requires a column number input (e.g. 16368 will return a string 'XEN').
If the application of this concept is different than my function, it's important to note that a column of x number of A's begins every 26^(x-1) + 26^(x-2) + ... + 26^2 + 26 + 1. (e.g. 'AAA' begins on 26^2 + 26 + 1 = 703)
function [col_str] = let_loc(num_loc)
test = 2;
old = 0;
x = 0;
while test >= 1
old = 26^x + old;
test = num_loc/old;
x = x + 1;
end
num_letters = x - 1;
str_array = zeros(1,num_letters);
for i = 1:num_letters
loc = floor(num_loc/(26^(num_letters-i)));
num_loc = num_loc - (loc*26^(num_letters-i));
str_array(i) = char(65 + (loc - 1));
end
col_str = strcat(str_array(1:length(str_array)));
end
Hope this saves someone some time!

Resources