Fast string splitting in MATLAB - performance

I've been using MATLAB to read through a bunch of output files and have noticed that it was reading the files fairly slowly in comparison to a reader that I wrote in Python for the same files (on the order of 120s for MATLAB, 4s for Python on the same set). The files have a combination of letters and numbers, where the numbers I actually want each have a unique string on the same line, but there is no real pattern to the rest of the file. Is there a faster way to read in non-uniformly formatted text files in MATLAB?
I tried using the code profiler in MATLAB to see what takes the most time, and it seemed to be the strfind and strsplit functions. Deeper down, the strfun\private\strescape seems to be the culprit which takes up around 50% of the time, which is called by strsplit function.
I am currently using a combination of strfind and strsplit in order to search through a file for 5 specific strings, then convert the string after it into a double.
lots of text before this
#### unique identifying text here
lots of text before this
sometext X = #####
Y = #####
Z = #####
more text = ######
I am iterating through the file with approximately the following code, repeated for each number that is being found.
fid=fopen(filename)
tline=fgets(fid)
while ischar(tline)
if ~isempty(strfind(tline('X =')))
tempstring=strsplit(tline(13:length(tline)),' ');
result=str2double(char(tempstring(2)));
end
tline=fgets(fid);
end

I'm guessing this will be a bit faster, but maybe not by much.
s = fileread('texto');
[X,s] = strtok(strsplit(s, "X = "){2}); X = str2num(X);
[Y,s] = strtok(strsplit(s, "Y = "){2}); Y = str2num(Y);
[Z,s] = strtok(strsplit(s, "Z = "){2}); Z = str2num(Z);
Obviously this is highly specific to your text example. You haven't given me any more info on how the variables might change etc so presumably you'll have to implement try/catch blocks if files are not consistent etc.
PS. This is octave syntax which allows chaining operations. For matlab, split them into separate operations as appropriate.
EDIT: ach, nevermind, here's the matlab compatible one too. :)
s = fileread('texto');
C = strsplit(s, 'X = '); [X,s] = strtok(C{2}); X = str2num(X);
C = strsplit(s, 'Y = '); [Y,s] = strtok(C{2}); Y = str2num(Y);
C = strsplit(s, 'Z = '); [Z,s] = strtok(C{2}); Z = str2num(Z);

Related

Convert structure fields to arrays efficiently matlab

I have a structure called s in Matlab. This is a structure with two fields a and b. The structure size is 1 x 1,620,000.
It is a very large structure (that probably takes half of the ram of my machine). This is what the structure looks like:
I am looking for an efficient way to concatenate each of the fields a and b into two separate arrays that I can then export to csv. I built the code below, to do so, but even after 12 hours running it has not even reached a quarter of the loop. Any more efficient way of doing this?
a = [];
b =[];
total_n = size(s,2);
count = 1;
while size(s,2)>0
if size(s(1).a,1)
a = [a; s(1).a];
end
if size(s(1).b,1)
b = [b; s(1).b];
end
s(1) = []; %to save memory
if mod(count,1000) == 0
fprintf('Done %2f \n', [count/total_n])
end
count = count+1;
end
s(1) = []; %to save memory
ah, but such huge misunderstanding that comment is.
if size(s) is 1 x 1,620,000, you just suddenly forced the loop to do (under the hood, you dont see it)
snew=zeros(1,size(s,2)-1) # now you use double memory
snew=s(2:end) # now you force an unnecesary copy
So not only does that line make your code require double the memory, but also in each loop, you make an unnecesary copy of a large array.
Just replace your while for a normal for loop of for ii=1:size(s,2) and then index s!
Now, you can see hopefully then why the following is equally a big mistake (not only that, but any modern MATLAB version is currently telling you this is a bad idea in your editor)
a=[]
a=[a;s(1).a]
In here in each loop you are forcing MATLAB to make a new a that is 1 bigger than before, and copy the contents of the old a there.
instead, preallocate the size of a.
As you don't know what you are going to put there, I suggest using a cell array, as each s(ii).a has a different length.
You can then, after the loop, remove all empty (isempty) cells if you want.
Managed to do it efficiently:
s= struct2cell(s);
s= squeeze(s);
a = a(1,:);
a = a';
a = vertcat(a{:});
b = a(2,:);
b = b';
b = vertcat(b{:});

How to rename many images after processing with different parameters

Hello dear programmers,
I have a sequence of images and I would like to perform dilation on each of them with different dilation parameters. Then, I would like to save the processed images with new name including both the old name and the corresponding dilation parameter. My codes are as follows.
Input_folder =
Output_folder =
D = dir([Input_folder '*.jpg']);
Inputs = {D.name}';
Outputs = Inputs; % preallocate
%print(length(Inputs));
for k = 1:length(Inputs)
X = imread([Input_folder Inputs{k}]);
dotLocations = find(Inputs{k} == '.');
name_img = Inputs{k}(1:dotLocations(1)-1);
image1=im2bw(X);
vec = [3;6;10];
vec_l = length(vec);
for i = 1:vec_l
%se = strel('disk',2);
fprintf(num2str(vec(i)));
se = strel('diamond',vec(i)); %imdilate
im = imdilate(image1,se);
image2 = im - image1;
Outputs{k} = regexprep(Outputs{k}, name_img, strcat(name_img,'_', num2str(vec(i))));
imwrite(image2, [Output_folder Outputs{k}])
end
end
As it can be seen, I would like to apply dilation with parameters 3,6 and 10. Let us assume that an image has as name "image1", after processing it, I would like to have "image1_3", "image1_6" and "image1_10". However, I am getting as results "image1_3", "image1_6_3" and "image1_10_6_3". Please, how can I modify my codes to fix this problem?
This is because you rewrite each item of the Outputs variable three times, each time using the previous value to create a new value. Instead, you should use the values stored in Inputs new names. Another mistake in your code is that the size of Inputs and Outputs are equal, while for every file in the input folder, three files must be stored in the output folder.
I also suggest using fileparts function, instead of string processing, to get different parts of a file path.
vec = [3;6;10];
vec_l = length(vec);
Outputs = cell(size(Inputs, 1)*vec_l, 1); % preallocate
for k = 1:length(Inputs)
[~,name,ext] = fileparts([Input_folder Inputs{k}]);
% load image
for i = 1:vec_l
% modify image
newName = sprintf('%s_%d%s', name, vec(i), ext);
Outputs{(k-1)*vec_l+i} = newName;
imwrite(image2, [Output_folder newName])
end
end

word2vec recommendation system KeyError: "word '21883' not in vocabulary"

The code works absolutely fine for the data set containing 500000+ instances but whenever I reduce the data set to 5000/10000/15000 it throws a key error : word "***" not in vocabulary.Not for every data point but for most them it throws the error.The data set is in excel format. [1]: https://i.stack.imgur.com/YCBiQ.png
I don't know how to fix this problem since i have very little knowledge about it,,I am still learning.Please help me fix this problem!
purchases_train = []
for i in tqdm(customers_train):
temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
purchases_train.append(temp)
purchases_val = []
for i in tqdm(validation_df['CustomerID'].unique()):
temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
purchases_val.append(temp)
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
model.save("word2vec_2.model")
model.init_sims(replace=True)
# extract all vectors
X = model[model.wv.vocab]
X.shape
products = train_df[["StockCode", "Description"]]
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")
products_dict=products.groupby('StockCode'['Description'].apply(list).to_dict()
def similar_products(v, n = 6):
ms = model.similar_by_vector(v, topn= n+1)[1:]
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model['21883'])
If you get a KeyError saying a word is not in the vocabulary, that's a reliable indicator that the word you're looking-up was not in the training data fed to Word2Vec, or did not appear enough (default min_count=5) times.
So, your error indicates the word-token '21883' did not appear at least 5 times in the texts (purchases_train) supplied to Word2Vec. You should do either or both of:
Ensure all words you're going to look-up appear enough times, either with more training data or a lower min_count. (However, words with only one or a few occurrences tend not to get good vectors & instead just drag the quaality of surrounding-words' vectors down - so keeping this value above 1, or even raising it above the default of 5 to discard more rare words, is a better path whenever you have sufficient data.)
If your later code will be looking up words that might not be present, either check for their presence first (word in model.wv.vocab) or set up a try: ... except: ... to catch & handle the case where they're not present.

MATLAB parfor slicing a 3D array

I'm trying to speed up my code using parfor. The purpose of the code is to slide a 3D square window on a 3D image and for each block of mxmxm apply a function.
I wrote this code:
function [ o_image ] = SlidingWindow( i_image, i_padSize, i_fun, i_options )
%SLIDINGWINDOW Summary of this function goes here
% Detailed explanation goes here
o_image = zeros(size(i_image,1),size(i_image,2),size(i_image,3));
i_image = padarray(i_image,i_padSize,'symmetric');
i_padSize = num2cell(i_padSize);
[m,n,p] = deal(i_padSize{:});
[row,col,depth] = size(i_image);
windowShape = i_options.windowShape;
mask = i_options.mask;
parfor (i = m+1:row-m,i_options.cores)
temp = i_image(i-m:i+m,:,:);
for j = n+1:col-n
for h = p+1:depth-p
ii = i-m;
jj = j-n;
hh = h-p;
temp = temp(:,j-n:j+n, h-p:h+p);
o_image(ii,jj,hh) = parfeval(i_fun, temp, windowShape, mask);
end
end
end
end
I get one warning and one error that I don't understand how to solve.
The warning says:
the entire array or structure 'i_image' is a broadcast variable.
The error says:
the PARFOR loop can not run due to the way variable 'o_image' is used.
I don't understand how to fix these two things. Any help is greatly appreciated!
As far as I understand, parfeval takes care of running your function on the available number of workers, which is why it doesn't need to be surrounded by parfor. Assuming you already have an active parpool, changing the external parfor into for eliminates both problems.
Unfortunately, I can't support my answer with a benchmark or suggest a more fitting solution because your inputs are unknown.
It seems to me that the code can be optimized in other ways, mainly by vectorization. I would suggest you looked into the following resources:
This question, for additional info on parfeval.
Examples on how to use bsxfun and permute and benchmarks thereof: ex1, ex2, ex3.
P.S.: The 2nd part of (i = m+1:row-m,i_options.cores) seems out of place...

Is there alternative for imread command to reduce delay in matlab program?

I am having 2900 images in this path G:\newdatabase\ It is taking too much time to read images.For dot product also it is taking too much time.
Questions:
1.Is there any alternative for imread command which increases performance?
2.Is there any alternative for dot command which increases performance?
Source code i tried:
srcFiles = dir('G:\newdatabase\*.jpg'); % the folder in which ur images exists
for b = 1 : length(srcFiles)
filename = strcat('G:\newdatabase\',srcFiles(b).name);
Imgdata = imread(filename);
Source code i tried:
for i = 1:aa
pare = dot(NormImage,u(:,i));
p = [p; pare];
end

Resources