Working with very big data faster in Matlab?

Working with very big data faster in Matlab? - performance

I have to deal with very big data (Point clouds generally more than 30 000 000 points) using Matlab. I can read ascii data using textscan function. After reading, I need to detect invalid data (points with 0,0,0 coordinates) and then I need to do some mathematical operations on each point or each line in the data. In my way, first I read data with textscan and then I assign this data to a matrix. Secondly, I use for loops for detecting invalid points and doing some mathematical operations on each point or line in the data. A sample of my code is shown as below. According to profile tool of Matlab textscan takes 37% and line
transformed_list((i:i),(1:4)) = coordinate_list((i:i),(1:4))*t_matrix;
takes 35% of all computation time.
I tried it with another point cloud (stores around 5 500 000) and profile tool reported same results. Is there a way of avoiding for loops, or is there another way of speeding up this computation?
fileID = fopen('C:\Users\Mustafa\Desktop\ptx_all_data\dede5.ptx');
original_data = textscan(fileID,'%f %f %f %f %f %f %f', 'delimiter',' ');
fclose(fileID);
column = original_data{1}(1);
row = original_data{1}(2);
t_matrix = [original_data{1}(7) original_data{2}(7) original_data{3}(7) original_data{4}(7)
original_data{1}(8) original_data{2}(8) original_data{3}(8) original_data{4}(8)
original_data{1}(9) original_data{2}(9) original_data{3}(9) original_data{4}(9)
original_data{1}(10) original_data{2}(10) original_data{3}(10) original_data{4}(10)];
coordinate_list(:,1) = original_data{1}(11:length(original_data{1}));
coordinate_list(:,2) = original_data{2}(11:length(original_data{2}));
coordinate_list(:,3) = original_data{3}(11:length(original_data{3}));
coordinate_list(:,4) = 0;
coordinate_list(:,5) = original_data{4}(11:length(original_data{4}));
transformed_list = zeros(length(coordinate_list),5);
for i = 1:length(coordinate_list)
if coordinate_list(i,1) == 0 && coordinate_list(i,2) == 0 && coordinate_list(i,3) == 0
transformed_list(i,:) = NaN;
else
%transformed_list(i,:) = coordinate_list(i,:)*t_matrix;
transformed_list((i:i),(1:4)) = coordinate_list((i:i),(1:4))*t_matrix;
transformed_list(i,5) = coordinate_list(i,5);
end
%i
end
Thanks in advance

for loops with conditional statements like those will take ages to run. But what Matlab lacks in loop speed it makes up with vectorization and indexing.
Let's try some logical indexing like this to solve the first step:
coordinate_list(coordinate_list(:,1) == 0 .* ...
coordinate_list(:,2) == 0 .* ...
coordinate_list(:,3) == 0)=nan;
And then vectorize the second statement:
transformed_list(:,(1:4)) = coordinate_list(:,(1:4))*t_matrix;
As EBH mentioned above this might be a bit heavy on your RAM. If it's more than your computer can handle asks yourself if the coordinates really have to be doubles, maybe single precision will do. If that still doesn't do, try slicing the vector and performing the operation in parts.
Small example to give you an idea because I had a 2million element point cloud around here:
In R2015a
transformed_list = zeros(length(coordinate_list),5);
tic
for i = 1:length(coordinate_list)
if coordinate_list(i,1) == 0 && coordinate_list(i,2) == 0 && coordinate_list(i,3) == 0
transformed_list(i,:) = NaN;
else
%transformed_list(i,:) = coordinate_list(i,:)*t_matrix;
transformed_list((i:i),(1:3)) = coordinate_list((i:i),(1:3))*t_matrix;
transformed_list(i,5) = 1;
end
%i
end
toc
Returns Elapsed time is 10.928142 seconds.
transformed_list=coordinate_list;
tic
coordinate_list(coordinate_list(:,1) == 0 .* ...
coordinate_list(:,2) == 0 .* ...
coordinate_list(:,3) == 0)=nan;
transformed_list(:,(1:3)) = coordinate_list(:,(1:3))*t_matrix;
toc
Returns Elapsed time is 0.101696 seconds.

Rather than read the whole file, you'd be better off using a loop with
fscanf(fileID, '%f', 7)
and processing input as you read it.

Related

Confused about the use of validation set here

For the main.py of the px2graph project, the part of training and validation is shown as below:
splits = [s for s in ['train', 'valid'] if opt.iters[s] > 0]
start_round = opt.last_round - opt.num_rounds
# Main training loop
for round_idx in range(start_round, opt.last_round):
for split in splits:
print("Round %d: %s" % (round_idx, split))
loader.start_epoch(sess, split, train_flag, opt.iters[split] * opt.batchsize)
flag_val = split == 'train'
for step in tqdm(range(opt.iters[split]), ascii=True):
global_step = step + round_idx * opt.iters[split]
to_run = [sample_idx, summaries[split], loss, accuracy]
if split == 'train': to_run += [optim]
# Do image summaries at the end of each round
do_image_summary = step == opt.iters[split] - 1
if do_image_summary: to_run[1] = image_summaries[split]
# Start with lower learning rate to prevent early divergence
t = 1/(1+np.exp(-(global_step-5000)/1000))
lr_start = opt.learning_rate / 15
lr_end = opt.learning_rate
tmp_lr = (1-t) * lr_start + t * lr_end
# Run computation graph
result = sess.run(to_run, feed_dict={train_flag:flag_val, lr:tmp_lr})
out_loss = result[2]
out_accuracy = result[3]
if sum(out_loss) > 1e5:
print("Loss diverging...exiting before code freezes due to NaN values.")
print("If this continues you may need to try a lower learning rate, a")
print("different optimizer, or a larger batch size.")
return
time_str = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, global_step, out_loss, out_accuracy))
# Log data
if split == 'valid' or (split == 'train' and step % 20 == 0) or do_image_summary:
writer.add_summary(result[1], global_step)
writer.flush()
# Save training snapshot
saver.save(sess, 'exp/' + opt.exp_id + '/snapshot')
with open('exp/' + opt.exp_id + '/last_round', 'w') as f:
f.write('%d\n' % round_idx)
It seems that the author only get the result of each batch of the validation set. I am wondering, if I want to observe whether the model is improving or reaching the best performance, should I use the result on the whole validation set?

If the validation set is small enough, we could calculate the loss, accuracy on the whole validation set during training to observe the performance. However, if the validation set is too large, it is better to calculate batch-wise validation results and for multiple steps.

How to improve running time of my binary search code in peripherical parts?

I am studying for this great Coursera course https://www.coursera.org/learn/algorithmic-toolbox . On the fourth week, we have an assignment related to binary trees.
I think I did a good job. I created a binary search code that solves this problem using recursion in Python3. That's my code:
#python3
data_in_sequence = list(map(int,(input().split())))
data_in_keys = list(map(int,(input()).split()))
original_array = data_in_sequence[1:]
data_in_sequence = data_in_sequence[1:]
data_in_keys = data_in_keys[1:]
def binary_search(data_in_sequence,target):
answer = 0
sub_array = data_in_sequence
#print("sub_array",sub_array)
if not sub_array:
# print("sub_array",sub_array)
answer = -1
return answer
#print("target",target)
mid_point_index = (len(sub_array)//2)
#print("mid_point", sub_array[mid_point_index])
beg_point_index = 0
#print("beg_point_index",beg_point_index)
end_point_index = len(sub_array)-1
#print("end_point_index",end_point_index)
if sub_array[mid_point_index]==target:
#print ("final midpoint, ", sub_array[mid_point_index])
#print ("original_array",original_array)
#print("sub_array[mid_point_index]",sub_array[mid_point_index])
#print ("answer",answer)
answer = original_array.index(sub_array[mid_point_index])
return answer
elif target>sub_array[mid_point_index]:
#print("target num higher than current midpoint")
beg_point_index = mid_point_index+1
sub_array=sub_array[beg_point_index:]
end_point_index = len(sub_array)-1
#print("sub_array",sub_array)
return binary_search(sub_array,target)
elif target<sub_array[mid_point_index]:
#print("target num smaller than current midpoint")
sub_array = sub_array[:mid_point_index]
return binary_search(sub_array,target)
else:
return None
def bin_search_over_seq(data_in_sequence,data_in_keys):
final_output = ""
for key in data_in_keys:
final_output = final_output + " " + str(binary_search(data_in_sequence,key))
return final_output
print (bin_search_over_seq(data_in_sequence,data_in_keys))
I usually get the correct output. For instance, if I input:
5 1 5 8 12 13
5 8 1 23 1 11
I get the correct indexes of the sequences or (-1) if the term is not in sequence (first line):
2 0 -1 0 -1
However, my code does not pass on the expected running time.
Failed case #4/22: time limit exceeded (Time used: 13.47/10.00, memory used: 36696064/536870912.)
I think this happens not due to the implementation of my binary search (I think it is right). Actually, I think this happens due to some inneficieny in a peripheral part of the code. Like the way I am managing to output the final answer. However, the way I am presenting the final answer does not seem to be really "heavy"... I am lost.
Am I not seeing something? Is there another inefficiency I am not seeing? How can I solve this? Just trying to present the final result in a faster way?

Fast and furious random reading of huge files

I know that the question isn't new but I haven't found anything useful. In my case I have a 20 GB file and I need to read random lines from it. Now I have simple file index which contains line numbers and corresponding seek offsets. Also I disabled buffering when reading to read only the needed line.
And this is my code:
def create_random_file_gen(file_path, batch_size=0, dtype=np.float32, delimiter=','):
index = load_file_index(file_path)
if (batch_size > len(index)) or (batch_size == 0):
batch_size = len(index)
lines_indices = np.random.random_integers(0, len(index), batch_size)
with io.open(file_path, 'rb', buffering=0) as f:
for line_index in lines_indices:
f.seek(index[line_index])
line = f.readline(2048)
yield __get_features_from_line(line, delimiter, dtype)
The problem is that it's extremely slow: reading of 5000 lines takes 89 seconds on my Mac(here I point to ssd drive). There is code I used for testing:
features_gen = tedlium_random_speech_gen(5000) # just a wrapper for function given above
i = 0
for feature, cls in features_gen:
if i % 1000 == 0:
print("Got %d features" % i)
i += 1
print("Total %d features" % i)
I've read something about files memory mapping but I don't really understand how it works: how the mapping works in essence and will it speed up the process or no.
So the main question what are the possible ways to speed up the process? The only way I see now is to read randomly not every line but blocks of lines.

why is Matlab slow in for loop with large number of iterations but fast with a small number of iterations?

I am running a function to extract some information from 100,000+ patient xray dicom files. the files are stored within a veracrypt encryption container for security purposes.
when i run the function on a small sample of files it performs really quickly, however when i run the function on the entire dataset it is very slow in comparison, going from several files per second to 1 file per second (approximately).
i was wandering why this is happening? i have tried storing the data on an ssd and on a normal hard drive and get the same sort of slow down when using a larger dataset compared to a small one.
i have added the code below for reference but haven't commented it fully yet.. this is for my thesis so i will do it once i get the extraction finished..
thanks for any help.
function [ DB, corrupted_files ] = extract_from_dcm( folder_name )
%EXTRACT_FROM_DCM Summary of this function goes here
% Detailed explanation goes here
if nargin == 0
folder_name = 'I:\Find and Treat\MXU Old Backup\2005';
end
Database_Check = strcat(folder_name, '\DataBase.mat');
if exist(Database_Check, 'file')
load(Database_Check);
entry_start = length(DB) + 1;
else
entry_start = 1;
[ found_dicoms ] = recursive_search( folder_name );
end
mat_file_location = strcat(folder_name, '\DataBase.mat');
excel_DB_file = strcat(folder_name, '\DataBase.xlsx');
excel_Corrupted_file = strcat(folder_name, '\Corrupted_Files.xlsx');
% the recursive search creates a struct with the path for each
% dcm file found. the list is then recursivly used to locate
% the image and extract the relevant information from it.
fprintf('---------------------------------------------\n');
fprintf('Start Patient Data Extraction\n');
tic
h = waitbar(0,'','Name','Patient Data Extraction');
entry_end = length(found_dicoms);
if entry_end == 0
% set(handles.info_box, 'String', 'No Dicom Files Found in this Folder or its Subfolders');
else
% set(handles.info_box, 'String', 'Congratulations Dicom Files have been found Look Through the Data Base using the Buttons Below....Press Save Button to save the Database. (Database Save format is EXCEL SpreadSheet and MAT file');
for kk = entry_start : entry_end
progress = kk/entry_end;
progress_percent = round(progress * 100);
waitbar(progress,h, sprintf('%d%% %d/%d of images processed', progress_percent, kk, entry_end));
img_full_path = found_dicoms(kk).name;
% search_path = folder_name;
% img_full_path = strrep(img_full_path, search_path, '');
try %# Attempt to perform some computation
dicom_info = dicominfo(img_full_path); %# The operation you are trying to perform goes here
try %# Attempt to perform some computation
dicom_read = dicomread(dicom_info); %# The operation you are trying to perform goes here
old = dicominfo(img_full_path);
DB(kk).StudyDate = old.StudyDate;
DB(kk).StudyTime = old.StudyTime;
if isfield(old.PatientName, 'FamilyName')
DB(kk).Forename = old.PatientName.FamilyName;
else
DB(kk).Forename = 'NA';
end
if isfield(old.PatientName, 'GivenName')
DB(kk).LastName = old.PatientName.GivenName;
else
DB(kk).LastName = 'NA';
end
if isfield(old, 'PatientSex')
DB(kk).PatientSex = old.PatientSex;
else
DB(kk).PatientSex = 'NA';
end
if isempty(old.PatientBirthDate)
DB(kk).PatientBirthDate = '00000000';
else
DB(kk).PatientBirthDate = old.PatientBirthDate;
end
if strcmp(old.Manufacturer, 'Philips Medical Systems')
DB(kk).Van = '1';
else
DB(kk).Van = '0';% section to represent organising by different vans
end
DB(kk).img_Path = img_full_path;
save(mat_file_location,'DB','found_dicoms');
catch exception %# Catch the exception
fprintf('read - file %d corrupt.\n',kk);
continue %# Pass control to the next loop iteration
end
catch exception %# Catch the exception
fprintf('info - file %d corrupt.\n',kk);
continue %# Pass control to the next loop iteration
end
end
end
[ corrupted_files, DB ] = corruption_check( DB, found_dicoms, folder_name );
toc
fprintf('End Patient Data Extraction\n');
fprintf('---------------------------------------------\n');
fprintf('---------------------------------------------\n');
fprintf('Start Saving Extracted Data \n');
tic
save(mat_file_location,'DB','corrupted_files','found_dicoms');
if isempty(DB)
msg = sprintf('No Dicom Files Found');
msgbox(strcat(msg));
else
DB_table = struct2table(DB);
writetable(DB_table, excel_DB_file);
end
close(h);
toc
fprintf('End Saving Extracted Data \n');
fprintf('---------------------------------------------\n');
end

OK thanks for all the help..
My problem was the saving at the end of each iteration but the biggest problem was the line where i run the dicomread function. i changed the saving to occur for every 20 images processed.
I also removed the preallocation suggested in the comments to see what difference it made without the dicromread and saving a swell. it was considerably slower than with the preallocation.
... i just need to find a solution for dicomread (which i was using as a way to check if the file was corrupt or not).

Dealing with under flow while calculating GMM parameters using EM

I am currently runnuing training in matlab on a matrix of logspecrum samples I am constantly dealing with underflow problems.I understood that I need to work with log's in order to deal with underflowing.
I am still strugling with uderflow though , when i calculate the mean (mue) bucause it is negetive i cant work with logs so i need the real values that underflow.
These are equasions i am working with:
In MATLAB code i calulate log_tau in oreder avoid underflow but when calulating mue i need exp(log(tau)) which goes to zero.
I am attaching relevent MATLAB code
**in the code i called the variable alpha is tau ...
for i = 1 : 50
log_c = Logsum(log_alpha,1) - log(N);
c = exp(log_c);
mue = DataMat*alpha./(repmat(exp(Logsum(log_alpha,1)),FrameSize,1));
log_abs_mue = log(abs(mue));
log_SigmaSqr = log((DataMat.^2)*alpha) - repmat(Logsum(log_alpha,1),FrameSize,1) - 2*log_abs_mue;
SigmaSqr = exp(log_SigmaSqr);
for j=1:N
rep_DataMat(:,:,j) = repmat(DataMat(:,j),1,M);
log_gamma(j,:) = log_c - 0.5*(FrameSize*log(2*pi)+sum(log_SigmaSqr)) + sum((rep_DataMat(:,:,j) - mue).^2./(2*SigmaSqr));
end
log_alpha = log_gamma - repmat(Logsum(log_gamma,2),1,M);
alpha = exp(log_alpha);
end
c = exp(log_c);
SigmaSqr = exp(log_SigmaSqr);
does any one see how i can avoid this? or what needs to be fixed in code?

What i did was add this line to the MATLAB code:
mue(isnan(mue))=0; %fix 0/0 problem
and this one:
SigmaSqr(SigmaSqr==0)=1;%fix if mue_k = x_k
not sure if this is the best solution but is seems to work...
any have a better idea?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Working with very big data faster in Matlab? - performance

Rather than read the whole file, you'd be better off using a loop with fscanf(fileID, '%f', 7) and processing input as you read it.

Related

Confused about the use of validation set here

How to improve running time of my binary search code in peripherical parts?

Fast and furious random reading of huge files

why is Matlab slow in for loop with large number of iterations but fast with a small number of iterations?

Dealing with under flow while calculating GMM parameters using EM

Categories

Resources