Groovy I/O performance issue - performance

I'm not a groovy expert, just use it from time to time. One of the latest goals was to generate a very simple file containing some random data. I created the following script:
out = new File('sampledata.txt')
Random random = new Random();
java.util.Date dt = new java.util.Date();
for (int i=0; i<100000; ++i) {
dt = new java.util.Date();
out << dt.format('yyyMMdd HH:mm:ss.SSS') + '|box|process|||java.lang.Long|' + random.nextInt(100) + '|name\n'
}
Now, I'm really puzzled with its performance. It takes around 1.5 minutes to complete whilst the same code written in Java or Ruby takes less than a second.
Similar code in Ruby (takes around 1 second to execute):
require "time"
File.open("output.txt", "w") do |file|
100000.times do
line = Time.now.strftime("%Y%m%d %H:%M:%S.%L") + '|box|process|||java.lang.Long|' + rand(100).to_s + '|name'
file.puts line
end
end
Any ideas how groovy's processing speed could be improved?

The left shift operator opens the file, jumps to the end, appends the text, and closes the file again...
Instead, try:
Random random = new Random();
// Open the file and append to it.
// If you want a new file each time, use withWriter instead of withWriterAppend
new File('sampledata.txt').withWriterAppend { w ->
100000.times {
w.writeLine "${new Date().format('yyyMMdd HH:mm:ss.SSS')}|box|process|||java.lang.Long|${random.nextInt(100)}|name"
}
}
(this is also much more like what the Ruby code is doing)

Related

Speed/pace calculation code; keep getting invalid syntax. Very new; what am I doing wrong?

def calculation(minutes, seconds, miles):
pace = (int(minutes) + (int(seconds)/60)/miles)
speed = (float(miles)/(int(minutes) + (int(seconds)/60)
minutes = raw_input("Minutes ==> ")
seconds = raw_input("Seconds ==> ")
miles = raw_input("Miles ==>" )
I'm attempting to do an input that is supposed to calculate the pace and speed of user-inputted variables but I keep getting syntax errors starting from the fourth line down. I'm very new to this for the record so it's probably something stupid but any help is appreciated!

Improve genbank feature addition

I am trying to add more than 70000 new features to a genbank file using biopython.
I have this code:
from Bio import SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation
fi = "myoriginal.gbk"
fo = "mynewfile.gbk"
for result in results:
start = 0
end = 0
result = result.split("\t")
start = int(result[0])
end = int(result[1])
for record in SeqIO.parse(original, "gb"):
record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
SeqIO.write(record, fo, "gb")
Results is just a list of lists containing the start and end of each one of the features I need to add to the original gbk file.
This solution is extremely costly for my computer and I do not know how to improve the performance. Any good idea?
You should parse the genbank file just once. Omitting what results contains (I do not know exactly, because there are some missing pieces of code in your example), I would guess something like this would improve performance, modifying your code:
fi = "myoriginal.gbk"
fo = "mynewfile.gbk"
original_records = list(SeqIO.parse(fi, "gb"))
for result in results:
result = result.split("\t")
start = int(result[0])
end = int(result[1])
for record in original_records:
record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
SeqIO.write(record, fo, "gb")

why is Matlab slow in for loop with large number of iterations but fast with a small number of iterations?

I am running a function to extract some information from 100,000+ patient xray dicom files. the files are stored within a veracrypt encryption container for security purposes.
when i run the function on a small sample of files it performs really quickly, however when i run the function on the entire dataset it is very slow in comparison, going from several files per second to 1 file per second (approximately).
i was wandering why this is happening? i have tried storing the data on an ssd and on a normal hard drive and get the same sort of slow down when using a larger dataset compared to a small one.
i have added the code below for reference but haven't commented it fully yet.. this is for my thesis so i will do it once i get the extraction finished..
thanks for any help.
function [ DB, corrupted_files ] = extract_from_dcm( folder_name )
%EXTRACT_FROM_DCM Summary of this function goes here
% Detailed explanation goes here
if nargin == 0
folder_name = 'I:\Find and Treat\MXU Old Backup\2005';
end
Database_Check = strcat(folder_name, '\DataBase.mat');
if exist(Database_Check, 'file')
load(Database_Check);
entry_start = length(DB) + 1;
else
entry_start = 1;
[ found_dicoms ] = recursive_search( folder_name );
end
mat_file_location = strcat(folder_name, '\DataBase.mat');
excel_DB_file = strcat(folder_name, '\DataBase.xlsx');
excel_Corrupted_file = strcat(folder_name, '\Corrupted_Files.xlsx');
% the recursive search creates a struct with the path for each
% dcm file found. the list is then recursivly used to locate
% the image and extract the relevant information from it.
fprintf('---------------------------------------------\n');
fprintf('Start Patient Data Extraction\n');
tic
h = waitbar(0,'','Name','Patient Data Extraction');
entry_end = length(found_dicoms);
if entry_end == 0
% set(handles.info_box, 'String', 'No Dicom Files Found in this Folder or its Subfolders');
else
% set(handles.info_box, 'String', 'Congratulations Dicom Files have been found Look Through the Data Base using the Buttons Below....Press Save Button to save the Database. (Database Save format is EXCEL SpreadSheet and MAT file');
for kk = entry_start : entry_end
progress = kk/entry_end;
progress_percent = round(progress * 100);
waitbar(progress,h, sprintf('%d%% %d/%d of images processed', progress_percent, kk, entry_end));
img_full_path = found_dicoms(kk).name;
% search_path = folder_name;
% img_full_path = strrep(img_full_path, search_path, '');
try %# Attempt to perform some computation
dicom_info = dicominfo(img_full_path); %# The operation you are trying to perform goes here
try %# Attempt to perform some computation
dicom_read = dicomread(dicom_info); %# The operation you are trying to perform goes here
old = dicominfo(img_full_path);
DB(kk).StudyDate = old.StudyDate;
DB(kk).StudyTime = old.StudyTime;
if isfield(old.PatientName, 'FamilyName')
DB(kk).Forename = old.PatientName.FamilyName;
else
DB(kk).Forename = 'NA';
end
if isfield(old.PatientName, 'GivenName')
DB(kk).LastName = old.PatientName.GivenName;
else
DB(kk).LastName = 'NA';
end
if isfield(old, 'PatientSex')
DB(kk).PatientSex = old.PatientSex;
else
DB(kk).PatientSex = 'NA';
end
if isempty(old.PatientBirthDate)
DB(kk).PatientBirthDate = '00000000';
else
DB(kk).PatientBirthDate = old.PatientBirthDate;
end
if strcmp(old.Manufacturer, 'Philips Medical Systems')
DB(kk).Van = '1';
else
DB(kk).Van = '0';% section to represent organising by different vans
end
DB(kk).img_Path = img_full_path;
save(mat_file_location,'DB','found_dicoms');
catch exception %# Catch the exception
fprintf('read - file %d corrupt.\n',kk);
continue %# Pass control to the next loop iteration
end
catch exception %# Catch the exception
fprintf('info - file %d corrupt.\n',kk);
continue %# Pass control to the next loop iteration
end
end
end
[ corrupted_files, DB ] = corruption_check( DB, found_dicoms, folder_name );
toc
fprintf('End Patient Data Extraction\n');
fprintf('---------------------------------------------\n');
fprintf('---------------------------------------------\n');
fprintf('Start Saving Extracted Data \n');
tic
save(mat_file_location,'DB','corrupted_files','found_dicoms');
if isempty(DB)
msg = sprintf('No Dicom Files Found');
msgbox(strcat(msg));
else
DB_table = struct2table(DB);
writetable(DB_table, excel_DB_file);
end
close(h);
toc
fprintf('End Saving Extracted Data \n');
fprintf('---------------------------------------------\n');
end
OK thanks for all the help..
My problem was the saving at the end of each iteration but the biggest problem was the line where i run the dicomread function. i changed the saving to occur for every 20 images processed.
I also removed the preallocation suggested in the comments to see what difference it made without the dicromread and saving a swell. it was considerably slower than with the preallocation.
... i just need to find a solution for dicomread (which i was using as a way to check if the file was corrupt or not).

Multiprocessing and shared multiprocessing manager lists for parsing large file

I am trying to parse a huge file (approx 23 MB) using the code below, wherein I populate a multiprocessing.manager.list with all the lines read from the file . In the target routine (parse_line) for each process, I pop a line and parse it to create a defaultdict object with certain parsed attributes and finally push each of these objects into another multiprocessing.manager.list.
class parser(object):
def __init__(self):
self.manager = mp.Manager()
self.in_list = self.manager.list()
self.out_list = self.manager.list()
self.dict_list,self.lines, self.pcap_text = [],[],[]
self.last_timestamp = [[(999999,0)]*32]*2
self.num = Word(nums)
self.word = Word(alphas)
self.open_brace = Suppress(Literal("["))
self.close_brace = Suppress(Literal("]"))
self.colon = Literal(":")
self.stime = Combine(OneOrMore(self.num + self.colon) + self.num + Literal(".") + self.num)
self.date = OneOrMore(self.word) + self.num + self.stime
self.is_cavium = self.open_brace + (Suppress(self.word)) + self.close_brace
self.oct_id = self.open_brace + Suppress(self.word) + Suppress(Literal("=")) \
+ self.num + self.close_brace
self.core_id = self.open_brace + Suppress(self.word) + Suppress(Literal("#")) \
+ self.num + self.close_brace
self.ppm_id = self.open_brace + self.num + self.close_brace
self.oct_ts = self.open_brace + self.num + self.close_brace
self.dump = Suppress(Word(hexnums) + Literal(":")) + OneOrMore(Word(hexnums))
self.opening = Suppress(self.date) + Optional(self.is_cavium.setResultsName("cavium")) \
+ self.oct_id.setResultsName("octeon").setParseAction(lambda toks:int(toks[0])) \
+ self.core_id.setResultsName("core").setParseAction(lambda toks:int(toks[0])) \
+ Optional(self.ppm_id.setResultsName("ppm").setParseAction(lambda toks:int(toks[0])) \
+ self.oct_ts.setResultsName("timestamp").setParseAction(lambda toks:int(toks[0]))) \
+ Optional(self.dump.setResultsName("pcap"))
def parse_file(self, filepath):
self.filepath = filepath
with open(self.filepath,'r') as f:
self.lines = f.readlines()
for lineno,line in enumerate(self.lines):
self.in_list.append((lineno,line))
processes = [mp.Process(target=self.parse_line) for i in range(mp.cpu_count())]
[process.start() for process in processes]
[process.join() for process in processes]
while self.in_list:
(lineno, len) = self.in_list.pop()
print mp.current_process().name, "start"
dic = defaultdict(int)
result = self.opening.parseString(line)
self.pcap_text.append("".join(result.pcap))
if result.timestamp or result.ppm:
dic['oct'], dic['core'], dic['ppm'], dic['timestamp'] = result[0:4]
self.last_timestamp[result.octeon][result.core] = (result.ppm,result.timestamp)
else:
dic['oct'], dic['core'] = result[0:2]
dic['ppm'] = (self.last_timestamp[result.octeon][result.core])[0]
dic['ts'] = (self.last_timestamp[result.octeon][result.core])[1]
dic['line'] = lineno
self.out_list.append(dic)
However this entire process takes approximately 3 minutes to complete.
My question is, if there is a better way to make this faster ?
I am using pyparsing module to parse each line, if it makes any difference.
PS: Made changes in the routine Paul McGuire's advice
Not a big performance issue, but learn to iterate over files directly, instead of using readlines(). In place of this code:
self.lines = f.readlines()
for lineno,line in enumerate(self.lines):
self.in_list.append((lineno,line))
You can write:
self.in_list = list(enumerate(f))
A hidden performance killer is using while self.in_list: (lineno,line) = list.pop(). Each call to pop removes the 0'th element from the list. Unfortunately, Python's lists are implemented as arrays. To remove the 0'th element, the 1..n-1'th elements have to be moved up one slot in the array. You don't really have to destroy self.in_list as you go, just iterate over it:
for lineno, line in self.in_list:
<Do something with line and line no. Parse each line and push into out_list>
If you are thinking that consuming self.in_list as you go is a memory-saving measure, then you can avoid the array-shifting inefficiency of Python lists by using a deque instead (from Python's provided collections module). deque's are implemented internally as linked lists, so that pushing or popping to and from either end is very fast, but indexed access is slow. To use a deque, replace the line:
self.in_list = list(enumerate(f))
with:
self.in_list = deque(enumerate(f))
Then replace the call in your code self.in_list.pop() with self.in_list.popleft().
But MUCH more likely to be the performance issue is the pyparsing code you are using to process each line. But since you didn't post the parser code, there is not much help we can provide there.
To get an idea about where the time is going, try leaving all your code, and then comment out the <Do something with line and line no. Parse each line and push into out_list> code (you may have to add a pass statement for the for loop), and then run against your 23MB file. This will give you a rough idea about how much of your 3 minutes is being spent in reading and iterating over the file, and how much is being spent doing the actual parsing. Then post back in another question when you find where the real performance issues lie.

Entity Framework SaveChanges() first call is very slow

I appreciate that this issue has been raised a couple of times before, but I can't find a definitive answer (maybe there isn't one!).
Anyway the title tells it all really. Create a new context, add a new entity, SaveChanges() takes 20 seconds. Add second entity in same context, SaveChanges() instant.
Any thoughts on this? :-)
============ UPDATE =============
I've created a very simple app running against my existing model to show the issue...
public void Go()
{
ModelContainer context = new ModelContainer(DbHelper.GenerateConnectionString());
for (int i = 1; i <= 5; i++)
{
DateTime start = DateTime.Now;
Order order = context.Orders.Single(c => c.Reference == "AA05056");
DateTime end = DateTime.Now;
double millisecs = (end - start).TotalMilliseconds;
Console.WriteLine("Query " + i + " = " + millisecs + "ms (" + millisecs / 1000 + "s)");
start = DateTime.Now;
order.Note = start.ToLongTimeString();
context.SaveChanges();
end = DateTime.Now;
millisecs = (end - start).TotalMilliseconds;
Console.WriteLine("SaveChanges " + i + " = " + millisecs + "ms (" + millisecs / 1000 + "s)");
Thread.Sleep(1000);
}
Console.ReadKey();
}
Please do not comment on my code - unless it is an invalid test ;)
The results are:
Query 1 = 3999.2288ms (3.9992288s)
SaveChanges 1 = 3391.194ms (3.391194s)
Query 2 = 18.001ms (0.018001s)
SaveChanges 2 = 4.0002ms (0.0040002s)
Query 3 = 14.0008ms (0.0140008s)
SaveChanges 3 = 3.0002ms (0.0030002s)
Query 4 = 13.0008ms (0.0130008s)
SaveChanges 4 = 3.0002ms (0.0030002s)
Query 5 = 10.0005ms (0.0100005s)
SaveChanges 5 = 3.0002ms (0.0030002s)
The first query takes time which I assume is the view generation? Or db connection?
The first save takes nearly 4 seconds which for the more complex save in my app takes over 20 seconds which is not acceptable.
Not sure where to go with this now :-(
UPDATE...
SQL Profiler shows first query and update are fast and are not different for first. So I know delay is Entity Framework as suspected.
It might not be the SaveChanges call - the first time you make any call to the database in EF, it has to do some initial code generation from the metadata. You can pre-generate this though at compile-time: http://msdn.microsoft.com/en-us/library/bb896240.aspx
I would be surprised if that's the only problem, but it might help.
Also have a look here: http://msdn.microsoft.com/en-us/library/cc853327.aspx
I would run the following code on app start up and see how long it takes and if after that the first SaveChanges is fast.
public static void UpdateDatabase()
{
//Note: Using SetInitializer is reconnended by Ladislav Mrnka with reputation 275k
//http://stackoverflow.com/questions/9281423/entity-framework-4-3-run-migrations-at-application-start
Database.SetInitializer<DAL.MyDbContext>(
new MigrateDatabaseToLatestVersion<DAL.MyDbContext,
Migrations.MyDbContext.Configuration>());
using (var db = new DAL.MyDbContext()) {
db.Database.Initialize(false);//Execute the migrations now, not at the first access
}
}

Resources