Perl efficiency in writing in file - performance

I am creating a database with some informations of files.
e.g: file_name | size | modify_date ...
I was thinking what is more efficient in this situation:
1) For each file get the info and print them in my file
foreach my $file ( #listOfFiles) {
my %temporary_hash = get_info_for_file($file); //store in a tempoarary hash
the informations for current file
print_info(%temporary_hash, $output_file); // print the information in my output file
}
2) Store the info for every file in a hash and print all the hash at once
foreach my $file( #listOfFiles){
store_info_in_hash( get_info_for_file($file), %hash); // for each file, store the
information in a global hash
}
print_all_info(%hash, $output_file); //after i have informations for each file
print the whole hash in my output file

You are wrong to consider efficiency before you have even got your program working
You should write your code as clearly as possible and debug it. Only then, if it is not running fast enough for your purpose, you should put your code through a profiler to discover the bottlenecks that are taking the most time
The two options you show will probably not be very different unless your files are enormous

Doing a benchmark test on the two options i got those results ( which, if I increase the information size for each file, will lead to even bigger differencies between the two).

Related

How to write for loop to run regression on multiple files contained in same folder?

I need to develop a script to run a simple OLS on multiple csv files stored in the same folder.
All have the same column names and regression will always be based upon the same columns ("x_var" and "y_var").
The below code is used to read in the csvs and rename them.
## Read in files from folder
file.List <- list.files(pattern = "*.csv")
for(i in 1:length(file.List))
{
assign(paste(gsub(".csv","", file.List[i])), read.csv(file.List[i]))
}
However, after this [very initial stage!] I've got a bit lost........
Each dataframe has 7 identical columns. a, b, c, d, x_var, e, y_var.....
I need to run a simple OLS using lm(x_car ~ y_var, data = dataframes) and plot the result on each dataframe and assumed a 'for loop' would be the best option, but am not too sure of how to do so....
After each regression is run I want it to extract the coefficients/R2 etc into a csv and save the plot separately.......
Tried below, but have gone very wrong [and not working at all];
list <- list(gsub(" finalSIRTAnalysis.csv","", file.List))
for(i in length(file.List))
{
lm(x_var ~ y_var, data = [i])
}
Can't even make a start on this........and need some advice, if anyone has any good ideas (such as creating an external function first.....)
I am not sure if the function lm is available to compute the results using multiple variable sources. Try merging the database. I have have a similar issue because I have 5k files and is computationally impossible to merge them all. But maybe this answer can help you.
https://stackoverflow.com/a/63770065/14744492

How to speed up XQuery string search?

I am using XQuery/BaseX to look through large XML files to find historical data for some counters. All the files are zipped and stored somewhere on drive. The important part of file looks as follows:
<measInfo xmlns="http://www.hehe.org/foo/" measInfoId="uplink speed">
<granPeriod duration="DS222S" endTime="2020-09-03T08:15:00+02:00"/>
<repPeriod duration="DS222S"/>
<measTypes>AFD123 AFD124 AFD125 AFD156</measTypes>
<measValue measObjLdn="PLDS-PLDS/STBHG-532632">
<measResults>23 42 12 43</measResults>
</measValue>
</measInfo>
I built the following query:
declare default element namespace "http://www.hehe.org/foo/";
let $sought := ["AFD124", "AFD125"]
let $datasource := collection("C:\Users\Patryk\Desktop\folderwitharchives")
let $filename := concat(convert:dateTime-to-integer(current-dateTime()), ".xml")
for $meas in $datasource/measCollecFile/measData/measInfo return
for $measType at $i in $meas/tokenize(measTypes)[. = $sought] return
file:append($filename,
<meas
measInfoId="{data($meas/#measInfoId)}"
measObjLdn="{data($meas/measValue/#measObjLdn)}"
>
{$meas/granPeriod}
{$meas/repPeriod}
<measType>{$measType}</measType>
<measValue>{$meas/measValue/tokenize(measResults, " ")[$i]}</measValue>
</meas>)
The script works, but it takes a lot of time for some counters (measType). I read the documentation about indexing, and my idea is to somehow index all the measTypes (parts of the string), so that once I need to look through the whole archive looking for a counter, it can be quickly accessed. I am not sure if it is possible when operating directly on archives? Would I have to create a new database of them? I would prefer not to, due to the size of files. How to create indexes for such case?
It is not the answer to my question, but I have noticed that the execution time is much longer when I write XML nodes to a file. It is much faster to append any other string to a file:
concat($measInfo/#measInfoId, ",", $measInfo/measValue/#measObjLdn, ",",
$measInfo/granPeriod, ",", $measInfo/repPeriod, ",", $measType, ",",
$tokenizedValues[$i], "
"))
Why is it and how to speed up writing XML nodes to a file?
Also, I have noticed that appending value to a file inside for loop is much longer, and I suspect that it is because the file has to be opened again in each iteration. Is there a way to keep the file open throughout the whole query?

Sampling 1000 lines from a bunch of gzipped files with PIG

I'm very new to Pig so I may be going about this the wrong way. I have a bunch of gzipped files in a directory in Hadoop. I'm trying to sample around 1000 lines from all of these files put together. It doesn't have to be exact, so I wanted to use SAMPLE. SAMPLE needs a probability of sampling a line, rather than the number of lines that I need, so I thought I should count up the number of lines among all these files and than simply divide 1000 by that count and use it as the probability. This will work, since I don't need to have exactly 100 lines at the end. Here is what I got so far:
raw = LOAD '/data_dir';
cnt = FOREACH (GROUP raw ALL) GENERATE COUNT_STAR(raw);
cntdiv = FOREACH cnt GENERATE (float)100/ct.$0;
Now I'm not sure how to use the value in cntdiv in SAMPLE. I tried SAMPLE raw cntdiv and SAMPLE raw cntdiv.$0, but they don't work. Can I even use that value in the call to SAMPLE? Maybe there is a much better way of accomplishing what I'm trying to do?
Check out the description in the ticket originally requesting this feature: https://issues.apache.org/jira/browse/PIG-1926
I haven't tested this, but it looks like this should work:
raw = LOAD '/data_dir';
samplerate = FOREACH (GROUP raw ALL) GENERATE 1000.0/COUNT_STAR(raw) AS rate;
thousand = SAMPLE raw samplerate.rate;
The important thing is to refer to your scalar by name (rate), not by position ($0).

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);
I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.
In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

sed optimization (large file modification based on smaller dataset)

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.
My most recent task involves some line editing based on data from another file.
The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)
The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.
I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.
Let's say I have a line like this in the data file:
(some 500 character)id_number(some 300 character)
And I need to modify data in the 300 char part.
Based on the modifier file, I come up with sed lines like this:
/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/
So I have 1800 lines like this.
But I know, that even on a very fast server, if I do a
sed -i.bak -f modifier.sed data.file
It's very slow, because it has to read every pattern x every line.
Isn't there a better way?
Note: I'm not a programmer, had never learnt (in school) about algorithms.
I can use awk, sed, an outdated version of perl on the server.
My suggested approaches (in order of desirably) would be to process this data as:
A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
A flat file containing fixed record lengths
A flat file containing variable record lengths
Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.
If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?
If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.
I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).
You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.
Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.
I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)
With perl you should use substr to get id_number, especially if id_number has constant width.
my $id_number=substr($str, 500, id_number_length);
After that if $id_number is in range, you should use substr to replace remaining text.
substr($str, -300,300, $new_text);
Perl's regular expressions are very fast, but not in this case.
My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.
When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.
You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:
<id><tab><position_after_id><tab><amount><tab><data>
We will read data from stdin and write to stdout and script can be something like this:
my $modifier_filename = 'modifier_file.txt';
open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
chomp;
my ($id, $position, $amount, $data) = split /\t/;
$modifications{$id} = [$position, $amount, $data];
}
close $mf;
# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/; # compile regexp
while (<>) {
next unless m/$id_regexp/;
next unless $modifications{$1};
my ($position, $amount, $data) = #{$modifications{$1}};
substr $_, $+[1] + $position, $amount, $data;
}
continue { print }
On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?
If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:
my $mod;
while (<>) {
next unless m/$id_regexp/;
$mod = $modifications{$1};
next unless $mod;
substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }
You should almost certainly use a database, as MikeyB suggested.
If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.
If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:
Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
Adjust the record accordingly if they match
Write it out
Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
Goto 1.
Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.
Good deal on the sqlloader or datadump decision. That's the way to go.

Resources