handle huge amount of files in directory in mathematica - for-loop

I have a huge amount of datafiles (.csv) saved in one directory. I want now to fit and evaluate several parameters for each file. Since there are over 300.000 files in this directory, mathematica is not able to run my script. The first attempt i tried was to set the directory to this folder and then i tried to emport each file alone through a 'For-loop' (for i=1,i<=imax,i++ where imax is the number of files in there), do the whole fitting evaluation etc and then starting the loop again and importing the second file,.... to save memory. Unfortunately, this approach didn't work at all and mathematica crashed almost immediately.
So, my question is now, can I handle such a huge amount of files in a single directory somehow without running out of memory?

The method below loads all the data from all the CSVs into a function variable called data. So if you have CSVs called file1.csv and file2.csv their data will be loaded into variables called data["file1.csv"] and data["file2.csv"]. Then, for example, the data in the first column of every CSV is read into a variable called xvalues and the data in the second column of every CSV is read into a variable called yvalues.
SetDirectory["C:\\Users\\yourname\\datadirectory"];
files = FileNames["*.csv"];
(data[#] = Import[#]) & /# files;
xvalues = yvalues = {};
(xvalues = Join[xvalues, data[#][[All, 1]]]) & /# files;
xvalues = Flatten[xvalues];
(yvalues = Join[yvalues, data[#][[All, 2]]]) & /# files;
yvalues = Flatten[yvalues];
A fit can then be calculated.
fit = Fit[Transpose[{xvalues, yvalues}], {1, x}, x]

Related

How to write for loop to run regression on multiple files contained in same folder?

I need to develop a script to run a simple OLS on multiple csv files stored in the same folder.
All have the same column names and regression will always be based upon the same columns ("x_var" and "y_var").
The below code is used to read in the csvs and rename them.
## Read in files from folder
file.List <- list.files(pattern = "*.csv")
for(i in 1:length(file.List))
{
assign(paste(gsub(".csv","", file.List[i])), read.csv(file.List[i]))
}
However, after this [very initial stage!] I've got a bit lost........
Each dataframe has 7 identical columns. a, b, c, d, x_var, e, y_var.....
I need to run a simple OLS using lm(x_car ~ y_var, data = dataframes) and plot the result on each dataframe and assumed a 'for loop' would be the best option, but am not too sure of how to do so....
After each regression is run I want it to extract the coefficients/R2 etc into a csv and save the plot separately.......
Tried below, but have gone very wrong [and not working at all];
list <- list(gsub(" finalSIRTAnalysis.csv","", file.List))
for(i in length(file.List))
{
lm(x_var ~ y_var, data = [i])
}
Can't even make a start on this........and need some advice, if anyone has any good ideas (such as creating an external function first.....)
I am not sure if the function lm is available to compute the results using multiple variable sources. Try merging the database. I have have a similar issue because I have 5k files and is computationally impossible to merge them all. But maybe this answer can help you.
https://stackoverflow.com/a/63770065/14744492

Differential file saving algorithm or tool

I'm looking for any information or algorithms that allows differential file saving and merging.
To be more clear, I would like when modifying the content of a file the original file should stay the same and every modification made must be saved in a separate file (same thing as differential backup but for files), in case of accessing the file, it should reconstruct the latest version of the file using the original file and the last differential file.
What I need to do is described in the diagram below :
For calculating diffs you could use something like diff_match_patch.
You could store for each file version series of DeltaDiff.
DeltaDiff would be a tuple of one of 2 types: INSERT or DELETE.
Then you could store the series of DeltaDiff as follows:
Diff = [DeltaDiff_1, DeltaDiff_2, ... DeltaDiff_n ] = [
(INSERT, byteoffset regarding to initial file, bytes)
(DELETE, byteoffset regarding to initial file, length)
....
(....)
]
Applying the DeltaDiffs to initial file would give you the next file version, and so on, for example:
FileVersion1 + Diff1 -> FileVersion2 + Diff2 -> FileVersion3 + ....

How to speed up XQuery string search?

I am using XQuery/BaseX to look through large XML files to find historical data for some counters. All the files are zipped and stored somewhere on drive. The important part of file looks as follows:
<measInfo xmlns="http://www.hehe.org/foo/" measInfoId="uplink speed">
<granPeriod duration="DS222S" endTime="2020-09-03T08:15:00+02:00"/>
<repPeriod duration="DS222S"/>
<measTypes>AFD123 AFD124 AFD125 AFD156</measTypes>
<measValue measObjLdn="PLDS-PLDS/STBHG-532632">
<measResults>23 42 12 43</measResults>
</measValue>
</measInfo>
I built the following query:
declare default element namespace "http://www.hehe.org/foo/";
let $sought := ["AFD124", "AFD125"]
let $datasource := collection("C:\Users\Patryk\Desktop\folderwitharchives")
let $filename := concat(convert:dateTime-to-integer(current-dateTime()), ".xml")
for $meas in $datasource/measCollecFile/measData/measInfo return
for $measType at $i in $meas/tokenize(measTypes)[. = $sought] return
file:append($filename,
<meas
measInfoId="{data($meas/#measInfoId)}"
measObjLdn="{data($meas/measValue/#measObjLdn)}"
>
{$meas/granPeriod}
{$meas/repPeriod}
<measType>{$measType}</measType>
<measValue>{$meas/measValue/tokenize(measResults, " ")[$i]}</measValue>
</meas>)
The script works, but it takes a lot of time for some counters (measType). I read the documentation about indexing, and my idea is to somehow index all the measTypes (parts of the string), so that once I need to look through the whole archive looking for a counter, it can be quickly accessed. I am not sure if it is possible when operating directly on archives? Would I have to create a new database of them? I would prefer not to, due to the size of files. How to create indexes for such case?
It is not the answer to my question, but I have noticed that the execution time is much longer when I write XML nodes to a file. It is much faster to append any other string to a file:
concat($measInfo/#measInfoId, ",", $measInfo/measValue/#measObjLdn, ",",
$measInfo/granPeriod, ",", $measInfo/repPeriod, ",", $measType, ",",
$tokenizedValues[$i], "
"))
Why is it and how to speed up writing XML nodes to a file?
Also, I have noticed that appending value to a file inside for loop is much longer, and I suspect that it is because the file has to be opened again in each iteration. Is there a way to keep the file open throughout the whole query?

Access file inside folder, matlab mac

I have about 300 files I would like to access and import in Matlab, all these files are inside 300 folders.
The first file lie in the directory users/matt/Documents/folder_1 with the filename line.csv the 2nd file lie in users/matt/Documents/folder_2 with filename line.csv
So I would like to import the data from the 300 line.csv files in Matlab so I can take the average value. Is this possible? I am using mac osx btw.
I know what do with the .csv file, but I have no clue how to access them efficiently.
This should work: All we are doing is generating the string for every file path using sprintf and the loop index i, and then reading the csv file using csvread and storing the data in a cell array.
for i = 1:300 % Loop 300 times.
% Full path pointing to the csv file.
file_path = sprintf('users/matt/Documents/folder_%d/line.csv', i);
% Read data from csv and store it in a cell array.
data{i} = csvread(file_path);
end
% Do your computations here.
% ...
Remember to replace the 300 by the actual number of folders that you have.

Compressed array of file paths and random access

I'm developing an file management Windows application. The program should keep an array of paths to all files and folders that are on the disk. For example:
0 "C:"
1 "C:\abc"
2 "C:\abc\def"
3 "C:\ghi"
4 "C:\ghi\readme.txt"
The array "as is" will be very large, so it should be compressed and stored on the disk. However, I'd like to have random access to it:
to retrieve any path in the array by index (e.g., RetrievePath(2) = "C:\abc\def")
to find index of any path in the array (e.g., IndexOf("C:\ghi") = 3)
to add a new path to the array (indexes of any existing paths should not change), e.g., AddPath("C:\ghi\xyz\file.dat")
to rename some file or folder in the database;
to delete existing path (again, any other indexes should not change).
For example, delete path 1 "C:\abc" from the database and still have 4 "C:\ghi\readme.txt".
Can someone suggest some good algorithm/data structure/ideas to do these things?
Edit:
At the moment I've come up with the following solution:
0 "C:"
1 "[0]\abc"
2 "[1]\def"
3 "[0]\ghi"
4 "[3]\readme.txt"
That is, common prefixes are compressed.
RetrievePath(2) = "[1]\def" = RetrievePath(1) + "\def" = "[0]\abc\def" = RetrievePath(0) + "\abc\def" = "C:\abc\def"
IndexOf() also works iteratively, something like that:
IndexOf("C:") = 0
IndexOf("C:\abc") = IndexOf("[0]\abc") = 1
IndexOf("C:\abc\def") = IndexOf("[1]\def") = 2
To add new path, say AddPath("C:\ghi\xyz\file.dat"), one should first add its prefixes:
5 [3]\xyz
6 [5]\file.dat
Renaming/moving file/folder involves just one replacement (e.g., replacing [0]\ghi with [1]\klm will rename directory "ghi" to "klm" and move it to the directory "C:\abc")
DeletePath() involves setting it (and all subpaths) to empty strings. In future, they can be replaced with new paths.
After DeletePath("C:\abc"), the array will be:
0 "C:"
1 ""
2 ""
3 "[0]\ghi"
4 "[3]\readme.txt"
The whole array still needs to be loaded in RAM to perform fast operations. With, for example, 1000000 files and folders in total and average filename length of 10, the array will occupy over 10 MB.
Also, function IndexOf() is forced to scan array sequentially.
Edit (2): I just realised that my question can be reformulated:
How I can assign each file and each folder on the disk unique integer index so that I will be able to quickly find file/folder by index, index of the known file/folder, and perform basic file operations without changing many indices?
Edit (3): Here is a question about similar but Linux-related problem. It is suggested to use filename and content hashing to identify file. Are there some Windows-specific improvements?
Your solution seems decent. You could also try to compress more using ad-hoc tricks such as using a few bits only for common characters such as "\", drive letters, maybe common file extensions and whatnot. You could also have a look on tries ( http://en.wikipedia.org/wiki/Trie ).
Regarding your second edit, this seems to match the features of a hash table, but this is for indexing, not compressed storage.

Resources