Why does Dask's map_partitions function use more memory than looping over partitions? - memory-management

I have a parquet file of position data for vehicles that is indexed by vehicle ID and sorted by timestamp. I want to read the parquet file, do some calculations on each partition (not aggregations) and then write the output directly to a new parquet file of similar size.
I organized my data and wrote my code (below) to use Dask's map_partitions, as I understood this would perform the operations one partition at a time, saving each result to disk sequentially and thereby minimizing memory usage. I was surprised to find that this was exceeding my available memory and I found that if I instead create a loop that runs my code on a single partition at a time and appends the output to the new parquet file (see second code block below), it easily fits within memory.
Is there something incorrect in the original way I used map_partitions? If not, why does it use so much more memory? What is the proper, most efficient way of achieving what I want?
Thanks in advance for any insight!!
Original (memory hungry) code:
ddf = dd.read_parquet(input_file)
meta_dict = ddf.dtypes.to_dict()
(
ddf
.map_partitions(my_function, meta = meta_dict)
.to_parquet(
output_file,
append = False,
overwrite = True,
engine = 'fastparquet'
)
)
Awkward looped (but more memory friendly) code:
ddf = dd.read_parquet(input_file)
for partition in range(0, ddf.npartitions, 1):
partition_df = ddf.partitions[partition]
(
my_function(partition_df)
.to_parquet(
output_file,
append = True,
overwrite = False,
engine = 'fastparquet'
)
)
More hardware and data details:
The total input parquet file is around 5GB and is split into 11 partitions of up to 900MB. It is indexed by ID with divisions so I can do vehicle grouped operations without working across partitions. The laptop I'm using has 16GB RAM and 19GB swap. The original code uses all of both, while the looped version fits within RAM.

As #MichaelDelgado pointed out, by default Dask will spin up multiple workers/threads according to what is available on the machine. With the size of the partitions I have, this maxes out the available memory when using the map_partitions approach. In order to avoid this, I limited the number of workers and the number of threads per worker to prevent automatic parellelization using the code below, and the task fit in memory.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers = 1,
threads_per_worker = 1)
client = Client(cluster)

Related

Incrementally writing Parquet dataset from Python

I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch size as hinted here I am facing the issues:
RAM usage grows heavily
The writer starts to slow down after a while (write throughput speed drops more than 5x)
My assumption is that this is because the ParquetWriter metadata management becomes expensive when the number of rows increase. I am thinking that I should switch to datasets that would allow the writer to close the file in the middle of processing flush out the metadata.
My question is
Is there an example for writing incremental datasets with Python and Parquet
Are my assumptions correct or incorrect and using datasets would help to maintain the writer throughput?
My distilled code:
writer = pq.ParquetWriter(
fname,
Candle.to_pyarrow_schema(small_candles),
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
def writeout():
nonlocal data
duration = time.time() - stats["started"]
throughout = stats["candles_processed"] / duration
logger.info("Writing Parquet table for candle %s, throughput is %s", "{:,}".format(stats["candles_processed"]), throughout)
writer.write_table(
pa.Table.from_pydict(
data,
writer.schema
)
)
data = dict.fromkeys(data.keys(), [])
process = psutil.Process(os.getpid())
logger.info("Flushed %s writer, the memory usage is %s", bucket, process.memory_info())
# Use massive yield_per() or otherwise we are leaking memory
for item in query.yield_per(100_000):
frame = construct_frame(row_type, item)
for key, value in frame.items():
data[key].append(value)
stats["candles_processed"] += 1
# Do regular checkopoints to avoid out of memory
# and to log the progress to the console
# For fine tuning Parquet writer see
# https://issues.apache.org/jira/browse/ARROW-10052
if stats["candles_processed"] % 100_000 == 0:
writeout()
In this case, the reason was the incorrect use of Python lists and dicts as a working buffer, as pointed out by #0x26res.
After making sure the dictionary of lists is cleared correctly, the memory consumption issues become negligible.

How to handle for loop with large objects in Rstudio?

I have a for loop with large objects. According to my trial-and-error, I can only load the large object once. If I load the object again, I would be returned the error "Error: cannot allocate vector of size *** Mb". I tried to overcome this issue by removing the object at the end of the for loop. However, I am still returned the error "Error: cannot allocate vector of size 699.2 Mb" at the beginning of the second run of the for loop.
My for loop has the following structure:
for (i in 1:22) {
VeryLargeObject <- ...i...
...
.
.
.
...
rm(VeryLargeOjbect)
}
The VeryLargeObjects ranges from 2-3GB. My PC has RAM of 16Gb, 8 cores, 64-bit Win10.
Any solution on how I can manage to complete the for loop?
The error "cannot allocate..." likely comes from the fact that rm() does not immediately free memory. So the first object still occupies RAM when you load the second one. Objects that are not assigned to any name (variable) anymore get garbage collected by R at time points that R decides for itself.
Most remedies come from not loading the entire object into RAM:
If you are working with a matrix, create a filebacked.big.matrix() with the bigmemory package. Write your data into this object using var[...,...] syntax like a normal matrix. Then, in a new R session (and a new R script to preserve reproducibility), you can load this matrix from disk and modify it.
The mmap package uses a similar approach, using your operating system's ability to map RAM pages to disk. So they appear to a program like they are in ram, but are read from disk. To improve speed, the operating system takes care of keeping the relevant parts in RAM.
If you work with data frames, you can use packages like fst and feather that enable you to load only parts of your data frame into a variable.
Transfer your data frame into a data base like sqlite and then access the data base with R. The package dbplyr enables you to treat a data base as a tidyverse-style data set. Here is the RStudio help page. You can also use raw SQL commands with the package DBI
Another approach is to not write interactively, but to write an R script that processes only one of your objects:
Write an R script, named, say processBigObject.R that gets the file name of your big object from the command line using commandArgs():
#!/usr/bin/env Rscript
#
# Process a big object
#
# Usage: Rscript processBigObject.R <FILENAME>
input_filename <- commandArgs(trailing = TRUE)[1]
output_filename <- commandArgs(trailing = TRUE)[2]
# I'm making up function names here, do what you must for your object
o <- readBigObject(input_filename)
s <- calculateSmallerSummaryOf(o)
writeOutput(s, output_filename)
Then, write a shell script or use system2() to call the script multiple times, with different file names. Because R is terminated after each object, the memory is freed:
system2("Rscript", c("processBigObject.R", "bigObject1.dat", "bigObject1_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject2.dat", "bigObject2_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject3.dat", "bigObject3_result.dat"))
...

Windows (ReFS,NTFS) file preallocation hint

Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)
Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.

Neo4J tuning or just more RAM?

I have a Neo4J-enterprise database running on a DigitalOcean VPS with 8Gb RAM and 80Gb SSD.
The performance of the Neo4J instance is awful at the moment:
match (n) where n.gram='0gram' AND n.word=~'a.' return n.word LIMIT 5 # 349ms
match (n) where n.gram='0gram' AND n.word=~'a.*' return n.word LIMIT 25 # 1588ms
I understand regex are expensive, but on likewise queries where I replace the 'a.' or 'a.*' part with any other letter, Neo4j simply crashes. I can see a huge build-up in memory before that (towards 90%), and the CPU sky-rocketing.
My Neo4j is populated as follows:
Number Of Relationship Type Ids In Use: 1,
Number Of Node Ids In Use: 172412046,
Number Of Relationship Ids In Use: 172219328,
Number Of Property Ids In Use: 344453742
The VPS only runs Neo4J (on debian 7/amd64). I use the NUMA+parallelGC flags as they're supposed to be faster. I've been tweaking my RAM settings, and although it doesn't crash at often now, I have a feeling there should be some gainings to be made
neostore.nodestore.db.mapped_memory=1024M
neostore.relationshipstore.db.mapped_memory=2048M
neostore.propertystore.db.mapped_memory=6144M
neostore.propertystore.db.strings.mapped_memory=512M
neostore.propertystore.db.arrays.mapped_memory=512M
# caching
cache_type=hpc
node_cache_array_fraction=7
relationship_cache_array_fraction=5
# node_cache_size=3G
# relationship_cache_size=1G --> these throw a not-enough-heap-mem error
The data is essentially a series of tree, where on node0 only a full text search is needed, the following nodes are searched by a property with floating point values.
node0 -REL-> node0.1 -REL-> node0.1.1 ... node0.1.1.1.1
\
-REL-> node0.2 -REL-> node0.2.1 ... node0.2.1.1
There are aprox. 5.000 top-nodes like node0.
Should I reconfigure my memory/cache usage, or should I just add more RAM?
--- Edit on Indexes ---
Because all tree's of nodes al always 4-levels deep, each level has a label for quick finding.in this case all node0 nodes have a label (called 0gram). the n.gram='0gram' should use the index coupled to the label.
--- Edit on new Config ---
I upgraded the VPS to 16Gb. The nodeStore has 2.3Gb (11%), PropertyStore 13.8Gb (64%) and the relastionshipStore amounts to 5.6Gb (26%) on the SSD.
On this basis I created a new config (detailed above).
I'm waiting for the full set of queries and will do some additional testing in the mean time
Yes you need to create an index, what's your label called? Imagine it being called :NGram
create index on :NGram(gram);
match (n:NGram) where n.gram='0gram' AND n.word=~'a.' return n.word LIMIT 5
match (n:NGram) where n.gram='0gram' AND n.word=~'a.*' return n.word LIMIT 25
What you're doing is not a graph search but just a lookup via full scan + property comparison with a regexp. Not a very efficient operation. What you need is FullTextSearch (which is not supported with the new schema indexes but still with the legacy indexes).
Could you run this query (after you created the index) and say how many nodes it returns?
match (n:NGram) where n.gram='0gram' return count(*)
which is the equivalent to
match (n:NGram {gram:'0gram'}) return count(*)
I wrote a blog post about it a few days ago, please read it and see if it applies to your case.
How big is your Neo4j database on disk?
What is the configured heap size? (in neo4j-wrapper.conf?)
As you can see you use more RAM than you machine has (not even counting OS or filesystem caches).
So you would have to reduce the mmio sizes, e.g. to 500M for nodes 2G for rels and 1G for properties.
Look at your store-file sizes and set mmio accordingly.
Depending on the number of nodes having n.gram='0gram' you might benefit a lot from setting a label on them and index for the gram property. If you have this in place a index lookup will directly return all 0gram nodes and apply regex matching only on those. Your current statement will load each and every node from the db and inspect its properties.

MATLAB : access loaded MAT file very slow

I'm currently working on a project involving saving/loading quite big MAT files (around 150 MB), and I realized that it was much slower to access a loaded cell array than the equivalent version created inside a script or a function.
I created this example to simulate my code and show the difference :
clear; clc;
disp('Test for computing with loading');
if exist('data.mat', 'file')
delete('data.mat');
end
n_tests = 10000;
data = {};
for i=1:n_tests
data{end+1} = rand(1, 4096);
end
% disp('Saving data');
% save('data.mat', 'data');
% clear('data');
%
% disp('Loading data');
% load('data.mat', '-mat');
for i=1:n_tests
tic;
for j=1:n_tests
d = sum((data{i} - data{j}) .^ 2);
end
time = toc;
disp(['#' num2str(i) ' computed in ' num2str(time) ' s']);
end
In this code, no MAT file is saved nor loaded. The average time for one iteration over i is 0.75s. When I uncomment the lines to save/load the file, the computation for one iteration over i takes about 6.2s (the saving/loading time is not taking into consideration). The difference is 8x slower !
I'm using MATLAB 7.12.0 (R2011a) 64 bits with Windows 7 64 bits, and the MAT files are saved with the version v7.3.
Can it be related to the compression of the MAT file? Or caching variables ?
Is there any way to prevent/avoid this ?
I also know this problem. I think it's also related to the inefficient managing of memory in matlab - and as I remember it's not doing well with swapping.
A 150MB file can easily hold a lot of data - maybe more than can be quickly allocated.
I just made a quick calculation for your example using the information by mathworks
In your case total_size = n_tests*121 + n_tests*(1*4096* 8) is about 313MB.
First I would suggest to save them in format 7 (instead of 7.3) - I noticed very poor performance in reading this new format. That alone could be the reason of your slowdown.
Personally I solved this in two ways:
Split the data in smaller sets and then use functions that load the data when needed or create it on the fly (can be elegantly done with classes)
Move the data into a database. SQLite and MySQL are great. Both work efficiently with MUCH larger datasets (in the TBs instead of GBs). And the SQL language is quite efficient to quickly get subsets to manipulate.
I test this code with Windows 64bit, matlab 64bit 2014b.
Without saving and loading, the computation is around 0.22s,
Save the data file with '-v7' and then load, the computation is around 0.2s.
Save the data file with '-v7.3' and then load, the computation is around 4.1s.
So it is related to the compression of the MAT file.

Resources