Elixir process taking up too much memory - caching

I am reading postcodes from a csv file, taking that data and caching it with ets.
The postcode file is quite large (95MB) as it contains about 1.8 million entries.
I am only caching the postcodes that are needed for look ups at the moment (about 200k) so the amount of data stored in ets should not be an issue. However no matter how small the number of inserts into ets is, the amount of memory taken up by the process is virtually unchanged. Doesn't seem to matter if I insert 1 row or all 1.8 million.
# not logging all functions defs so it is not too long.
# Comment if more into is needed.
defmodule PostcodeCache do
use GenServer
def cache_postcodes do
"path_to_postcode.csv"
|> File.read!()
|> function_to_parse()
|> function_to_filter()
|> function_to_format()
|> Enum.map(&(:ets.insert_new(:cache, &1)))
end
end
I am running this in the terminal with iex -S mix and running the command :observer.start. When I go to the processes tab, my postcodeCache memory is massive (over 600MB)
Even if I filter the file so I only end up storing 1 postcode in :ets it is still over 600MB.

I realised that the error I was making was when I was looking at the memory of the process and assuming that it was to do with the cache.
Because this is a GenServer it is holding onto all the information from csv file when it is read (File.read!) and also appears to be holding onto all changes made to that file as well.
How I have solved this is by changing the File.read! to a File.stream!. I then use Enum.each instead of mapping over the returned data.
In the each I check the postcode is what I want and if it is I then insert it into ets.
def cache_postcodes do
"path_to_postcode.csv"
|> File.stream!()
|> Enum.each(fn(line) ->
value_to_store = some_check_on_line(line)
:ets.insert_new(:cache, &1)
end)
end
With this approach my processes memory is now only about 2MB (not 632MB) and my ets memory is about 30MB. That is about what I would expect.

Related

Why does Dask's map_partitions function use more memory than looping over partitions?

I have a parquet file of position data for vehicles that is indexed by vehicle ID and sorted by timestamp. I want to read the parquet file, do some calculations on each partition (not aggregations) and then write the output directly to a new parquet file of similar size.
I organized my data and wrote my code (below) to use Dask's map_partitions, as I understood this would perform the operations one partition at a time, saving each result to disk sequentially and thereby minimizing memory usage. I was surprised to find that this was exceeding my available memory and I found that if I instead create a loop that runs my code on a single partition at a time and appends the output to the new parquet file (see second code block below), it easily fits within memory.
Is there something incorrect in the original way I used map_partitions? If not, why does it use so much more memory? What is the proper, most efficient way of achieving what I want?
Thanks in advance for any insight!!
Original (memory hungry) code:
ddf = dd.read_parquet(input_file)
meta_dict = ddf.dtypes.to_dict()
(
ddf
.map_partitions(my_function, meta = meta_dict)
.to_parquet(
output_file,
append = False,
overwrite = True,
engine = 'fastparquet'
)
)
Awkward looped (but more memory friendly) code:
ddf = dd.read_parquet(input_file)
for partition in range(0, ddf.npartitions, 1):
partition_df = ddf.partitions[partition]
(
my_function(partition_df)
.to_parquet(
output_file,
append = True,
overwrite = False,
engine = 'fastparquet'
)
)
More hardware and data details:
The total input parquet file is around 5GB and is split into 11 partitions of up to 900MB. It is indexed by ID with divisions so I can do vehicle grouped operations without working across partitions. The laptop I'm using has 16GB RAM and 19GB swap. The original code uses all of both, while the looped version fits within RAM.
As #MichaelDelgado pointed out, by default Dask will spin up multiple workers/threads according to what is available on the machine. With the size of the partitions I have, this maxes out the available memory when using the map_partitions approach. In order to avoid this, I limited the number of workers and the number of threads per worker to prevent automatic parellelization using the code below, and the task fit in memory.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers = 1,
threads_per_worker = 1)
client = Client(cluster)

Incrementally writing Parquet dataset from Python

I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch size as hinted here I am facing the issues:
RAM usage grows heavily
The writer starts to slow down after a while (write throughput speed drops more than 5x)
My assumption is that this is because the ParquetWriter metadata management becomes expensive when the number of rows increase. I am thinking that I should switch to datasets that would allow the writer to close the file in the middle of processing flush out the metadata.
My question is
Is there an example for writing incremental datasets with Python and Parquet
Are my assumptions correct or incorrect and using datasets would help to maintain the writer throughput?
My distilled code:
writer = pq.ParquetWriter(
fname,
Candle.to_pyarrow_schema(small_candles),
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
def writeout():
nonlocal data
duration = time.time() - stats["started"]
throughout = stats["candles_processed"] / duration
logger.info("Writing Parquet table for candle %s, throughput is %s", "{:,}".format(stats["candles_processed"]), throughout)
writer.write_table(
pa.Table.from_pydict(
data,
writer.schema
)
)
data = dict.fromkeys(data.keys(), [])
process = psutil.Process(os.getpid())
logger.info("Flushed %s writer, the memory usage is %s", bucket, process.memory_info())
# Use massive yield_per() or otherwise we are leaking memory
for item in query.yield_per(100_000):
frame = construct_frame(row_type, item)
for key, value in frame.items():
data[key].append(value)
stats["candles_processed"] += 1
# Do regular checkopoints to avoid out of memory
# and to log the progress to the console
# For fine tuning Parquet writer see
# https://issues.apache.org/jira/browse/ARROW-10052
if stats["candles_processed"] % 100_000 == 0:
writeout()
In this case, the reason was the incorrect use of Python lists and dicts as a working buffer, as pointed out by #0x26res.
After making sure the dictionary of lists is cleared correctly, the memory consumption issues become negligible.

Storing the data using Parallel Programming

The requirement is as follows:
We have a third party client from where we need to take the data and store in the database(ultimately).
The way client is sharing data to us is through the dll's having functions(the dll is built out of C++ code) and we need to call those functions with appropriate parameters and we get the result.
Declare Function wcmo_deal Lib "E:\IGB\System\Intex\vcmowrap\vcmowr64.dll" (
ByRef WCMOarg_Handle As String,
ByRef WCMOarg_User As String,
ByRef WCMOarg_Options As String,
ByRef WCMOarg_Deal As String,
ByRef WCMOarg_DataOut As String,
ByRef WCMOarg_ErrOut As String) _
As Long
wcmo_deal(wcmo_deal_WCMOarg_Handle, WCMOarg_User, WCMOarg_Options, WCMOarg_Deal, WCMOarg_DataOut, WCMOarg_ErrOut)
Here WCMOarg_DataOut is the data we get and that needs to be stored.
Similar to the above method we have 10 more methods(so,total 11 methods) which pull the data and that data(string of around 500 KB to 1 MB each) is stored in the files using the below method :
File.WriteAllText(logPath & sDealName & ".txt", sDealName & " - " & WCMOarg_ErrOut & vbCrLf)
Now these method calls run for each deal. So for a single deal we get output in 11 different folders with the text file stored with the data received from the client.
There are 5000 deals totally for which we need to call these methods and the data gets stored in the files.
The way this functionality has been implemented is by using Parallel Programming with Master-Child relationship as follows:
Dim opts As New ParallelOptions
opts.MaxDegreeOfParallelism = System.Environment.ProcessorCount
Parallel.ForEach(dealList, opts, Sub(deal)
If Len(deal) > 0 Then
Dim dealPass As String = ""
Try
If dealPassDict.ContainsKey(deal.ToUpper) Then
dealPass = dealPassDict(deal.ToUpper)
End If
Dim p As New Process()
p.StartInfo.FileName = "E:\IGB_New\CMBS Intex Data Deal v2.0.exe"
p.StartInfo.Arguments = deal & "|" & keycode & "|" & dealPass & "|" & clArgs(1) & "|"
p.StartInfo.UseShellExecute = False
p.StartInfo.CreateNoWindow = True
p.Start()
p.WaitForExit()
Catch ex As Exception
exceptions.Enqueue(ex)
End Try
End If
End Sub)
where CMBS Intex Data Deal v2.0.exe is the child code which will execute 5000 times as deallist contains 5000 deals.
The CMBS Intex Data Deal v2.0.exe code contains the code of calling the dlls and storing the data in the files mentioned above.
Issues faced :
The code was run keeping the Master and Child code in one single place but we get out of memory exception after 3000 deals.[for 32 GB RAM,Processor Count =16]
The above code(Master-Child) is also taking up a lot of memory, it runs fine upto 4800 deals in one hour(the memory usage gradually reaches 100% at around 4800 deals) and then for the remaining 200 deals it takes close to 1 hour(so , totally 2 hours).[for 32 GB RAM,Processor Count =16]
The reason Master child was tried was on the assumption that GC will take care of the Memory disposal of all the objects in the Child.
After the data is stored in text files, a perl script runs and loads the data into the database.
Approach tried:
Instead of keeping the data in text files and then storing into the database, I tried storing the data into the DB directly without putting them in the files (assuming I/O operations consume a lot of memory),but this too didnt work as the DB crashed/Hangs everytime.
Note:
All the handles related to the DLL is properly being closed.
The call to the DLL's method consume a lot of memory,but nothing can be done to reduce it as it cant be controlled by us.
The reason to use Parallel approach is if we go with sequential approach, it would take many hours to fetch and load the data and we need to run this twice a day as the data keeps changing, so need to be up-to-date with the latest data from the client.
There was a CPU maxing out issue as well but that has been resolved by keeping the MaxDegreeOfParallelism = System.Environment.ProcessorCount.
Question :
Is there a way to reduce the time taken by the process to complete.
Currently it takes 2 hours to complete but that could be due to no memory remaining as it reaches 4800 deals, and without any memory it cannot process any further.
Is there a way to reduce memory consumption here by trying out a different way to execute this or there is something if changed in the same code could make it work?
Parallelism is most likely totally useless. You are bound by the IO, not by the CPU. The IO is the bottleneck and parallelization might even make it worse. You might experiment by using a RAM drive and then copy all the output to the actual storage.

How to handle for loop with large objects in Rstudio?

I have a for loop with large objects. According to my trial-and-error, I can only load the large object once. If I load the object again, I would be returned the error "Error: cannot allocate vector of size *** Mb". I tried to overcome this issue by removing the object at the end of the for loop. However, I am still returned the error "Error: cannot allocate vector of size 699.2 Mb" at the beginning of the second run of the for loop.
My for loop has the following structure:
for (i in 1:22) {
VeryLargeObject <- ...i...
...
.
.
.
...
rm(VeryLargeOjbect)
}
The VeryLargeObjects ranges from 2-3GB. My PC has RAM of 16Gb, 8 cores, 64-bit Win10.
Any solution on how I can manage to complete the for loop?
The error "cannot allocate..." likely comes from the fact that rm() does not immediately free memory. So the first object still occupies RAM when you load the second one. Objects that are not assigned to any name (variable) anymore get garbage collected by R at time points that R decides for itself.
Most remedies come from not loading the entire object into RAM:
If you are working with a matrix, create a filebacked.big.matrix() with the bigmemory package. Write your data into this object using var[...,...] syntax like a normal matrix. Then, in a new R session (and a new R script to preserve reproducibility), you can load this matrix from disk and modify it.
The mmap package uses a similar approach, using your operating system's ability to map RAM pages to disk. So they appear to a program like they are in ram, but are read from disk. To improve speed, the operating system takes care of keeping the relevant parts in RAM.
If you work with data frames, you can use packages like fst and feather that enable you to load only parts of your data frame into a variable.
Transfer your data frame into a data base like sqlite and then access the data base with R. The package dbplyr enables you to treat a data base as a tidyverse-style data set. Here is the RStudio help page. You can also use raw SQL commands with the package DBI
Another approach is to not write interactively, but to write an R script that processes only one of your objects:
Write an R script, named, say processBigObject.R that gets the file name of your big object from the command line using commandArgs():
#!/usr/bin/env Rscript
#
# Process a big object
#
# Usage: Rscript processBigObject.R <FILENAME>
input_filename <- commandArgs(trailing = TRUE)[1]
output_filename <- commandArgs(trailing = TRUE)[2]
# I'm making up function names here, do what you must for your object
o <- readBigObject(input_filename)
s <- calculateSmallerSummaryOf(o)
writeOutput(s, output_filename)
Then, write a shell script or use system2() to call the script multiple times, with different file names. Because R is terminated after each object, the memory is freed:
system2("Rscript", c("processBigObject.R", "bigObject1.dat", "bigObject1_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject2.dat", "bigObject2_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject3.dat", "bigObject3_result.dat"))
...

Ruby Memory Management

I have been using Ruby for a while now and I find, for bigger projects, it can take up a fair amount of memory. What are some best practices for reducing memory usage in Ruby?
Please, let each answer have one "best practice" and let the community vote it up.
When working with huge arrays of ActiveRecord objects be very careful... When processing those objects in a loop if on each iteration you are loading their related objects using ActiveRecord's has_many, belongs_to, etc. - the memory usage grows a lot because each object that belongs to an array grows...
The following technique helped us a lot (simplified example):
students.each do |student|
cloned_student = student.clone
...
cloned_student.books.detect {...}
ca_teachers = cloned_student.teachers.detect {|teacher| teacher.address.state == 'CA'}
ca_teachers.blah_blah
...
# Not sure if the following is necessary, but we have it just in case...
cloned_student = nil
end
In the code above "cloned_student" is the object that grows, but since it is "nullified" at the end of each iteration this is not a problem for huge array of students. If we didn't do "clone", the loop variable "student" would have grown, but since it belongs to an array - the memory used by it is never released as long as array object exists.
Different approach works too:
students.each do |student|
loop_student = Student.find(student.id) # just re-find the record into local variable.
...
loop_student.books.detect {...}
ca_teachers = loop_student.teachers.detect {|teacher| teacher.address.state == 'CA'}
ca_teachers.blah_blah
...
end
In our production environment we had a background process that failed to finish once because 8Gb of RAM wasn't enough for it. After this small change it uses less than 1Gb to process the same amount of data...
Don't abuse symbols.
Each time you create a symbol, ruby puts an entry in it's symbol table. The symbol table is a global hash which never gets emptied.
This is not technically a memory leak, but it behaves like one. Symbols don't take up much memory so you don't need to be too paranoid, but it pays to be aware of this.
A general guideline: If you've actually typed the symbol in code, it's fine (you only have a finite amount of code after all), but don't call to_sym on dynamically generated or user-input strings, as this opens the door to a potentially ever-increasing number
Don't do this:
def method(x)
x.split( doesn't matter what the args are )
end
or this:
def method(x)
x.gsub( doesn't matter what the args are )
end
Both will permanently leak memory in ruby 1.8.5 and 1.8.6. (not sure about 1.8.7 as I haven't tried it, but I really hope it's fixed.) The workaround is stupid and involves creating a local variable. You don't have to use the local, just create one...
Things like this are why I have lots of love for the ruby language, but no respect for MRI
Beware of C extensions which allocate large chunks of memory themselves.
As an example, when you load an image using RMagick, the entire bitmap gets loaded into memory inside the ruby process. This may be 30 meg or so depending on the size of the image.
However, most of this memory has been allocated by RMagick itself. All ruby knows about is a wrapper object, which is tiny(1).
Ruby only thinks it's holding onto a tiny amount of memory, so it won't bother running the GC. In actual fact it's holding onto 30 meg.
If you loop over a say 10 images, you can run yourself out of memory really fast.
The preferred solution is to manually tell the C library to clean up the memory itself - RMagick has a destroy! method which does this. If your library doesn't however, you may need to forcibly run the GC yourself, even though this is generally discouraged.
(1): Ruby C extensions have callbacks which will get run when the ruby runtime decides to free them, so the memory will eventually be successfully freed at some point, just perhaps not soon enough.
Measure and detect which parts of your code are creating objects that cause memory usage to go up. Improve and modify your code then measure again. Sometimes, you're using gems or libraries that use up a lot of memory and creating a lot of objects as well.
There are many tools out there such as busy-administrator that allow you to check the memory size of objects (including those inside hashes and arrays).
$ gem install busy-administrator
Example # 1: MemorySize.of
require 'busy-administrator'
data = BusyAdministrator::ExampleGenerator.generate_string_with_specified_memory_size(10.mebibytes)
puts BusyAdministrator::MemorySize.of(data)
# => 10 MiB
Example # 2: MemoryUtils.profile
Code
require 'busy-administrator'
results = BusyAdministrator::MemoryUtils.profile(gc_enabled: false) do |analyzer|
BusyAdministrator::ExampleGenerator.generate_string_with_specified_memory_size(10.mebibytes)
end
BusyAdministrator::Display.debug(results)
Output:
{
memory_usage:
{
before: 12 MiB
after: 22 MiB
diff: 10 MiB
}
total_time: 0.406452
gc:
{
count: 0
enabled: false
}
specific:
{
}
object_count: 151
general:
{
String: 10 MiB
Hash: 8 KiB
BusyAdministrator::MemorySize: 0 Bytes
Process::Status: 0 Bytes
IO: 432 Bytes
Array: 326 KiB
Proc: 72 Bytes
RubyVM::Env: 96 Bytes
Time: 176 Bytes
Enumerator: 80 Bytes
}
}
You can also try ruby-prof and memory_profiler. It is better if you test and experiment different versions of your code so you can measure the memory usage and performance of each version. This will allow you to check if your optimization really worked or not. You usually use these tools in development / testing mode and turn them off in production.

Resources