mapred.child.javaopts parameter in Hadoop - reading a serialized HashMap - hadoop

I have a 1.5 GB file that contains a serialized HashMap inside it.
I have a setup() method in the Mapper class where I am reading this into a HashMap variable.
It looks like it is able to go to the read method, but immediately throws a java heap space error for the tasks.
I read over many discussions that we may need to set the mapred.child.opts parameter and I am doing that inside the main program code.
I am using:
conf.set("mapred.child.java.opts.", "-Xmx1024M");
I even tried to increase the number. Why does it still keep throwing the same error at the time it is trying to read the serialized file into a HashMap variable?
Here is the code in my setup() method:
try {
test="hello";
Path pt=new Path("hdfs://localhost:9000/user/watsonuser/topic_dump.tsv");
FileSystem fs = FileSystem.get(new Configuration());
}catch(Exception e) {System.out.println("Exception while reading the nameMap
file."); e.printStackTrace();}
InputStream is = fs.open(pt);
ObjectInputStream s = new ObjectInputStream(is);
nameMap = (HashMap<String, String>) s.readObject();
s.close();
}catch(Exception e) {
System.out.println("Exception while reading the nameMap file.");
e.printStackTrace();
}

As you're using the serialized version of the hash map, and the final output size of the file is 1.5GB, i'm guessing that the amount of memory your JVM is going to need is at least 1.5GB.
You should be able to test this with a small program to load in your file (as you already have), but keep increasing the -Xmx value until you no longer see the memory error - this will be your baseline (you'll probably still need to add some more when running within a hadoop mapper as it has buffer size requirements for spills sorting etc.
Do you also know how many bins and items are represented in this hash map? The implementation of HashMap is just an array of bins with linked entry items that hash to that bin number. The number of bins also has to be a power of two, so as you put more and more items in your map, the memory requirements for the actual backing array double when the map reaches its threshold value / load factor (0.75). With this in mind, i imagine the problems your seeing is that such a large hash map (1.5GB serialized) will require an as large, if not larger memory footprint when deserialized into memory

Related

Why does Dask's map_partitions function use more memory than looping over partitions?

I have a parquet file of position data for vehicles that is indexed by vehicle ID and sorted by timestamp. I want to read the parquet file, do some calculations on each partition (not aggregations) and then write the output directly to a new parquet file of similar size.
I organized my data and wrote my code (below) to use Dask's map_partitions, as I understood this would perform the operations one partition at a time, saving each result to disk sequentially and thereby minimizing memory usage. I was surprised to find that this was exceeding my available memory and I found that if I instead create a loop that runs my code on a single partition at a time and appends the output to the new parquet file (see second code block below), it easily fits within memory.
Is there something incorrect in the original way I used map_partitions? If not, why does it use so much more memory? What is the proper, most efficient way of achieving what I want?
Thanks in advance for any insight!!
Original (memory hungry) code:
ddf = dd.read_parquet(input_file)
meta_dict = ddf.dtypes.to_dict()
(
ddf
.map_partitions(my_function, meta = meta_dict)
.to_parquet(
output_file,
append = False,
overwrite = True,
engine = 'fastparquet'
)
)
Awkward looped (but more memory friendly) code:
ddf = dd.read_parquet(input_file)
for partition in range(0, ddf.npartitions, 1):
partition_df = ddf.partitions[partition]
(
my_function(partition_df)
.to_parquet(
output_file,
append = True,
overwrite = False,
engine = 'fastparquet'
)
)
More hardware and data details:
The total input parquet file is around 5GB and is split into 11 partitions of up to 900MB. It is indexed by ID with divisions so I can do vehicle grouped operations without working across partitions. The laptop I'm using has 16GB RAM and 19GB swap. The original code uses all of both, while the looped version fits within RAM.
As #MichaelDelgado pointed out, by default Dask will spin up multiple workers/threads according to what is available on the machine. With the size of the partitions I have, this maxes out the available memory when using the map_partitions approach. In order to avoid this, I limited the number of workers and the number of threads per worker to prevent automatic parellelization using the code below, and the task fit in memory.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers = 1,
threads_per_worker = 1)
client = Client(cluster)

Go performance penalty in high number of calls to append

I'm writing an emulator in Go, and for debugging purposes I'm logging the cpu' state at every emulator's cycle to generate a log file later.
There's something I'm not doing properly because while the logger is enabled performance drops and makes the emulator unusable.
Profiler shows clearly the culprit resides in the logging routine (logStep method):
logStep method is very simple, it calls CreateState to snapshot current cpu state in a struct, and then adds it to a slice (in method Log).
I call this method at every emulated cpu cycle (around 30.000 times per second), and I suspect either Garbage Collector is slowing my execution or I'm doing something wrong with this data structure.
I get the profile graph is pointing me to runtime growslice caused by an append located in (*cpu6502Logger)Log, but I'm unable to find information on how to do this more efficiently.
Also, I scratch my head on why CreateState takes that long to just create a simple struct.
This is what CpuState looks like:
type CpuState struct {
Registers Cpu6502Registers
CurrentInstruction Instruction
RawOpcode [4]byte
EvaluatedAddress Address
CyclesSinceReset uint32
}
This is how I create a CPU Snapshot:
func CreateState(cpu Cpu6502) CpuState {
pc := cpu.Registers().Pc
var rawOpcode [4]byte
rawOpcode[0] = 0x00
pc++
instruction := cpu.instructions[rawOpcode[0]]
for i := byte(0); i < (instruction.Size() - 1); i++ {
rawOpcode[1+i] = cpu.memory.Read(pc+Address(i))
}
_, evaluatedAddress, _, _ := cpu.addressEvaluators[instruction.AddressMode()](pc)
state := CpuState{
*cpu.Registers(),
instruction,
rawOpcode,
evaluatedAddress,
cpu.cycle,
}
return state
}
And finally, how I add this snapshot to a collection (log method in the profile graph). I've also addde how I initialize logger.snapshots:
func createCPULogger(outputPath string) cpu6502Logger {
return cpu6502Logger{
outputPath: outputPath,
snapshots: make([]CpuState, 0, 10024),
}
}
func (logger *cpu6502Logger) Log(state CpuState) {
logger.snapshots = append(logger.snapshots, state)
}
Disclaimer: following text contains grammar mistakes but i dont give a damn
why is it slow
Maintaining one gigantic slice to hold all data there is is wery costy mainly when it constantly extends. Each time you append few elements, whole memory section is copied to bigges section to allow expansion. with grownig slice, complexity grows and each realocation is slower and slower. You told us that you emulate tousands of cpu states per second.
solution
The best way to deal with this is allocating fixed buffer of some length. Now we now that eventually we will run out of space. When that happens we have two options. First you can write all data ftom buffer to file then truncate the buffer and start filling again (then write again). Other option is to save filled buffers in a slice and allocate new one. Choos witch one fits your machine. (slow or small ram is not good for second solution)
why does this help
i think this also helps the emulator it self. There will be performance spikes when restoring buffer, but most of the time, performance will be at maximum. Allocating big memory is just slow as alocator is less likely to find fitting section on first try. Garbage collection is also wery unhappy with frequent allocations. By allocating buffer and filling it, we use one big allocation, (but not too big), and store data in sections. Sections we already saved can stey where they are. We can also say that in this case we are handling memory our selfs more then gc does. (no garbage memory produced)

Memory Leak using InternalSearchHit.sourceAsMap

I'm experiencing an odd behavior in a project I'm currently working on. What I'm seeing is that when I reference an InternalSearchHit.sourceAsMap directly I'm experiencing a memory leak. The code is doing some graph traversal and so there are references to other InternalSearchHit.sourceAsMap documents as well.
I'm currently writing groovy code, and so part of what I'm seeing is:
def docMap = hit.sourceAsMap()
//Add data into hash map
//Add other InternalSearchHit.sourceAsMap as child of map
I end up with a memory leak.
If I copy the map using the HashMap constructor:
def docMap = new HashMap(hit.sourceAsMap())
//Add data into hash map
//Add other InternalSearchHit.sourceAsMap as child of map
Then the memory leak goes away.
I took a look at the source for InternalSearchHit and I didn't see anything glaring. At best what I can guess, is that referencing the sourceAsMap object retains a hold on the SearchHit which in turn retains a hold on something else.

ITextRenderer.createPDF stops after creating multiple PDF

I'm facing a problem I can't seems to fix and I need your help.
I'm generating a list of PDF that I write to the hard drive and everything works fine for a small amount of files, but when I start to generate more files (via a for loop), the creations stops and the others PDF files arent created.
I'm using Play Framework with the PDF module, that rely on ITextRenderer to generate the PDF.
I localized the problem (well, I believe it's here) by adding outputs to see where it stops, and the problem is when I call .createPDF(os);.
At first, I was able to only create 16 files and after that, it would stops, but I created a Singleton that creates the renderer in the Class instance and re-use the same instance (in order to avoid adding the fonts and settings everytime) and I went to 61 files created, but no more.
I though about a memory leak that blocks the process, but can't see where nor how to find it correctly.
Here's my part of the code :
List models; // I got a list of MyModel from a db query, this MyModel contains a path to a file
List<InputStream> files = new ArrayList<InputStream>();
for (MyModel model : models) {
if (!model.getFile().exists()) {
model.generatePdf();
}
files.add(new FileInputStream(model.getFile()));
}
// The generatePDF :
public void generatePdf() {
byte[] bytes = PDF.toBytes(views.html.invoices.pdf.invoice.render(this, due));
FileOutputStream output;
try {
File file = getFile();
if (!file.getParentFile().exists()) {
file.getParentFile().mkdirs();
}
if (file.exists()) {
file.delete();
}
output = new FileOutputStream(file);
BufferedOutputStream bos = new BufferedOutputStream(output);
bos.write(bytes);
bos.flush();
bos.close();
output.flush();
output.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
As you can see, I do my best to avoid memory leaks but this isn't enough.
In order to locate the problem, I replaced PDF.toBytes and all subsequent calls from that class to a copy/paste version inside my class, and added outputs. That's how I found that the thread hangs at createPDF, line
Update 1:
I have two (indentical) PlayFramework applications running with those parameters :
-Xms1024m -Xmx1024m -XX:PermSize=512m -XX:MaxPermSize=512m
I tried to stop one instance and re-execute the PDF generation, but it didn't impact the number of file generated, it stops at the same amount of files.
I also tried to update the allocated memories :
-Xms1536m -Xmx1536m -XX:PermSize=1024m -XX:MaxPermSize=1024m
No changes at all neither.
For information, the server has 16 Gb of RAM.
cat /proc/cpuinfo :
model name : Intel(R) Core(TM) i5-2400 CPU # 3.10GHz
cpu MHz : 3101.000
cpu cores : 4
cache size : 6144 KB
Hope it'll helps.
Well I'm really surprised the bug has absolutely nothing related to memory, memory leaks or available memory left.
I'm astonished.
It's related to an image that was loaded via an url, in the same server (local), that was taking to long to load. Removing that image fixed the issue.
I will make a base64 encoded image and it should fix the issue.
I still can't believe it!
The module is developed by Jörg Viola, I think it's safe to assume everything is fine on this side. From the IText library, I also believe it's safe to assume that everything is safe.
The bottleneck, as you guessed, was from your code. The interesting part was that it's wasn't some memory not properly managed, but from a network request that was making the PDF rendering slower and slower everytime, until it would ultimately fails.
It's nice you finally make it work.

Element size in EhCache

I've setup a simple cache, using an Integer for the key and a Double for the value. After populating the cache, the ratio cache.calculateInMemorySize() / cache.getMemoryStoreSize() is constant at 344 bytes per element. I expect overhead, but my payload is (32 + 64) 96 bits, or 12 bytes, so the overhead is a whopping 332 bytes - or am I completely misunderstanding how this work? If not, what, if anything, can I do to bring down the overhead?
The cache is meant to be a memory-only store. We want to fit everything in there, so overflow and expiry is not needed, and as we can populate fairly quickly from the external data source (just not fast enough to use it as the primary data source), persistence is not needed either.
Using version 2.4.0.
I'm assuming the payload is the things actually being cached. Are you including the size of the keys as well? I believe the calculateInMemorySize also includes the keys.
Based on your requirements:
The cache is meant to be a memory-only store. We want to fit
everything in there, so overflow and expiry is not needed, and as we
can populate fairly quickly from the external data source (just not
fast enough to use it as the primary data source), persistence is not
needed either.
I conclude that you don't need to use any 3rd party caching framework at all. Instead you can get away with simple HashMap...since your cached elements never expire and always fit in memory. Also, you will not to include EhCache jars in your classpath and load its classes!
Here is sample code:
public class MyCustomCache
{
private Map<Integer, Double> myMap = new HashMap<Integer, Double>();
public Double getCachedValueByKey(Integer key)
{
return myMap.get(key);
}
public void putValue2Cache(Integer key, Double value)
{
myMap.put(key, value);
}
public Double removeValueFromCache(Integer key)
{
return myMap.remove(key);
}
}

Resources