Errno::ENOMEM: Cannot allocate memory - cat - ruby

I have a job running on production which process xml files.
xml files counts around 4k and of size 8 to 9 GB all together.
After processing we get CSV files as output. I've a cat command which will merge all CSV files to a single file I'm getting:
Errno::ENOMEM: Cannot allocate memory
on cat (Backtick) command.
Below are few details:
System Memory - 4 GB
Swap - 2 GB
Ruby : 1.9.3p286
Files are processed using nokogiri and saxbuilder-0.0.8.
Here, there is a block of code which will process 4,000 XML files and output is saved in CSV (1 per xml) (sorry, I'm not suppose to share it b'coz of company policy).
Below is the code which will merge the output files to a single file
Dir["#{processing_directory}/*.csv"].sort_by {|file| [file.count("/"), file]}.each {|file|
`cat #{file} >> #{final_output_file}`
}
I've taken memory consumption snapshots during processing.It consumes almost all part of the memory, but, it won't fail.
It always fails on cat command.
I guess, on backtick it tries to fork a new process which doesn't get enough memory so it fails.
Please let me know your opinion and alternative to this.

So it seems that your system is running pretty low on memory and spawning a shell + calling cat is too much for the few memory left.
If you don't mind loosing some speed, you can merge the files in ruby, with small buffers.
This avoids spawning a shell, and you can control the buffer size.
This is untested but you get the idea :
buffer_size = 4096
output_file = File.open(final_output_file, 'w')
Dir["#{processing_directory}/*.csv"].sort_by {|file| [file.count("/"), file]}.each do |file|
f = File.open(file)
while buffer = f.read(buffer_size)
output_file.write(buffer)
end
f.close
end

You are probably out of physical memory, so double check that and verify your swap (free -m). In case you don't have a swap space, create one.
Otherwise if your memory is fine, the error is most likely caused by shell resource limits. You may check them by ulimit -a.
They can be changed by ulimit which can modify shell resource limits (see: help ulimit), e.g.
ulimit -Sn unlimited && ulimit -Sl unlimited
To make these limit persistent, you can configure it by creating the ulimit setting file by the following shell command:
cat | sudo tee /etc/security/limits.d/01-${USER}.conf <<EOF
${USER} soft core unlimited
${USER} soft fsize unlimited
${USER} soft nofile 4096
${USER} soft nproc 30654
EOF
Or use /etc/sysctl.conf to change the limit globally (man sysctl.conf), e.g.
kern.maxprocperuid=1000
kern.maxproc=2000
kern.maxfilesperproc=20000
kern.maxfiles=50000

I have the same problem, but instead of cat it was sendmail (gem mail).
I found problem & solution here by installing posix-spawn gem, e.g.
gem install posix-spawn
and here is the example:
a = (1..500_000_000).to_a
require 'posix/spawn'
POSIX::Spawn::spawn('ls')
This time creating child process should succeed.
See also: Minimizing Memory Usage for Creating Application Subprocesses at Oracle.

Related

Discrepancy between the size of file created and size displayed by du -sh [duplicate]

This question already has answers here:
Size() vs ls -la vs du -h which one is correct size?
(3 answers)
Closed last month.
I had to create a random file of 10GB size, which I can using dd or fallocate, but the size shown by du -sh is twice the one I created:
$ dd bs=1MB count=10000 if=/dev/zero of=foo
10000+0 records in
10000+0 records out
10000000000 bytes (10 GB, 9.3 GiB) copied, 4.78419 s, 2.1 GB/s
$ du -sh foo
19G foo
$ ls -sh foo
19G foo
$ fallocate -l 10G bar
$ du -sh bar
20G bar
$ ls -sh bar
20G bar
Can someone please explain me this apparent discrepancy?
On wikipedia, it mentions about GPFS ...
The system stores data on standard block storage volumes, but includes an internal RAID layer that can virtualize those volumes for redundancy and parallel access much like a RAID block storage system.
I conclude that there is at least one non-visible duplicate for every file, and therefore each file actually uses twice the amount of space than the actual content of a single file. So the underlying RAID imposes the double-usage.
That could explain it, because I have created a similar massive file for other purposes, also using dd, on an ext4 filesystem, but the OS reports my file size matching the dd creation size, as per design intent (no RAID in effect on my drive).
The fact that you indicate that stat does report the correct file size as per dd's actions, confirms what I put forward above.

Filter int, float and character from a text file in a Shell Script

Suppose I have a text file, which contains data like this.
Below output generated from du - sh /home/*
1.5G user1
2.5G user2
And so on...
Now if I want that those files size be stored in an array and compared to 5 GB if the user is consuming more than 5 Gb. What can I do?
The du command shows the usage of each folder in home directory. So if i want myself to be notified that some user is consuming more than 5 GB. Because there is a long list of users. It will be tedious to identify each user's disk usage. I want a shell script to identify the usage for each directory in home. And then I will put mail function to notify myself for exceeded limits.
Note : Don't want to implement quota as I just want to monitor the usage.
Use du's -t (--threshold) option to specify you only want to know about directories with more than a certain amount of data in them:
$ du -sh -t 5G /home/*
If you're picky about precisely how big a gigabyte is, note that 5G uses multiples of 1024; you may prefer -t 5GB for multiples of 1000, or even -t 5000M to mix them.
For lots of users, you're probably better off writing that using -d 1 instead of -s to avoid the shell having to expand the * into a very long list:
$ du -h -d 1 -t 5G /home/

Hex Dump Specific Parts of File - Bash

I'm trying to write a bash script to audit hard drives that have been wiped to ensure the wiping system is working properly. I would like to find a way to hex dump specific parts of a drive without having to hex dump the entire drive and extract the parts I'd like (as this seems to run for too long to make the script worth writing). Ideally, I'd be able to grab parts from the beginning, middle, and end of the drive.
I would like to take the output of the hex dump and check it for the existence of only one character (indicating the drive has been successfully wiped). This part, I can handle, but I thought it may affect any advice I may get.
I've used head piped into xxd to get the beginning of the file which has worked, but I'm still stuck on the other parts. I've tried using tail to just get the end of the drive, but that doesn't seem to work quickly either. Is it possible to do this efficiently? Possibly using dd or something else and pipe it into a hex editor? I've looked through options for xxd as well as hexdump to no avail. If someone could point me in the right direction, it would be greatly appreciated!
xxd has options to skip a ways into the file (-s) and dump a limited length (-l). If you use its plain hex (-p) option, you may be able to use grep to find any anomalies:
$ xxd -s 8192 -l 256 -p /dev/disk3s2 | grep [^0]
000000010000000000000000000000000000000000000000000000000000
000000000000000000000000300000000000000800000000000000000000
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
od has similar skip (-j) and limit length (-N). Similarly, dd has skip= and count= (although these are counted in blocks, not bytes; you can change the block size with bs=).
EDIT: Since xxd -p is giving weird results (not stopping at what should be the end of the device), I'd recommend running some tests to figure out what's going on. First, back up anything important on the computer, because if something is weird at the device access level, it's possible that some of these tests might overwrite something unexpected, possibly even on another disk.
Next, try dumping to the end of the device with different tools, and see if they all behave the same way:
xxd -s 65451982336 /dev/sdb | more # This *should* dump 512 bytes (32 lines) then stop, but apparently keeps going
od -xv -j 65451982336 /dev/sdb | more # This also *should* dump 512 bytes then stop
dd if=/dev/sdb skip=127835903 | xxd | more # This again should do the same thing (note that the skip value is in 512-byte blocks)
Do the other tools read past what fdisk reports as the end of the disk? If all three read more data, I'm going with the "fdisk is wrong/misleading" answer. You can test further by writing some nonzero data past the "end" and seeing what the results are:
dd if=/dev/random of=/dev/sdb seek=127835903 count=2
...then repeat the various dump commands. If they show two blocks (=64 lines) of random data followed by zeroes, I'm pretty sure the device is bigger than you think it is.
I am not near my shell, but something along these lines should get you started:
dd if=/dev/hda1 | hexdump -C | grep [^00]
will print all non-zero bytes.
dd if=/dev/hda1 | od -x -j100
will give you a hexadecimal dump with offsets, starting 100 bytes in.

Couch has apparent limit of attachment sizes on Mac OS X

I have plain vanilla CouchDB from Apache, which runs as an App running on a Mac OS X 10.9. If I try to attach an attachment to a document that is above 1 Meg in size, it just hangs and does nothing.
I have tried to use couchdbs on Linux, and there the sky is the limit.
I first thought it had to do with low limits on the mac but it doesn't seem so :
➜ ~ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-v: address space (kbytes) unlimited
-l: locked-in-memory size (kbytes) unlimited
-u: processes 709
-n: file descriptors 256
What is causing this ? Why ? And how to fix this ?
Check the config files given by couchdb -c. You probably have this somewhere in them (for some unknown reason):
[couchdb]
max_attachment_size = 1048576 ; bytes
Remove or comment the line and you should be fine.
Or maybe it was compiled with this hardcoded so you could add this line to one of the config file and increase the value.
Update
max_attachment_size is undocumented so probably not safe to use. I leave the original answer as it seems to have solved the problem of the OP but according to the docs, the attachment size should be unlimited. Also attachment_stream_buffer_size is the config key controlling the chunk size of the attachments which might relevant.

Why can I only open 2045 files with Tie::File on Windows?

I have the following code that tries to tie arrays to files. Except, when I run this code, it only creates 2045 files. What is the issue here?
#!/usr/bin/perl
use Tie::File;
for (my $i = 0; $i < 10000; $i++) {
#files{$i} = ();
tie #{$files{$i}}, 'Tie::File', "files//tiefile$i";
}
Edit: I am on windows
You are accumulating open file handles (see ulimit -n, setrlimit RLIMIT_NOFILE/RLIMIT_OFILE), and you ultimately hit a 2048 open file descriptors limit (2045 + stdin + stdout + stderr.)
Under Windows you will have to rewrite your application so that it has at most 2048 open file handles at any one time, since the 2048 limit is hard limit (cannot be modified) in MSVC's stdio.
On Linux machines go to /etc/security/limits.conf and add or modify these lines
* soft nofile 10003
* hard nofile 10003
This will increase the number of files each process can have open to 10003 (remember that you always start with three open: stdin, stdout, and stderr).
Based on you comments it sounds like you are using a Win32 machine. I can't find a way to increase the number of open files per process, but you might, and I stress might, be able to handle this through fork'ing (which is really threading on Win32).

Resources