Optimize bash command to count all lines in HDFS txt files - bash

Summary:
I need to count all unique lines in all .txt files in a HDFS instance.
Total size of .txt files ~450GB.
I use this bash command:
hdfs dfs -cat /<top-level-dir>/<sub-dir>/*/*/.txt | cut -d , -f 1 | sort --parallel=<some-number> | uniq | wc -l
The problem is that this command takes all free ram and the HDFS instance exits with code 137 (out of memory).
Question:
Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs OR somehow clean the memory while the command is still running?
Update:
I need to remove | sort | because it is a merge sort implementation so O(n) space complexity.
I can use only | uniq | without | sort |.

Some things you can try to limit sort's memory consumption:
Use sort -u instead of sort | uniq. That way sort has a chance to remove duplicates on the spot instead of having to keep them until the end. 🞵
Write the input to a file and sort the file instead of running sort in a pipe. Sorting pipes is slower than sorting files and I assume that sorting pipes requires more memory than sorting files:
hdfs ... | cut -d, -f1 > input && sort -u ... input | wc -l
Set the buffer size manually using -S 2G. The size buffer is shared between all threads. The size specified here roughly equals the overall memory consumption when running sort.
Change the temporary directory using -T /some/dir/different/from/tmp. On many linux systems /tmp is a ramdisk so be sure to use the actual hard drive.
If the hard disk is not an option you could also try --compress-program=PROG to compress sort's temporary files. I'd recommend a fast compression algorithm like lz4.
Reduce parallelism using --parallel=N as more threads need more memory. With a small buffer too much threads are less efficient.
Merge at most two temporary files at once using --batch-size=2.
🞵 I assumed that sort was smart enough to immediately remove sequential duplicates in the unsorted input. However, from my experiments it seems that (at least) sort (GNU coreutils) 8.31 does not.
If you know that your input contains a lot of sequential duplicates as in the input generated by the following commands …
yes a | head -c 10m > input
yes b | head -c 10m >> input
yes a | head -c 10m >> input
yes b | head -c 10m >> input
… then you can drastically save resources on sort by using uniq first:
# takes 6 seconds and 2'010'212 kB of memory
sort -u input
# takes less than 1 second and 3'904 kB of memory
uniq input > preprocessed-input &&
sort -u preprocessed-input
Times and memory usage were measured using GNU time 1.9-2 (often installed in /usr/bin/time) and its -v option. My system has an Intel Core i5 M 520 (two cores + hyper-threading) and 8 GB memory.

Reduce number of sorts run in parallel.
From info sort:
--parallel=N: Set the number of sorts run in parallel to N. By default, N is set
to the number of available processors, but limited to 8, as there
are diminishing performance gains after that. Note also that using
N threads increases the memory usage by a factor of log N.

it runs out of memory.
From man sort:
--batch-size=NMERGE
merge at most NMERGE inputs at once; for more use temp files
--compress-program=PROG
compress temporaries with PROG; decompress them with PROG -d-T,
-S, --buffer-size=SIZE
use SIZE for main memory buffer
-T, --temporary-directory=DIR
use DIR for temporaries, not $TMPDIR or /tmp; multiple options
specify multiple directories
These are the options you could be looking into. Specify a temporary directory on the disc and specify buffer size ex. 1GB. So like sort -u -T "$HOME"/tmp -S 1G.
Also as advised in other answers, use sort -u instead of sort | uniq.
Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs
Kind-of, use -S option. You could sort -S "$(free -t | awk '/Total/{print $4}')".

Related

How to convert SIMD bash commands into GPU processable commands?

Consider a SIMD kind of code which extracts all instances of a pattern match from a file like this:
grep grep -n <some_pattern>
This can be made faster using GNU Parallel and some modifications like this
cat fileName | parallel -j{cores} --pipe --block {chunk_size}M --cat LC_ALL=C grep -n '/some_pattern/'
I can also use xargs to make the parallel execution if the single input file is split into multiple separate files:
xargs -P {cores} -L {line_per_process} bash -c grep {1}< fileID*
But this kind of parallelism is limited by the number of CPU cores that you can have.
I am interested in knowing whether there is any way to convert such commands into GPU(CUDA) threads?
The whole task can be broken into chunks equal to the number of CPU cores and then each CPU Core processes those chunks as individual threads in GPUs?
I will be surprised if there is such a way. grep is not like a matrix multiplication where you do exactly the same machine code instruction for every byte. On the contrary, grep does a lot of optimization for different situations (e.g. if current byte does not match, skip this many bytes ahead).
So while you may call this Same Command Multiple Data (SCMD), it does not qualify as SIMD at the machine code level.
That does not mean that there is no way to convert grep into real SIMD, but this is not going to be automatic. You will have to rewrite grep using algorithms that are suitable for GPUs. And that can clearly be done: https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/competition/bkase.github.com/CUDA-grep/finalreport.html
If you want to convert another tool than grep you will again have to rewrite that tool. Possibly using some of the algorithms that you used for grep, but not necessarily: It might be that you have to use completely different algorithms.
Normally you will be limited by your disk (your disk is slow, grep is fast).
If you have really fast disks try:
parallel -a filename -k --pipepart --block -1 LC_ALL=C grep '/some_pattern/'
--pipe can deliver in the order of 100MB/s total. --pipepart can deliver in the order of 1 GB/s per core (and usually your disks cannot deliver 1 GB/s/core). --block -1 chops filename into one block per job on the fly.
Unfortunately you lose the ability to see the line number (so grep -n will give the wrong answer).
If your grep is still limited by CPU, then you should probably ask another question and elaborate on why your grep is so CPU intense.

GNU Parallel: split file into children

Goal
Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.
Input File
name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)
Hardware
36 GB Memory
16 CPUs
HPCC environment (I'm not admin)
Code
Version 1
zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.
Version 2
# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.
zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:
parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.
Questions
How can my code be improved ?
Is there a faster way to accomplish this goal ?
Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.
You tell that to GNU Parallel with -L 4.
In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.
To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"
parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.
Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:
#EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
#EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
#EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
Here it is #EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with #EAS54_6_R. It just does not happen.
We can use that to our advantage, because now you can use \n followed by #EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:
parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.
parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.
It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:
parallel -a file1.fastq --pipe-part --block -1 -j16
--regexp --recend '\n' --recstart '#.*\n[A-Za-z\n\.~]'
my_command
Here we assume that the lines will start like this:
#<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>
Even if you have a few quality lines starting with '#', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with #.
You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:
You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).
By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.
Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.
split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:
doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.
Filenames will be named big.aa.gz .. big.ap.gz.

How to generate a memory shortage using bash script

I need to write a bash script that would consume a maximum of RAM of my ESXi and potentially generate a memory shortage.
I already checked here and try to run the given script several times so that I can consume more thant 500Mb of RAM.
However I get a "sh: out of memory" error (of course) and I'd like to know if there is any possibility to configuration the amount of memory allocated to my shell ?
Note1 : Another requirement is that I cannot enter a VM a run a greedy task.
Note2 : I tried to script the creation of greedy new VMs with huge RAM however I cannot get to ESXi state where there is a shortage of memory.
Note3 : I cannot use a C compiler and I only have very limited python library.
Thank you in advance for your help :)
From an earlier answer of mine: https://unix.stackexchange.com/a/254976/30731
If you have basic GNU tools (head and tail) or BusyBox on Linux, you can do this to fill a certain amount of free memory:
</dev/zero head -c BYTES | tail
# Protip: use $((1024**3*7)) to calculate 7GiB easily
</dev/zero head -c $((1024**3*7)) | tail
This works because tail needs to keep the current line in memory, in case it turns out to be the last line. The line, read from /dev/zero which outputs only null bytes and no newlines, will be infinitely long, but is limited by head to BYTES bytes, thus tail will use only that much memory. For a more precise amount, you will need to check how much RAM head and tail itself use on your system and subtract that.
To just quickly run out of RAM completely, you can remove the limiting head part:
tail /dev/zero
If you want to also add a duration, this can be done quite easily in bash (will not work in sh):
cat <( </dev/zero head -c BYTES) <(sleep SECONDS) | tail
The <(command) thing seems to be little known but is often extremely useful, more info on it here: http://tldp.org/LDP/abs/html/process-sub.html
Then for the use of cat: cat will wait for inputs to complete until exiting, and by keeping one of the pipes open, it will keep tail alive.
If you have pv and want to slowly increase RAM use:
</dev/zero head -c BYTES | pv -L BYTES_PER_SEC | tail
For example:
</dev/zero head -c $((1024**3)) | pv -L $((1024**2)) | tail
Will use up to a gigabyte at a rate of a megabyte per second. As an added bonus, pv will show you the current rate of use and the total use so far. Of course this can also be done with previous variants:
</dev/zero head -c BYTES | pv | tail
Just inserting the | pv | part will show you the current status (throughput and total by default).
Credits to falstaff for contributing a variant that is even simpler and more broadly compatible (like with BusyBox).
Bash has the ulimit command, which can be used to set the size of the virtual memory which may be used by a bash process and the processes started by it.
There are two limits, a hard limit and a soft limit. You can lower both limits, but only raise the soft limit up to the hard limit.
ulimit -S -v unlimited sets the virtual memory size to unlimited (if the hard limit allows this).
If there is a hard limit set (see ulimit -H -v), check any initialisation scripts for lines setting it.

Fastest possible grep

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'd like to know if there is any tip to make grep as fast as possible. I have a rather large base of text files to search in the quickest possible way. I've made them all lowercase, so that I could get rid of -i option. This makes the search much faster.
Also, I've found out that -F and -P modes are quicker than the default one. I use the former when the search string is not a regular expression (just plain text), the latter if regex is involved.
Does anyone have any experience in speeding up grep? Maybe compile it from scratch with some particular flag (I'm on Linux CentOS), organize the files in a certain fashion or maybe make the search parallel in some way?
Try with GNU parallel, which includes an example of how to use it with grep:
grep -r greps recursively through directories. On multicore CPUs GNU
parallel can often speed this up.
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to grep.
For big files, it can split it the input in several chunks with the --pipe and --block arguments:
parallel --pipe --block 2M grep foo < bigfile
You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):
parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile
If you're searching very large files, then setting your locale can really help.
GNU grep goes a lot faster in the C locale than with UTF-8.
export LC_ALL=C
Ripgrep claims to now be the fastest.
https://github.com/BurntSushi/ripgrep
Also includes parallelism by default
-j, --threads ARG
The number of threads to use. Defaults to the number of logical CPUs (capped at 6). [default: 0]
From the README
It is built on top of Rust's regex engine. Rust's regex engine uses
finite automata, SIMD and aggressive literal optimizations to make
searching very fast.
Apparently using --mmap can help on some systems:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
Not strictly a code improvement but something I found helpful after running grep on 2+ million files.
I moved the operation onto a cheap SSD drive (120GB). At about $100, it's an affordable option if you are crunching lots of files regularly.
If you don't care about which files contains the string, you might want to separate reading and grepping into two jobs, since it might be costly to spawn grep many times – once for each small file.
If you've one very large file:
parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>
Many small compressed files (sorted by inode)
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>
I usually compress my files with lz4 for maximum throughput.
If you want just the filename with the match:
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}
Building on the response by Sandro I looked at the reference he provided here and played around with BSD grep vs. GNU grep. My quick benchmark results showed: GNU grep is way, way faster.
So my recommendation to the original question "fastest possible grep": Make sure you are using GNU grep rather than BSD grep (which is the default on MacOS for example).
I personally use the ag (silver searcher) instead of grep and it's way faster, also you can combine it with parallel and pipe block.
https://github.com/ggreer/the_silver_searcher
Update:
I now use https://github.com/BurntSushi/ripgrep which is faster than ag depending on your use case.
One thing I've found faster for using grep to search (especially for changing patterns) in a single big file is to use split + grep + xargs with it's parallel flag. For instance:
Having a file of ids you want to search for in a big file called my_ids.txt
Name of bigfile bigfile.txt
Use split to split the file into parts:
# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]
# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files
In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.
cgrep, if it's available, can be orders of magnitude faster than grep.
MCE 1.508 includes a dual chunk-level {file, list} wrapper script supporting many C binaries; agrep, grep, egrep, fgrep, and tre-agrep.
https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep
https://metacpan.org/release/MCE
One does not need to convert to lowercase when wanting -i to run fast. Simply pass --lang=C to mce_grep.
Output order is preserved. The -n and -b output is also correct. Unfortunately, that is not the case for GNU parallel mentioned on this page. I was really hoping for GNU Parallel to work here. In addition, mce_grep does not sub-shell (sh -c /path/to/grep) when calling the binary.
Another alternate is the MCE::Grep module included with MCE.
A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:
Once you compile it (the golang package is needed), you can index a folder with:
# index current folder
cindex .
The index will be created under ~/.csearchindex
Now you can search:
# search folders previously indexed with cindex
csearch eggs
I'm still piping the results through grep to get colorized matches.

How could the UNIX sort command sort a very large file?

The UNIX sort command can sort a very large file like this:
sort large_file
How is the sort algorithm implemented?
How come it does not cause excessive consumption of memory?
The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.
The sort command stores working data in temporary disk files (usually in /tmp).
#!/bin/bash
usage ()
{
echo Parallel sort
echo usage: psort file1 file2
echo Sorts text file file1 and stores the output in file2
}
# test if we have two arguments on the command line
if [ $# != 2 ]
then
usage
exit
fi
pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2
WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.
Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !
#! /bin/ksh
MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted
usage ()
{
echo Parallel sort
echo usage: psort file1 file2
echo Sorts text file file1 and stores the output in file2
echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
echo and each chunk will be sorted in parallel
}
# test if we have two arguments on the command line
if [ $# != 2 ]
then
usage
exit
fi
#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE
#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX
for file in $CHUNK_FILE_PREFIX*
do
sort $file > $file.sorted &
done
wait
#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE
#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
See also:
"Sorting large files faster with a shell script"
I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.
Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem.
Key parameters on Ubuntu are
Location of temporary files -T directory_name
Amount of memory to use -S N% ( N% of all memory to use, the more the better but
avoid over subscription that causes swapping to disk. You can use it like "-S 80%" to use 80% of available RAM, or "-S 2G" for 2 GB RAM.)
The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.
How to use -T option for sorting large file
I have to sort a large file's 7th column.
I was using:
grep vdd "file name" | sort -nk 7 |
I faced below error:
******sort: write failed: /tmp/sort1hc37c: No space left on device******
then I used -T option as below it worked:
grep vdda "file name" | sort -nk 7 -T /dev/null/ |
Memory should not be a problem - sort already takes care of that. If you want make optimal usage of your multi-core CPU I have implementend this in a small script (similar to some you might find on the net, but simpler/cleaner than most of those ;)).
#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
#
# psort largefile.txt 20m 4
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
let i++
sort $fname > $fname.$suffix &
mres=$(($i % $nthreads))
test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix
rm $1.part*

Resources