Speed up sed on a gz file - performance

I'm using a script to sed a file and remove text this way:
gzip -cd /data/file.gz | sed 's/WITH (appendonly=true, compresstype=quicklz)//' | gzip > file_seeded.gz
It takes a lot of time to perform the operation on big files (50GB for example). Is the way I'm doing this the optimal way or there are alternatives to speed up the process?

Use the fact, that you can append multiple gzip files:
mysed() {
sed 's/WITH (appendonly=true, compresstype=quicklz)//' | gzip
}
export -f mysed
gzip -cd /data/file.gz | parallel --pipe -k --block 50M mysed > file_seeded.gz
Adjust 50M until you find the value that works best. It depends on how fast I/O to /tmp is and how much RAM and CPU cache you have. The best value will most likely be between 1M and 1000M.
If time is more important than disk space use gzip -1.

There is no way to avoid recompressing the edited data, which dominates the execution time. All I can suggest would be to use gzip -1 or gzip -3 to speed up the compression at the cost of slightly larger output. You can also use pigz to make use of all of your cores.

Related

How to merge multiple large .gz files into one in a efficient way?

I'm trying to combine multiple (29) compressed files (.gz), one after the other, into one file.
The compressed files are around 500MB and in their uncompressed format ~30GB. All the files start with a header that I don't want in the final file.
I have tried to do it using zcatand gzip, but it takes a lot of time (more than 3hours):
zcat file*.gz | tail -n +2 | gzip -c >> all_files.txt.gz
I have also tried it with pigz:
unpigz -c file*.gz | tail -n +2 | pigz -c >> all_files_pigz.txt.gz
In this case, I'm working in a cluster where they don't have this command and I can't install anything.
The last thing I have tried is to merge all with cat:
cat file*.gz > all_files_cat.txt.gz
It doesn't take a lot of time, but when I'm going to read it, at some pint appears the following message:
gzip: unexpected end of file
How could I deal with this?
If you want to remove the first line of every uncompressed file, and concatenate them all into one compressed file, you'll need a loop. Something like
for f in file*.gz; do
zcat "$f" | tail -n +2
done | gzip -c > all_files_cat.txt.gz
If there's lots of big files, yes, it can take a while. Maybe use a lower compression level than the default (At the expense of larger file size). Or use a different compression program than gzip; there are lots of options, each with their own speed and compression ratio tradeoffs.

How to convert SIMD bash commands into GPU processable commands?

Consider a SIMD kind of code which extracts all instances of a pattern match from a file like this:
grep grep -n <some_pattern>
This can be made faster using GNU Parallel and some modifications like this
cat fileName | parallel -j{cores} --pipe --block {chunk_size}M --cat LC_ALL=C grep -n '/some_pattern/'
I can also use xargs to make the parallel execution if the single input file is split into multiple separate files:
xargs -P {cores} -L {line_per_process} bash -c grep {1}< fileID*
But this kind of parallelism is limited by the number of CPU cores that you can have.
I am interested in knowing whether there is any way to convert such commands into GPU(CUDA) threads?
The whole task can be broken into chunks equal to the number of CPU cores and then each CPU Core processes those chunks as individual threads in GPUs?
I will be surprised if there is such a way. grep is not like a matrix multiplication where you do exactly the same machine code instruction for every byte. On the contrary, grep does a lot of optimization for different situations (e.g. if current byte does not match, skip this many bytes ahead).
So while you may call this Same Command Multiple Data (SCMD), it does not qualify as SIMD at the machine code level.
That does not mean that there is no way to convert grep into real SIMD, but this is not going to be automatic. You will have to rewrite grep using algorithms that are suitable for GPUs. And that can clearly be done: https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/competition/bkase.github.com/CUDA-grep/finalreport.html
If you want to convert another tool than grep you will again have to rewrite that tool. Possibly using some of the algorithms that you used for grep, but not necessarily: It might be that you have to use completely different algorithms.
Normally you will be limited by your disk (your disk is slow, grep is fast).
If you have really fast disks try:
parallel -a filename -k --pipepart --block -1 LC_ALL=C grep '/some_pattern/'
--pipe can deliver in the order of 100MB/s total. --pipepart can deliver in the order of 1 GB/s per core (and usually your disks cannot deliver 1 GB/s/core). --block -1 chops filename into one block per job on the fly.
Unfortunately you lose the ability to see the line number (so grep -n will give the wrong answer).
If your grep is still limited by CPU, then you should probably ask another question and elaborate on why your grep is so CPU intense.

Optimize bash command to count all lines in HDFS txt files

Summary:
I need to count all unique lines in all .txt files in a HDFS instance.
Total size of .txt files ~450GB.
I use this bash command:
hdfs dfs -cat /<top-level-dir>/<sub-dir>/*/*/.txt | cut -d , -f 1 | sort --parallel=<some-number> | uniq | wc -l
The problem is that this command takes all free ram and the HDFS instance exits with code 137 (out of memory).
Question:
Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs OR somehow clean the memory while the command is still running?
Update:
I need to remove | sort | because it is a merge sort implementation so O(n) space complexity.
I can use only | uniq | without | sort |.
Some things you can try to limit sort's memory consumption:
Use sort -u instead of sort | uniq. That way sort has a chance to remove duplicates on the spot instead of having to keep them until the end. 🞵
Write the input to a file and sort the file instead of running sort in a pipe. Sorting pipes is slower than sorting files and I assume that sorting pipes requires more memory than sorting files:
hdfs ... | cut -d, -f1 > input && sort -u ... input | wc -l
Set the buffer size manually using -S 2G. The size buffer is shared between all threads. The size specified here roughly equals the overall memory consumption when running sort.
Change the temporary directory using -T /some/dir/different/from/tmp. On many linux systems /tmp is a ramdisk so be sure to use the actual hard drive.
If the hard disk is not an option you could also try --compress-program=PROG to compress sort's temporary files. I'd recommend a fast compression algorithm like lz4.
Reduce parallelism using --parallel=N as more threads need more memory. With a small buffer too much threads are less efficient.
Merge at most two temporary files at once using --batch-size=2.
🞵 I assumed that sort was smart enough to immediately remove sequential duplicates in the unsorted input. However, from my experiments it seems that (at least) sort (GNU coreutils) 8.31 does not.
If you know that your input contains a lot of sequential duplicates as in the input generated by the following commands …
yes a | head -c 10m > input
yes b | head -c 10m >> input
yes a | head -c 10m >> input
yes b | head -c 10m >> input
… then you can drastically save resources on sort by using uniq first:
# takes 6 seconds and 2'010'212 kB of memory
sort -u input
# takes less than 1 second and 3'904 kB of memory
uniq input > preprocessed-input &&
sort -u preprocessed-input
Times and memory usage were measured using GNU time 1.9-2 (often installed in /usr/bin/time) and its -v option. My system has an Intel Core i5 M 520 (two cores + hyper-threading) and 8 GB memory.
Reduce number of sorts run in parallel.
From info sort:
--parallel=N: Set the number of sorts run in parallel to N. By default, N is set
to the number of available processors, but limited to 8, as there
are diminishing performance gains after that. Note also that using
N threads increases the memory usage by a factor of log N.
it runs out of memory.
From man sort:
--batch-size=NMERGE
merge at most NMERGE inputs at once; for more use temp files
--compress-program=PROG
compress temporaries with PROG; decompress them with PROG -d-T,
-S, --buffer-size=SIZE
use SIZE for main memory buffer
-T, --temporary-directory=DIR
use DIR for temporaries, not $TMPDIR or /tmp; multiple options
specify multiple directories
These are the options you could be looking into. Specify a temporary directory on the disc and specify buffer size ex. 1GB. So like sort -u -T "$HOME"/tmp -S 1G.
Also as advised in other answers, use sort -u instead of sort | uniq.
Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs
Kind-of, use -S option. You could sort -S "$(free -t | awk '/Total/{print $4}')".

Grepping a huge file (80GB) any way to speed it up?

grep -i -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded.
Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)
The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?
the grep is grabbing up to 93% CPU
Here are a few options:
1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.
2) Use fgrep because you're searching for a fixed string, not a regular expression.
3) Remove the -i option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.
If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:
< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'
Depending on your disks and CPUs it may be faster to read larger blocks:
< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'
It's not entirely clear from you question, but other options for grep include:
Dropping the -i flag.
Using the -F flag for a fixed string
Disabling NLS with LANG=C
Setting a max number of matches with the -m flag.
Some trivial improvement:
Remove the -i option, if you can, case insensitive is quite slow.
Replace the . by \.
A single point is the regex symbol to match any character, which is also slow
Two lines of attack:
are you sure, you need the -i, or do you habe a possibility to get rid of it?
Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.
< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'
If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.
Try ripgrep
It provides much better results compared to grep.
All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.
However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -
startline=$(grep -n -m 1 "$start_pattern" file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern" file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))
Then work on this subset of logs!
hmm…… what speeds do you need ? i created a synthetic 77.6 GB file with nearly 525 mn rows with plenty of unicode :
rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.
and randomly selected rows at an avg. rate of 1 every 3^5, using rand() not just NR % 243, to place the string db_pd.Clients at a random position in the middle of the existing text, totaling 2.16 mn rows where the regex pattern hits
rows = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.
% dtp; pvE0 < testfile_gigantic_001.txt|
mawk2 '
_^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','
in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%
out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372
And mawk2 took just 59 seconds to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.
At throughput rates of 1.3GiB/s, as seen above calculated by pv, it might even be detrimental to use utils like parallel to split the tasks.

Fastest possible grep

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'd like to know if there is any tip to make grep as fast as possible. I have a rather large base of text files to search in the quickest possible way. I've made them all lowercase, so that I could get rid of -i option. This makes the search much faster.
Also, I've found out that -F and -P modes are quicker than the default one. I use the former when the search string is not a regular expression (just plain text), the latter if regex is involved.
Does anyone have any experience in speeding up grep? Maybe compile it from scratch with some particular flag (I'm on Linux CentOS), organize the files in a certain fashion or maybe make the search parallel in some way?
Try with GNU parallel, which includes an example of how to use it with grep:
grep -r greps recursively through directories. On multicore CPUs GNU
parallel can often speed this up.
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to grep.
For big files, it can split it the input in several chunks with the --pipe and --block arguments:
parallel --pipe --block 2M grep foo < bigfile
You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):
parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile
If you're searching very large files, then setting your locale can really help.
GNU grep goes a lot faster in the C locale than with UTF-8.
export LC_ALL=C
Ripgrep claims to now be the fastest.
https://github.com/BurntSushi/ripgrep
Also includes parallelism by default
-j, --threads ARG
The number of threads to use. Defaults to the number of logical CPUs (capped at 6). [default: 0]
From the README
It is built on top of Rust's regex engine. Rust's regex engine uses
finite automata, SIMD and aggressive literal optimizations to make
searching very fast.
Apparently using --mmap can help on some systems:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
Not strictly a code improvement but something I found helpful after running grep on 2+ million files.
I moved the operation onto a cheap SSD drive (120GB). At about $100, it's an affordable option if you are crunching lots of files regularly.
If you don't care about which files contains the string, you might want to separate reading and grepping into two jobs, since it might be costly to spawn grep many times – once for each small file.
If you've one very large file:
parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>
Many small compressed files (sorted by inode)
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>
I usually compress my files with lz4 for maximum throughput.
If you want just the filename with the match:
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}
Building on the response by Sandro I looked at the reference he provided here and played around with BSD grep vs. GNU grep. My quick benchmark results showed: GNU grep is way, way faster.
So my recommendation to the original question "fastest possible grep": Make sure you are using GNU grep rather than BSD grep (which is the default on MacOS for example).
I personally use the ag (silver searcher) instead of grep and it's way faster, also you can combine it with parallel and pipe block.
https://github.com/ggreer/the_silver_searcher
Update:
I now use https://github.com/BurntSushi/ripgrep which is faster than ag depending on your use case.
One thing I've found faster for using grep to search (especially for changing patterns) in a single big file is to use split + grep + xargs with it's parallel flag. For instance:
Having a file of ids you want to search for in a big file called my_ids.txt
Name of bigfile bigfile.txt
Use split to split the file into parts:
# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]
# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files
In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.
cgrep, if it's available, can be orders of magnitude faster than grep.
MCE 1.508 includes a dual chunk-level {file, list} wrapper script supporting many C binaries; agrep, grep, egrep, fgrep, and tre-agrep.
https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep
https://metacpan.org/release/MCE
One does not need to convert to lowercase when wanting -i to run fast. Simply pass --lang=C to mce_grep.
Output order is preserved. The -n and -b output is also correct. Unfortunately, that is not the case for GNU parallel mentioned on this page. I was really hoping for GNU Parallel to work here. In addition, mce_grep does not sub-shell (sh -c /path/to/grep) when calling the binary.
Another alternate is the MCE::Grep module included with MCE.
A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:
Once you compile it (the golang package is needed), you can index a folder with:
# index current folder
cindex .
The index will be created under ~/.csearchindex
Now you can search:
# search folders previously indexed with cindex
csearch eggs
I'm still piping the results through grep to get colorized matches.

Resources