Fastest way to merge millions of files

Fastest way to merge millions of files - bash

There are 81 million files (!) stored in one directory on a remote machine. All files end in ".paintedHaploDiversity". I would like to merge those files into one called allOutputs_3.5 in the parent directory. More specifically, each file contains two or three lines. The first line is a header that I can ignore. Among the remaining one or two lines, one of them has the value 2 in the fourth column. For each file, I want to copy the whole line where there is a 2 in the second column and add to it the filename (excluding the extension ".paintedHaploDiversity"). I refer to this filename as "simID".
For information, the remote machine runs on MAC OS X 10.11.6 (15G22010). It is a simple destkop. There is hence no network involved (outside my ssh command to reach the remote machine).
I first tried
for f in *;
do
simID=${f%.paintedHaploDiversity}
awk -v simID=${simID} 'NR>1{if ($4==2) {printf simID"\t"; print}}' $f >> ../allOutputs_3.5
done
but it was very slow. I estimated the time required to months or even years! Then, I tried
awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' * >> ../allOutputs
but it does not seem any faster. Just as a speed test, I also considered
find . -exec cat '{}' ';' > out
but it is again very slow. Thinking that maybe the issue might come from the regex expansion *, I tried to loop through each file by reproducing their name through two C style loops.
for ((bigID=1; bigID <= 9 ;++bigID)); do
for ((rep=1; rep <= 9000000 ;++rep)); do
awk -v simID=3.5.${bigID}_${rep} 'NR>1{if ($4==2) {printf simID"\t"; print}}' 3.5.${bigID}_${rep}.paintedHaploDiversity >> ../allOutputs_3.5
done
done
The process is now quite a bit faster but it would still take months to run! Finally, I figured, I might as well remove the lines where the the second column is not equal to 2 only later on (probably with a sed command) and do
for ((bigID=1; bigID <= 6 ;++bigID)); do
for ((r=1; r <= 9000000 ;++r)); do
printf "3.5_${bigID}_${r}\t" >> ../allOutputs_3.5
tail -n +2 3.5_${bigID}_${r}.paintedHaploDiversity >> ../allOutputs_3.5
done
done
The process now is expected to take about two weeks. That starts to be reasonable. I am still wondering what is causing this process to be so slow and whether it can be improved.
I suppose the bottleneck is likely the disk IO. Or is it the filesystem that takes a lot of CPU time? Is the process so slow because there are so many files in the same directory and it requires searching through a binary tree of files at every iteration of the loop? How can it be improved? Should I try writing the process in c++?
If it helps here is the output of top -o MEM while the last command (the one using printf and tail) was running
Processes: 254 total, 3 running, 12 stuck, 239 sleeping, 1721 threads 03:12:40
Load Avg: 2.04, 1.79, 1.60 CPU usage: 0.84% user, 4.33% sys, 94.81% idle
SharedLibs: 85M resident, 11M data, 10M linkedit.
MemRegions: 42324 total, 4006M resident, 63M private, 230M shared.
PhysMem: 14G used (2286M wired), 10G unused.
VM: 753G vsize, 535M framework vsize, 1206153(0) swapins, 2115303(0) swapouts.
Networks: packets: 413664671/284G in, 126210468/104G out.
Disks: 1539349069/12T read, 1401722156/7876G written.
PID COMMAND %CPU TIME #TH #WQ #PORTS MEM PURG CMPRS PGRP PPID STATE
0 kernel_task 42.1 1716 hrs 167/25 0 2- 1968M 0B 0B 0 0 running
366 SystemUIServ 0.4 24:42:03 5 2 345 1055M 0B 10M 366 1 sleeping
472 softwareupda 0.0 12:46:11 5 0 3760 340M 0B 18M 472 1 sleeping
54242 Sublime Text 0.0 03:55:44 12 0 237 233M 0B 68K 54242 1 sleeping
63 powerd 0.0 44:07:21 2 0 95 204M 0B 8932K 63 1 sleeping
34951 Finder 0.1 04:11:06 9 2 1665 166M 0B 68M 34951 1 sleeping
197 WindowServer 0.0 40:02:58 3 0 453 142M 0B 63M 197 1 sleeping
13248 Terminal 0.0 84:19.45 5 0 388 114M 0B 113M 13248 1 sleeping
29465 X11.bin 0.0 89:38.70 9 0 229 104M 0B 16M 29464 29464 sleeping
12372 system_insta 0.0 00:31.61 2 0 75 78M 0B 9996K 12372 1 sleeping
1588 sysmond 0.0 02:34:04 2 1 23 62M 0B 4536K 1588 1 sleeping
54245 plugin_host 0.0 00:03.88 5 0 56 51M 0B 0B 54242 54242 sleeping
554 spindump 0.0 00:36.51 2 1 164 44M 0B 33M 554 1 sleeping
20024 com.apple.GS 0.0 00:01.43 3 2 24 43M 0B 2200K 20024 1 sleeping
475 suhelperd 0.0 00:19.84 2 0 55 42M 0B 28M 475 1 sleeping
418 installd 0.0 01:21.89 2 0 69 40M 0B 12M 418 1 sleeping
57 fseventsd 0.1 13:03:20 10 0 241 39M 0B 2904K 57 1 sleeping
364 Dock 0.0 08:48.83 3 0 283 38M 0B 27M 364 1 sleeping
201 sandboxd 0.0 18:55.44 2 1 38 38M 0B 10M 201 1 sleeping
103 loginwindow 0.0 04:26.65 2 0 377 35M 0B 3400K 103 1 sleeping
897 systemstatsd 0.0 65:30.17 2 1 43 34M 0B 4928K 897 1 sleeping
367 fontd 0.0 11:35.30 2 0 77 32M 0B 5920K 367 1 sleeping
396 ScopedBookma 0.0 01:00.46 3 2 46 32M 0B 28M 396 1 sleeping
22752 cfbackd 0.4 32:18.73 9 1 84 30M 0B 0B 22752 1 sleeping
39760 Preview 0.0 00:03.75 3 0 209 29M 0B 0B 39760 1 sleeping
53 syslogd 0.0 05:33:59 4 3 186- 29M- 0B 1668K 53 1 sleeping
533 SmartDaemon 0.0 27:07.67 10 7 175 28M 128K 5192K 533 1 stuck
388 iconservices 0.0 00:08.85 2 1 66 27M 0B 157M 388 1 sleeping
7268 diskmanageme 0.0 00:40.14 888 0 8899 27M 0B 7352K 7268 1 sleeping
513 Notification 0.0 00:46.42 3 0 245 26M 0B 9852K 513 1 sleeping
83 opendirector 0.0 19:22:12 6 5 8827 26M 0B 2444K 83 1 sleeping
557 AppleSpell 0.0 03:12.61 2 0 57 26M 0B 10M 557 1 sleeping
422 com.apple.ge 0.0 01:50.41 5 0 83 25M 0B 1680K 422 1 sleeping
397 storeaccount 0.0 00:48.41 4 0 1333 21M 0B 2248K 397 1 sleeping
87 launchservic 0.0 64:26.85 3 2 306 20M 0B 5804K 87 1 sleeping
1 launchd 0.0 26:26:23 5 4 1802 20M 0B 6532K 1 0 stuck
222 taskgated 0.0 17:59:00 3 1 43 19M 0B 4528K 222 1 sleeping
54 UserEventAge 0.0 18:19.74 3 0 32605- 18M- 0B 2968K 54 1 sleeping
4527 com.apple.sp 0.0 00:13.01 2 0 48 17M 0B 7792K 4527 1 sleeping
79 coreduetd 0.0 05:40.06 2 0 95 17M 0B 4604K 79 1 sleepin
and here is the output of iostat
disk0 disk1 disk2 cpu load average
KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us sy id 1m 5m 15m
7.19 152 1.07 8.10 0 0.00 8.22 0 0.00 15 50 35 1.68 1.74 1.59
Example:
Consider the following files
file_0:
first second third fourth fifth
bbb a a 2 r
file_1:
first second third fourth fifth
f o o 2 o
file_2:
first second third fourth fifth
f r e 1 e
x xxx x 2 x
file_3:
first second third fourth fifth
a a a 2 a
The expected output is
file_0 bbb a a 2 r
file_1 f o o 2 o
file_2 x xxx x 2 x
file_3 a a a 2 a

You probably can cope with two single calls to the programs grep and sed. This should be pretty fast. Maybe even faster than a self-written C program.
cd dir_with_all_the_files
grep -rE '^([^ ]+ +){3}2 ' . |
sed -En 's/^\.\/(.*)\.paintedHaploDiversity:/\1 /p' > ../allOutputs_3.5
Assumptions made:
The header of the column to be searched isn't 2 too.
The directory contains no subdirectories.
The command may still produce correct results but has to run needlessly long.
The filenames contain no : or linebreaks.
Your grep implementation supports the non-Posix -r option (usually the case).
Further improvements if your grep implementation supports it:
Add -m1 to speed up the search.
Try grep -P (usually not supported on Mac OS) or pcregrep. PCRE is sometimes faster. With PCRE you can also try the alternative regex '^(.*? ){3}2 '.
--exclude-dir \* (note that * is quoted) excludes subdirectories, so that you can use the command even without above assumption.
If you want the output to be sorted by filenames (as you would get when iterating *.paintedHaploDiversity), run sort -t ' ' -k 1,1 -o allOutputs_3.5{,} afterwards.
You might as well set export LC_ALL=C to speed up grep, sort, and maybe even sed.

Difficult problem. Might have painted yourself into a corner there...
If even the find command takes too long, which does nothing but open, read and close every file, then the likely bottleneck is the seek time on an HDD. This is typically around 10 ms (source), so for 81 million files you're looking at about almost 10 days, assuming a single seek per file. Due to the filesystem (directory accesses etc.) it might be more seeks, but if locality is good each seek might also be shorter.
If you can afford to wait this long once, I'd recommend zipping up all those files into a single file. This will take a lot of time, but after that you can process the data set more quickly.
If zipping (or otherwise copying or accessing) each individual file is out of the question, a solution might be to take an image (snapshot) of the entire filesystem and copy that onto a faster drive. SSDs have seek times around 0.1 ms (source) so working off of an SSD you could be finished in slightly over two hours.
A more hardcore approach would be to write code that operates directly on the raw disk bytes, implementing the necessary parts of the filesystem and using large in-memory buffers to avoid disk seeks. Depending on how the files are scattered across the disk, this might give you a big speedup, but of course it's a nontrivial effort to program this.

The problem — apart from the obvious I/O load of processing a few GB of data — is that starting one or several processes 81 million times takes a long time. Even creating a command line or expand a files glob to, say, 300MB (for f in *...) may need substantial time or exceed system and program specifications.
One solution is to write a C program to open the files and process them, or pipe their contents to other programs. But that may take a couple of days to program and debug, and maybe your intern is on vacation. But there are already programs in the Unix toolbox which do part of what you need, except that the file names are lost. We assume that all files are in a directory called bla.
Use tar to create a stream with the contents of the files, like this:
tar cf - bla | tar -xOf -
This writes the concatenated contents of the files to standard out, by default the console. Both tars and the grep are only started once. The first tar finds all the files in the directory and creates an archive (which is some sort of structured concatenation) which it writes to stdout; the second tar grabs that archive, extracts the files and writes them to stdout instead of creating files in the file system, thanks to -O.
After that, start processing:
tar cf - bla | tar -xOf - | grep '^whatever is before the 2 \<2\>' > out.txt
If the presence of the filenames is a hard requirement you may be able to repeat the processing chain, but let the second tar emit the file names (-t option), and pipe it to a shell script which reads a line from the out.txt and from the tar output, combining the two and writing the combined lines to a new file.

If the printf/tail attempt is considered the fastest at this point (2 weeks? based solely on OPs comments), I'd want to eliminate the 81 million printf/tail command pairs with a smaller number of awk/substr(FILENAME) calls that work on a wildcard set that breaks processing into, say, ~10K files at a time, eg:
for bigID in {1..6}
do
# poll first 99 files (r=1..99) + 9 millionth file
awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' 3.5_${bigID}_{1..99}.paintedHaploDiversity 3.5_${bigID}_9000000.paintedHaploDiversity >> ../allOutputs
# break rest of files into ~10K chunks based on first 3 digits of suffix
for r in {100..899} # break 9000000 into ~10K chunks
do
awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' 3.5_${bigID}_${r}*.paintedHaploDiversity >> ../allOutputs
done
done
NOTE: I'm only picking 10K as an assumption that there's some sort of performance hit for awk grabbing a bigger set of file ids; some testing of this size may find a sweet spot on the number of files awk can (quickly) handle
Also, iostat is showing 3x disks. If these are 3x physically separate disks and they're attached as separate disks (ie, not part of a RAID config), then make sure the target file (allOutputs_3.5) resides on a different disk from the source files. This should cut down on the read->write->read->write thrashing (more so on HDDs, less so on SSDs).
NOTE: This (obviously) assumes there is room on the other disk(s) to hold the target file.
I'd probably want to test this idea (read from disk #1, write to disk #2) with a small subset of files (eg, 110K), using each of the previously mentioned coding attempts, to see if there's a (relatively) large diff in timings (thus pointing at the read/write thrashing as being one bottleneck).

Any solution with a bash loop, where you call million times one or more processes, would be very slow. Also the attempt awk '{...}' * > output for me, on linux, resulted to: bash: /usr/bin/awk: Argument list too long.
With find and xargs
find is what you have to use, not with -exec because this way you will call again million processes per file argument, but with xargs, this way you can pass tons of arguments to one process. You can also do the job in batches with xargs -n. In general it is possible to hit any limitation of your OS, bash arguments, etc, but I have not tested for a huge number.
I executed the solution below on an very old box, slower than the desktop in question, and a sample of 800K files (1% of the total in question) took 3 minutes.
find . -type f -printf "%f\n" |\
xargs awk '$4==2{ print(substr(FILENAME, 1, length(FILENAME)-22), $0) }' >> output.txt
First, you have to avoid Swap Usage during execution or else it will slow down dramatically and second, you will probably hit any limits, as said above. So it may need to be done in batches, e.g you run find once and save the results to file, split the file to batches (e.g. 1M filenames each) and xargs each chunk to awk.
Without find, creating filenames with loop: Use xargs again
I see that you can create the filenames in a bash loop as they follow a standard pattern, which could be faster of find, but I believe this is not the bottleneck anyway. Again you should not execute one command per argument, but provide this file to awk through xargs.
For example, create the filenames, with your loop and save them to file.
for (( i=1;i<=9;i++ )); do
for (( j=1;j<=9000000;j++ )); do
printf "file_%s_%s\n" "$i" "$j" >> filenames.txt
done
done
and feed them once to awk:
cat filenames.txt | xargs awk '{...}'
or in batches, e.g. of 1M
split -l 1000000 -d filenames.txt chunk
for f in chunk*; do cat "$f" | xargs awk '{...}' ; done

Related

Skipping chunk of a big file with bash

Using bash, how to awk/grep from the middle of a given file and skip 1Gig for instance? In other words, I don't want awk/grep to search through the first 1Gig of the file but want to start my search in the middle of the file.

You can use dd like this:
# make a 10GB file of zeroes
dd if=/dev/zero bs=1G count=10 > file
# read it, skipping first 9GB and count what you get
dd if=file bs=1G skip=9 | wc -c
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.84402 s, 582 MB/s
1073741824
Note that I am just demonstrating a concept of how easily you can skip 9GB. In practice, you may prefer to use a 100MB memory buffer and skip 90 of them, rather than allocating a whole GB. So, in practice, you might prefer:
dd if=file bs=100M skip=90 | wc -c
Note also that I am piping to wc rather than awk because my test data is not line oriented - it is just zeros.
Or, if your record size is 30kB and you want to skip a million records and discard diagnostic output:
dd if=file bs=30K skip=1000000 2> /dev/null | awk ...
Note that:
your line numbers will be "wrong" in awk (because awk didn't "see" them), and
your first line may be incomplete (because dd isn't "line oriented") but I guess that doesn't matter.
Note also, that it is generally very advantageous to use a large block size. So, if you want 8MB, you will do much better with bs=1m count=8 than with bs=8 count=1000000 which will cause a million writes of 8 bytes each.
Note also, that if you like processing very large files, you can get GNU Parallel to divide them up for processing in parallel by multiple subprocesses. So, for example, the following code takes the 10GB file we made at the start and starts 10 parallel jobs counting the bytes in each 1GB chunk:
parallel -a file --recend "" --pipepart --block 1G wc -c

If you know the full size of the file (lets say 5 million lines) you can do this:
tail -2000000 filename|grep "yourfilter"
This way you will do whatever editing, or printing, starting below the first 3 million lines

Not tested the performance on very large files, compared to tail | grep, but you could try GNU sed:
sed -n '3000001,$ {/your regex/p}' file
skips the first 3 millions lines and then prints all lines matching the your regex regular expression. Same with awk:
awk 'NR>3000000 && /your regex/' file

How to add 2 to all integers bigger then 100 in a .txt file in bash

I have a .txt file with bookmarks and all bookmarks above 100 have to be placed 2 pages down from where they are now, because I added two pages in the document. How do I write a bash script that adds 2 to all integers it finds in the document?
I'm new to writing code in general, but I already know that I should make a for loop to read each line, then determine if each word is an integer or not and then with an if statement add 2 to each integer above 100.
The problem is that i don't exactly know how to access (read and write) to the file and I also don't know how to determine if something is a number or not.
Here is the link to the .txt file. A small sample:
The Tortle Package; 24
Tortle; 25
Elemental Evil Player's Companion; 27
Aarakocra; 28
Deep Gnome (gnome subrace); 30
Eberron\: Rising from the Last War; 84
Changelings; 85
Gnomes; 91
Goblinoids; 92
Bugbear; 93
Goblin; 94
Hobgoblin; 94
Half-Elves; 94
I did some research and this is the code I've come up with:
#!/bin/bash
cd /home/dexterdy/Documents/
i=$(grep -ho '[0-9]*' bookmarks.txt)
if [ "$i" -gt 100 ]; then
i += 2
fi
It seems that the grep variable outputs one large string with all the numbers. I also can't get the if-statement to work for some reason and I don't know how to actually write the numbers into the file.

From the shape of your input file, I suggest the following magic:
awk 'BEGIN{FS=OFS=";"}($NF>100){$NF+=2}1' input_file > output_file
This will remove that space just after the ;, which can be set back when doing:
awk 'BEGIN{FS=OFS=";"}($NF>100){$NF=" "($NF+2)}1' input_file > output_file
If you want to ensure that malformatted lines such as
foo;20
bar\; car;105
are all correctly converted into
foo; 20
bar\; car; 107
You have to do:
awk 'BEGIN{FS=OFS=";"}{$NF=" "($NF+($NF>100?2:0))}1' input_file > output_file

GNU Parallel: split file into children

Goal
Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.
Input File
name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)
Hardware
36 GB Memory
16 CPUs
HPCC environment (I'm not admin)
Code
Version 1
zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.
Version 2
# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.
zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:
parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.
Questions
How can my code be improved ?
Is there a faster way to accomplish this goal ?

Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.
You tell that to GNU Parallel with -L 4.
In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.
To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"
parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.
Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:
#EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
#EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
#EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
Here it is #EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with #EAS54_6_R. It just does not happen.
We can use that to our advantage, because now you can use \n followed by #EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:
parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.
parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.
It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:
parallel -a file1.fastq --pipe-part --block -1 -j16
--regexp --recend '\n' --recstart '#.*\n[A-Za-z\n\.~]'
my_command
Here we assume that the lines will start like this:
#<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>
Even if you have a few quality lines starting with '#', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with #.
You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:
You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).
By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.

Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.
split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:
doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.
Filenames will be named big.aa.gz .. big.ap.gz.

bash: split ascii file into n parts; iterate over ONLY those files

I have an ASCII file of a few thousand lines, processed one line at a time by a bash script. Because the processing is embarrassingly parallel, I'd like to split the file into parts of roughly the same size, preserving line breaks, one part per CPU core. Unfortunately the file suffixes made by split r/numberOfCores aren't easily iterated over.
split --numeric-suffixes=1 r/42 ... makes files foo.01, foo.02, ..., foo.42, which can be iterated over with for i in `seq -w 1 42 ` because -w adds a leading zero). But if the 42 changes to something smaller than 10, the files still have the leading zero but the seq doesn't, so it fails. This concern is valid, because nowadays some PCs have fewer than 10 cores, some more than 10. A ghastly workaround:
[[ $numOfCores < 10 ]] && optionForSeq="" || optionForSeq="-w"
The naive solution for f in foo.* is risky: the wildcard might match files other than the ones that split made.
An ugly way to make the suffixes seq-friendly, but with the same risk:
split -n r/numOfCores infile foo.
for i in `seq 1 $numOfCores`; do
mv `ls foo.* | head -1` newPrefix.$i
done
for i in `seq 1 $numofCores`; do
... newPrefix.$i ...
done
Is there a cleaner, robust way of splitting the file into n parts, where 1<=n<=64 isn't known until runtime, and then iterating over those parts? split only into a freshly created directory?
(Edit: To clarify "if the 42 changes to something smaller than 10," the same code should work on a PC with 8 cores and on another PC with 42 cores.)
A seq-based solution is clunky. A wildcard-based solution is risky. Is there an alternative to split? (csplit with line numbers would be even clunkier.) A gawk one-liner?

How about using a format string with seq?
$ seq -f '%02g' 1 4
01
02
03
04
$ seq -f '%02g' 1 12
01
02
03
...
09
10
11
12

With GNU bash 4:
Use printf to format your numbers:
for ((i=1;i<=4;i++)); do printf -v num "%02d" $i; echo "$num"; done
Output:
01
02
03
04

Are you sure this is not a job for GNU Parallel?
cat file | parallel --pipe -N1 myscript_that_reads_one_line_from_stdin
This way you do not need to have the temporary files at all.
If your script can read more than one line (so it is in practice a UNIX filter), then this should be very close to optimal:
parallel --pipepart -k --roundrobin -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into one part per core on the fly. If some lines are harder to process than others (i.e. you can get "stuck" for a while on a single line), then this solution might be better:
parallel --pipepart -k -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into 10 part per core on the fly, thus running on average 10 jobs per core in total.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Get the filenames with ls and then use a regex:
for n in $(ls foo.* |grep "^foo\.[0-9][0-9]*$") ; do

Making a simple system monitor bash script

I'm trying to create a simple bash script to monitor the following: CPU Utilization, outbound network bandwidth, and inbound network bandwidth. The kicker, I have to use information from /proc/loadavg for the CPU and information from /proc for the bandwidth.
For the CPU Utilization, because it is supposed to be on a short time interval, I can use the first value from /proc/loadavg. Thing is, I'm not sure how to just get that one value so what I have so far is this:
CPU=sudo cat /proc/loadavg | sed 's///'
echo "CPU Utilization: $CPU %"
Where I'm not sure what the sed operation should be. Also I'm not sure how to format what I would get from that so that it would print as "16.5%"
For the bandwidth monitors I haven't the slightest clue of what I could use in /proc to get that sort of information so I'm open to all suggestions.

Load average
You don't need sudo to read /proc/loadavg
In addition, sed is the wrong tool here, try using cut, for example:
$ cut -d' ' -f1 < /proc/loadavg
0.04
cut will cut lines by a delimiter (given with -d), in this case a space, and you can then use -f to select a field, in this case the first one.
Now, converting it to percentages is actually fairly meaningless, since you'll often end up above 100% (see comment below), I've seen load averages in excess of 50 (that would be 5000% percent?).
In all my years of UNIX/Linux experience, I can't recall ever seeing the load average being expressed as a percentage, and if I would encounter such a thing, I would find it very odd.
But if you really want to (you don't!), just multiply by 100 with dc, like so:
$ dc -e "`cut -d' ' -f1 < /proc/loadavg` 100 * p"
29.00
For the CPU Utilization, because it is supposed to be on a short time
interval, I can use the first value from /proc/loadavg.
The load average is not the same thing as CPU usage.
A load average of 1 means there is one process waiting for something (usually the CPU or disk).
A load average of 2 means there are two processes waiting.
A load average of 0.5 (over the last minute), can mean that for 30 seconds, there was one process waiting, and for 30 seconds, there were no processes waiting. It can also mean that for 15 seconds there were two processes waiting, and for 45 seconds there were no processes waiting. The keyword here is average.
If you want to get the CPU utilization, then this is probably the most portable way:
$ top -bn2 | grep "Cpu(s)" | \
tail -n1 | \
sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | \
awk '{print 100 - $1"%"}'
Note you need to use -n2 to get fairly accurate results.
I've adapted this from this answer, which also lists some other possibilities, some simpler, but most tools mentioned aren't installed by default on most systems.
Network
For the bandwidth monitors I haven't the slightest clue of what I
could use in /proc to get that sort of information so I'm open to all
suggestions.
You can use the output of ifconfig, for example, on my system:
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.178.28 netmask 255.255.255.0 broadcast 192.168.178.255
inet6 2001:980:82cd:1:20c:29ff:fe9e:c84b prefixlen 128 scopeid 0x0<global>
inet6 fe80::20c:29ff:fe9e:c84b prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:9e:c8:4b txqueuelen 1000 (Ethernet)
RX packets 45891 bytes 36176865 (34.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 20802 bytes 2603821 (2.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
It's the RX packets & TX packets we want. Let's extract just those values:
$ ifconfig ens33 | grep -E '(R|T)X packets' | grep -Eo '\([0-9].*\)' | tr -d '()'
34.5 MiB
2.5 MiB
First we grep all the lines starting with RX or TX
With those lines, we then grep for a parenthesis \(, followed by a number [0-9], followed by any characters .*, followed by a closing parenthesis \). With the -o flag we show only the matching part, instead of the whole line.
With tr, we remove the unwanted parentheses.
This should be what you want. If you want to get a number of bytes, you can use a different grep pattern in the second grep. I'll leave it as an exercise to you what exactly that is.

Here's how you can print the first number output by cat /proc/loadavg as a percent value (but see #Carpetsmoker's caveat regarding whether that makes sense), rounded to 1 decimal place:
printf "1-minute load average: %.1f%%\n" \
$(bc <<<"$(cut -d ' ' -f 1 /proc/loadavg) * 100")

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio