There are 81 million files (!) stored in one directory on a remote machine. All files end in ".paintedHaploDiversity". I would like to merge those files into one called allOutputs_3.5 in the parent directory. More specifically, each file contains two or three lines. The first line is a header that I can ignore. Among the remaining one or two lines, one of them has the value 2 in the fourth column. For each file, I want to copy the whole line where there is a 2 in the second column and add to it the filename (excluding the extension ".paintedHaploDiversity"). I refer to this filename as "simID".
For information, the remote machine runs on MAC OS X 10.11.6 (15G22010). It is a simple destkop. There is hence no network involved (outside my ssh command to reach the remote machine).
I first tried
for f in *;
do
simID=${f%.paintedHaploDiversity}
awk -v simID=${simID} 'NR>1{if ($4==2) {printf simID"\t"; print}}' $f >> ../allOutputs_3.5
done
but it was very slow. I estimated the time required to months or even years! Then, I tried
awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' * >> ../allOutputs
but it does not seem any faster. Just as a speed test, I also considered
find . -exec cat '{}' ';' > out
but it is again very slow. Thinking that maybe the issue might come from the regex expansion *, I tried to loop through each file by reproducing their name through two C style loops.
for ((bigID=1; bigID <= 9 ;++bigID)); do
for ((rep=1; rep <= 9000000 ;++rep)); do
awk -v simID=3.5.${bigID}_${rep} 'NR>1{if ($4==2) {printf simID"\t"; print}}' 3.5.${bigID}_${rep}.paintedHaploDiversity >> ../allOutputs_3.5
done
done
The process is now quite a bit faster but it would still take months to run! Finally, I figured, I might as well remove the lines where the the second column is not equal to 2 only later on (probably with a sed command) and do
for ((bigID=1; bigID <= 6 ;++bigID)); do
for ((r=1; r <= 9000000 ;++r)); do
printf "3.5_${bigID}_${r}\t" >> ../allOutputs_3.5
tail -n +2 3.5_${bigID}_${r}.paintedHaploDiversity >> ../allOutputs_3.5
done
done
The process now is expected to take about two weeks. That starts to be reasonable. I am still wondering what is causing this process to be so slow and whether it can be improved.
I suppose the bottleneck is likely the disk IO. Or is it the filesystem that takes a lot of CPU time? Is the process so slow because there are so many files in the same directory and it requires searching through a binary tree of files at every iteration of the loop? How can it be improved? Should I try writing the process in c++?
If it helps here is the output of top -o MEM while the last command (the one using printf and tail) was running
Processes: 254 total, 3 running, 12 stuck, 239 sleeping, 1721 threads 03:12:40
Load Avg: 2.04, 1.79, 1.60 CPU usage: 0.84% user, 4.33% sys, 94.81% idle
SharedLibs: 85M resident, 11M data, 10M linkedit.
MemRegions: 42324 total, 4006M resident, 63M private, 230M shared.
PhysMem: 14G used (2286M wired), 10G unused.
VM: 753G vsize, 535M framework vsize, 1206153(0) swapins, 2115303(0) swapouts.
Networks: packets: 413664671/284G in, 126210468/104G out.
Disks: 1539349069/12T read, 1401722156/7876G written.
PID COMMAND %CPU TIME #TH #WQ #PORTS MEM PURG CMPRS PGRP PPID STATE
0 kernel_task 42.1 1716 hrs 167/25 0 2- 1968M 0B 0B 0 0 running
366 SystemUIServ 0.4 24:42:03 5 2 345 1055M 0B 10M 366 1 sleeping
472 softwareupda 0.0 12:46:11 5 0 3760 340M 0B 18M 472 1 sleeping
54242 Sublime Text 0.0 03:55:44 12 0 237 233M 0B 68K 54242 1 sleeping
63 powerd 0.0 44:07:21 2 0 95 204M 0B 8932K 63 1 sleeping
34951 Finder 0.1 04:11:06 9 2 1665 166M 0B 68M 34951 1 sleeping
197 WindowServer 0.0 40:02:58 3 0 453 142M 0B 63M 197 1 sleeping
13248 Terminal 0.0 84:19.45 5 0 388 114M 0B 113M 13248 1 sleeping
29465 X11.bin 0.0 89:38.70 9 0 229 104M 0B 16M 29464 29464 sleeping
12372 system_insta 0.0 00:31.61 2 0 75 78M 0B 9996K 12372 1 sleeping
1588 sysmond 0.0 02:34:04 2 1 23 62M 0B 4536K 1588 1 sleeping
54245 plugin_host 0.0 00:03.88 5 0 56 51M 0B 0B 54242 54242 sleeping
554 spindump 0.0 00:36.51 2 1 164 44M 0B 33M 554 1 sleeping
20024 com.apple.GS 0.0 00:01.43 3 2 24 43M 0B 2200K 20024 1 sleeping
475 suhelperd 0.0 00:19.84 2 0 55 42M 0B 28M 475 1 sleeping
418 installd 0.0 01:21.89 2 0 69 40M 0B 12M 418 1 sleeping
57 fseventsd 0.1 13:03:20 10 0 241 39M 0B 2904K 57 1 sleeping
364 Dock 0.0 08:48.83 3 0 283 38M 0B 27M 364 1 sleeping
201 sandboxd 0.0 18:55.44 2 1 38 38M 0B 10M 201 1 sleeping
103 loginwindow 0.0 04:26.65 2 0 377 35M 0B 3400K 103 1 sleeping
897 systemstatsd 0.0 65:30.17 2 1 43 34M 0B 4928K 897 1 sleeping
367 fontd 0.0 11:35.30 2 0 77 32M 0B 5920K 367 1 sleeping
396 ScopedBookma 0.0 01:00.46 3 2 46 32M 0B 28M 396 1 sleeping
22752 cfbackd 0.4 32:18.73 9 1 84 30M 0B 0B 22752 1 sleeping
39760 Preview 0.0 00:03.75 3 0 209 29M 0B 0B 39760 1 sleeping
53 syslogd 0.0 05:33:59 4 3 186- 29M- 0B 1668K 53 1 sleeping
533 SmartDaemon 0.0 27:07.67 10 7 175 28M 128K 5192K 533 1 stuck
388 iconservices 0.0 00:08.85 2 1 66 27M 0B 157M 388 1 sleeping
7268 diskmanageme 0.0 00:40.14 888 0 8899 27M 0B 7352K 7268 1 sleeping
513 Notification 0.0 00:46.42 3 0 245 26M 0B 9852K 513 1 sleeping
83 opendirector 0.0 19:22:12 6 5 8827 26M 0B 2444K 83 1 sleeping
557 AppleSpell 0.0 03:12.61 2 0 57 26M 0B 10M 557 1 sleeping
422 com.apple.ge 0.0 01:50.41 5 0 83 25M 0B 1680K 422 1 sleeping
397 storeaccount 0.0 00:48.41 4 0 1333 21M 0B 2248K 397 1 sleeping
87 launchservic 0.0 64:26.85 3 2 306 20M 0B 5804K 87 1 sleeping
1 launchd 0.0 26:26:23 5 4 1802 20M 0B 6532K 1 0 stuck
222 taskgated 0.0 17:59:00 3 1 43 19M 0B 4528K 222 1 sleeping
54 UserEventAge 0.0 18:19.74 3 0 32605- 18M- 0B 2968K 54 1 sleeping
4527 com.apple.sp 0.0 00:13.01 2 0 48 17M 0B 7792K 4527 1 sleeping
79 coreduetd 0.0 05:40.06 2 0 95 17M 0B 4604K 79 1 sleepin
and here is the output of iostat
disk0 disk1 disk2 cpu load average
KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us sy id 1m 5m 15m
7.19 152 1.07 8.10 0 0.00 8.22 0 0.00 15 50 35 1.68 1.74 1.59
Example:
Consider the following files
file_0:
first second third fourth fifth
bbb a a 2 r
file_1:
first second third fourth fifth
f o o 2 o
file_2:
first second third fourth fifth
f r e 1 e
x xxx x 2 x
file_3:
first second third fourth fifth
a a a 2 a
The expected output is
file_0 bbb a a 2 r
file_1 f o o 2 o
file_2 x xxx x 2 x
file_3 a a a 2 a
You probably can cope with two single calls to the programs grep and sed. This should be pretty fast. Maybe even faster than a self-written C program.
cd dir_with_all_the_files
grep -rE '^([^ ]+ +){3}2 ' . |
sed -En 's/^\.\/(.*)\.paintedHaploDiversity:/\1 /p' > ../allOutputs_3.5
Assumptions made:
The header of the column to be searched isn't 2 too.
The directory contains no subdirectories.
The command may still produce correct results but has to run needlessly long.
The filenames contain no : or linebreaks.
Your grep implementation supports the non-Posix -r option (usually the case).
Further improvements if your grep implementation supports it:
Add -m1 to speed up the search.
Try grep -P (usually not supported on Mac OS) or pcregrep. PCRE is sometimes faster. With PCRE you can also try the alternative regex '^(.*? ){3}2 '.
--exclude-dir \* (note that * is quoted) excludes subdirectories, so that you can use the command even without above assumption.
If you want the output to be sorted by filenames (as you would get when iterating *.paintedHaploDiversity), run sort -t ' ' -k 1,1 -o allOutputs_3.5{,} afterwards.
You might as well set export LC_ALL=C to speed up grep, sort, and maybe even sed.
Difficult problem. Might have painted yourself into a corner there...
If even the find command takes too long, which does nothing but open, read and close every file, then the likely bottleneck is the seek time on an HDD. This is typically around 10 ms (source), so for 81 million files you're looking at about almost 10 days, assuming a single seek per file. Due to the filesystem (directory accesses etc.) it might be more seeks, but if locality is good each seek might also be shorter.
If you can afford to wait this long once, I'd recommend zipping up all those files into a single file. This will take a lot of time, but after that you can process the data set more quickly.
If zipping (or otherwise copying or accessing) each individual file is out of the question, a solution might be to take an image (snapshot) of the entire filesystem and copy that onto a faster drive. SSDs have seek times around 0.1 ms (source) so working off of an SSD you could be finished in slightly over two hours.
A more hardcore approach would be to write code that operates directly on the raw disk bytes, implementing the necessary parts of the filesystem and using large in-memory buffers to avoid disk seeks. Depending on how the files are scattered across the disk, this might give you a big speedup, but of course it's a nontrivial effort to program this.
The problem — apart from the obvious I/O load of processing a few GB of data — is that starting one or several processes 81 million times takes a long time. Even creating a command line or expand a files glob to, say, 300MB (for f in *...) may need substantial time or exceed system and program specifications.
One solution is to write a C program to open the files and process them, or pipe their contents to other programs. But that may take a couple of days to program and debug, and maybe your intern is on vacation. But there are already programs in the Unix toolbox which do part of what you need, except that the file names are lost. We assume that all files are in a directory called bla.
Use tar to create a stream with the contents of the files, like this:
tar cf - bla | tar -xOf -
This writes the concatenated contents of the files to standard out, by default the console. Both tars and the grep are only started once. The first tar finds all the files in the directory and creates an archive (which is some sort of structured concatenation) which it writes to stdout; the second tar grabs that archive, extracts the files and writes them to stdout instead of creating files in the file system, thanks to -O.
After that, start processing:
tar cf - bla | tar -xOf - | grep '^whatever is before the 2 \<2\>' > out.txt
If the presence of the filenames is a hard requirement you may be able to repeat the processing chain, but let the second tar emit the file names (-t option), and pipe it to a shell script which reads a line from the out.txt and from the tar output, combining the two and writing the combined lines to a new file.
If the printf/tail attempt is considered the fastest at this point (2 weeks? based solely on OPs comments), I'd want to eliminate the 81 million printf/tail command pairs with a smaller number of awk/substr(FILENAME) calls that work on a wildcard set that breaks processing into, say, ~10K files at a time, eg:
for bigID in {1..6}
do
# poll first 99 files (r=1..99) + 9 millionth file
awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' 3.5_${bigID}_{1..99}.paintedHaploDiversity 3.5_${bigID}_9000000.paintedHaploDiversity >> ../allOutputs
# break rest of files into ~10K chunks based on first 3 digits of suffix
for r in {100..899} # break 9000000 into ~10K chunks
do
awk 'FNR==1{simID=substr(FILENAME, 1, length(FILENAME)-22)}FNR>1{if ($4==2) {printf simID"\t"; print}}' 3.5_${bigID}_${r}*.paintedHaploDiversity >> ../allOutputs
done
done
NOTE: I'm only picking 10K as an assumption that there's some sort of performance hit for awk grabbing a bigger set of file ids; some testing of this size may find a sweet spot on the number of files awk can (quickly) handle
Also, iostat is showing 3x disks. If these are 3x physically separate disks and they're attached as separate disks (ie, not part of a RAID config), then make sure the target file (allOutputs_3.5) resides on a different disk from the source files. This should cut down on the read->write->read->write thrashing (more so on HDDs, less so on SSDs).
NOTE: This (obviously) assumes there is room on the other disk(s) to hold the target file.
I'd probably want to test this idea (read from disk #1, write to disk #2) with a small subset of files (eg, 110K), using each of the previously mentioned coding attempts, to see if there's a (relatively) large diff in timings (thus pointing at the read/write thrashing as being one bottleneck).
Any solution with a bash loop, where you call million times one or more processes, would be very slow. Also the attempt awk '{...}' * > output for me, on linux, resulted to: bash: /usr/bin/awk: Argument list too long.
With find and xargs
find is what you have to use, not with -exec because this way you will call again million processes per file argument, but with xargs, this way you can pass tons of arguments to one process. You can also do the job in batches with xargs -n. In general it is possible to hit any limitation of your OS, bash arguments, etc, but I have not tested for a huge number.
I executed the solution below on an very old box, slower than the desktop in question, and a sample of 800K files (1% of the total in question) took 3 minutes.
find . -type f -printf "%f\n" |\
xargs awk '$4==2{ print(substr(FILENAME, 1, length(FILENAME)-22), $0) }' >> output.txt
First, you have to avoid Swap Usage during execution or else it will slow down dramatically and second, you will probably hit any limits, as said above. So it may need to be done in batches, e.g you run find once and save the results to file, split the file to batches (e.g. 1M filenames each) and xargs each chunk to awk.
Without find, creating filenames with loop: Use xargs again
I see that you can create the filenames in a bash loop as they follow a standard pattern, which could be faster of find, but I believe this is not the bottleneck anyway. Again you should not execute one command per argument, but provide this file to awk through xargs.
For example, create the filenames, with your loop and save them to file.
for (( i=1;i<=9;i++ )); do
for (( j=1;j<=9000000;j++ )); do
printf "file_%s_%s\n" "$i" "$j" >> filenames.txt
done
done
and feed them once to awk:
cat filenames.txt | xargs awk '{...}'
or in batches, e.g. of 1M
split -l 1000000 -d filenames.txt chunk
for f in chunk*; do cat "$f" | xargs awk '{...}' ; done
when I use bash to write a temp test file on liunx terminal.
echo text_content>file1
if set length of text_content 4096 char(random char from [a-Z]) long.
the result file1 ends up use two 4K blocks. and one inode.
test#instance-7:~/notes/rust$ du -csh file1
8.0K file1
8.0K totaldu
But why it used two 4K blocks? I mean, Isn't one 4K block is enough for it?
if I set the length of the text_content 4095 char long, it used only one 4K block.
why it's using more blocks for it needed? or I'm missing something?
here are some disk info for my liunx machine.
test#instance-7:~/notes/rust$ sudo fdisk -l /dev/sda
Disk /dev/sda: 30 GiB, 32212254720 bytes, 62914560 sectors
Disk model: PersistentDisk
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gptDisk identifier: 35BD657D-931E-497E-A86C-D3D7C4F6BD2A
Try this experiment:
write cat > file1 and hit Enter,
then write this and then hit Ctrl-D twice without hitting Enter first;
write cat > file2 and hit Enter,
then write this, and then hit Enter and then Ctrl-D.
finally run diff file1 file2 and ls -l file[12]
I have a string from which I would like to extract all the ip addresses. I'm very new to bash. Please help
#!/usr/bin/env bash
IP=$(ping -c 25 x.x.x.255)
"$IP" will have the following text. Please suggest a way to get only the ip addresses like x.x.x.252, x.x.x.141 ......
PING x.x.x.255 (x.x.x.255): 56 data bytes
64 bytes from x.x.x.252: icmp_seq=0 ttl=64 time=0.111 ms
64 bytes from x.x.x.141: icmp_seq=0 ttl=255 time=2.200 ms
64 bytes from x.x.x.197: icmp_seq=0 ttl=64 time=70.087 ms
64 bytes from x.x.x.108: icmp_seq=0 ttl=64 time=70.136 ms
--- x.x.x.255 ping statistics ---
25 packets transmitted, 25 packets received, +12 duplicates, 0.0% packet loss
You could use grep's -E and -o options:
$ ips=$(ping -c 4 example.com | grep -Eo '\s([0-9]+\.){3}[0-9]+\s')
$ echo $ips
93.184.216.34 93.184.216.34 93.184.216.34 93.184.216.34
The regex matches a space \s, followed by a capturing group consisting of one or more digits [0-9]+ and a dot \. repeated three times {3}, followed by [0-9]+\s. Information about grep's options can be found in man grep.
I need a script to send stdout to journald and to the console parallel. For the journald lines I need to sanitize the messages before persisting.
I have a dummy example to show my issue:
ping google.com | tee >( sed 's/seq/SEQ/' | systemd-cat -t 'my-ping')
When I have sed & systemd-cat the messages to journald are delayed and they arrive only after stopping the ping process.
Example:
$ ping google.com | tee >( sed 's/seq/SEQ/' | systemd-cat -t 'my-ping')
PING google.com (216.58.197.238) 56(84) bytes of data.
64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_seq=1 ttl=40 time=240 ms
64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_seq=2 ttl=40 time=240 ms
64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_seq=3 ttl=40 time=240 ms
64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_seq=4 ttl=40 time=240 ms
64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_seq=5 ttl=40 time=240 ms
^C
The messages are going in all at once (see timestamp):
journalctl -f | grep my-ping
Aug 17 06:03:40 hostname my-ping[30555]: PING google.com (216.58.197.238) 56(84) bytes of data.
Aug 17 06:03:40 hostname my-ping[30555]: 64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_SEQ=1 ttl=40 time=240 ms
Aug 17 06:03:40 hostname my-ping[30555]: 64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_SEQ=2 ttl=40 time=240 ms
Aug 17 06:03:40 hostname my-ping[30555]: 64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_SEQ=3 ttl=40 time=240 ms
Aug 17 06:03:40 hostname my-ping[30555]: 64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_SEQ=4 ttl=40 time=240 ms
Aug 17 06:03:40 hostname my-ping[30555]: 64 bytes from nrt13s49-in-f14.1e100.net (216.58.197.238): icmp_SEQ=5 ttl=40 time=240 ms
It seems this behavior only presents when I use both if any one of them is left out everything works as expected.
Do you have any pointers what can be the issue and how to get over it?
I am capturing traffic using:
tcpdump -i <interface> -nn -s0 -w ike2.pcap
Then I am reading the captured file using:
tcpdump -vvv -l -r ike2.pcap
While reading, the first packet is getting displayed as:
07:22:33.320142 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], length: 296) 10.0.0.1.isakmp > 10.0.0.2.isakmp: [udp sum ok] isakmp 2.0 msgid cookie ->: phase 1 I #34[]:
(#33)
(#34)
(#40)
(#41)
(#41)
I need to get the details of this packet. In this ISAKMP IKEv2 packet, I am interested to extract the values of 'Encryption Algorithm' and 'Integrity Algorithm' (i.e 'ENCR_3DES' and 'AUTH_HMAC_MD5_96')
I can view the values if I inspect the packet in wireshark. But, I have to do this from a shell script, so I cannot use wireshark. I need to get these values from tcpdump read command itself.
I am assuming there might be some way to print the encryption and integrity algorithms used from tcpdump read command. Could you please help to figure this out.
Captured packet:
Could find the way to do it as you are expecting with only tcpdump, but has #user862787 said use tshark like:
# tshark -V -r somecapfile.pcap
Frame 1: 196 bytes on wire (1568 bits), 196 bytes captured (1568 bits)
Encapsulation type: Linux cooked-mode capture (25)
Arrival Time: May 10, 2017 02:00:34.811347000 CDT
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1494399634.811347000 seconds
[Time delta from previous captured frame: 0.000000000 seconds]
[Time delta from previous displayed frame: 0.000000000 seconds]
[Time since reference or first frame: 0.000000000 seconds]
Frame Number: 1
Frame Length: 196 bytes (1568 bits)
Capture Length: 196 bytes (1568 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: sll:ethertype:ip:sctp:m3ua:sccp:tcap:gsm_map]
The -V does what you want!
I can view the values if I inspect the packet in wireshark. But, I have to do this from a shell script, so I cannot use wireshark.
But you could perhaps use TShark.
tcpdump -r file with some other flags (-X, for example).