awk: Output to different processes

awk: Output to different processes - bash

I have awk script which splits big file into several files by some condition. Than I'm running another script over each file in parallel.
awk -f script.awk -v DEST_FOLDER=tmp input.file
find tmp/ -name "*.part" | xargs -P $ALLOWED_CPUS --replace --verbose /bin/bash -c "./process.sh {}"
The question is: are there any way to run ./process.sh:
before first script is done, because process.sh processes file line by line (one line too long to be passed to xargs directly);
each new file has a header (added in script.awk) that should be run before the rest of file;
limit amount of parallel processes;
GNU parallel,inotifywait is not an option;
assume dest folder is empty, files name are unknown.
The purpose of optimization to get rid of waiting until the awk is done while some files are ready to be processed.

Once you have created a file, you can pass the filename to a process' or script's input:
awk '{print name_of_created_file | "./process.sh &"}'
& sends process.sh to the background, so that they can run in parallel. However, this is a gawk extension and not POSIX. Check the manual

You basically give the answer yourself: GNU Parallel + inotifywait will work.
Since you are not allowed to use inotifywait, you can make your substitute for inotifywait. If you are allowed to write your own script, you are also allowed to run GNU Parallel (as that is just a script).
So something like this:
awk -f script.awk -v DEST_FOLDER=tmp input.file &
sleep 1
record file sizes of files in tmp
while tmp is not empty do
for files in tmp:
if file size is unchanged: print file
record new file size
sleep 1
done | parallel 'process {}; rm {}'
It is assumed that awk will produce some output with one second. If that takes longer, adjust the sleeps accordingly.

Related

Writing a Bash script that takes a text file as input and pipes the text file through several commands

I keep text files with definitions in a folder. I like to convert them to spoken word so I can listen to them. I already do this manually by running a few commands to insert some pre-processing codes into the text files and then convert the text to spoken word like so:
sed 's/\..*$/[[slnc 2000]]/' input.txt inserts a control code after first period
sed 's/$/[[slnc 2000]]/' input.txt" inserts a control code at end of each line
cat input.txt | say -v Alex -o input.aiff
Instead of having to retype these each time, I would like to create a Bash script that pipes the output of these commands to the final product. I want to call the script with the script name, followed by an input file argument for the text file. I want to preserve the original text file so that if I open it again, none of the control codes are actually inserted, as the only purpose of the control codes is to insert pauses in the audio file.
I've tried writing
#!/bin/bash
FILE=$1
sed 's/$/ [[slnc 2000]]/' FILE -o FILE
But I get hung up immediately as it says sed: -o: No such file or directory. Can anyone help out?

If you just want to use foo.txt to generate foo.aiff with control characters, you can do:
#!/bin/sh
for file; do
test "${file%.txt}" = "${file}" && continue
sed -e 's/\..*$/[[slnc 2000]]/' "$file" |
sed -e 's/$/[[slnc 2000]]/' |
say -v Alex -o "${file%.txt}".aiff
done
Call the script with your .txt files as arguments (eg, ./myscript *.txt) and it will generate the .aiff files. Be warned, if say overwrites files, then this will as well. You don't really need two sed invocations, and the sed that you're calling can be cleaned up, but I don't want to distract from the core issue here, so I'm leaving that as you have it.

This will:-
a} Make a list of your text files to process in the current directory, with find.
b} Apply your sed commands to each text file in the list, but only for the current use, allowing you to preserve them intact.
c} Call "say" with the edited files.
I don't have say, so I can't test that or the control codes; but as long as you have Ed, the loop works. I've used it many times. I learned it as a result of exposure to FORTH, which is a language that still permits unterminated loops. I used to have problems with remembering to invoke next at the end of the script in order to start it, but I got over that by defining my words (functions) first, in FORTH style, and then always placing my single-use commands at the end.
#!/bin/sh
next() {
[[ -s stack ]] && main
end
}
main() {
line=$(ed -s stack < edprint+.txt)
infile=$(cat "${line}" | sed 's/\..*$/[[slnc 2000]]/' | sed 's/$/[[slnc 2000]]/')
say "${infile}" -v Alex -o input.aiff
ed -s stack < edpop+.txt
next
}
end() {
rm -v ./stack
rm -v ./edprint+.txt
rm -v ./edpop+.txt
exit 0
}
find *.txt -type -f > stack
cat >> edprint+.txt << EOF
1
q
EOF
cat >> edpop+.txt << EOF
1d
wq
EOF
next

Run a command on items in file using Ansible

I am looking for a way to run a command like smartctl on a file containing device names like /dev/sda; (one per line). The ansible playbook should be able to read each line and make it an arg to the command.

Are you looking for something like this?
<file_with_smartctl_args xargs -n1 smartctl
Replace file_with_smartctl_args with the file (complete path!) that contains the names of the files (arguments) you want to pass to smartctl. This will run "smartctl" one time for EACH of the lines (arguments) in the file.
Example:
If the file /usr/me/smartctl_args contains the following text:
file1
file2
file3
The command:
</usr/me/smartctl_args xargs -n1 smartctl
Will run smartctl 3 times (since the file has 3 lines in it), like this:
smartctl file1
smartctl file2
smartctl file3
The initial < tells the Unix shell that your "standard input" is going to come from the filename that follows (/usr/me/smartctl_args). Then, xargs will convert the "standard input" to command arguments, the -n1 option causes xargs to execute the command (smartctl) once for each argument.

xargs -a [file] mv -t [new-directory] gives me mv: cannot stat `filename*': No such file or directory error

I have been trying to run this command (that I have run before in a different directory), and everything I've read on the message boards has not solved my unknown issue.
Of note: 1) the files exist in this directory 2) I have proper permissions to move these files around 3) I have run this exact line of code before and it has worked. 4) I tried listing files with and without '' to capture all the files (see below). 5) I also tired to list each file as 'Sample1', but that did not work.
xargs -a [filename.txt] mv -t [new-directory]
I have file beginnings (I have ~5 file for each beginning), and I want to move all the files associated with that beginning.
Example: Sample1.bam Sample1.sorted.bam, etc
The lines in the file are listed as such:
Sample1*
Sample2*
Sample3* ...etc.
What am I doing incorrectly and how can I fix it?
TIA!

When you execute command using 'xargs' arguments are passed directly to the called program ('mv' in your case). Wildcard patterns in the input are not expanded - 'sample1*' is passed as is to "mv", which issue an error message about note having a file named 'sample1*'.
To get file name expansion, you want to use the shell. One way to handle this situation is
xargs -a FILENAME.TXT -I__ sh -c "mv -t NEW-FOLDER -- __"
Security Note: the code provides some protection against command line injection (e.g., file name starting with '-'). However, other possible attacks are possible. Safer version is
cat FILENAME.txt | grep '^[A-Za-z0-9][A-Z-z0-9._-]*$' | xargs I__ sh -c "mv -t NEW-FOLDER -- __"
which will limit the input to file with alphanumeric. The 'grep' patterns can be extend the pattern as needed.

With GNU Parallel you would do something like:
cat FILENAME.txt | parallel mv {} NEW-FOLDER
One of the benefits of GNU Parallel is that it deals correctly with file names like:
My brother's 12" records cost > $1000.txt

How to make awk command run faster on large data files

I used this awk command below to create a new UUID column in a table in my existing .dat files.
$ awk '("uuidgen" | getline uuid) > 0 {print uuid "|" $0} {close("uuidgen")}' $filename > ${filename}.pk
The problem is that my .dat files are pretty big (like 50-60 GB) and this awk command takes hours even on small data files (like 15MB).
Is there any way to increase the speed of this awk command?

I wonder if you might save time by not having awk open and close uuidgen every line.
$ function regen() { while true; do uuidgen; done; }
$ coproc regen
$ awk -v f="$filename" '!(getline line < f){exit} {print $0,line}' OFS="|" < /dev/fd/${COPROC[0]} > "$filename".pk
This has awk reading your "real" filename from a variable, and the uuid from stdin, because the call to uuidgen is handled by a bash "coprocess". The funky bit around the getline is to tell awk to quit once it runs out of input from $filename. Also, note that awk is taking input from input redirection instead of reading the file directly. This is important; the file descriptor at /dev/fd/## is a bash thing, and awk can't open it.
This should theoretically save you time doing unnecessary system calls to open, run and close the uuidgen binary. On the other hand, the coprocess is doing almost the same thing anyway by running uuidgen in a loop. Perhaps you'll see some improvement in an SMP environment. I don't have a 50GB text file handy for benchmarking. I'd love to hear your results.
Note that coproc is a feature that was introduced with bash version 4. And use of /dev/fd/* requires that bash is compiled with file descriptor support. In my system, it also means I have to make sure fdescfs(5) is mounted.
I just noticed the following on my system (FreeBSD 11):
$ /bin/uuidgen -
usage: uuidgen [-1] [-n count] [-o filename]
If your uuidgen also has a -n option, then adding it to your regen() function with ANY value might be a useful optimization, to reduce the number of times the command needs to be reopened. For example:
$ function regen() { while true; do uuidgen -n 100; done; }
This would result in uuidgen being called only once every 100 lines of input, rather than for every line.
And if you're running Linux, depending on how you're set up, you may have an alternate source for UUIDs. Note:
$ awk -v f=/proc/sys/kernel/random/uuid '{getline u<f; close(f); print u,$0}' OFS="|" "$filename" "$filename".pk
This doesn't require the bash coproc, it just has awk read a random uuid directly from a Linux kernel function that provides them. You're still closing the file handle for every line of input, but at least you don't have to exec the uuidgen binary.
YMMV. I don't know what OS you're running, so I don't know what's likely to work for you.

Your script is calling shell to call awk to call shell to call uuidgen. Awk is a tool for manipulating text, it's not a shell (an environment to call other tools from) so don't do that, just call uuidgen from shell:
$ cat file
foo .*
bar stuff
here
$ xargs -d $'\n' -n 1 printf '%s|%s\n' "$(uuidgen)" < file
5662f3bd-7818-4da8-9e3a-f5636b174e94|foo .*
5662f3bd-7818-4da8-9e3a-f5636b174e94|bar stuff
5662f3bd-7818-4da8-9e3a-f5636b174e94|here

I'm just guessing that the real problem here is that you're running a sub-process for each line. You could read your file explicitly line by line and read output from a batch-uuidgen line by line, and thus only have a single subprocess to handle at once. Unfortunately, uuidgen doesn't work that way.
Maybe another solution?
perl -MData::UUID -ple 'BEGIN{ $ug = Data::UUID->new } $_ = lc($ug->to_string($ug->create)) . " | " . $_' $filename > ${filename}.pk
Might this be faster?

BASH shell scripting file parsing [newbie]

I am trying to write a bash script that goes through a file line by line (ignoring the header), extracts a file name from the beginning of each line, and then finds a file by this name in one directory and moves it to another directory. I will be processing hundreds of these files in a loop and moving over a million individual files. A sample of the file is:
ImageFileName Left_Edge_Longitude Right_Edge_Longitude Top_Edge_Latitude Bottom_Edge_Latitude
21088_82092.jpg: -122.08007812500000 -122.07733154296875 41.33763821961143 41.33557596965434
21088_82093.jpg: -122.08007812500000 -122.07733154296875 41.33970040427444 41.33763821961143
21088_82094.jpg: -122.08007812500000 -122.07733154296875 41.34176252364274 41.33970040427444
I would like to ignore the first line and then grab 21088_82092.jpg as a variable. File names may not always be the same length, but they will always have the format digits_digits.jpg
Any help for an efficient approach is much appreciated.

This should get you started:
$ tail -n +2 input | cut -f 1 -d: | while read file; do test -f $dir/$file && mv -v $dir/$file $destination; done

You can construct a script that will do something like this, then simply run the script. The following command will give you a script which will copy the files from one place to another, but you can make the script generation more complex simply by changing the awk output:
pax:~$ cat qq.in
ImageFileName Left_Edge_Longitude Right_Edge_Longitude
21088_82092.jpg: -122.08007812500000 -122.07733154296875
21088_82093.jpg: -122.08007812500000 -122.07733154296875
21088_82094.jpg: -122.08007812500000 -122.07733154296875
pax:~$ awk -F: '/^[0-9]+_[0-9]+.jpg:/ {
printf "cp /srcdir/%s /dstdir\n",$1
} {}' qq.in
cp /srcdir/21088_82092.jpg /dstdir
cp /srcdir/21088_82093.jpg /dstdir
cp /srcdir/21088_82094.jpg /dstdir
You capture the output of that script (the last three lines) to another file then that file is your script for doing the actual copies.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

awk: Output to different processes - bash

Once you have created a file, you can pass the filename to a process' or script's input: awk '{print name_of_created_file | "./process.sh &"}' & sends process.sh to the background, so that they can run in parallel. However, this is a gawk extension and not POSIX. Check the manual

Related

Writing a Bash script that takes a text file as input and pipes the text file through several commands

Run a command on items in file using Ansible

xargs -a [file] mv -t [new-directory] gives me mv: cannot stat `filename*': No such file or directory error

How to make awk command run faster on large data files

BASH shell scripting file parsing [newbie]

Categories

Resources