How to awk through multiple files? - bash

I have hundreds of .dat file with data in side and I want to awk them with the command
awk 'NR % 513 == 99' data.0003.127.dat > data.0003.127.Ma.dat
I tried to write a script file like
for i in {1 ... 9}; do
i=i*3
datafile = sprintf("data.%04d.127.dat",i)
outfile = sprintf("data.%04d.127.Ma.dat",i)
awk 'NR % 513 == 99' datafile > outfile
done
I only need 0003 0006 0009 ... files, but the above script doesn't work fine. The error says
bash: ./Ma_awk.sh: line 3: syntax error near unexpected token `('
bash: ./Ma_awk.sh: line 3: `datafile = sprintf("data.%04d.127.dat",i)'
What shall I do next? I use ubuntu 14.04.

In bash (since v4) you can write a sequence expression with an increment:
$ echo {3..27..3}
3 6 9 12 15 18 21 24 27
You can also include leading zeros, which will be preserved:
$ echo {0003..0027..3}
0003 0006 0009 0012 0015 0018 0021 0024 0027
So you could use the following:
for i in {0003..0027..3}; do
awk 'NR % 513 == 99' "data.$i.127.dat" > "data.$i.127.Ma.dat"
done

There are multiple issues with your code, simply because they are bash syntax errors.
Don't use spaces around variable assignment
Brace expension looks like {1..9}
Capturing stdout into a variable is done through =$() notation
Reference variables with the dollar sign $
Especially for filenames use double quotes " to ensure that the argument is considered as one
Now I have not considered the actual validity of your awk program but fixing the bash syntax errors would look something like the following
#!/usr/bin/env bash
for i in {1..9}; do
datafile=$(printf "data.%04d.127.dat" $i)
outfile=$(printf "data.%04d.127.Ma.dat" $i)
awk 'NR % 513 == 99' "$datafile" > "$outfile"
done
This does not take care of the correct iteration bounds with an increment of three, but since you have not specified an upper bound I will leave that as an exercise to you

Related

read file line by line and sum each line individually

Im trying to make a script that creates a file say file01.txt that writes a number on each line.
001
002
...
998
999
then I want to read the file line by line and sum each line and say whether the number is even or odd.
sum each line like 0+0+1 = 1 which is odd
9+9+8 = 26 so even
001 odd
002 even
..
998 even
999 odd
I tried
while IFS=read -r line; do sum+=line >> file02.txt; done <file01.txt
but that sums the whole file not each line.
You can do this fairly easily in bash itself making use of built-in parameter expansions to trim leading zeros from the beginning of each line in order to sum the digits for odd / even.
When reading from a file (either a named file or stdin by default), you can use the initialization with default to use the first argument (positional parameter) as the filename (if given) and if not, just read from stdin, e.g.
#!/bin/bash
infile="${1:-/dev/stdin}" ## read from file provide as $1 or stdin
Which you will use infile with your while loop, e.g.
while read -r line; do ## loop reading each line
...
done < "$infile"
To trim the leading zeros, first obtain the substring of leading zeros trimming all digits from the right until only zeros remain, e.g.
leading="${line%%[1-9]*}" ## get leading 0's
Now using the same type parameter expansion with # instead of %% trim the leading zeros substring from the front of line saving the resulting number in value, e.g.
value="${line#$leading}" ## trim from front
Now zero your sum and loop over the digits in value to obtain the sum of digits:
for ((i=0;i<${#value};i++)); do ## loop summing digits
sum=$((sum + ${value:$i:1}))
done
All that remains is your even / odd test. Putting it altogether in a short example script that intentionally outputs the sum of digits in addition to your wanted "odd" / "even" output, you could do:
#!/bin/bash
infile="${1:-/dev/stdin}" ## read from file provide as $1 or stdin
while read -r line; do ## read each line
[ "$line" -eq "$line" 2>/dev/null ] || continue ## validate integer
leading="${line%%[1-9]*}" ## get leading 0's
value="${line#$leading}" ## trim from front
sum=0 ## zero sum
for ((i=0;i<${#value};i++)); do ## loop summing digits
sum=$((sum + ${value:$i:1}))
done
printf "%s (sum=%d) - " "$line" "$sum" ## output line w/sum
## (temporary output)
if ((sum % 2 == 0)); then ## check odd / even
echo "even"
else
echo "odd"
fi
done < "$infile"
(note: you can actually loop over the digits in line and skip removing the leading zeros substring. The removal ensure that if the whole value is used it isn't interpreted as an octal value -- up to you)
Example Use/Output
Using a quick process substitution to provide input of 001 - 020 on stdin you could do:
$ ./sumdigitsoddeven.sh < <(printf "%03d\n" {1..20})
001 (sum=1) - odd
002 (sum=2) - even
003 (sum=3) - odd
004 (sum=4) - even
005 (sum=5) - odd
006 (sum=6) - even
007 (sum=7) - odd
008 (sum=8) - even
009 (sum=9) - odd
010 (sum=1) - odd
011 (sum=2) - even
012 (sum=3) - odd
013 (sum=4) - even
014 (sum=5) - odd
015 (sum=6) - even
016 (sum=7) - odd
017 (sum=8) - even
018 (sum=9) - odd
019 (sum=10) - even
020 (sum=2) - even
You can simply remove the output of "(sum=X)" when you have confirmed it operates as you expect and redirect the output to your new file. Let me know if I understood your question properly and if you have further questions.
Would you please try the bash version:
parity=("even" "odd")
while IFS= read -r line; do
mapfile -t ary < <(fold -w1 <<< "$line")
sum=0
for i in "${ary[#]}"; do
(( sum += i ))
done
echo "$line" "${parity[sum % 2]}"
done < file01.txt > file92.txt
fold -w1 <<< "$line" breaks the string $line into lines of character
(one digit per line).
mapfile assigns array to the elements fed by the fold command.
Please note the bash script is not efficient in time and not suitable
for the large inputs.
With GNU awk:
awk -vFS='' '{sum=0; for(i=1;i<=NF;i++) sum+=$i;
print $0, sum%2 ? "odd" : "even"}' file01.txt
The FS awk variable defines the field separator. If it is set to the empty string (this is what the -vFS='' option does) then each character is a separate field.
The rest is trivial: the block between curly braces is executed for each line of the input. It compute the sum of the fields with a for loop (NF is another awk variable, its value is the number of fields of the current record). And it then prints the original line ($0) followed by the string even if the sum is even, else odd.
pure awk:
BEGIN {
for (i=1; i<=999; i++) {
printf ("%03d\n", i) > ARGV[1]
}
close(ARGV[1])
ARGC = 2
FS = ""
result[0] = "even"
result[1] = "odd"
}
{
printf("%s: %s\n", $0, result[($1+$2+$3) % 2])
}
Processing a file line by line, and doing math, is a perfect task for awk.
pure bash:
set -e
printf '%03d\n' {1..999} > "${1:?no path provided}"
result=(even odd)
mapfile -t num_list < "$1"
for i in "${num_list[#]}"; do
echo $i: ${result[(${i:0:1} + ${i:1:1} + ${i:2:1}) % 2]}
done
A similar method can be applied in bash, but it's slower.
comparison:
bash is about 10x slower.
$ cd ./tmp.Kb5ug7tQTi
$ bash -c 'time awk -f ../solution.awk numlist-awk > result-awk'
real 0m0.108s
user 0m0.102s
sys 0m0.000s
$ bash -c 'time bash ../solution.bash numlist-bash > result-bash'
real 0m0.931s
user 0m0.929s
sys 0m0.000s
$ diff --report-identical result*
Files result-awk and result-bash are identical
$ diff --report-identical numlist*
Files numlist-awk and numlist-bash are identical
$ head -n 5 *
==> numlist-awk <==
001
002
003
004
005
==> numlist-bash <==
001
002
003
004
005
==> result-awk <==
001: odd
002: even
003: odd
004: even
005: odd
==> result-bash <==
001: odd
002: even
003: odd
004: even
005: odd
read is a bottleneck in a while IFS= read -r line loop. More info in this answer.
mapfile (combined with for loop) can be slightly faster, but still slow (it also copies all the data to an array first).
Both solutions create a number list in a new file (which was in the question), and print the odd/even results to stdout. The path for the file is given as a single argument.
In awk, you can set the field separator to empty (FS="") to process individual characters.
In bash it can be done with substring expansion (${var:index:length}).
Modulo 2 (number % 2) to get odd or even.

Add variable content to a new column in file

I don't have much experience using UNIX and I wonder how to do this:
I have a bash variable with this content:
82 195 9 53
Current file looks like:
A
B
C
D
I want to add a new column to a file with that numbers, like this:
A 82
B 195
C 9
D 53
Hope you can help me. Thanks in advance.
Or simply use paste with a space as the delimiter, e.g. with your example file content in file:
var="82 195 9 53"
paste -d ' ' file <(printf "%s\n" $var)
(note: $var is use unquoted in the process-substitution with printf)
Result
A 82
B 195
C 9
D 53
Note for a general POSIX shell solution, you would simply pipe the output of printf to paste instead of using the bash-only process substitution, e.g.
printf "%s\n" $var | paste -d ' ' file -
With bash and an array:
numbers="82 195 9 53"
array=($numbers)
declare -i c=0 # declare with integer flag
while read -r line; do
echo "$line ${array[$c]}"
c=c+1
done < file
Output:
A 82
B 195
C 9
D 53
One idea using awk:
x='82 195 9 53'
awk -v x="${x}" 'BEGIN { split(x,arr) } { print $0,arr[FNR] }' file.txt
Where:
-v x="${x}" - pass OS variable "${x}" is as awk variable x
split(x,arr) - split awk variable x into array arr[] (default delimiter is space); this will give us arr[1]=82, arr[2]=195, arr[3]=9 and arr[4]=53
This generates:
A 82
B 195
C 9
D 53
The question has been tagged with windows-subsystem-for-linux.
If the input file has Windows/DOS line endings (\r\n) the proposed awk solution may generate incorrect results, eg:
82 # each line
195 # appears
9 # to begin
53 # with a space
In this scenario OP has a couple options:
before calling awk run dos2unix file.txt to convert to unix line engines (\n) or ...
change the awk/BEGIN block to BEGIN { RS="\r\n"; split(x,arr) }
With mapfile aka readarray which is a bash4+ feature.
#!/usr/bin/env bash
##: The variable with numbers
variable='82 195 9 53'
##: Save the the variable into an array named numbers
mapfile -t numbers <<< "${variable// /$'\n'}"
##: Save the content of the file into an array named file_content
mapfile -t file_content < file.txt
##: Loop through the indices of both the arrays and print them side-by-side
for i in "${!file_content[#]}"; do
printf '%s %d\n' "${file_content[i]}" "${numbers[i]}"
done

Bash compute the letter suffix from the split command (i.e. integer into base 26 with letters)

The split command produces by default a file suffix of the form "aa" "ab" ... "by" "bz"...
However in a script, I need to recover this suffix, starting from the file number as an integer (without globbing).
I wrote the following code, but maybe bash wizards here have a more concise solution?
alph="abcdefghijklmnopqrstuvwxyz"
for j in {0..100}; do
# Convert j to the split suffix (aa ab ac ...)
first=$(( j / 26 ))
sec=$(( j % 26 ))
echo "${alph:$first:1}${alph:$sec:1}"
done
Alternatively, I could use bc with the obase variable, but it only outputs one number in case j<26.
bc <<< 'obase=26; 5'
# 05
bc <<< 'obase=26; 31'
# 01 05
Use this Perl one-liner and specify the file numbers (0-indexed) as arguments, for example:
perl -le 'print for ("aa".."zz")[#ARGV]' 0 25 26
Output:
aa
az
ba
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
#ARGV : array of the command-line arguments.
From top of my head, depending on 97 beeing ASCII a:
printf "\x$(printf %x $((97+j/26)))\x$(printf %x $((97+j%26)))\n"
printf "\\$(printf %o $((97+j/26)))\\$(printf %o $((97+j%26)))\n"
awk "BEGIN{ printf \"%c%c\\n\", $((97+j/26)), $((97+j%26))}" <&-
printf %x $((97+j/26)) $((97+j%26)) | xxd -r -p
You could also just write without temporary variables:
echo "${alph:j/26:1}${alph:j%26:1}"
In my use case, I do want to generate the full list
awk should be fast:
awk 'BEGIN{ for (i=0;i<=100;++i) printf "%c%c\n", 97+i/26, 97+i%26}' <&-

How can a "grep | sed | awk" script merging line pairs be more cleanly implemented?

I have a little script to extract specific data and cleanup the output a little. It seems overly messy and i'm wondering if the script can be trimmed down a bit.
The input file contains of pairs of lines -- names, followed by numbers.
Line pairs where the numeric value is not between 80 and 199 should be discarded.
Pairs may sometimes, but will not always, be preceded or followed by blank lines, which should be ignored.
Example input file:
al12t5682-heapmemusage-latest.log
38
al12t5683-heapmemusage-latest.log
88
al12t5684-heapmemusage-latest.log
100
al12t5685-heapmemusage-latest.log
0
al12t5686-heapmemusage-latest.log
91
Example/wanted output:
al12t5683 88
al12t5684 100
al12t5686 91
Current script:
grep --no-group-separator -PxB1 '([8,9][0-9]|[1][0-9][0-9])' inputfile.txt \
| sed 's/-heapmemusage-latest.log//' \
| awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'
Extra input example
al14672-heapmemusage-latest.log
38
al14671-heapmemusage-latest.log
5
g4t5534-heapmemusage-latest.log
100
al1t0000-heapmemusage-latest.log
0
al1t5535-heapmemusage-latest.log
al1t4676-heapmemusage-latest.log
127
al1t4674-heapmemusage-latest.log
53
A1t5540-heapmemusage-latest.log
54
G4t9981-heapmemusage-latest.log
45
al1c4678-heapmemusage-latest.log
81
B4t8830-heapmemusage-latest.log
76
a1t0091-heapmemusage-latest.log
88
al1t4684-heapmemusage-latest.log
91
Extra Example expected output:
g4t5534 100
al1t4676 127
al1c4678 81
a1t0091 88
al1t4684 91
another awk
$ awk -F- 'NR%2{p=$1; next} 80<=$1 && $1<=199 {print p,$1}' file
al12t5683 88
al12t5684 100
al12t5686 91
UPDATE
for the empty line record delimiter
$ awk -v RS= '80<=$2 && $2<=199{sub(/-.*/,"",$1); print}' file
al12t5683 88
al12t5684 100
al12t5686 91
Consider implementing this in native bash, as in the following (which can be seen running with your sample input -- including sporadically-present blank lines -- at http://ideone.com/Qtfmrr):
#!/bin/bash
name=; number=
while IFS= read -r line; do
[[ $line ]] || continue # skip blank lines
[[ -z $name ]] && { name=$line; continue; } # first non-blank line becomes name
number=$line # second one becomes number
if (( number >= 80 && number < 200 )); then
name=${name%%-*} # prune everything after first "-"
printf '%s %s\n' "$name" "$number" # emit our output
fi
name=; number= # clear the variables
done <inputfile.txt
The above uses no external commands whatsoever -- so whereas it might be slower to run over large input than a well-implemented awk or perl script, it also has far shorter startup time since no interpreter other than the already-running shell is required.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?, describing the while read idiom.
BashFAQ #100 - How do I do string manipulations in bash?; or The Bash-Hackers' Wiki on parameter expansion, describing how name=${name%%-*} works.
The Bash-Hackers' Wiki on arithmetic expressions, describing the (( ... )) syntax used for numeric comparisons.
perl -nle's/-.*//; $n=<>; print "$_ $n" if 80<=$n && $n<=199' inputfile.txt
With gnu sed
sed -E '
N
/\n[8-9][0-9]$/bA
/\n1[0-9]{2}$/!d
:A
s/([^-]*).*\n([0-9]+$)/\1 \2/
' infile

Using awk with Operations on Variables

I'm trying to write a Bash script that reads files with several columns of data and multiplies each value in the second column by each value in the third column, adding the results of all those multiplications together.
For example if the file looked like this:
Column 1 Column 2 Column 3 Column 4
genome 1 30 500
genome 2 27 500
genome 3 83 500
...
The script should multiply 1*30 to give 30, then 2*27 to give 54 (and add that to 30), then 3*83 to give 249 (and add that to 84) etc..
I've been trying to use awk to parse the input file but am unsure of how to get the operation to proceed line by line. Right now it stops after the first line is read and the operations on the variables are performed.
Here's what I've written so far:
for file in fileone filetwo
do
set -- $(awk '/genome/ {print $2,$3}' $file.hist)
var1=$1
var2=$2
var3=$((var1*var2))
total=$((total+var3))
echo var1 \= $var1
echo var2 \= $var2
echo var3 \= $var3
echo total \= $total
done
I tried placing a "while read" loop around everything but could not get the variables to update with each line. I think I'm going about this the wrong way!
I'm very new to Linux and Bash scripting so any help would be greatly appreciated!
That's because awk reads the entire file and runs its program on each line. So the output you get from awk '/genome/ {print $2,$3}' $file.hist will look like
1 30
2 27
3 83
and so on, which means in the bash script, the set command makes the following variable assignments:
$1 = 1
$2 = 30
$3 = 2
$4 = 27
$5 = 3
$6 = 83
etc. But you only use $1 and $2 in your script, meaning that the rest of the file's contents - everything after the first line - is discarded.
Honestly, unless you're doing this just to learn how to use bash, I'd say just do it in awk. Since awk automatically runs over every line in the file, it'll be easy to multiply columns 2 and 3 and keep a running total.
awk '{ total += $2 * $3 } ENDFILE { print total; total = 0 }' fileone filetwo
Here ENDFILE is a special address that means "run this next block at the end of each file, not at each line."
If you are doing this for educational purposes, let me say this: the only thing you need to know about doing arithmetic in bash is that you should never do arithmetic in bash :-P Seriously though, when you want to manipulate numbers, bash is one of the least well-adapted tools for that job. But if you really want to know, I can edit this to include some information on how you could do this task primarily in bash.
I agree that awk is in general better suited for this kind of work, but if you are curious what a pure bash implementation would look like:
for f in file1 file2; do
total=0
while read -r _ x y _; do
((total += x * y))
done < "$f"
echo "$total"
done

Resources