Bash First Element in List Recognition - macos

I'm very new to Bash so I'm sorry if this question is actually very simple. I am dealing with a text file that contains many vertical lists of numbers 2-32 counting up by 2, and each number has a line of other text following it. The problem is that some of the lists are missing numbers. Any pointers for a code that could go through and check to see if each number is there, and if not add a line and put the number in.
One list might look like:
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8 daflgkdsakfjhasdlkjhfasdjkhf
12 dlsagflakdjshgflksdhflksdahfl
All the way down to 32. How would I in this case make it so the 10 is recognized as missing and then added in above the 12? Thanks!

Here's one awk-based solution (formatted for readability, not necessarily how you would type it):
awk ' { value[0 + $1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i]
}' input.txt
It basically just records the existing lines in a key/value pair (associative array), then at the end, prints all the records you care about, along with the (possibly empty) value saved earlier.
Note: if the first column needs to be seen as a string instead of an integer, this variant should work:
awk ' { value[$1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i ""]
}' input.txt

You can use awk to figure out the missing line and add it back:
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
Testing:
cat file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
20 daflgkdsakfjhasdlkjhfasdjkhf
24 dlsagflakdjshgflksdhflksdahfl
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8
10
12
14
16
18
20 daflgkdsakfjhasdlkjhfasdjkhf
22
24 dlsagflakdjshgflksdhflksdahfl
26
28
30
32

Related

Awk Matrix multiplication

I'm trying to write an AWK command that allows me to perform matrix multiplication between two tab separated files.
example:
cat m1
1 2 3 4
5 6 7 8
cat m2
1 2
3 4
5 6
7 8
desired output:
50 60
114 140
without any validation of the input files for the sizes.
it will be easier to break into two scripts, one for transposing the second matrix and one to create a dot product of vectors. Also to simply awk code, you can resort to join.
$ awk '{m=NF/2; for(i=1;i<=m;i++) sum[NR] += $i*$(i+m)}
END {for(i=1;i<=NR;i++)
printf "%s", sum[i] (i==sqrt(NR)?ORS:OFS);
print ""}' <(join -j99 m1 <(transpose m2))
where transpose function is defined as
$ function transpose() { awk '{for(j=1;j<=NF;j++) a[NR,j]=$j}
END {for(i=1;i<=NF;i++)
for(j=1;j<=NR;j++)
printf "%s",a[j,i] (j==NR?ORS:OFS)}' "$1"; }
I would suggest going with GNU Octave:
octave --eval 'load("m1"); load("m2"); m1*m2'
Output:
ans =
50 60
114 140
However, assuming well-formatted files you can do it like this with GNU awk:
matrix-mult.awk
ARGIND == 1 {
for(i=1; i<=NF; i++)
m1[FNR][i] = $i
m1_width = NF
m1_height = FNR
}
ARGIND == 2 {
for(i=1; i<=NF; i++)
m2[FNR][i] = $i
m2_width = NF
m2_height = FNR
}
END {
if(m1_width != m2_height) {
print "Matrices are incompatible, unable to multiply!"
exit 1
}
for(i=1; i<=m1_height; i++) {
for(j=1; j<=m2_width; j++) {
for(k=1; k<=m1_width; k++)
sum += m1[i][k] * m2[k][j]
printf sum OFS; sum=0
}
printf ORS
}
}
Run it like this:
awk -f matrix-mult.awk m1 m2
Output:
50 60
114 140
If you process the second matrix before the first matrix, then you don't have to transpose the second matrix or to store both matrices in an array:
awk 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;w=NF;next}{for(i=1;i<=w;i++){s=0;for(j=1;j<=NF;j++)s+=$j*a[j,i];printf"%s"(i==w?RS:FS),s}}' m2 m1
When I replaced multidimensional arrays with arrays of arrays by replacing a[NR,i] with a[NR][i] and a[j,i] with a[j][i], it made the code about twice as fast in gawk. But arrays of arrays are not supported by nawk, which is /usr/bin/awk on macOS.
Or another option is to use R:
Rscript -e 'as.matrix(read.table("m1"))%*%as.matrix(read.table("m2"))'
Or this gets the names of the input files as command line arguments and prints the result without column names or row names:
Rscript -e 'write.table(Reduce(`%*%`,lapply(commandArgs(T),function(x)as.matrix(read.table(x)))),col.names=F,row.names=F)' m1 m2

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?
Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4
$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.
updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Conditional replacement of field value in csv file

I am converting a lot of CSV files with bash scripts. They all have the same structure and the same header names. The values in the columns are variable of course. Col4 is always an integer.
Source file:
Col1;Col2;Col3;Col4
Name1;Street1;City1;2
Name2;Street2;City2;12
Name3;Street3;City3;15
Name4;Street4;City4;10
Name5;Street5;City5;3
Now when Col4 contains a certain value, for example "10", the value has to be changed in "10 pcs" and the complete line has to be duplicated.
For every 5 pcs one line.
So you could say that the number of duplicates is the value of Col4 divided by 5 and then rounded up.
So if Col4 = 10 I need 2 duplicates and if Col4 = 12, I need 3 duplicates.
Result file:
Col1;Col2;Col3;Col4
Name1;Street1;City1;2
Name2;Street2;City2;... of 12
Name2;Street2;City2;... of 12
Name2;Street2;City2;... of 12
Name3;Street3;City3;... of 15
Name3;Street3;City3;... of 15
Name3;Street3;City3;... of 15
Name4;Street4;City4;... of 10
Name4;Street4;City4;... of 10
Name5;Street5;City5;3
Can anyone help me to put this in a script. Something with bash, sed, awk. These are the languages I'm familiar with. Although I'm interested in other solutions too.
Here is the awk code assuming that the input is in a file called /tmp/input
awk -F\; '$4 < 5 {print}; $4 > 5 {for (i = 0; i < ($4/5); i++) printf "%s;%s;%s;...of %s\n",$1,$2,$3,$4}' /tmp/input
Explanation:
There are two rules.
First rule prints any rows where the $4 is less than 5. This will also print the header
$4 < 5 {print}
The second rule print if $4 is greater than 5. The loop runs $4/5 times:
$4 > 5 {for (i=0; i< ($4/5); i++) printf "%s;%s;%s;...of %s\n",$1,$2,$3,$4}
Output:
Col1;Col2;Col3;Col4
Name1;Street1;City1;2
Name2;Street2;City2;...of 12
Name2;Street2;City2;...of 12
Name2;Street2;City2;...of 12
Name3;Street3;City3;...of 15
Name3;Street3;City3;...of 15
Name3;Street3;City3;...of 15
Name4;Street4;City4;...of 10
Name4;Street4;City4;...of 10
Name5;Street5;City5;3
The code does not handle the use case where $4 == 5. You can handle that by adding a third rule. I did not added that. But I think you got the idea.
Thanks Jay! This was just what I needed.
Here is the final awk code I'm using now:
awk -F\; '$4 == "Col4" {print}; $4 < 5 {print}; $4 == 5 {print}; $4 > 5 {for (i = 0; i < ($4/5); i++) printf "%s;%s;%s;...of %s\n",$1,$2,$3,$4}' /tmp/input
I added the rule below to print the header, because it wasn't printed
$4 == "Col4" {print}
I added the rule below to print the lines where the value is equal to 5
$4 == 5 {print}

Compare values of each records in field 1 to find min and max values AWK

I am new to text preprocessing and AWK language.
I am trying to loop through each record in a given field(field1) and find the max and min of values and store it in a variable.
Algorithm :
1) Set Min = 0 and Max = 0
2) Loop through $1(field 1)
3) Compare FNR of the field 1 and set Max and Min
4) Finally print Max and Min
this is what I tried :
BEGIN{max = 0; min = 0; NF = 58}
{
for(i = 0; i < NF-57; i++)
{
for(j =0; j < NR; j++)
{
min = (min < $j) ? min : $j
max = (max > $j) ? max : $j
}
}
}
END{print max, min}
#Dataset
f1 f2 f3 f4 .... f58
0.3 3.3 0.5 3.6
0.9 4.7 2.5 1.6
0.2 2.7 6.3 9.3
0.5 3.6 0.9 2.7
0.7 1.6 8.9 4.7
Here, f1,f2,..,f58 are the fields or columns in Dataset.
I need to loop through column one(f1) and find Min-Max.
Output Required:
Min = 0.2
Max = 0.9
What I get as a result:
Min = ''(I dont get any result)
Max = 9.3(I get max of all the fields instead of field1)
This is for learning purpose so I asked for one column So that I can try on my own for multiple columns
These is what I have:
This for loop would only loop 4 times as there r only four fields. Will the code inside the for loop execute for each record that is, for 5 times?
for(i = 0; i < NF; i++)
{
if (min[i]=="") min[i]=$i
if (max[i]=="") max[i]=$i
if ($i<min[i]) min[i]=$i
if ($i>max[i]) max[i]=$i
}
END
{
OFS="\t";
print "min","max";
#If I am not wrong, I saved the data in an array and I guess this would be the right way to print all min and max?
for(i=0; i < NF; i++;)
{
print min[i], max[i]
}
}
Here is a working solution which is really much easier than what you are doing:
/^-?[0-9]*(\.[0-9]*)?$/ checks that $1 is indeed a valid number, otherwise it is discarded.
sort -n | awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {a[c++]=$1} END {OFS="\t"; print "min","max";print a[0],a[c-1]}'
If you don't use this, then min and max need to be initialized, for example with the first value:
awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {if (min=="") min=$1; if (max=="") max=$1; if ($1<min) min=$1; if ($1>max) max=$1} END {OFS="\t"; print "min","max";print min, max}'
Readable versions:
sort -n | awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
a[c++]=$1
}
END {
OFS="\t"
print "min","max"
print a[0],a[c-1]
}'
and
awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
if (min=="") min=$1
if (max=="") max=$1
if ($1<min) min=$1
if ($1>max) max=$1
}
END {
OFS="\t"
print "min","max"
print min, max
}'
On your input, is outputs:
min max
0.2 0.9
EDIT (replying to the comment requiring more information on how awk works):
Awk loops through lines (named records) and for each line you have columns (named fields) available. Each awk iteration reads a line and provides among others the NR and NF variables. In your case, you are only interested in the first column, so you will only use $1 which is the first column field. For each record where $1 is matching /^-?[0-9]*(\.[0-9]*)?$/ which is a regex matching positive and negative integers or floats, we are either storing the value in an array a (in the first version) or setting the min/max variables if needed (in the second version).
Here is the explanation for the condition $1 ~ /^-?[0-9]*(\.[0-9]*)?$/:
$1 ~ means we are checking if the first field $1 matches the regex between slashes
^ means we start matching from the beginning of the $1 field
-? means an optional minus sign
[0-9]* is any number of digits (including zero, so .1 or -.1 can be matched)
()? means an optional block which can be present or not
\.[0-9]* if that optional block is present, it should start with a dot and contain zero or more digits (so -. or . can be matched! adapt the regex if you have uncertain input)
$ means we are matching until the last character from the $1 field
If you wanted to loop through fields, you would have to use a for loop from 1 to NF (included) like this:
echo "1 2 3 4" | awk '{for (i=1; i<=NF; i++) {if (min=="") min=$(i); if (max=="") max=$(i); if ($(i)<min) min=$(i); if ($(i)>max) max=$(i)}} END {OFS="\t"; print "min","max";print min, max}'
(please note that I have not checked the input here for simplicity purposes)
Which outputs:
min max
1 4
If you had more lines as an input, awk would also process them after reading the first record, example with this input:
1 2 3 4
5 6 7 8
Outputs:
min max
1 8
To prevent this and only work on the first line, you can add a condition like NR == 1 to process only the first line or add an exit statement after the for loop to stop processing the input after the first line.
If you're looking to only column 1, you may try this:
awk '/^[[:digit:]].*/{if($1<min||!min){min=$1};if($1>max){max=$1}}END{print min,max}' dataset
The script looks for line starting with digit and set the min or max if it didn't find it before.

how to sum up matrices in multiple files using bash or awk

If I have an arbitrary number of files, say n files, and each file contains a matrix, how can I use bash or awk to sum up all the matrices in each file and get an output?
For example, if n=3, and I have these 3 files with the following contents
$ cat mat1.txt
1 2 3
4 5 6
7 8 9
$cat mat2.txt
1 1 1
1 1 1
1 1 1
$ cat mat3.txt
2 2 2
2 2 2
2 2 2
I want to get this output:
$ cat output.txt
4 5 6
7 8 9
10 11 12
Is there a simple one liner to do this?
Thanks!
$ awk '{for (i=1;i<=NF;i++) total[FNR","i]+=$i;} END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print "";}}' mat1.txt mat2.txt mat3.txt
4 5 6
7 8 9
10 11 12
This will automatically adjust to different size matrices. I don't believe that I have used any GNU features so this should be portable to OSX and elsewhere.
How it works:
This command reads from each line from each matrix, one matrix at a time.
For each line read, the following command is executed:
for (i=1;i<=NF;i++) total[FNR","i]+=$i
This loops over every column on the line and adds it to the array total.
GNU awk has multidimensional arrays but, for portability, they are not used here. awk's arrays are associative and this creates an index from the file's line number, FNR, and the column number i, by combining them together with a comma. The result should be portable.
After all the matrices have been read, the results in total are printed:
END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print ""}}
Here, j loops over each line up to the total number of lines, FNR. Then i loops over each column up to the total number of columns, NF. For each row and column, the total is printed via printf "%3i ",total[j","i]. This prints the total as a 3-character-wide integer. If you numbers are float or are bigger, adjust the format accordingly.
At the end of each row, the print "" statement causes a newline character to be printed.
You can use awk with paste:
awk -v n=3 '{for (i=1; i<=n; i++) printf "%s%s", ($i + $(i+n) + $(i+n*2)),
(i==n)?ORS:OFS}' <(paste mat{1,2,3}.txt)
4 5 6
7 8 9
10 11 12
GNU awk has multi-dimensional arrays.
gawk '
{
for (i=1; i<=NF; i++)
m[i][FNR] += $i
}
END {
for (y=1; y<=FNR; y++) {
for (x=1; x<=NF; x++)
printf "%d ", m[x][y]
print ""
}
}
' mat{1,2,3}.txt

Resources