I am converting a lot of CSV files with bash scripts. They all have the same structure and the same header names. The values in the columns are variable of course. Col4 is always an integer.
Source file:
Col1;Col2;Col3;Col4
Name1;Street1;City1;2
Name2;Street2;City2;12
Name3;Street3;City3;15
Name4;Street4;City4;10
Name5;Street5;City5;3
Now when Col4 contains a certain value, for example "10", the value has to be changed in "10 pcs" and the complete line has to be duplicated.
For every 5 pcs one line.
So you could say that the number of duplicates is the value of Col4 divided by 5 and then rounded up.
So if Col4 = 10 I need 2 duplicates and if Col4 = 12, I need 3 duplicates.
Result file:
Col1;Col2;Col3;Col4
Name1;Street1;City1;2
Name2;Street2;City2;... of 12
Name2;Street2;City2;... of 12
Name2;Street2;City2;... of 12
Name3;Street3;City3;... of 15
Name3;Street3;City3;... of 15
Name3;Street3;City3;... of 15
Name4;Street4;City4;... of 10
Name4;Street4;City4;... of 10
Name5;Street5;City5;3
Can anyone help me to put this in a script. Something with bash, sed, awk. These are the languages I'm familiar with. Although I'm interested in other solutions too.
Here is the awk code assuming that the input is in a file called /tmp/input
awk -F\; '$4 < 5 {print}; $4 > 5 {for (i = 0; i < ($4/5); i++) printf "%s;%s;%s;...of %s\n",$1,$2,$3,$4}' /tmp/input
Explanation:
There are two rules.
First rule prints any rows where the $4 is less than 5. This will also print the header
$4 < 5 {print}
The second rule print if $4 is greater than 5. The loop runs $4/5 times:
$4 > 5 {for (i=0; i< ($4/5); i++) printf "%s;%s;%s;...of %s\n",$1,$2,$3,$4}
Output:
Col1;Col2;Col3;Col4
Name1;Street1;City1;2
Name2;Street2;City2;...of 12
Name2;Street2;City2;...of 12
Name2;Street2;City2;...of 12
Name3;Street3;City3;...of 15
Name3;Street3;City3;...of 15
Name3;Street3;City3;...of 15
Name4;Street4;City4;...of 10
Name4;Street4;City4;...of 10
Name5;Street5;City5;3
The code does not handle the use case where $4 == 5. You can handle that by adding a third rule. I did not added that. But I think you got the idea.
Thanks Jay! This was just what I needed.
Here is the final awk code I'm using now:
awk -F\; '$4 == "Col4" {print}; $4 < 5 {print}; $4 == 5 {print}; $4 > 5 {for (i = 0; i < ($4/5); i++) printf "%s;%s;%s;...of %s\n",$1,$2,$3,$4}' /tmp/input
I added the rule below to print the header, because it wasn't printed
$4 == "Col4" {print}
I added the rule below to print the lines where the value is equal to 5
$4 == 5 {print}
Related
Is it possible with awk to find a average of a certain row ?
For example the txt file (average.txt) contains:
2 5 10
1 5 5
1 5 10
So I want to find only first row average: 5,666667.
I tried to do it this way:
awk 'NR==1 {sum+=NF} END {print(sum/NF)}' average.txt
but the output is wrong: 1
I want to explain what your code is actually doing. NF built-in variable is holding number of files in current line thus for file average.txt
2 5 10
1 5 5
1 5 10
code
awk 'NR==1 {sum+=NF} END {print(sum/NF)}' average.txt
does for fist line increase sum by number of fields (3 for provided file) and then after processing all files does print that value divided by number of fields in last line, in other words your code does compute ratio of number of fields in 1st field and number of fields in last line. If you want to know more about NF then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Like this:
$ awk 'NR==1{for (i=1; i<=NF; i++) sum+=$i}END{print sum/NF}' file
5.66667
You can loop through all fields and sum up all the values.
If you only want to process the first record, you can print the value directly and then exit awk.
To prevent a division by zero exception, you can check if the number of fields is > 0
awk 'NR==1 && NF>0 {for (i=1; i<=NF; i++) sum+=$i; print sum/NF; exit}' average.txt
Yet another solution. It seems useless for me to loop on every line if only the first is the one you are interested in.
Instead, just head -1 the file containing your data, like so (here, the file is called test):
head -1 test | awk '{for (i=1; i<=NF; i++) sum+=$i; print sum/NF}'
Here, the awk command is basically copy/pasta from other answers, but without all the NR=1 stuff.
I have one file (excel file) which has some columns (not fixed, changes dynamically) and I need to get values for couple of particular columns. I'm able to get the columns using one awk command and then printing rows using these columns numbers into another awk command. Is there any way I can combine into one?
awk -F',' ' {for(i=1;i < 9;i++) {if($i ~ /CLIENT_ID/) {print i}}} {for(s=1;s < 2;s++) {if($s ~ /SEC_DESC/) {print s}}} ' <file.csv> | awk -F "," '!($5~/...[0-9]L/ && $21~/FUT /) {print $0}' <file.csv>
Gives me output as 5 and 9 for columns (client_idandsec_desc`), which is their column number (this changes with different files).
Now using this column number, I get the desired output as follows:
awk -F "," '!($5~/...[0-9]L/ && $21~/FUT /) {print $0}' <file.csv>
How can I combine these into one command? Pass a variable from the first to the second?
Input (csv file having various dynamic columns, interested in following two columns)
CLIENT_ID SEC_DESC
USZ256 FUT DEC 16 U.S.
USZ256L FUT DEC 16 U.S. BONDS
WNZ256 FUT DEC 16 CBX
WNZ256L FUT DEC 16 CBX BONDS
Output give me rows- 2 and 4 that matched my regex pattern in second awk command (having column numbers as 5 & 21). These column numbers changes as per file so first have to get the column number using first awl and then giving it as input to second awk.
I think I got it.
awk -F',' '
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "CLIENT_ID") cl_col = i
if ($i == "SEC_DESC") sec_col = i
}
}
NR > 1 && !($cl_col ~ /...[0-9]L/ && $sec_col ~ /FUT /) {print $0}
' RED_FUT_TST.csv
To solve your problem you can test when you're processing the first row, and put the logic to discover the column numbers there. Then when you are processing the data rows, use the column numbers from the first step.
(NR is an awk built-in variable containing the record number being processed. NF is the number of columns.)
E.g.:
$ cat red.awk
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "CLIENT_ID") cl_col = i;
if ($i == "SEC_DESC") sec_col = i;
}
}
NR > 1 && $cl_col ~ /...[0-9]L/ && $sec_col ~ /FUT /
$ awk -F'\t' -f red.awk RED_FUT_TST.csv
USZ256L FUT DEC 16 U.S. BONDS
WNZ256L FUT DEC 16 CBX BONDS
{gsub(/[ \t]+$/, "", $4); length($4) < 9 || length($4) > 12 } {print $4$1} {print length($4)} { fails4++ }
so I have this portion above that supposed to validate the 4th field thus($4) for if lenght < 9 or if lenght is greater than 11 characters its supposed to fail the validation... even after i print the length i get 11 characters and I set validation to greater than 12 but its still failing
What I am trying to DO is correctly account for the length of the field, if there are any white spaces in $4 field it supposed to trim and get the length and fail if its less that 9 or greater 11 characters
length($4) < 9 || length($4) > 11 {print $4$1} {print length($4)} { fails4++ }
It sounds like you want:
{gsub(/^[[:space:]]+|[[:space:]]+$/, "", $4); lgth=length($4)} lgth < 9 || lgth > 11{print $4 $1, lgth; fails4++}
If not, post some sample input and expected output.
If I have an arbitrary number of files, say n files, and each file contains a matrix, how can I use bash or awk to sum up all the matrices in each file and get an output?
For example, if n=3, and I have these 3 files with the following contents
$ cat mat1.txt
1 2 3
4 5 6
7 8 9
$cat mat2.txt
1 1 1
1 1 1
1 1 1
$ cat mat3.txt
2 2 2
2 2 2
2 2 2
I want to get this output:
$ cat output.txt
4 5 6
7 8 9
10 11 12
Is there a simple one liner to do this?
Thanks!
$ awk '{for (i=1;i<=NF;i++) total[FNR","i]+=$i;} END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print "";}}' mat1.txt mat2.txt mat3.txt
4 5 6
7 8 9
10 11 12
This will automatically adjust to different size matrices. I don't believe that I have used any GNU features so this should be portable to OSX and elsewhere.
How it works:
This command reads from each line from each matrix, one matrix at a time.
For each line read, the following command is executed:
for (i=1;i<=NF;i++) total[FNR","i]+=$i
This loops over every column on the line and adds it to the array total.
GNU awk has multidimensional arrays but, for portability, they are not used here. awk's arrays are associative and this creates an index from the file's line number, FNR, and the column number i, by combining them together with a comma. The result should be portable.
After all the matrices have been read, the results in total are printed:
END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print ""}}
Here, j loops over each line up to the total number of lines, FNR. Then i loops over each column up to the total number of columns, NF. For each row and column, the total is printed via printf "%3i ",total[j","i]. This prints the total as a 3-character-wide integer. If you numbers are float or are bigger, adjust the format accordingly.
At the end of each row, the print "" statement causes a newline character to be printed.
You can use awk with paste:
awk -v n=3 '{for (i=1; i<=n; i++) printf "%s%s", ($i + $(i+n) + $(i+n*2)),
(i==n)?ORS:OFS}' <(paste mat{1,2,3}.txt)
4 5 6
7 8 9
10 11 12
GNU awk has multi-dimensional arrays.
gawk '
{
for (i=1; i<=NF; i++)
m[i][FNR] += $i
}
END {
for (y=1; y<=FNR; y++) {
for (x=1; x<=NF; x++)
printf "%d ", m[x][y]
print ""
}
}
' mat{1,2,3}.txt
I'm very new to Bash so I'm sorry if this question is actually very simple. I am dealing with a text file that contains many vertical lists of numbers 2-32 counting up by 2, and each number has a line of other text following it. The problem is that some of the lists are missing numbers. Any pointers for a code that could go through and check to see if each number is there, and if not add a line and put the number in.
One list might look like:
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8 daflgkdsakfjhasdlkjhfasdjkhf
12 dlsagflakdjshgflksdhflksdahfl
All the way down to 32. How would I in this case make it so the 10 is recognized as missing and then added in above the 12? Thanks!
Here's one awk-based solution (formatted for readability, not necessarily how you would type it):
awk ' { value[0 + $1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i]
}' input.txt
It basically just records the existing lines in a key/value pair (associative array), then at the end, prints all the records you care about, along with the (possibly empty) value saved earlier.
Note: if the first column needs to be seen as a string instead of an integer, this variant should work:
awk ' { value[$1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i ""]
}' input.txt
You can use awk to figure out the missing line and add it back:
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
Testing:
cat file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
20 daflgkdsakfjhasdlkjhfasdjkhf
24 dlsagflakdjshgflksdhflksdahfl
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8
10
12
14
16
18
20 daflgkdsakfjhasdlkjhfasdjkhf
22
24 dlsagflakdjshgflksdhflksdahfl
26
28
30
32