KSH remove digits from number precision - ksh

I am trying to compare 2 log files in a ksh script that look like
log1:
10100 951 5 20150318 20150430
10101 11950 0 20150323 20150630
10102 285933 1 20150128 20150430
10041 57007 3.53 20150128 20150430
log2
10100 951 5.0000 20150318 20150430
10101 11950 0.0000 20150323 20150630
10102 285933 1.0000 20150128 20150430
10041 57007 3.5300 20150128 20150430
Log1 on column 3 has maximum 2 digits after the . (eg: 3.53)
Log2 on colum 3 always has 4 digits after the . (eg: 0.0000 or 3.5300)
How could I add some digits after . for the first log or remove the digits in log2 in order to be able to compare them line for line?
My script is written in ksh.

You should format the value with printf:
cat log1 | while read col1 col2 col3 col4 col5; do
printf "%d %d %.4f %d %d\n" ${col1} ${col2} ${col3} ${col4} ${col5}
done > log1.converted
Above code reads easy, but makes an unnecessary call to cat ("UUOC").
The better way to write this is
while read col1 col2 col3 col4 col5; do
printf "%d %d %.4f %d %d\n" ${col1} ${col2} ${col3} ${col4} ${col5}
done < log1 > log1.converted

Related

String to Integer conversions in shell script and back to String

I would like to do String to Integer conversion, operate that integer and back to string in shell.
I have
input_sub=000
while [ -d $input_dir ]
do
echo $input_sub
# HERE I would like to fist convert 000 to 0
# then add 1 to it 0-> 1
# then convert that 1 to 001
done
Don't mind much about the while condition.
I would like to do what is described in the comments.
How can I do this?
You can do what you need in POSIX shell, but you must protect against numbers with leading zeros being interpreted as octal numbers. To do what you want, you need a way to remove the leading zeros for your conversion to a number. While bash provides a simple built-in parameter expansion that will work, in POSIX shell, you are stuck using the old expr syntax or calling a utility like sed or grep.
To trim the leading zeros using expr, you must first know how many there are. The old POSIX shell expr provides two expressions that will work. The first called index can return the index of the first character in $input_sub that is not 0. Which gives you the index (1-based) where the first non-zero digit is found. The form you can use is:
## get index of first non-zero digit, POSIX compliant
nonzero=$(expr index "$input_sub" [123456789])
With the index of the first non-zero digit in $nonzero, you can use the substr expression to obtain the number without leading zeros (you know the max number of digits is 3, so obtain the substring from the index to 3), e.g.
num=$(expr substr "$input_sub" "$nonzero" 3) ## remove leading 0's
You need to be able to handle 000 as $inpu_sub, so go ahead and add a if .. then ... else ... fi to handle that case, e.g.
if [ "$nonzero" -eq 0 ]; then
num=0
else
num=$(expr substr "$input_sub" "$nonzero" 3) ## remove leading 0's
fi
Now you can simply add 1 to get your new number:
newnum=$((num + 1))
To convert the number back to a string of 3 characters representing the number with leading zeros replaced, just use printf with the "%03d" conversion specifier, e.g.
# then convert that 1 to 001
input_sub=$(printf "%03d" "$newnum")
Putting together a short example showing the progression that takes place, I have replaced your while loop with a loop that will loop 21 times from 0 to 20 to show the operation and I have added printf statements to show the numbers and conversion back to string. You simply restore your while and remove the extra printf statements for your use:
#!/bin/sh
input_sub=000
# while [ -d $input_dir ]
while [ "$input_sub" != "020" ] ## temporary loop 000 to 009
do
printf "input_sub: %s " "$input_sub"
# HERE I would like to fist convert 000 to 0
# then add 1 to it 0-> 1
## get index of first non-zero digit, POSIX compliant
nonzero=$(expr index "$input_sub" [123456789])
if [ "$nonzero" -eq 0 ]; then
num=0
else
num=$(expr substr "$input_sub" "$nonzero" 3) ## remove leading 0's
fi
newnum=$((num + 1))
# then convert that 1 to 001
input_sub=$(printf "%03d" "$newnum")
printf "%2d + 1 = %2d => input_sub: %s\n" "$num" "$newnum" "$input_sub"
done
Example Use/Output
Showing the conversions with the modified while loop, you would get:
$ sh str2int2str.sh
input_sub: 000 0 + 1 = 1 => input_sub: 001
input_sub: 001 1 + 1 = 2 => input_sub: 002
input_sub: 002 2 + 1 = 3 => input_sub: 003
input_sub: 003 3 + 1 = 4 => input_sub: 004
input_sub: 004 4 + 1 = 5 => input_sub: 005
input_sub: 005 5 + 1 = 6 => input_sub: 006
input_sub: 006 6 + 1 = 7 => input_sub: 007
input_sub: 007 7 + 1 = 8 => input_sub: 008
input_sub: 008 8 + 1 = 9 => input_sub: 009
input_sub: 009 9 + 1 = 10 => input_sub: 010
input_sub: 010 10 + 1 = 11 => input_sub: 011
input_sub: 011 11 + 1 = 12 => input_sub: 012
input_sub: 012 12 + 1 = 13 => input_sub: 013
input_sub: 013 13 + 1 = 14 => input_sub: 014
input_sub: 014 14 + 1 = 15 => input_sub: 015
input_sub: 015 15 + 1 = 16 => input_sub: 016
input_sub: 016 16 + 1 = 17 => input_sub: 017
input_sub: 017 17 + 1 = 18 => input_sub: 018
input_sub: 018 18 + 1 = 19 => input_sub: 019
input_sub: 019 19 + 1 = 20 => input_sub: 020
This has been done in POSIX shell given your tag [shell]. If you have bash available, you can shorten and make the script a bit more efficient by using bash built-ins instead of expr. That said, for 1000 directories max -- you won't notice much difference. Let me know if you have further questions.
Bash Solution Per-Request in Comment
If you do have bash available, then the [[ ... ]] expression provides the =~ operator which allows an extended REGEX match on the right hand side (e.g. [[ $var =~ REGEX ]]) The REGEX can contain capture groups (parts of the REGEX enclosed by (..)), that are used to fill the BASH_REMATCH array where ${BASH_REMATCH[0]} contains the total expression matched and ${BASH_REMATCH[1]} ... contain each captured part of the regex.
So using [[ ... =~ ... ]] with a capture on the number beginning with [123456789] will leave the wanted number in ${BASH_REMATCH[1]} allowing you to compute the new number using the builtin, e.g.
#!/bin/bash
input_sub=000
# while [ -d $input_dir ]
while [ "$input_sub" != "020" ] ## temporary loop 000 to 020
do
printf "input_sub: %s " "$input_sub"
# HERE I would like to fist convert 000 to 0
# then add 1 to it 0-> 1
## [[ .. =~ REGEX ]], captures between (...) in array BASH_REMATCH
if [[ $input_sub =~ ^0*([123456789]+[0123456789]*)$ ]]
then
num=${BASH_REMATCH[1]} ## use number if not all zeros
else
num=0 ## handle 000 case
fi
newnum=$((num + 1))
# then convert that 1 to 001
input_sub=$(printf "%03d" "$newnum")
printf "%2d + 1 = %2d => input_sub: %s\n" "$num" "$newnum" "$input_sub"
done
(same output)
Let me know if you have further questions.

Combining multiple awk output statements into one line

I have some ascii files I’m processing, with 35 columns each, and variable number of rows. I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
I’ve done similar processing in the past, but by outputting temp files for each awk command, reading each successive temp file in to eventually create a final ascii file. Then, I would delete the temp files after. I’m hoping there is an easier/faster method than having to create a bunch of temp files.
Below is an initial working processing step, that the above awk commands would need to follow and fit into. This step gets the data from foo.txt, removes the header, and processes only the rows containing a particular, but varying, string.
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
There’s another processing step for different data files, that I would also need the 2 new columns discussed earlier. This is simply appending a unique file name from what’s being catted to the last column of every row in a new ascii file. This command is actually in a loop with varying input files, but I’ve simplified it here.
cat foo.txt | tail -n +2 | awk -v fname="$fname" '{print $0 OFS fname;}' >> foo_new.txt
An example of one of the foo.txt files.
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
Below would be the example foo_new.txt desired. The requested 2 columns of output from awk (last 2 columns). In this example, column 5 is the difference between column 3 and 2 plus 1. Column 6 is the result of column 1 divided by column 5.
20 0 5 F001 6 3.3
4 2 3 F002 2 2.0
12 4 8 F003 5 2.4
For the second example foo_new.txt. The last column is an example of fname. These are computed in the shell script, and passed to awk. I don't care if the results in column 7 (fname) are at the end or placed between columns 4 and 5, so long as it gets along with the other awk statements.
20 0 5 F001 6 3.3 C1
4 2 3 F002 2 2.0 C2
12 4 8 F003 5 2.4 C3
The best luck so far, but unfortunately this is producing a file with the original output first, and the added output below it. I'd like to have the added output appended on as columns (#5 and #6).
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
cat foo_new.txt | awk '{print $4=$3-$2+1, $5=$1/($3-$2+1)}' >> foo_new.txt
Consider an input file data with header line like this (based closely on your minimal example):
Col1 Col2 Col3 Col4
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
You want the output to contain a column 5 that is the value of $3 - $2 + 1 (column 3 minus column 2 plus 1), and a column 6 that is the value of column 1 divided by column 5 (with 1 decimal place in the output), and a file name that is based on a variable fname passed to the script but that has a unique value for each line. And you only want lines where column 4 matches F and 3 digits, and you want to skip the first line. That can all be written directly in awk:
awk -v fname=C '
NR == 1 { next }
$4 ~ /^F[0-9][0-9][0-9]$/ { c5 = $3 - $2 + 1
c6 = sprintf("%.1f", $1 / c5)
print $0, c5, c6, fname NR
}' data
You could write that on one line too:
awk -v fname=C 'NR==1{next} $4~/^F[0-9][0-9][0-9]$/ { c5=$3-$2+1; print $0,c5,sprintf("%.1f",$1/c5), fname NR }' data
The output is:
20 0 5 F001 6 3.3 C2
4 2 3 F002 2 2.0 C3
12 4 8 F003 5 2.4 C4
Clearly, you could change the file name so that the counter starts from 0 or 1 by using counter++ or ++counter respectively in place of the NR in the print statement, and you could format it with leading zeros or whatever else you want with sprintf() again. If you want to drop the first line of each file, rather than just the first file, change the NR == 1 condition to FNR == 1 instead.
Note that this does not need the preprocessing provided by cat foo.txt | tail -n +2.
I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
That's just:
awk -vN=9 -vanother_column=10 '{ v36 = $N - $(N+1); print $0, v36, $another_column / v36 }' input_file.tsv
I guess your file has some "header"/special "first line", so if it's the first line, then preserve it:
awk ... 'NR==1{print $0, "36_header", "37_header"} NR>1{ ... the script above ... }`
Taking first 3 columns from the example script you presented, and substituting N for 2 and another_column for 1, we get the following script:
# recreate input file
cat <<EOF |
20 0 5
4 2 3
12 4 8
100 10 29
EOF
tr -s ' ' |
tr ' ' '\t' > input_file.tsv
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N; print $0, tmp, $another_column / tmp }' input_file.tsv
and it will output:
20 0 5 5 4
4 2 3 1 4
12 4 8 4 3
100 10 29 19 5.26316
Such script:
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N + 1; print $0, tmp, sprintf("%.1f", $another_column / tmp) }' input_file.tsv
I think get's closer output to what you want:
20 0 5 6 3.3
4 2 3 2 2.0
12 4 8 5 2.4
100 10 29 20 5.0
And I guess that by that (N+1) you meant "the difference between two columns with 1 added".

How to generate N columns with printf

I'm currently using:
printf "%14s %14s %14s %14s %14s %14s\n" $(cat NFE.txt)>prueba.txt
This reads a list in NFE.txt and generates 6 columns. I need to generate N columns where N is a variable.
Is there a simple way of saying something like:
printf "N*(%14s)\n" $(cat NFE.txt)>prueba.txt
Which generates the desire output?
# T1 is a white string with N blanks
T1=$(printf "%${N}s")
# Replace every blank in T with string %14s and assign to T2
T2="${T// /%14s }"
# Pay attention to that T2 contains a trailing blank.
# ${T2% } stands for T2 without a trailing blank
printf "${T2% }\n" $(cat NFE.txt)>prueba.txt
You can do this although i don't know how robust it will be
$(printf 'printf '; printf '%%14s%0.s' {1..6}; printf '\\n') $(<file)
^
This is your variable number of strings
It prints out the command with the correct number of string and executes it in a subshell.
Input
10 20 30 40 50 1 0
1 3 45 6 78 9 4 3
123 4
5 4 8 4 2 4
Output
10 20 30 40 50 1
0 1 3 45 6 78
9 4 3 123 4 5
4 8 4 2 4
You could write this in pure bash, but then you could just use an existing language. For example:
printf "$(python -c 'print("%14s "*6)')\n" $(<NFE.txt)
In pure bash, you could write, for example:
repeat() { (($1)) && printf "%s%s" "$2" "$(times $(($1-1)) "$2")"; }
and then use that in the printf:
printf "$(repeat 6 "%14s ")\n" $(<NFE.txt)

Vertically divide an array so we get minimum splits

I am thinking on the following problem.
I can have an array of strings like
Col1 Col2 Col3 Col4
aa aa aa aa
aaa aaa aaaaa aaa
aaaa aaaaaaa aa a
...........................
Actually it is CSV file. And I should find a way to divide this vertically into one or more files. Condition for splitting is that no one file contain no row that exceeds some bytes. For simplicity we can rewrite that array with lengths:
Col1 Col2 Col3 Col4
2 2 2 2
3 3 5 3
4 7 2 1
...........................
And let's say the limit is 10, i.e. if > 9 we should split. So if we split into 2 files [Col1, Col2, Col3] and [Col4] this will not satisfy the condition because the first file will contain 3 + 3 + 5 > 9 in the second row and 4 + 7 + 2 > 9 in the third row. If we split into [Col1, Col2] and [Col3, Col4] this will not satisfy the condition because the first file will contain 4 + 7 > 9 in the third row. So we are splitting this into 3 files like [Col1], [Col2, Col3] and [Col4]. Now every file is correct and looks like:
File1 | File2 | File3
------------------------------
Col1 | Col2 Col3 | Col4
2 | 2 2 | 2
3 | 3 5 | 3
4 | 7 2 | 1
...............................
So it should split from left to right giving maximum columns as possible to the left file. The problem is that this file can be huge and I don't want to read it into memory and so we read the initial file line by line and somehow I should determine a set of indexes to split. If that is possible at all? I hope I described the problem well, so you can understand it.
Generally awk is quite good at handling large csv files.
You could try something like this to retrieve the max length for each column and then decide how to split.
Let's say the file.txt contains
Col1;Col2;Col3;Col4
aa;aa;aa;aa
aaa;aaa;aaaaa;aaa
aaaa;aaaaaaa;aa;a
(Assuming windows style quotes) Running the following :
> awk -F";" "NR>1{for (i=1; i<=NF; i++) max[i]=(length($i)>max[i]?length($i):max[i])} END {for (i=1; i<=NF; i++) printf \"%d%s\", max[i], (i==NF?RS:FS)}" file.txt
Will output :
4;7;5;3
Could you try this on your real data set ?

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

Resources