Update column in file based on associative array value in bash - bash

So I have a file named testingFruits.csv with the following columns:
name,value_id,size
apple,1,small
mango,2,small
banana,3,medium
watermelon,4,large
I also have an associative array that stores the following data:
fruitSizes[apple] = xsmall
fruitSizes[mango] = small
fruitSizes[banana] = medium
fruitSizes[watermelon] = xlarge
Is there anyway I can update the 'size' column within the file based on the data within the associative array for each value in the 'name' column?
I've tried using awk but I had no luck. Here's a sample of what I tried to do:
awk -v t="${fruitSizes[*]}" 'BEGIN{n=split(t,arrayval,""); ($1 in arrayval) {$3=arrayval[$1]}' "testingFruits.csv"
I understand this command would get the bash defined array fruitSizes, do a split on all the values, then check if the first column (name) is within the fruitSizes array. If it is, then it would update the third column (size) with the value found in fruitSizes for that specific name.
Unfortunately this gives me the following error:
Argument list too long
This is the expected output I'd like in the same testingFruits.csv file:
name,value_id,size
apple,1,xsmall
mango,2,small
banana,3,medium
watermelon,4,xlarge
One edge case I'd like to handle is the presence of duplicate values in the name column with different values for the value_id and size columns.

If you want to stick to an awk script, pass the array via stdin to avoid running into ARG_MAX issues.
Since your array is associative, listing only the values ${fruitSizes[#]} is not sufficient. You also need the keys ${!fruitSizes[#]}. pr -2 can pair the keys and values in one line.
This assumes that ${fruitSizes[#]} and ${!fruitSizes[#]} expand in the same order, and your keys and values are free of the field separator (, in this case).
printf %s\\n "${!fruitSizes[#]}" "${fruitSizes[#]}" | pr -t -2 -s, |
awk -F, -v OFS=, 'NR==FNR {a[$1]=$2; next} $1 in a {$3=a[$1]} 1' - testingFruits.csv
However, I'm wondering where the array fruitSizes comes from. If you read it from a file or something like that, it would be easier to leave out the array altogether and do everything in awk.

Related

Is there a way to iterate over values of a column then check if it's present elsewhere?

I have generated 2 .csv files, one containing the original md5sums of some files in a directory and one containing the md5sums calculated at a specific moment.
md5_original.csv
----------
$1 $2 $3
7815696ecbf1c96e6894b779456d330e,,s1.txt
912ec803b2ce49e4a541068d495ab570,,s2.txt
040b7cf4a55014e185813e0644502ea9,,s64.txt
8a0b67188083b924d48ea72cb187b168,,b43.txt
etc.
md5_$current_date.csv
----------
$1 $2 $3
7815696ecbf1c96e6894b779456d330e,,s1.txt
4d4046cae9e9bf9218fa653e51cadb08,,s2.txt
3ff22b3585a0d3759f9195b310635c29,,b43.txt
etc.
* some files could be deleted when calculating current md5sums
I am looking to iterate over the values of column $3 in md5_$current_date.csv and, for each value of that column, to check if it exists in the md5_original.csv and if so, finally to compare its value on $1.
Output should be:
s2.txt hash changed from 912ec803b2ce49e4a541068d495ab570 to 4d4046cae9e9bf9218fa653e51cadb08.
b43.txt hash changed from 8a0b67188083b924d48ea72cb187b168 to 3ff22b3585a0d3759f9195b310635c29.
I have written the script for building this two .csv files, but I am struggling to the awk part where I have to do what I have asked above. I don't know if there is a better way to do this, I am a newbie.
I would use GNU AWK for this task following way, let md5_original.csv content be
7815696ecbf1c96e6894b779456d330e {BLANK_COLUMN} s1.txt
912ec803b2ce49e4a541068d495ab570 {BLANK_COLUMN} s2.txt
040b7cf4a55014e185813e0644502ea9 {BLANK_COLUMN} s64.txt
8a0b67188083b924d48ea72cb187b168 {BLANK_COLUMN} b43.txt
and md5_current.csv content be
7815696ecbf1c96e6894b779456d330e {BLANK_COLUMN} s1.txt
4d4046cae9e9bf9218fa653e51cadb08 {BLANK_COLUMN} s2.txt
3ff22b3585a0d3759f9195b310635c29 {BLANK_COLUMN} b43.txt
then
awk 'FNR==NR{arr[$3]=$1;next}($3 in arr)&&($1 != arr[$3]){print $3 " hash changed from " arr[$3] " to " $1}' md5_original.csv md5_current.csv
output
s2.txt hash changed from 912ec803b2ce49e4a541068d495ab570 to 4d4046cae9e9bf9218fa653e51cadb08
b43.txt hash changed from 8a0b67188083b924d48ea72cb187b168 to 3ff22b3585a0d3759f9195b310635c29
Explanation: FNR is number of row in current file, NR is number of row globally, these are equal only when processing 1st file. When processing 1st file I create array arr so keys are filenames and values are corresponding hash values, next cause GNU AWK to go to next line i.e. no other action is undertaken, so rest is applied only for all but first file. ($3 in arr) is condition: is current $3 one of keys of arr? If it does hold true I print concatenation of current $3 (that is filename) hash changed from string value for key $3 from array arr (that is old hash value) to string $1 (current hash value). If given filename is not present in array arr then no action is undertaken.
Edit: added exclusion for hash which not changed as suggested in comment.
(tested in gawk 4.2.1)

Replace specific column with values passed as parameter in a file within loop in shell script

Suppose we have a file test_file.csv:
"261718"|"2017-08-21"|"ramesh_1"|"111"
"261719"|"2017-08-23"|"suresh_1"|"112"
required modified test_file.csv should be :
"261718"|"2017-08-21"|"ramesh"|"111"
"261719"|"2017-08-23"|"suresh"|"112"
How would I find and replace the third column with the required values passed as parameters? It should be within an iteration.
You can save your arguments as comma separated values and store it in a variable args.
Pass this variable to awk using -v option. overwrite the third column $3 with nth array element where n is the current row number
args='"ramesh","suresh"'
awk -F "|" -v args=$args '
BEGIN {
split(args,arr,",")
}
{
$3=arr[NR];OFS=FS;print
}' test_file.csv
Output:
"261718"|"2017-08-21"|"ramesh"|"111"
"261719"|"2017-08-23"|"suresh"|"112"

Bash Find Null values of all variables after equal sign in a file

I have a configuration(conf.file) with list of variables and its values generated from shell script
cat conf.file
export ORA_HOME=/u01/app/12.1.0
export ORA_SID=test1
export ORA_LOC=
export TW_WALL=
export TE_STAT=YES
I want to find any variable has null value after equal(=) symbol, if so, then report the message as Configuration file has following list of null variables
You can use awk for this:
awk -F"[= ]" '$3=="" && NF==3 {print $2}' conf.file
That will split each record by a space or an equal sign, then test the third field in each row. If it's empty, it will print the second field (the variable).
UPDATE: Added in a test for Number of Fields (NF) equal to 3 to avoid null rows.
try:
awk -F"=" '$2' Input_file
As you need after = a field shouldn't be empty so making = as a field separator and checking if 2nd field is not empty then no action defined in my code so default print action will happen for any line which satisfy this condition. Let me know if this helps.
EDIT: Above will give only those values whose values are NULL after =, thanks to JNevill for letting me know that requirement is exactly opposite, following may help now in same.
awk -F"=" '!$2{gsub(/.* |=/,"",$1);print $1}' Input_file

How to replace a string like "[1.0 - 4.0]" with a numeric value using awk or sed?

I have a CSV file that I am piping through a set of awk/sed commands.
Some lines in the CSV file look like this:
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
where the 8th and 9th columns are a string representing a numeric range.
How can I use awk or sed to replace those fields with a numeric value? Either the beginning of the range, or the end of the range?
So this line would end up as
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,0.384
or
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,3.0,0.768
I got as far as removing the brackets but past that I'm stuck. I considered splitting on the " - ", but many lines in my file have a regular numeric value, not a range, in those last two columns, and that makes things messy (I don't want to end up with some lines having a different number of columns).
Here is a sed command that will take each range and break it up into two fields. It looks for strings like "[A - B]" and converts them to A,B. It can easily be modified to just use one of the values if needed by changing the \1,\2 portion. The regular expression assumes that all numbers have at least one digit on either side of a required decimal place. So, 1, .5, and 3. would not be valid. If you need that, the regex can be made to be more accommodating.
$ cat file
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
$ sed -Ee 's|"\[([0-9]+\.[0-9]+) - ([0-9]+\.[0-9]+)\]"|\1,\2|g' file
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,3.0,0.384,0.768
Since your data is field-based, awk is the logical choice.
Note that while awk generally isn't aware of double-quoted fields, that is not a problem here, because the double-quoted fields do not have embedded , instances.
#!/usr/bin/env bash
useStart1=1 # set to `0` to use the *end* of the *penultimate* fields' range instead.
useStart2=1 # set to `0` to use the *end* of the *last* field's range instead.
awk -v useStart1=$useStart1 -v useStart2=$useStart2 '
BEGIN { FS=OFS="," }
{
split($(NF-1), tokens1, /[][" -]+/)
split($NF, tokens2, /[][" -]+/)
$(NF-1) = useStart1 ? tokens1[2] : tokens1[3]
$NF = useStart2 ? tokens2[2] : tokens2[3]
print
}
' <<'EOF'
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
EOF
The code above yields:
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,0.384
Modifying the values of $useStart1 and $useStart2 yields the appropriate variations.

How to use bash to filter csv by column value and remove duplicates based on multiple columns

I'm trying to filter my CSV by the value of one column, and then remove duplicate rows based on the values of 2 columns. For the sake of simplicity, here's an example. I would like to remove duplicate rows based on columns ID1, ID2 and Year. I would also like to filter my results by only pulling back rows with "3" in the VALUE column.
ID1,ID2,YEAR,LAT,LON,VALUE
A,B,2016,123,456,3
A,B,2016,133,466,3
A,B,2016,122,446,3
C,D,2015,223,456,3
C,D,2015,241,455,3
A,B,2016,123,456,2
A,B,2016,133,466,2
A,B,2016,122,446,2
C,D,2015,223,456,2
C,D,2015,241,455,2
RESULT:
ID1,ID2,YEAR,LAT,LON,VALUE
A,B,2016,123,456,3
C,D,2015,223,456,3
You can use awk that uses an associative array with key as composite value commprising $1,$2,$3:
awk -F, '$NF==3 && !seen[$1,$2,$3]++' file.csv
ID1,ID2,YEAR,LAT,LON,VALUE
A,B,2016,123,456,3
C,D,2015,223,456,3
this solution assumes the same as was metioned above, but in more expanded version.
Both awk solutions won't work if there is , in the values. Then insted of that you could use csvtool to separate values.
cat file1 | awk -F, ' $NF==3 {unq[$1,$2,$3]=$0} END{for (i in unq){print unq[i] }}'

Resources