How to use bash to filter csv by column value and remove duplicates based on multiple columns - bash

I'm trying to filter my CSV by the value of one column, and then remove duplicate rows based on the values of 2 columns. For the sake of simplicity, here's an example. I would like to remove duplicate rows based on columns ID1, ID2 and Year. I would also like to filter my results by only pulling back rows with "3" in the VALUE column.
ID1,ID2,YEAR,LAT,LON,VALUE
A,B,2016,123,456,3
A,B,2016,133,466,3
A,B,2016,122,446,3
C,D,2015,223,456,3
C,D,2015,241,455,3
A,B,2016,123,456,2
A,B,2016,133,466,2
A,B,2016,122,446,2
C,D,2015,223,456,2
C,D,2015,241,455,2
RESULT:
ID1,ID2,YEAR,LAT,LON,VALUE
A,B,2016,123,456,3
C,D,2015,223,456,3

You can use awk that uses an associative array with key as composite value commprising $1,$2,$3:
awk -F, '$NF==3 && !seen[$1,$2,$3]++' file.csv
ID1,ID2,YEAR,LAT,LON,VALUE
A,B,2016,123,456,3
C,D,2015,223,456,3

this solution assumes the same as was metioned above, but in more expanded version.
Both awk solutions won't work if there is , in the values. Then insted of that you could use csvtool to separate values.
cat file1 | awk -F, ' $NF==3 {unq[$1,$2,$3]=$0} END{for (i in unq){print unq[i] }}'

Related

Update column in file based on associative array value in bash

So I have a file named testingFruits.csv with the following columns:
name,value_id,size
apple,1,small
mango,2,small
banana,3,medium
watermelon,4,large
I also have an associative array that stores the following data:
fruitSizes[apple] = xsmall
fruitSizes[mango] = small
fruitSizes[banana] = medium
fruitSizes[watermelon] = xlarge
Is there anyway I can update the 'size' column within the file based on the data within the associative array for each value in the 'name' column?
I've tried using awk but I had no luck. Here's a sample of what I tried to do:
awk -v t="${fruitSizes[*]}" 'BEGIN{n=split(t,arrayval,""); ($1 in arrayval) {$3=arrayval[$1]}' "testingFruits.csv"
I understand this command would get the bash defined array fruitSizes, do a split on all the values, then check if the first column (name) is within the fruitSizes array. If it is, then it would update the third column (size) with the value found in fruitSizes for that specific name.
Unfortunately this gives me the following error:
Argument list too long
This is the expected output I'd like in the same testingFruits.csv file:
name,value_id,size
apple,1,xsmall
mango,2,small
banana,3,medium
watermelon,4,xlarge
One edge case I'd like to handle is the presence of duplicate values in the name column with different values for the value_id and size columns.
If you want to stick to an awk script, pass the array via stdin to avoid running into ARG_MAX issues.
Since your array is associative, listing only the values ${fruitSizes[#]} is not sufficient. You also need the keys ${!fruitSizes[#]}. pr -2 can pair the keys and values in one line.
This assumes that ${fruitSizes[#]} and ${!fruitSizes[#]} expand in the same order, and your keys and values are free of the field separator (, in this case).
printf %s\\n "${!fruitSizes[#]}" "${fruitSizes[#]}" | pr -t -2 -s, |
awk -F, -v OFS=, 'NR==FNR {a[$1]=$2; next} $1 in a {$3=a[$1]} 1' - testingFruits.csv
However, I'm wondering where the array fruitSizes comes from. If you read it from a file or something like that, it would be easier to leave out the array altogether and do everything in awk.

Parsing data from function in POSIX

I'm using POSIX. I have a function called get_data which returns:
4;Fix README;feature4;develop;URL5
2;Fix file3;feature2;develop;URL2
5;Fix README;feature2;develop;URL3
1;Fix file2;feature1;develop;URL1
I want to get to the URL (last part) of latest feature2 (based on the first index). In the above example, it will return URL3 because it has feature2 in the third field and 5 > 2 in the first field.
The first thing I tried is:
url=$(get_data | grep feature2)
But I don't like this solution because the other lines also can contain feature2 on other fields. If it was Bash I would use BASH_REMATCH with regex, but here I'm not sure what is the best most elegant way to get that URL.
Is it possible to get some suggestion on how to do it?
Use awk:
url=$(get_data | awk -F";" '$3 == "feature2" && $1 > idx {idx=$1; url=$5} END {print url}')
After splitting each line into ;-delimited fields, save the fifth field from a line whose third field is the desired feature, if the index is greater than the one you last saved. Once you have checked each line, output the final value of url.
Using sort and awk, we can do
url=$(get_data | sort -t ";" -k 1nr,1nr file | awk -F";" '$3 == "feature2"{print $5;exit}')

Add column from one file to another based on multiple matches while retaining unmatched

So I am really new to this kind of stuff (seriously, sorry in advance) but I figured I would post this question since it is taking me some time to solve it and I'm sure it's a lot more difficult than I am imagining.
I have the file small.csv:
id,name,x,y,id2
1,john,2,6,13
2,bob,3,4,15
3,jane,5,6,17
4,cindy,1,4,18
and another file big.csv:
id3,id4,name,x,y
100,{},john,2,6
101,{},bob,3,4
102,{},jane,5,6
103,{},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
The problem with this is I am attempting to put id2 of the small.csv into the id4 column of the big.csv only if the name AND x AND y match. I have tried using different awk and join commands in Git Bash but am coming up short. Again I am sorry for the newbie perspective on all of this but any help would be awesome. Thank you in advance.
EDIT: Sorry, this is what the final desired output should look like:
id3,id4,name,x,y
100,{13},john,2,6
101,{15},bob,3,4
102,{17},jane,5,6
103,{18},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
And one of the latest trials I did was the following:
$ join -j 1 -o 1.5,2.1,2.2,2.3,2.4,2.5 <(sort -k2 small.csv) <(sort -k2 big.csv)
But I received this error:
join: /dev/fd/63: No such file or directory
Probably not trivial to solve with join but fairly easy with awk:
awk -F, -v OFS=, ' # set input and output field separators to comma
# create lookup table from lines of small.csv
NR==FNR {
# ignore header
# map columns 2/3/4 to column 5
if (NR>1) lut[$2,$3,$4] = $5
next
}
# process lines of big.csv
# if lookup table has mapping for columns 3/4/5, update column 2
v = lut[$3,$4,$5] {
$2 = "{" v "}"
}
# print (possibly-modified) lines of big.csv
1
' small.csv big.csv >bignew.csv
Code assumes small.csv contains only one line for each distinct column 2/3/4.
NR==FNR { ...; next } is a way to process contents of the first file argument. (FNR is less than NR when processing lines from second and subsequent file arguments. next skips execution of the remaining awk commands.)

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

awk for sort lines in file

I have a file which needs to sort on basis of column and the column is fixed length based column i.e. from character 5 to 10.
example file:
0120456789bcdc hsdsjjlofk
01204567-9 __abc __hsdsjjjiejks
01224-6777 abcddd hsdsjjjpsdpf
012645670- abccccd hsdsjjjopp
I tried awk -v FIELDWIDTHS="4 10" '{print|"$2 sort -n"}' file but it does not give proper output.
You can use sort for this
$ sort -k1.5,1.10 file
01224-6777 abcddd hsdsjjjpsdpf
01204567-9 __abc __hsdsjjjiejks
012645670- abccccd hsdsjjjopp
0120456789bcdc hsdsjjlofk

Resources