merge rows with identical column values

merge rows with identical column values - bash

I'm a beginner in bash. I have a .csv file. It contains 2 columns (Name and Num). Here is the content of columns:
Name,Num
ex1,10.20.30.40
ex2,20.30.40.30
ex3,10.45.60.20
ex4,10.20.30.40
ex5,20.30.40.30
I want to merge the rows that their 2nd column is identical. For example here I have 2 rows that their 2nd column is "10.20.30.40". I want the output to be like this:
Name,Num
ex1 ex4,10.20.30.40
ex2 ex5,20.30.40.30
ex3,10.45.60.20
so the name column of the first row contains both ex1 and ex2. I searched a lot and all I find was how to sort lines based on their 2nd column:
echo $(awk -F ',' '{print $2}' name.csv | sort) >> sub2.csv
but it only sorts and prints the second column to "sub2.csv".
I also tried this script:
echo $(awk -F',' '{k=$2;if(a[k])a[k]=a[k] OFS $1;else{a[k]=$0;b[++i]=k}}
END{for(x=1;x<=i;x++)print a[b[x]]}' name.csv) >> sub2.csv
but the output is confusing (for example rows are not separated).
Would you please guide me about how to do this?

awk 'BEGIN{FS=","} NR==1{print;next} {a[$2]=$2 in a ? a[$2] " " $1 : $1} END{for(i in a) print a[i] "," i}' file
Output:
Name,Num
ex1 ex4,10.20.30.40
ex2 ex5,20.30.40.30
ex3,10.45.60.20
Derived from: https://stackoverflow.com/a/31283693/3776858
See: 4 Awk If Statement Examples ( if, if else, if else if, :? )

Related

How to merge set of columns based on a common field

How can I do the following in bash.
a) Take out columns 1,2,6
b) Row is identified by field 'packetId'; There can be one or 2 rows with same 'packetId'; if there are 2 rows with same packetId, then append the first row with the last field of the second row
c) If there is only one row for a 'packetId', then ignore that row and donot print
Input
SequenceId,TimeStamp,packetId,size,secondaryid,eventType,randomfield,Source,Destination,SystemTime
1,3:41:24,1,100,xyz,event1,abc,S1,D1,1586989874
2,3:41:25,1,100,xyz,event2,abc,S1,D1,1586989877
3,3:41:26,2,100,xyz,event1,abc,S1,D1,1586989879
4,3:41:26,3,100,xyz,event1,abc,S1,D1,1586989871
5,3:41:26,3,100,xyz,event2,abc,S1,D1,1586989879
output
packetId,size,secondaryid,randomfield,Source,Destination,SystemTime,OtherSystemTime
1,100,xyz,abc,S1,D1,1586989874,1586989877
3,100,xyz,abc,S1,D1,1586989871,1586989879

You can do it all in awk, but it's simpler to use cut to first remove the fields you don't care about:
$ cut -d, -f3-5,7- input.csv |
awk -F, 'NR == 1 { print $0 ",OtherSystemTime"; next }
{ if ($1 in seen) print seen[$1] "," $NF; else seen[$1] = $0 }'
packetId,size,secondaryid,randomfield,Source,Destination,SystemTime,OtherSystemTime
1,100,xyz,abc,S1,D1,1586989874,1586989877
3,100,xyz,abc,S1,D1,1586989871,1586989879

bash: output contents of diff function into 2 columns

I have a file that looks like so:
file1.txt
rs13339951:45007956:T:C 45007956
rs2838331 45026728
rs5647 12335
rs4687576 5353566
file2.txt
rs13339951 45007956
rs2838331 45026728
rs5647 12335
rs4687576:ATCFHF 5353566
More descriptions:
Some of the values in column1 are identical between the 2 files but not all
The values in column2 are all identical between the 2 files
I want to identify the rows for which the values in column1 differs between the 2 files. I.e. these rows 1 and 4 in my example. I can do this with diff file1.txt and file2.txt.
However, I would like to obtain a end file like so (see below). Indeed, I aim to use sed to replace the names of one file in the other so that both files match completely.
rs13339951:45007956:T:C rs13339951
rs4687576 rs4687576:ATCFHF

awk is perfect for this
awk 'FNR==NR {a[$2]=$1; next} a[$2]!=$1 {print a[$2] " " $1}' file1 file2
outputs
rs13339951:45007956:T:C rs13339951
rs4687576 rs4687576:ATCFHF
We are passing two files to awk. It will pass over them consecutively.
FNR==NR {.... next} { ... }
With this "trick" the first action is executed for the first file and the second action is executed for the second file.
a[$2]=$1
A key value lookup table. Second column is key first column is value. We build this lookup table while reading the first file.
a[$2]!=$1 {print a[$2] " " $1}
While iterating over the second file, compare the current first column with the value in the lookup table. If they do not match print the desired output.

Insert column delimiters before pattern in a sorted file on a mac

Have a resulting file which contains values from different XML files.
The file have 5 columns separated by ";" in case that all pattern matched.
First column = neutral Index
Second column = specific Index1
Third column = file does contain Index1
Fourth column = specific Index2
Fifth column = file does contain Index2
Not matching pattern with Index2 (like last three lines) should also have 5 columns, while the last two columns should be like the first two lines.
The sorted files looks like:
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;AAA.2F1;file_C
BBB;BBB.2G1;file_D
CCC;CCC.1B1;file_H
YYY;YYY.2M1;file_N
The desired result would be:
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;;;AAA.2F1;file_C
BBB;;;BBB.2G1;file_D
CCC;CCC.1B1;file_H;;
YYY;;;YYY.2M1;file_N
If you have any idea/hint, your help is appreciated! Thanks in advance!

Updated Answer
In the light of the updated requirement, I think you want something like this:
awk -F';' 'NF==3 && $2~/\.1/{$0=$0 ";;"}
NF==3 && $2~/\.2/{$0=$1 ";;;" $2 ";" $3} 1' file
which can be written as a one-liner:
awk -F';' 'NF==3 && $2~/\.1/{$0=$0 ";;"} NF==3 && $2~/\.2/{$0=$1 ";;;" $2 ";" $3} 1' YourFile
Original Answer
I would do that with awk:
awk -F';' 'NF==3{$0=$1 ";;;" $2 ";" $3}1' YourFile
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;;;AAA.2F1;file_C
BBB;;;BBB.2G1;file_D
YYY;;;YYY.2M1;file_N
That says..."run awk on YourFile using ';' as field separator. If there are only 3 fields on any line, recreate the line using the existing first field, three semi-colons and then the other two fields. The 1 at the end, means print the current line`".
If you don't use awk much, NF refers to the number of fields, $0 refers to the entire current line, $1 refers to the first field on the line, $2 refers to the second field etc.

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.

Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile

awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1

You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

Merge two files in linux with different column

I have two files in linux, the first file has 4 columns and the second has 2 columns. I want to merge these files into a new file that has the first 3 columns from file 1 and the first column from file 2. I tried awk, but my data from file 2 was placed under file 1.

paste file1 file2 | awk '{print $1,$2,$3,$5}'

Not sure which columns you want from each file, but something like this should work:
paste <file1> <file2> | awk '{print $1,$2,$3,$5}'
The first three columns would be picked from file1, and the fourth skipped, then pick the first column from the second file.

If the files have the same number of rows, you can do something like:
awk '{ getline v < "file2"; split( v, a ); print a[2], $1, $3 }' file1
to print colums 1 and 3 from file 1 and column 2 from file2.

you can try this one without paste command:
awk '{print $1}{print $2}{print $3}' file1 >> mergedfile
awk '{print $2}' file2 >> mergedfile

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

merge rows with identical column values - bash

Related

How to merge set of columns based on a common field

bash: output contents of diff function into 2 columns

Insert column delimiters before pattern in a sorted file on a mac

Get the contents of one column given another column

Merge two files in linux with different column

Categories

Resources