bash merge 2 files with common atributes - bash

I have two files and I would like to merge them. The files are like below:
File 1:
8870:0/28,13
8870:2/22,1
8870:2/25,3
887:3/29,1
886:1/40,4
886:1/41,2
886:1/43,4
File 2:
8870:0,16
8870:2,9
887:3,5
886:1,31
Output:
8870:0,16,13
8870:2,9,4
887:3,5,1
886:1,31,10
In other words, in the output I want the F2 adding a column with the sum w of same x:y (x:y/z,w).

"I wonder how can I manage this task." Do not wonder, be curious and read as you can about sed and awk. Take what it follow as an hint.
First convert the format of file1 to the format of file2 discarding what is not needed.
sed 's/\/.*\,/\,/' file1.txt # here you erase what in between `/` and `,`
then process it with awk and associative arrays
sed 's/\/.*\,/\,/' file1.txt \
| awk -F ',' '{A[$1]=A[$1]+$2}END{for (b in A) print b","A[b]}' > file1b.txt
| pipe redirection
\ it allows to continue in the next line (no other characters after)
> file1b.txt, if you want you can redirect all to a new file.
Now you can use again associative array of awk with the 2 files file2.txt and file1b.txt (you want to add to the file2.txt so you have to write it for first)
awk -F ',' '{if (A[$1]=="" ) {A[$1]=A[$1]$2} \
else {A[$1]=A[$1]","$2}}END{for (b in A) print b","A[b]}' \
file2.txt file1b.txt | sort -nr
The final | sort -nr sort the output in reverse order (-r) with numeric order (-n).
Note that you do not need to create the file1b.txt
#!/bin/bash
( \
cat file2.txt ; \
sed 's/\/.*\,/\,/' file1.txt | \
awk -F ',' '{A[$1]=A[$1]+$2}END{for (b in A) print b","A[b]}' ;
) | \
awk -F ',' '{if (A[$1]=="" ) {A[$1]=A[$1]$2} else {A[$1]=A[$1]","$2}} \
END{for (b in A) print b","A[b]}'| sort -nr

Related

Extracting unique columns from a file into a comma separated list with a particular order

I have a .csv file with these values
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
and I want to make a file with comma separated rows in sorted order except "negative" always is at the end.
So I want
["brand","positive","product","negative"]
I was not able to automate this process so what I did was
awk -F ',' '{print $1}' file.csv | sort | uniq -c > file2.txt
awk '{if(NR>1) printf ", ";printf("\"%s\"",$0)} END {print ""}' file2.txt > file3.txt
I get "brand","negative","positive","product"
Then I manually move "negative" to the end and also append [ and ] to front and back to get
["brand","positive","product","negative"]
Is there a way to make it more efficient and automate the process?
another solution with easy to understand steps
$ awk -F, '{print ($1=="negative"?1:0) "\t\"" $1 "\""}' file | # mark negatives
sort | cut -f2 | uniq | # sort, cut, uniq
paste -sd, | sed 's/^/[/;s/$/]/' # serialize, add brackets
["brand","positive","product","negative"]
Here is a single gnu awk command to make it work:
awk -F, '{
a[$1] = ($1 == "negative" ? "~" : "") $1
}
END {
n = asort(a)
printf "["
for (i = 1; i <= n; i++) {
sub(/^~/, "", a[i])
printf "\"%s\"%s", a[i], (i < n ? ", " : "]\n")
}
}' file.csv
["brand", "positive", "product", "negative"]
There are lots of ways to approach this. Do you really want the result as what looks like a JSON array, with square brackets and quotation marks around the column names? If so, then jq is probably a good tool to use to generate it. Something like this will do it all as a single jq program:
jq -csR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)' file.csv
Which outputs this:
["brand","positive","product","negative"]
If you just want the headings separated by commas in a line without the other punctuation, suitable for heading up a CSV file, you can use more traditional text-manipulation commands:
cut -d, -f1 file.csv |
sed 's/negative/zzz&/' |
sort -u |
sed 's/zzz//' |
paste -d, -s -
Or you can slightly modify the jq command by adding the -r flag and another pipe at the end:
jq -csrR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)|
join(",")' file.csv
Either of which outputs this:
brand,positive,product,negative
Using Perl one-liner
$ cat unique.txt
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
$ perl -F, -lane ' { $x=$F[0];$x=~s/^(negative)/z\1/g;$rating{$x}++ } END {$q="\x22";$y=join("$q,$q",sort keys %rating) ; $y=~s/${q}z/$q/g; print "[$q$y$q]" }' unique.txt
["brand","positive","product","negative"]
$
This worked for me:
cut -d, -f1 file.csv | sort -u | sed "/^negative/d" | tr '\n' ',' | sed -e 's/^/["/' -e 's/,/","/g' -e 's/$/negative"]/'

How to merge two files with the same column value in bash

I have 2 csv files, these are their contents.
file1(23 fields)
data11,data12,ID1,data14...
data21,data22,ID2,data24...
data31,data32,ID3,data34...
file2 (22 fields)
ID1,value12,value13,...
ID1,value22,value23,...
ID1,value32,value33,...
ID2,value42,value43,...
ID3,value52,value53,...
The output should be...
OUTPUT:
data11,data12,ID1,data14,...,value12,value13
data11,data12,ID1,data14,...,value22,value23
data11,data12,ID1,data14,...,value32,value33
data21,data22,ID2,data24,...,value42,value43
data31,data32,ID3,data34,...,value52,value53
Can anyone help me to get this output using awk or any bash built-ins?
Thanks!
You can use join ..Specify the column order required for output after -o eg: 1.1 refers 1st column of 1st (file1) file. It is also required to pre-sort the input files
join -t "," -1 3 -2 1 -o 1.1,1.2,1.3,1.4,2.2,2.3
<( sort -t "," -k3 /tmp/file1 ) <( sort -t "," -k1 /tmp/file2 )
Sorry, my fault to misunderstand ur problem, try the following cmd, it should be what u want:
for line1 in `cat file1`;do id=`echo $line1|awk -F ',' '{print $3}'`;\
awk -v id=$id -v line1=$line1 -F ',' '($1==id){print line1","$0}' file2;done
the output of this cmd is
data11,data12,ID1,data14...,ID1,value12,value13,...
data11,data12,ID1,data14...,ID1,value22,value23,...
data11,data12,ID1,data14...,ID1,value32,value33,...
data21,data22,ID2,data24...,ID2,value42,value43,...
data31,data32,ID3,data34...,ID3,value52,value53,...
and if u don't want the repeated column of ID*, u can do this like
for line1 in `cat file1`;do id=`echo $line1|awk -F ',' '{print $3}'`;\
awk -v id=$id -v line1=$line1 -F ',' '($1==id){printf "%s",line1;\
for(i=2;i<NF;i++) printf ",%s",$i;print ","$NF}' file2;done
it won't print ID* in file2
data11,data12,ID1,data14...,value12,value13,...
data11,data12,ID1,data14...,value22,value23,...
data11,data12,ID1,data14...,value32,value33,...
data21,data22,ID2,data24...,value42,value43,...
data31,data32,ID3,data34...,value52,value53,...
----------wrong answer prior to updated---------
https://www.computerhope.com/unix/upaste.htm
HI,
u can use paste cmd to join related lines of different files
please use man paste cmd for detailed usage

how to find matching records from 3 different files in unix

I have 3 different files.
Test1.txt , Test2.txt & Test3.txt
Test1.txt contains
JJTP#yahoo.com
BBMU#ssc.com
HK#glb.com
Test2.txt contains
SFTY#gmail.com
JJTP#yahoo.com
Test3.txt contains
JJTP#yahoo.com
HK#glb.com
I would like to see only matching records in these 3 files.
so the matching records in above example will be JJTP#yahoo.com
The output should be
JJTP#yahoo.com
If you don't have duplicate lines in each file then:
$ awk '++a[$1]==3' test[1-3]
JJTP#yahoo.com
Here is an awk that has a mix of jaypal and sudo_o solution.
It will not give false positive since it test for uniqueness of the lines.
awk '!a[$1 FS FILENAME]++ && ++b[$1]==3' test*
JJTP#yahoo.com
If you have a unknown number of files, this could be an option
awk '!a[$1 FS FILENAME]++ && ++b[$1]==ARGC-1' test*
The ARGC store the number of files read by awk + 1
comm lists common lines for two files. Just find the common lines in the first two files, then pipe the output to comm again and find the common lines with the third file.
comm -12 <(sort Test1.txt) <(sort Test2.txt) | comm -12 - <(sort Test3.txt)
Here is how you'd do with awk:
awk '
FILENAME == ARGV[1] { a[$0]++ }
FILENAME == ARGV[2] && ($0 in a) { b[$0]++ }
FILENAME == ARGV[3] && ($0 in b)' file1 file2 file3
Output:
JJTP#yahoo.com
To find the common lines in two files, you can use:
sort Test1.txt Test2.txt | uniq -d
Or, if you wish to preserve the order found in Test1.txt, you may use:
while read x; do grep -w "$x" Test2.txt; done < Test1.txt
For three files, repeat this:
sort Test1.txt Test2.txt | uniq -d | sort - Test3.txt | uniq -d
Or:
cat Test1.txt |\
while read x; do grep -w "$x" Test2.txt; done |\
while read x; do grep -w "$x" Test3.txt; done
The sort method assumes that the files themselves don't have duplicate lines; if so, you may need create temporary files.
If you wish to use sed rather than grep, try sed -n "/^$x$/p".

AWK: Compare two CSV files

I have two CSV files and I want to compare them using AWK and generate a new file.
file1.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc123","C:/pro/xyz"
"abc124","C:/pro/in"
file2.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc125","C:/pro/xyz"
"abc126","C:/pro/in"
output.csv:
"file1","file2","Diff"
"abc121","abc121","Match"
"abc122","abc122","Match"
"abc123","","Unmatch"
"abc124","","Unmatch"
"","abc125","Unmatch"
"","abc126","Unmatch"
One way with awk:
script.awk:
BEGIN {
FS = ","
}
NR>1 && NR==FNR {
a[$1] = $2
next
}
FNR>1 {
print ($1 in a) ? $1 FS $1 FS "Match" : "\"\"" FS $1 FS "Unmatch"
delete a[$1]
}
END {
for (x in a) {
print x FS "\"\"" FS "Unmatch"
}
}
Output:
$ awk -f script.awk file1.csv file2.csv
"abc121","abc121",Match
"abc122","abc122",Match
"","abc125",Unmatch
"","abc126",Unmatch
"abc124","",Unmatch
"abc123","",Unmatch
I didn't use awk alone, but if I understood the gist of what you're asking correctly, I think this long one-liner should do it...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 file1.csv file2.csv | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e '1d' -e 's/^,/"",/' -e 's/,$/,"" /' -e 's/,,/,"",/g'
Description:
The join portion takes the two CSV files, joins them on the first column (default behavior of join) and outputs all four fields (-o 1.1 2.1 1.2 2.2), making sure to include rows that are unmatched for both files (-a 1 -a 2).
The awk portion takes that output and replaces combination of the 3rd and 4th columns to either "Match" or "Unmatch" based on if they do in fact match or not. I had to make an assumption on this behavior based on your example.
The sed portion deletes the "no","loc" header from the output (-e '1d') and replaces empty fields with open-close quote marks (-e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'). This last part might not be necessary for you.
EDIT:
As tripleee points out, the above fails if the two initial files are unsorted. Here's an updated command to fix that. It punts the header line and sorts each file before passing them to join...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 <( sed 1d file1.csv | sort ) <( sed 1d file2.csv | sort ) | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'

Add a column to any position in a file in unix [using awk or sed]

I'm looking for other alternatives/more intelligent 1 liner for following command, which should add a value to a requested column number.
I tried following following sed command works properly for adding value 4 to the 4th column.
[Need: As i have such file which contains 1000 records & many times i need to add a column in between at any position.]
My approch is sutaible for smaller scale only.
cat 1.txt
1|2|3|5
1|2|3|5
1|2|3|5
1|2|3|5
sed -i 's/1|2|3|/1|2|3|4|/g' 1.txt
cat 1.txt
1|2|3|4|5
1|2|3|4|5
1|2|3|4|5
1|2|3|4|5
thansk in advance.
Field Separators
http://www.gnu.org/software/gawk/manual/html_node/Field-Separators.html
String Concatenation
http://www.gnu.org/software/gawk/manual/html_node/Concatenation.html
Default pattern and action
http://www.gnu.org/software/gawk/manual/html_node/Very-Simple.html
awk -v FS='|' -v OFS='|' '{$3=$3"|"4} 1' 1.txt
One way using awk. Pass two arguments to the script, the column number and the value to insert. The script increments the number of fields (NF) and goes throught the last one until the indicated position and insert there the new value.
Run this command:
awk -v column=4 -v value="four" '
BEGIN {
FS = OFS = "|";
}
{
for ( i = NF + 1; i > column; i-- ) {
$i = $(i-1);
}
$i = value;
print $0;
}
' 1.txt
With following output:
1|2|3|four|5
1|2|3|four|5
1|2|3|four|5
1|2|3|four|5
One way using coreutils and process substitution:
f=1.txt
paste -d'|' \
<(cut -d'|' -f1-3 $f ) \
<(yes 4 | head -n`wc -l < $f`) \
<(cut -d'|' -f4- $f )
One way, using coreutils and process substitution:
sed 's/3|/3|4|/' 1.txt

Resources