How can I merge two file smartly with unique key? - shell

I have two files, such like:
File1:
A,Content1
B,Content2
C,Content3
File2:
D,Content4
E,Content5
B,Content6
There is the same key in file1 and file2, could I merge two files smartly that the result file is just as:
A,Content1
B,Content2
C,Content3
D,Content4
F,Content5

You should be able to accomplish this with a single sort:
sort -t',' -k1,1 -u file1 file2
It sets the field separator to comma, sorts and dedupes on only the first field.

If your files aren't too big (which is what I'll assume from your usage of shellscript) :
#!/bin/bash
keys=$(cat "$#" | cut -d',' -f1 | sort -u)
for key in $keys
do
grep -h $key "$#" | head -1
done
Basically :
extract the keys (stuff that is before the first comma)
find the first occurrence of that key in the files (that's the head -1)

Related

check if column has more than one value in unix [duplicate]

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Compare csv files based on column value

I have two large csv files:
File1.csv
id,name,code
1,dummy,0
2,micheal,3
5,abc,4
File2.csv
id,name,code
2,micheal,4
5,abc,4
1,cd,0
I want to compare two files based on id and if any of the columns are mismatched, I want to output those rows.
for example for the id 1 name is different and for id 2 the code is different, the output should be:
output
1,cd,0
2,micheal,4
and yes both files will have the same ids, could be in different order though.
I want to write a script that can give me above output.
If you need what in File2 is not paired with File1, you can use Miller and this simple command
mlr --csv join --np --ur -j id,name,code -f File1.csv File2.csv >./out.csv
In output you will have
+----+---------+------+
| id | name | code |
+----+---------+------+
| 2 | micheal | 4 |
| 1 | cd | 0 |
+----+---------+------+
awk -F, 'NR==FNR && FNR!=1 { map[$0]=1;next } FNR!=1 { if ( !map[$0] ) { print } }' File1.csv File2.csv
Set the field separator to comma. For the first file (NR==FNR), create an array map with the line as the first index. Then for the second file, if there is no entry for the line in map, print the line.
The tool of choice for finding differences between files is, of course, diff. Here, it doesn't really matter if these files are comma-separated or in some other format because you're really only interested in lines that differ.
Knowing that both files contain the same IDs makes this quite easy, although the fact that they will not necessarily be in the same order requires to first sort them both.
In your example, you want as output the lines from File2 so running the diff output through a grep for ^> will give you that.
Finally, let's get rid of the two additional characters at the beginning of the output lines that will have been inserted by diff, using cut:
diff <(sort File1.csv) <(sort File2.csv) | grep '^>' | cut -c3-

Filter records from one file based on a values present in another file using Unix

I have an Input csv file Input feed
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
There is an output error csv file which is generated from this input file which has the Primary Key
Error File
Pk,Error_Reason
D,Failure
E, Failure
F, Failure
I want to extract all the records from the input file and save it into a new file for which there is a Primary key entry in Error file.
Basically my new file should look like this:
New Input feed
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
I am a beginner in Unix and I have tried Awk command.
The Approach I have tried is, get all the primary key values into a file.
akw -F"," '{print $2}' error.csv >> error_pk.csv
Now I need to filter out the records from the input.csv for all the primary key values present in error.pk
Using awk. As there is leading space in the error file, it needs to be trimmend off first, I'm using sub for that. Then, since the titles of the first column are not identical, (PK vs Pk) that needs to be handled separately with FNR==1:
$ awk -F, ' # set separator
NR==FNR { # process the first file
sub(/^ */,"") # trim leading space
a[$1] # hash the first column
next
}
FNR==1 || ($1 in a)' error input # output tthe header record and if match hashed
Output:
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
You can use join.
First remove everything afte the comma from second file
Join on the first field from both files
cat <<EOF >file1
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
EOF
cat <<EOF >file2
PK,Error_Reason
D,Failure
E,Failure
F,Failure
EOF
join -t, -11 -21 <(sort -k1 file1) <(cut -d, -f1 file2 | sort -k1)
If you need the file to be sorted according to file1, you can number the lines in first file, join the files, re-sort using the line numbers and then remove the numbers from the output:
join -t, -12 -21 <(nl -w1 -s, file1 | sort -t, -k2) <(cut -d, -f1 file2 | sort -k1) |
sort -t, -k2 | cut -d, -f1,3-
You can use grep -f with a file with search items. Cut off at the ,.
grep -Ef <(sed -r 's/([^,]*).*/^\1,/' file2) file1
When you want a header in the output,

Comparing output from two greps

I have two C source files with lots of defines and I want to compare them to each other and filter out lines that do not match.
The grep (grep NO_BCM_ include/soc/mcm/allenum.h | grep -v 56440) output of the first file may look like:
...
...
# if !defined(NO_BCM_5675_A0)
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
...
...
where grep (grep "define NO_BCM" include/sdk_custom_config.h) of the second looks like:
...
...
#define NO_BCM_56260_B0
#define NO_BCM_5675_A0
#define NO_BCM_56160_A0
...
...
So now I want to find any type number in the braces above that are missing from the #define below. How do I best go about this?
Thank you
You could use an awk logic with two process-substitution handlers for grep
awk 'FNR==NR{seen[$2]; next}!($2 in seen)' FS=" " <(grep "define NO_BCM" include/sdk_custom_config.h) FS="[()]" <(grep NO_BCM_ include/soc/mcm/allenum.h | grep -v 56440)
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
The idea is the commands within <() will execute and produce the output as needed. The usage of FS before the outputs are to ensure the common entity is parsed with a proper-delimiter.
FS="[()]" is to capture $2 as the unique field in second-group and FS=" " for the default whitespace de-limiting on first group.
The core logic of awk is identifying not repeating elements, i.e. FNR==NR parses the first group storing the unique entries in $2 as a hash-map. Once all the lines are parsed, !($2 in seen) is executed on the second-group which means filter those lines whose $2 from second-group is not present in the hash created.
Use comm this way:
comm -23 <(grep NO_BCM_ include/soc/mcm/allenum.h | cut -f2 -d'(' | cut -f1 -d')' | sort) <(grep "define NO_BCM" include/sdk_custom_config.h | cut -f2 -d' ' | sort)
This would give tokens unique to include/soc/mcm/allenum.h.
Output:
NO_BCM_2801PM_A0
NO_BCM_88660_A0
If you want the full lines from that file, then you can use fgrep:
fgrep -f <(comm -23 <(grep NO_BCM_ include/soc/mcm/allenum.h | cut -f2 -d'(' | cut -f1 -d')' | sort) <(grep "define NO_BCM" include/sdk_custom_config.h | cut -f2 -d' ' | sort)) include/soc/mcm/allenum.h
Output:
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
About comm:
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to
FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
It's hard to say without the surrounding context from your sample input files and no expected output but it sounds like this is all you need:
awk '!/define.*NO_BCM_/{next} NR==FNR{defined[$2];next} !($2 in defined)' include/sdk_custom_config.h FS='[()]' include/soc/mcm/allenum.h

Match and merge lines based on the first column

I have 2 files:
File1
123:dataset1:dataset932
534940023023:dataset:dataset039302
49930:dataset9203:dataset2003
File2
49930:399402:3949304:293000232:30203993
123:49030:1204:9300:293920
534940023023:49993029:3949203:49293904:29399
and I would like to create
Desired result:
49930:399402:3949304:293000232:30203993:dataset9203:dataset2003
534940023023:49993029:3949203:49293904:29399:dataset:dataset039302
etc
where the result contains one line for each pair of input lines that have identical first column (with : as the column separator).
The join command is your friend here. You'll likely need to sort the inputs (either pre-sort the files, or use a process substitution if available - e.g. with bash).
Something like:
join -t ':' <(sort file2) <(sort file1) >file3
When you do not want to sort files, play with grep:
while IFS=: read key others; do
echo "${key}:${others}:$(grep "^${key}:" file1 | cut -d: -f2-)"
done < file2

Resources