Complex csv question: how to generate a final csv after comparing multiple csvs (following manner) using shell scripting? - shell

assume
file1.csv
Schemaname.tablename.columns
exam1
exam2
filetomatch.csv
exam1
exam2
exam4
exam5
exam6
I used
awk 'NR==FNR{a[$1];next} ($1) in a' file1.csv filetomatch.csv >> result.csv (each time one csv is produced)
result
exam 1
exam 2
to match the results.
I have n number of files to comapre to filetomatch.csv
i need out put to be as follows
file matchedcolumns
file1 exam 1
exam 2
file2 exam 4
.
.
.
filen exam 2
exam 3
and so on..
How can i concatenate result.csvs everytime with first field as file name.
also is there a way to show the null columns as well
How can i add null values using this?
Example
File1 Column1
File1 Column1
File2 null
File3 column3
and so on

>> result.csv should be doing the concatenation for you.
for example, create test files
$ for i in {1..4}; do echo $i > file$i.txt; done
$ head file?.txt
==> file1.txt <==
1
==> file2.txt <==
2
==> file3.txt <==
3
==> file4.txt <==
4
run some awk script on all files, print the filename part of output and concatenate the results
$ for f in file{1..4}.txt; do awk '{print FILENAME, $0}' "$f" >> results.csv; done
$ cat results.csv
file1.txt 1
file2.txt 2
file3.txt 3
file4.txt 4

found this two useful:
awk 'NR==FNR{a[$1];next}($1) in a{ print FILENAME, ($1) }' file1.csv filetomatch.csv
Merge the commmon values in a column
awk -F, '{ if (f == $1) { for (c=0; c <length($1) ; c++) printf " "; print FS $2 FS $3 } else { print $0 } } { f = $1 }' file.csv

Related

Pattern matching in unix shell script

I have two files like below:
File 1:
id1
hftujdbbd
bdurijtbr
grhjend
Ghent
id2
fu Rubens
hejdnnd
bdudndn
id3
gjbfbd
vhrjend
rjndnd
.
.
.
File 2:
id1
id2
I need to find the ids in file1 that are matching with ids in file2 and print all the lines related to that matched id. Please let me know how to implement this.
So the expected output is as below:
hftujdbbd
bdurijtbr
grhjend
Ghent
fu Rubens
hejdnnd
bdudndn
Using awk:
awk -F "/" 'NR==FNR { indx[$0]=1;next } indx[$2]==1 && ($1 == 2 || $1 == 3) { print }' file2 file1
Explanation:
awk -F "/" 'NR==FNR { # Set the field separator to "/" and then process the first file (file2) (NR==FNR)
indx[$0]=1; # Set the index of an array (indx) to the line
next
}
indx[$2]==1 && ($1 == 2 || $1 == 3) { # When processing the second file (file1), if there in an entry for the second "/" delimited field in index and the first field is 2 or 3, print the line
print
}' file2 file1
Taking file1 and file2 as above
while read id; do
if grep -w $id file1.txt > /dev/null 2>&1; then
awk "/$id/{flag=1; next} /id[0-9]?/{flag=0} flag" file1.txt
fi
done<file2.txt
output:
abhishekphukan$ bash script.sh
hftujdbbd
bdurijtbr
grhjend
Ghent
fu Rubens
hejdnnd
bdudndn

Bash: compare 2 files and show the unique content of one file with 'hierachy'

So basically, these are two files I need to compare
file1.txt
1 a
2 b
3 c
44 d
file2.txt
11 a
123 a
3 b
445 d
To show the unique lines in file 1, I use 'comm -23' command after 'sort -u' these 2 files. Additionally, I would like to make '11 a' '123 a' in file 2 become subsets of '1 a' in file 1, similarly, '445 d' is a subset of ' 44 d'. These subsets are considered the same as their superset. So the desired output is
2 b
3 c
I'm a beginner and my loop is way too slow... So here is my code
comm -23 <( awk {print $1,$2}' file1.txt | sort -u ) <( awk '{print $1,$2}' file2.txt | sort -u ) >output.txt
array=($( awk -F ',' '{print $1}' file1.txt ))
for i in "${array[#]}";do
awk -v pattern="$i" 'match($0, "^" pattern)' output.txt > repeat.txt
done
comm -23 <( cat output.txt | sort -u ) <( cat repeat.txt | sort -u )
Anyone got any good ideas?
Another question: Any ways I could show the row numbers from original file at output? For example,
(row num from file 1)
2 2 b
3 3 c
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
vals[$2][$1]
next
}
$2 in vals {
for (i in vals[$2]) {
if ( index(i,$1) == 1 ) {
next
}
}
}
{ print FNR, $0 }
$ awk -f tst.awk file2 file1
2 2 b
3 3 c

Count number of nonempty entries in each column of, e.g., comm output

The Unix command comm file1 file2 has a 3 column output with lines unique to file1 in the first column, lines unique to file2 in the second, and lines shared by both in the 3rd (assuming file1 and file2 are sorted). It ends up looking something like this:
$ echo -e "alpha\nbravo\ncharlie" > file1
$ echo -e "alpha\nbravo\ndelta" > file2
$ comm file1 file2
alpha
bravo
charlie
delta
If I want the number of nonempty lines in each column, is there a general way to parse the output of comm and count those?
I know that for comm in particular I could just run
for i in {12,23,31}; do comm -$i file1 file2 | wc -l; done
but I'm curious about solutions that take the comm output file as a starting point, for the sake of getting better at Unix command line. I added the awk tag because I have a hunch there's a good awk solution.
The other answer covers your question of using awk to do the job quite well, but it is also worth mentioning that the GNU version of comm has a --total option which will print the sum of each column in a similar manner.
You may use this awk:
comm file1 file2 |
awk -F '\t' -v OFS='\n' '{ if ($1=="") if ($2=="") c3++; else c2++; else c1++ }
END { print c3, c2, c1 }'
2
1
1
Note that output of comm is tab delimited with these cases:
1st and 2nd empty column in common lines
1st empty column in lines unique to file2
1st non-empty column in lines unique to file1
The question is interesting, but not as easy as one would imagine, especially if you do not have the --total option.
A couple of things about comm:
comm works on sorted files
if a line appears n times in file1 and m times n < m times in file2, comm will output n-m entries in column 2 and n entries in column 3.
$ comm <(echo -e "1\n2\n3") <(echo "2\n2\n3\n4")
1
2
2
3
4
comm uses <tab>-character as a default separator, processing its output becomes problematic if your input contains this character.
$ comm <(echo -e "1\t2\n3") <(echo "2\n3\n4")
1 2 << this is the weird line
2
3
4
Luckily it has an option to define the delimiter (--output-delimiter=STR)
comm only adds a delimiter if other non-empty fields are following
$ comm --output-delimiter=SEP <(echo -e "1\n2\n3") <(echo "2\n3\n4")
1 << NO SEP (1 field)
SEPSEP2 << TWO SEP (3 fields)
SEPSEP3 << TWO SEP (3 fields)
SEP4 << ONE SEP (2 fields)
How can we solve it now:
We should clearly not use an ASCII symbol as a delimiter, this is asking for problems when processing ASCII files, so what you can do is use a non-printable character as a delimiter. You could use for example <start-of-heading>-character with octal value \001 (it does not accept the <null>-character). This generally solves the issues you might have due to point (3)
$ comm --output-delimiter=$'\001' <(echo -e "1\t2\n3") <(echo "2\n3\n4")
this output can now be piped into an extremely simple awk
$ awk -F "\001" '{a[NF]++}END{print a[1],a[2],a[3] }'
the above works because of point (4).
So you can just do:
$ comm --output-delimiter=$'\001' file1 file2 \
| awk -F "\001" '{a[NF]++}END{print a[1],a[2],a[3] }'
But I don't have that --output-delimiter option: This calls for the pure awk solution. We keep track of 3 arrays. a for file1 b for file2 and c for the combination. (c keeps track of all the entries). We make sure to keep point (2) into account.
$ awk '(NR==FNR) { a[$0]++; c[$0]++ }
(NR!=FNR) { b[$0]++; c[$0]-- }
END { for(i in c) {
if (c[i] < 0) { countb+=-c[i]; countc+=a[i] }
else if (c[i] == 0) { countc+=a[i] }
else { counta+= c[i]; countc+=b[i] }
}
print counta, countb, countc
}' file1 file2
We could essentially get rid of the array b as it can be derived from a and c, but I wanted to make it a bit more clear how it works; the other version would be:
$ awk '(NR==FNR) { a[$0]++; c[$0]++; next } { c[$0]-- }
END { for(i in c) {
counta+=(c[i]>0 ? c[i] : 0)
countb-=(c[i]<0 ? c[i] : 0)
countc+=a[i] - (c[i]>0 ? c[i] : 0)
}
print counta, countb, countc
}' file1 file2
Using Perl
$ comm file1 file2 | perl -lne ' /^\t\t/ and $kv{2}++; /^\t\S+/ and $kv{1}++; /^\S+/ and $kv{3}++; END { print "col-$_:$kv{$_}" for(keys %kv) } '
col-3:1
col-1:1
col-2:2
$
or
$ comm file1 file2 | perl -lne ' /(^\t\t)|(^\t\S+)|(^.)/ and $x=$+[0]>2?3:$+[0]; $kv{$x}++; END { print "col-$_:$kv{$_}" for(keys %kv) } '
col-3:1
col-1:1
col-2:2
$
where
col-1 -> first file
col-3 -> second file
col-2 -> both file
obviously you can do all in awk without comm or requiring sorted inputs.
$ awk 'NR==FNR {a[$1]; next}
{if($1 in a) {c3++; delete a[$1]}
else c2++}
END {print length(a),c2,c3}' file1 file2
1 1 2
that's counts for file1 only, file2 only, and common.
Note, this requires that the records are unique in each file.

Adding column values from multiple different files

I have ~100 files and I would like to do an arithmetical operation (e.g. sum them up) on the second column of the files, such that I add the value of first row of one file to the first row value of second file and so on for all rows of column 2 in each file.
In my actual files I have ~30 000 rows so any kind of manual manipulation with the rows is not possible.
fileA
1 1
2 100
3 1000
4 15000
fileB
1 7
2 500
3 6000
4 20000
fileC
1 4
2 300
3 8000
4 70000
output:
1 12
2 900
3 15000
4 105000
I used this and ran it as: script.sh listofnames.txt (All the files have the same name but they are in different directories so I was referring to them with $line to the file with the list of directories names). This gives me a syntax error and I am looking for a way to define the "sum" otherwise.
while IFS='' read -r line || [[ -n "$line" ]]; do
awk '{"'$sum'"+=$3; print $1,$2,"'$sum'"}' ../$line/file.txt >> output.txt
echo $sum
done < "$1"
$ paste fileA fileB fileC | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
1 12
2 900
3 15000
4 105000
or if you wanted to do it all in awk:
$ awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' fileA fileB fileC
1 12
2 900
3 15000
4 105000
If you have a list of directories in a file named "foo" and every file you're interested in in every directory is named "bar" then you can do:
IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
cmd "${files[#]}"
where cmd is awk or paste or anything else you want to run on those files. Look:
$ cat foo
abc
def
ghi klm
$ IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
$ awk 'BEGIN{ for (i=1;i<ARGC;i++) print "<" ARGV[i] ">"; exit}' "${files[#]}"
<abc/bar>
<def/bar>
<ghi klm/bar>
So if your files are all named file.txt and your directory names are stored in listofnames.txt then your script would be:
IFS=$'\n' files=( $(awk '{print $0 "/file.txt"}' listofnames.txt) )
followed by whichever of these you prefer:
paste "${files[#]}" | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' "${files[#]}"

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

Resources