Shell: put data together - shell

I write a shell script to put data together.
I have 2 files with different columns.
One of the columns is the same on both of the file.
Like :
File 1:
a 5
c 7
d 9
b 5
File 2:
c 1
d 8
a 6
b 3
For the moment my script put the data in a same file with
paste -d ' ' 'file1' 'file2' > "file3"
I would like to know if it's possible to match the 2 columns together and in order like:
a 5 6
b 5 3
c 7 1
d 9 8
Thanks

sort file1 > file1.tmp
sort file2 > file2.tmp
join -t " " -j 1 file1.tmp file2.tmp
Assumed that character and numbers are separated by a SPACE

Using process substitution you can sort the files and join them in a single command.
join -t " " -j 1 <(sort file1) <(sort file2)

Use sort to sort both files, and then join to join on the first field.

Related

is it possible to get the content of file1 minus file2 by using bash cmd?

I have two files:
log.txt
log.bak2022.06.20.10.00.txt
the log.bak2022.06.20.10.00.txt is the backup of log.txt at 2022.06.20 10:00.
but the log.txt is a content-increasing file.
now I have a requirement, I want get the content of log.txt minus log.bak2022.06.20.10.00.txt, then write into a new file.
is it possible to implement it?
Assumptions:
the small file contains N lines, and these N lines are an exact match for the 1st N lines in the big file
Sample inputs:
$ cat small
4
2
1
3
$ cat big
4
2
1
3
8
10
9
4
One comm idea:
$ comm --nocheck-order -13 small big
8
10
9
4
One awk idea:
$ awk '
FNR==NR { max=FNR; next }
FNR>max
' small big
8
10
9
4
One wc/sed idea:
$ max=$(wc -l < small)
$ ((max++))
$ sed -n "$max,$ p" big
8
10
9
4
awk-based solution without need for unix piping | chains, regex, function calling, or array splitting :
{m,n,g}awk '(_+= NR==FNR ) < FNR' FS='^$' small.txt big.txt
8
10
9
4

Delete matching lines in two tab delimited files

I have 2 tab delimited files
A 2
A 5
B 4
B 5
C 10
and
A 2
A 5
B 5
I want to delete the lines in file1 that are in file2 so that the output is:
B 4
C 10
I have tried:
awk 'NR==FNR{c[$1$2]++;next};!c[$1$2] > 0' file2 file1 > file3
but it deletes more lines than expected.
1026997259 file1
1787919 file2
1023608359 file3
How can I modify this code, so that:
I have 2 tab delimited files
A 2 3
A 5 4
B 4 5
B 5 5
C 10 12
and
A 2 5
A 5 4
B 5 3
F 6 7
Based only in the 1st and 2nd columns, I want to grab the lines in file1 that are also in file2 so that the output is:
B 5 5
C 10 12
Why not to use grep command?
grep -vf file2 file1
Think about it - if you concatenate ab c and a cb they both become abc so what do you think your code is doing with $1$2? Use SUBSEP as intended ($1,$2) and change !c[$1$2] > 0 to !(($1,$2) in c). Also consider whether !c[$1$2] > 0 means !(c[$1$2] > 0) or (!c[$1$2]) > 0. I'd never write the former code so idk for sure, I'd always write it with parens as I intended it to be parsed. So do:
awk 'NR==FNR{c[$1,$2];next} !(($1,$2) in c)' file2 file1
Or just use $0 instead of $1,$2:
awk 'NR==FNR{c[$0];next} !($0 in c)' file2 file1
If the matching lines in the two files are identical, and the two files are sorted in the same order, then comm(1) can do the trick:
comm -23 file1 file2
It prints out lines that are only in the first file (unless -1 is given), lines that are only in the second file (unless -2), and lines that are in both files (unless -3). If you leave more than one option enabled then they will be printed in multiple (tab-separated) columns.

unix utility to compare lists and perform a set operation

I believe what I'm asking for is a sort of set operation. I need help trying to create a list of the following:
List1 contains:
1
2
3
A
B
C
List2 contains:
1
2
3
4
5
A
B
C
D
E
(I need this) - The Final list I need would be (4) items:
4
5
D
E
So obviously List2 contains more elements than List1.
Final list which I needs are the elements in List2 that are NOT in List1.
Which linux utility can I use to accomplish this? I have looked at sort, comm but i'm unsure how to do this correctly. Thanks for the help
Using awk with a straight forward logic.
awk 'FNR==NR{a[$0]; next}!($0 in a)' file1 file2
4
5
D
E
Using GNU comm utility, where according to the man comm page,
comm -3 file1 file2
Print lines in file1 not in file2, and vice versa.
Using it for your example
comm -3 file2 file1
4
5
D
E
You can do it with a simple grep command inverting the match with -v and reading the search terms from list1 with -f, e.g. grep -v -f list1 list2. Example use:
$ grep -v -f list1 list2
4
5
D
E
Linux provides a number of different ways to skin this cat.
You can try this :
$ diff list1.txt list2.txt | egrep '>|<' | awk '{ print $2 }' | sort -u
4
5
D
E
i hope help you

count patterns in a csv file from another csv file in bash

I have two csv files
File A
ID
1
2
3
File B
ID
1
1
1
1
3
2
3
What I want to do is to count how many times that a ID in File A show up in File B, and save the result in a new file C (which is in csv format). For example, 1 in File A shows up 4 times in File B. So in the new file C, I should have something like
File C
ID,Count
1,4
2,1
3,2
Originally I was thinking use "grep -f", but it seems like it only works with .txt format. Unfortunately, File A and B are both in csv format. So now, I am thinking maybe I could use a for loop to get the ID from File A individually and use grep -c to count each one of them. Any idea will be helpful.
Thanks in advance!
You can use this awk command:
awk -v OFS=, 'FNR==1{next} FNR==NR{a[$1]; next} $1 in a{freq[$1]++}
END{print "ID", "Count"; for (i in freq) print i, freq[i]}' fileA fileB
ID,Count
1,4
2,1
3,2
You could use join, sort, uniq and process substitution <(command) creatively:
$ join -2 2 <(sort A) <(sort B | uniq -c) | sort -n > C
$ cat C
ID 1
1 4
2 1
3 2
And if you really really want the header to be ID Count, before writing to file C you could replace that 1 with Count with sed by adding:
... | sed 's/\(ID \)1/\1Count/' > C
to get
ID Count
1 4
2 1
3 2
and if you really really want commas as separators instead of spaces, to replace them with spaces using tr, add also:
... | tr \ , > C
to get
ID,Count
1,4
2,1
3,2
You could of course ditch the trand use the sed like this instead:
... | sed 's/\(ID \)1/\1Count/;s/ /,/' > C
And the output would be like above.

Parallelise grep - use file rows as input for grep

I have File1 and File2 as below. I found similar questions but not quite the same.
Use File1 rows as input for grep and extract 1st column of File2. In below toy example, if column2 in File2 equals to a or b then write 1 to File_ab.
So far I am using double loop, and estimated time is 4 days. I was hoping to get something like: cat File1 | xargs -P 12 -exec grep "$1\|$2" File2 > File_$1$2.txt
But failed to get the syntax right. I am trying to run 12 greps in parallel with OR condition.
File1
a b
c d
File2
1 a
2 b
3 c
1 d
4 a
5 e
6 d
Desired output is 2 files, File_ab and File_cd:
File_ab
1
2
4
File_cd
1
3
6
Note: My File1 is 25K rows, and File2 is 10Mln rows.
Use perl:
#!/usr/bin/perl
use FileCache;
#a=`cat File1`;
chomp(#a);
for $a (#a) {
#parts = split/ +/,$a;
push #re, #parts;
for $p (#parts) {
$file{$p} = "File_".join "",#parts;
}
}
$re = join("|",#re);
while(<>) {
if(/(\d+).*($re)/o and $file{$2}) {
$fh = cacheout $file{$2};
print $fh $1,"\n";
}
}
Then:
chmod 755 myscript
./myscript File2

Resources