How to merge text files with common pair of strings in their lines - bash

I have two text files with the following line format:
Value - Value - Number
I need to merge these files in a new one that contains only the lines with the common Value - Value pairs followed by the two Number values.
For example if I have these files:
File1.txt
Jack - Mark - 12
Alex - Ryan - 15
Jack - Ryan - 22
File2.txt
Paul - Bill - 11
Jack - Mark - 18
Jack - Ryan - 20
The merged file will contain:
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20
How can I do this?

awk to the rescue!
awk -F' - ' 'BEGIN{OFS=FS}
NR==FNR{a[$1,$2]=$3;next}
($1,$2) in a{print $1,$2,a[$1,$2],$3}' file1 file2
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20
alternatively, with decorate/join/undecorate
$ join <(sort file1 | sed 's/ - /-/') <(sort file2 | sed 's/ - /-/') |
sed 's/-/ - /'
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20

Related

Bash find files and filter by date and size

I have a directory with a lot of files in it.
Each day, new files are added automatically.
The filenames are formatted like that :
[GROUP_ID]_[RANDOM_NUMBER].txt
Example : 012_1234.txt
For every day, for every GROUP_ID (032, 024, 044...etc), I want to keep only the biggest file of the day.
So for example, for the two days 27 and 28 march I have :
March 27 - 012_1234.txt - 12ko
March 27 - 012_0243.txt - 3000ko
March 27 - 016_5647.txt - 25ko
March 27 - 024_4354.txt - 20ko
March 27 - 032_8745.txt - 40ko
March 28 - 032_1254.txt - 16ko
March 28 - 036_0456.txt - 30ko
March 28 - 042_7645.txt - 500ko
March 28 - 042_2310.txt - 25ko
March 28 - 042_2125.txt - 34ko
March 28 - 044_4510.txt - 35ko
And I want to have :
March 27 - 012_0243.txt - 3000ko
March 27 - 016_5647.txt - 25ko
March 27 - 024_4354.txt - 20ko
March 27 - 032_8745.txt - 40ko
March 28 - 032_1254.txt - 16ko
March 28 - 036_0456.txt - 30ko
March 28 - 042_7645.txt - 500ko
March 28 - 044_4510.txt - 35ko
I don't find the right bash ls/find command to do that, somebody have an idea ?
With this command, I can display the biggest file for each day.
ls -l *.txt --time-style=+%s |
awk '{$6 = int($6/86400); print}' |
sort -nk6,6 -nrk5,5 | sort -sunk6,6
But I want the biggest file of each GROUP_ID file of each day.
So, if there is one file for "012" group_id file, of 10ko, I want to display it, even if there is bigger files for others group id...
I found myself the solution:
ls -l | tail -n+2 |
awk '{ split($0,var,"_"); group_id=var[5]; print $0" "group_id }' |
sort -k9,9 -k5,5nr |
awk '$10 != x { print } { x = $10 }'
This gives me the biggest file for each group_id, so now I just add to handle the day part.
For information:
tail -n+2: hide the "total" part of the ls command's output
First awk: get the group_id part (012, 036...) and display it after the original line ($0)
Sort: sort on filename and size
Take the biggest size of each group_id (column 10 added by awk at beginning)

reading a file into an array in bash

Here is my code
#!bin/bash
IFS=$'\r\n'
GLOBIGNORE='*'
command eval
'array=($(<'$1'))'
sorted=($(sort <<<"${array[*]}"))
for ((i = -1; i <= ${array[-25]}; i--)); do
echo "${array[i]}" | awk -F "/| " '{print $2}'
done
I keep getting an error that says "line 5: array=($(<)): command not found"
This is my problem.
As a whole my code should read in a file as a command line argument, sort the elements, then print out column 2 of the last 25 lines. I haven't been able to test this far so if there's a problem there too any help would be appreciated.
This is some of what the file contains:
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
16227 nicole
15308 daniel
15163 babygirl
14726 monkey
14331 lovely
14103 jessica
13984 654321
13981 michael
13488 ashley
13456 qwerty
13272 111111
13134 iloveu
13028 000000
12714 michelle
11761 tigger
11489 sunshine
11289 chocolate
11112 password1
10836 soccer
10755 anthony
10731 friends
10560 butterfly
10547 purple
10508 angel
10167 jordan
9764 liverpool
9708 justin
9704 loveme
9610 fuckyou
9516 123123
9462 football
9310 secret
9153 andrea
9053 carlos
8976 jennifer
8960 joshua
8756 bubbles
8676 1234567890
8667 superman
8631 hannah
8537 amanda
8499 loveyou
8462 pretty
8404 basketball
8360 andrew
8310 angels
8285 tweety
8269 flower
8025 playboy
7901 hello
7866 elizabeth
7792 hottie
7766 tinkerbell
7735 charlie
7717 samantha
7654 barbie
7645 chelsea
7564 lovers
7536 teamo
7518 jasmine
7500 brandon
7419 666666
7333 shadow
7301 melissa
7241 eminem
7222 matthew
In Linux you can simply do a
sort -nbr file_to_sort | head -n 25 | awk '{print $2}'
read in a file as a command line argument, sort the elements, then
print out column 2 of the last 25 lines.
From that discription of the problem, I suggest:
#! /bin/sh
sort -bn $1 | tail -25 | awk '{print $2}'
As a rule, use the shell to operate on filenames, and never use the
shell to operate on data. Utilities like sort and awk are far
faster and more powerful than the shell when it comes to processing a
file.

Print names alphabetically and how many appearances for each name

I have a file that includes names, one on each line. I want to print the names alphabetically, but (and here is where it gets confusing at least for me) next to each name I must print the number of appearances of that name with exactly one space between the name and the number of appearances.
For example if the file includes these names:
Barry
Don
John
Sam
Harry
Don
Don
Sam
it must print
Barry 1
Don 3
Harry 1
John 1
Sam 2
Any ideas?
sort | uniq -c will get you very close, just with the columns reversed.
$ sort file | uniq -c
1 Barry
3 Don
1 Harry
1 John
2 Sam
If you really need them in the proscribed order you could swap them with awk.
$ sort test.txt | uniq -c | awk '{print $2, $1}'
Barry 1
Don 3
Harry 1
John 1
Sam 2
With awk :
% awk '{
a[$1]++
}
END{
for (i in a) {
print i, a[i]
}
}' file
Output:
Barry 1
Harry 1
Don 3
John 1
Sam 2
Given:
$ cat file
Barry
Don
John
Sam
Harry
Don
Don
Sam
You can do:
$ awk '{a[$1]++} END { for (e in a) print e, a[e] }' file | sort
Barry 1
Don 3
Harry 1
John 1
Sam 2

How to merge three text files into three columns on screen

How can I merge three text files into three columns on screen?
1 A 1
2 B 2
3 C 3
D
E
I tried...
paste file1.txt file2.txt file3.txt | column -s $'\t' -t
...but I always get
1 A 1
2 B 2
3 C 3
D
E
Thanks in advance for your help!
line 1-2 of file1.txt
USB Device Class ID:
CdRom&Ven_ZALMAN&Prod__Virtual_CD-Rom&Rev_
line 1-2 of file2.txt
USB Instance ID:
______XX00000001&1
line 1-2 of file3.txt
Last updated (Subkey):
2015-01-12 15:08:45 UTC+0000
I don't know your input files, but paste works as intended.
$ paste <(seq 1 4) <(seq 10 17) <(seq 5 9)
1 10 5
2 11 6
3 12 7
4 13 8
14 9
15
16
17
:|paste -d ' ' file1 - file2 - file3 | column -ts "| " combine many files as a table column -t and -s as a separator "| " .
the output will be like that
1 A 1
2 B 2
3 C 3
D
E
If you only have 3 files or a few to deal with you can do this:
$ paste foo[12].txt | expand -t 45 | paste - foo3.txt | expand -t 12
USB Device Class ID: USB Instance ID: Last updated (Subkey):
CdRom&Ven_ZALMAN&Prod__Virtual_CD-Rom&Rev_ ______XX00000001&1 2015-01-12 15:08:45 UTC+0000
______XY0000000182
$
You need to choose the tab expansions 45 and 12 depending upon maximum line widths in foo1.txt and foo2.txt.

How to grep two column from a single file

cat Error00
4 0 375
4 2001 21
4 2002 20
cat Error01
4 0 465
4 2001 12
4 2002 40
4 2016 1
I want output as below
4 0 375 465
4 2001 21 12
4 2002 20 20
4 2016 - 1
i am using the below query. here problem is i m not able to handle grep for two field because space is coming.
please suggest how can to get rid of this.
keylist=$(awk '{print $1,$2'} Error0[0-1] | sort | uniq)
for key in ${keylist} ; do
echo ${key}
val_a=$(grep "^${key}" Error00 | awk '{print $3}') ;val_a=${val_a:---}
val_b=$(grep "^${key}" Error01 | awk '{print $1,$2}') ; val_b=${val_b:--- --}
echo $key ${val_a} >>testreport
done
i m geting the oputput as below
4 375 465
0
4 21 12
2001
4 20 20
2002
4 - 1
2016
A single awk one liner can handle this easily:
awk 'FNR==NR{a[$1,$2]=$3;next}{print $1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3}' err0 err1
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
For formatted output you can use printf instead of print. Like Jonathan Leffler suggest:
printf "%s %-6s %-6s %s\n",$1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
However a general solution is to use column -t for a nice table output:
awk '{....}' err0 err1 | column -t
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
grep is not really the right tool for this job. You can either play with awk or Perl (or Python, or …), or you can use join. However, join only joins on a single column at a time, and you appear to need to join on two columns. So, we're going to have to massage the data so that it will work with join. I'm about to assume you're using bash and so have process substitution available. You can do the job without, but it is fiddlier and involves temporary files (and traps to clean them up, etc).
The key to the join will be to replace the blank between the first two columns with a colon (or any other convenient character — control-A would work fine too), then join the files on column 1 with a replacement character. The inputs must be sorted; the output must have the colon replaced with a blank.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$
The 's/ */:/' operation replaces the first sequence of one or more blanks with a colon; the input data has two blanks between the 4 and the 0 in the first line of Error00. The input to join must be in sorted order of the joining field, here the first field. The output is the join field, the second column of Error00 and the second column of Error01 (remembering that means the second column after the first two have been fused by the colon). If there's an unmatched line in the first file, generate an output line (-a 1); ditto for the second file; and for the missing fields, insert a dash (-e '-'). The final sed removes the colon that was added.
If you want the data formatted, pipe it through awk.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /' |
> awk '{printf("%s %-6s %-6s %s\n", $1, $2, $3, $4)}'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$

Resources