Merging two files with unequal lengths based on two keys in linux [duplicate] - shell

This question already has answers here:
Joining multiple fields in text files on Unix
(11 answers)
Closed 2 years ago.
I have two txt files with different lengths.
File 1:
Albania 20200305 0
Albania 20200306 0
Albania 20200307 0
Albania 20200308 0
Albania 20200309 3
Albania 20200310 7
Albania 20200311 4
Albania 20200312 2
File 2:
Europe Albania 20200309 2
Europe Albania 20200310 6
Europe Albania 20200311 10
Europe Albania 20200312 11
Europe Albania 20200313 23
Europe Albania 20200314 33
I would like to create a File3 which will add the 3. column of the File1 at the end of File2 if 1st and 2nd column of File1 is same with 2nd and 3rd column of File2. It should look like this:
File3:
Europe Albania 20200309 2 3
Europe Albania 20200310 6 7
Europe Albania 20200311 10 4
Europe Albania 20200312 11 2
I have tried
awk 'NR==FNR{A[$1,$2]=$3;next} (($2,$3) in A) {print $0, A[$1,$2]}' file1.txt file2.txt > file3.txt
but it is just printing File 2, it does not add the third column of File1.
Can you please help me with the problem.
Thanks in advance!

Your approach is correct but while printing you need to use like A[$2,$3], you are using A[$1,$2] which is NOT existing(Because 1st, 2nd columns of file1 should be compared to 2nd and 3rd columns of file2) in array A hence its printing only current line values of file2 in your file3.
awk 'NR==FNR{a[$1,$2]=$3;next} (($2,$3) in a) {print $0, a[$2,$3]}' file1 file2
Also see link(Thanks to James for providing nice link here) Why we shouldn't use variables in capital letters

Related

How to split a file into chunks with 1000 lines in each chunk in Bash? [duplicate]

This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 7 years ago.
I have a file that is 6200 lines long that looks like:
chrom chromStart chromEnd score a a.1
1 chr1 834359 867552 4 0.020979021 0.0000000000
2 chr1 1880283 1940830 9 0.075757576 0.0000000000
3 chr1 1960387 2064958 13 0.115093240 0.0006596306
4 chr1 2206040 2249092 5 0.019230769 0.0000000000
5 chr1 2325759 2408930 11 0.021296885 0.0080355001
I need to break the file into files that are 1000 lines long. How can this be done?
This sounds like a case for the POSIX split command:
split -l 1000 file-to-be-split prefix.
This will split the 'file to be split' into files with 1000 lines each (except the last, of course), and the names will start with prefix. and will end with aa, ab, ac, ...

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

Mapping ids for 10 million records [duplicate]

This question already has answers here:
Efficient way to map ids
(2 answers)
Closed 9 years ago.
I have two text files,
File 1 with data like
User game count
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
...
File 2
1 Basketball
2 Football
3 Rugby
...
90 TT
91 Volleyball
...
Now what I want to do is add another column to File 2 such that I have the corresponding index of the game from File 2 as an extra column in File 1.
I have 2 million entries in File 1. So I want to add another column specifying the index(basically the line number or order) of the game from file 2. How can I do this efficiently.
Right now I am doing this line by line. Reading a line from file 1, grep the corresponding game from file 2 for its line number and saving/writing that to a file.
This will take me ages. How can I speed this up if I have 10 million rows in file 2 and 3000 rows in file 1?
With awk, read field 1 from File2 into an array indexed by field 2, look up the array using field 2 from File1 as you iterate through it
awk 'NR == FNR{a[$2]=$1; next}; {print $0, a[$2]}' File2 File1
A Rugby 2 3
A Football 2 2
B Volleyball 1 91
C TT 2 90
You can construct an associative array from the second file, with game names as keys and the game index as values. then for each line in file 1 search the array for the wanted id, and write it back
Associative arrays provide O(1) time complexity.
Use the join command:
$ cat file1
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
$ cat file2
1 Basketball
2 Football
3 Rugby
90 TT
91 Volleyball
$ join -1 3 -2 1 -o 1.1,1.2,1.3,2.2 \
<(sort -k 3 file1) <(sort -k 1 file2)
B Volleyball 1 Basketball
A Football 2 Football
A Rugby 2 Football
C TT 2 Football
Here's another approach: only read the small file into memory, and then read the bigger file line-by-line. Once each ID has been found, bail out:
awk '
NR == FNR {
f1[$2] = $0
n++
next
}
($2 in f1) {
print f1[$2], $1
delete f1[$2]
if (--n == 0) exit
}
' file1 file2
Rereading your question, I don't know if I've answered the question: do you want an extra column appended to file1 or file2?

Combine text from two files, output to another [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 1 year ago.
i'm having a bit of a problem and i've been searching allll day. this is my first Unix class don't be to harsh.
so this may sound fairly simple, but i can't get it
I have two text files
file1
David 734.838.9801
Roberto‭ ‬313.123.4567
Sally‭ ‬248.344.5576
Mary‭ ‬313.449.1390
Ted‭ ‬248.496.2207
Alice‭ ‬616.556.4458
Frank‭ ‬634.296.1259
file2
Roberto Tuesday‭ ‬2
Sally Monday‭ ‬8
Ted Sunday‭ ‬16
Alice Wednesday‭ ‬23
David Thursday‭ ‬10
Mary Saturday‭ ‬14
Frank Friday‭ ‬15
I am trying to write a script using a looping structure that will combine both files and come out with the output below as a separate file
output:
Name On-Call Phone Start Time
Sally Monday 248.344.5576 8am
Roberto Tuesday 313.123.4567 2am
Alice‭ Wednesday‭ 616.556.4458‭ 11pm
David‭ Thursday‭ 734.838.9801‭ 10am
Frank‭ Friday‭ 634.296.1259‭ 3pm
Mary‭ Saturday‭ 313.449.1390‭ 2pm
Ted‭ ‬ Sunday‭ 248.496.2207‭ 4pm
This is what i tried( i know its horrible)
echo " Name On-Call Phone Start Time"
file="/home/xubuntu/date.txt"
file1="/home/xubuntu/name.txt"
while read name2 phone
do
while read name day time
do
echo "$name $day $phone $time"
done<"$file"
done<"$file1"
any help would be appreciated
First, sort the files using sort and then use this command:
paste file1 file2 | awk '{print $1,$4,$2,$5}'
This will bring you pretty close. After that you have to figure out how to format the time from the 24 hour format to the 12 hour format.
If you want to avoid using sort separately, you can bring in a little more complexity like this:
paste <(sort file1) <(sort file2) | awk '{print $1,$4,$2,$5}'
Finally, if you have not yet figured out how to print the time in 12 hour format, here is your full command:
paste <(sort file1) <(sort file2) | awk '{"date --date=\"" $5 ":00:00\" +%I%P" |& getline $5; print $1 " " $4 " " $2 " " $5 }'
You can use tabs (\t) in place of spaces as connectors to get a nicely formatted output.
In this case join command will also work,
join -1 1 -2 1 <(sort file1) <(sort file2)
Description
-1 -> file1
1 -> first field of file1 (common field)
-2 -> file2
1 -> first field of file2 (common field)
**cat file1**
David 734.838.9801
Roberto 313.123.4567
Sally 248.344.5576
Mary 313.449.1390
Ted 248.496.2207
Alice 616.556.4458
Frank 634.296.1259
**cat file2**
Roberto Tuesday 2
Sally Monday 8
Ted Sunday 16
Alice Wednesday 23
David Thursday 10
Mary Saturday 14
Frank Friday 15
output
Alice 616.556.4458 Wednesday 23
David 734.838.9801 Thursday 10
Frank 634.296.1259 Friday 15
Mary 313.449.1390 Saturday 14
Roberto 313.123.4567 Tuesday 2
Sally 248.344.5576 Monday 8
Ted 248.496.2207 Sunday 16

Write the number of elements per line of a file and its repetitions with awk

I have a file with all different integer in which each line may have different lenghts, like this:
1 2 3 4 5
16 7 8
9 10 101 102 13 14
15 6 17
24 28 31 30 18
I would like to print in output the number of elements that a line presents and the number of times there is the same number of elements per lines; the output of this example should be:
3 2
5 2
6 1
In the first column there are the number of elements per line, in the second the number of lines that presents the same number of elements.
The first line in the file has 5 elements and also the 5th one etc etc.
Print the count for the number of fields:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file
5 2
6 1
3 2
Pipe to sort for ordered output:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file | sort
3 2
5 2
6 1

Resources