Mapping ids for 10 million records [duplicate] - bash

This question already has answers here:
Efficient way to map ids
(2 answers)
Closed 9 years ago.
I have two text files,
File 1 with data like
User game count
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
...
File 2
1 Basketball
2 Football
3 Rugby
...
90 TT
91 Volleyball
...
Now what I want to do is add another column to File 2 such that I have the corresponding index of the game from File 2 as an extra column in File 1.
I have 2 million entries in File 1. So I want to add another column specifying the index(basically the line number or order) of the game from file 2. How can I do this efficiently.
Right now I am doing this line by line. Reading a line from file 1, grep the corresponding game from file 2 for its line number and saving/writing that to a file.
This will take me ages. How can I speed this up if I have 10 million rows in file 2 and 3000 rows in file 1?

With awk, read field 1 from File2 into an array indexed by field 2, look up the array using field 2 from File1 as you iterate through it
awk 'NR == FNR{a[$2]=$1; next}; {print $0, a[$2]}' File2 File1
A Rugby 2 3
A Football 2 2
B Volleyball 1 91
C TT 2 90

You can construct an associative array from the second file, with game names as keys and the game index as values. then for each line in file 1 search the array for the wanted id, and write it back
Associative arrays provide O(1) time complexity.

Use the join command:
$ cat file1
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
$ cat file2
1 Basketball
2 Football
3 Rugby
90 TT
91 Volleyball
$ join -1 3 -2 1 -o 1.1,1.2,1.3,2.2 \
<(sort -k 3 file1) <(sort -k 1 file2)
B Volleyball 1 Basketball
A Football 2 Football
A Rugby 2 Football
C TT 2 Football

Here's another approach: only read the small file into memory, and then read the bigger file line-by-line. Once each ID has been found, bail out:
awk '
NR == FNR {
f1[$2] = $0
n++
next
}
($2 in f1) {
print f1[$2], $1
delete f1[$2]
if (--n == 0) exit
}
' file1 file2
Rereading your question, I don't know if I've answered the question: do you want an extra column appended to file1 or file2?

Related

Combining multiple awk output statements into one line

I have some ascii files I’m processing, with 35 columns each, and variable number of rows. I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
I’ve done similar processing in the past, but by outputting temp files for each awk command, reading each successive temp file in to eventually create a final ascii file. Then, I would delete the temp files after. I’m hoping there is an easier/faster method than having to create a bunch of temp files.
Below is an initial working processing step, that the above awk commands would need to follow and fit into. This step gets the data from foo.txt, removes the header, and processes only the rows containing a particular, but varying, string.
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
There’s another processing step for different data files, that I would also need the 2 new columns discussed earlier. This is simply appending a unique file name from what’s being catted to the last column of every row in a new ascii file. This command is actually in a loop with varying input files, but I’ve simplified it here.
cat foo.txt | tail -n +2 | awk -v fname="$fname" '{print $0 OFS fname;}' >> foo_new.txt
An example of one of the foo.txt files.
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
Below would be the example foo_new.txt desired. The requested 2 columns of output from awk (last 2 columns). In this example, column 5 is the difference between column 3 and 2 plus 1. Column 6 is the result of column 1 divided by column 5.
20 0 5 F001 6 3.3
4 2 3 F002 2 2.0
12 4 8 F003 5 2.4
For the second example foo_new.txt. The last column is an example of fname. These are computed in the shell script, and passed to awk. I don't care if the results in column 7 (fname) are at the end or placed between columns 4 and 5, so long as it gets along with the other awk statements.
20 0 5 F001 6 3.3 C1
4 2 3 F002 2 2.0 C2
12 4 8 F003 5 2.4 C3
The best luck so far, but unfortunately this is producing a file with the original output first, and the added output below it. I'd like to have the added output appended on as columns (#5 and #6).
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
cat foo_new.txt | awk '{print $4=$3-$2+1, $5=$1/($3-$2+1)}' >> foo_new.txt
Consider an input file data with header line like this (based closely on your minimal example):
Col1 Col2 Col3 Col4
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
You want the output to contain a column 5 that is the value of $3 - $2 + 1 (column 3 minus column 2 plus 1), and a column 6 that is the value of column 1 divided by column 5 (with 1 decimal place in the output), and a file name that is based on a variable fname passed to the script but that has a unique value for each line. And you only want lines where column 4 matches F and 3 digits, and you want to skip the first line. That can all be written directly in awk:
awk -v fname=C '
NR == 1 { next }
$4 ~ /^F[0-9][0-9][0-9]$/ { c5 = $3 - $2 + 1
c6 = sprintf("%.1f", $1 / c5)
print $0, c5, c6, fname NR
}' data
You could write that on one line too:
awk -v fname=C 'NR==1{next} $4~/^F[0-9][0-9][0-9]$/ { c5=$3-$2+1; print $0,c5,sprintf("%.1f",$1/c5), fname NR }' data
The output is:
20 0 5 F001 6 3.3 C2
4 2 3 F002 2 2.0 C3
12 4 8 F003 5 2.4 C4
Clearly, you could change the file name so that the counter starts from 0 or 1 by using counter++ or ++counter respectively in place of the NR in the print statement, and you could format it with leading zeros or whatever else you want with sprintf() again. If you want to drop the first line of each file, rather than just the first file, change the NR == 1 condition to FNR == 1 instead.
Note that this does not need the preprocessing provided by cat foo.txt | tail -n +2.
I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
That's just:
awk -vN=9 -vanother_column=10 '{ v36 = $N - $(N+1); print $0, v36, $another_column / v36 }' input_file.tsv
I guess your file has some "header"/special "first line", so if it's the first line, then preserve it:
awk ... 'NR==1{print $0, "36_header", "37_header"} NR>1{ ... the script above ... }`
Taking first 3 columns from the example script you presented, and substituting N for 2 and another_column for 1, we get the following script:
# recreate input file
cat <<EOF |
20 0 5
4 2 3
12 4 8
100 10 29
EOF
tr -s ' ' |
tr ' ' '\t' > input_file.tsv
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N; print $0, tmp, $another_column / tmp }' input_file.tsv
and it will output:
20 0 5 5 4
4 2 3 1 4
12 4 8 4 3
100 10 29 19 5.26316
Such script:
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N + 1; print $0, tmp, sprintf("%.1f", $another_column / tmp) }' input_file.tsv
I think get's closer output to what you want:
20 0 5 6 3.3
4 2 3 2 2.0
12 4 8 5 2.4
100 10 29 20 5.0
And I guess that by that (N+1) you meant "the difference between two columns with 1 added".

Bash: read one column from one file and divided by one column in current file

I want to use bash to read one column from one file and divided by one column in current file and replace the column.
for example, I have one file called input.txt
1 2 3
1 4 3
1 8 3
And I want to read second column of the file and divide with 3rd column of current file aim_file.txt:
1 1 4
3 4 8
8 8 16
So I got result.txt:
1 1 2
3 4 2
8 8 2
Using awk you can do:
awk 'NR==FNR{a[FNR]=$2; next} a[FNR]{$3 /= a[FNR]} 1' input.txt aim_file.txt
Output:
1 1 2
3 4 2
8 8 2
We first iterate input.txt and store the 2nd column in an associative array by index of line#
Next while iterating aim_file.txt we divide 3rd column with the value stored in array

Using awk to get the maximum value of a column, for each unique value of another column

So I have a file such as:
10 1 abc
10 2 def
10 3 ghi
20 4 elm
20 5 nop
20 6 qrs
30 3 tuv
I would like to get the maximum value of the second column for each value of the first column, i.e.:
10 3 ghi
20 6 qrs
30 3 tuv
How can I do using awk or similar unix commands?
You can use awk:
awk '$2>max[$1]{max[$1]=$2; row[$1]=$0} END{for (i in row) print row[i]}' file
Output:
10 3 ghi
20 6 qrs
30 3 tuv
Explanation:
awk command uses an associative array max with key as $1 and value as $2. Every time we encounter a value already stored in this associative array max, we update our previous entry and store whole row in another associative array row with the same key. Finally in END section we simply iterate over associative array row and print it.
shorter alternative with sort
$ sort -k1,1 -k2,2nr file | sort -u -k1,1
10 3 ghi
20 6 qrs
30 3 tuv
sort by field one and field two (numeric, reverse) so that max for each key will be top of the group, pick the first for each key by the second sort.

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

Command line to sum frequency in concatenated file

I need to summarize the frequency of one column of several large tab-separated files.
An example of the content in the file is :
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
After concatenating the files, the trouble is the following:
Column 2 essentially is a frequency count of the amount of times the combination of Column 0 and Column 1 were seen together.
I need to add the frequency of all of the identical combinations in Column 2 of the concatenated file.
For instance: If in File A the contents are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
and in File B the contents are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
the contents in the concatenated File C are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
I want to sum the frequencies of all identical combos in Column 0 and Column 1 in a File D to get the following results:
Blue table 6
Blue chair 4
Big cat 2
Small cat 4
I tried to sort and count the info with the following command:
sort <input_file> | uniq -c <output_file>
but the result is the following:
2 Big cat 1
2 Blue chair 2
2 Blue table 3
2 Small cat 2
Does anyone have a suggestion of a terminal command that can produce my desired results?
Thank you in advance for any help.
You're close; you have all the numbers you need. The total for each row is the count of rows that you got from uniq (column 1) times the frequency count (column 4). You can calculate that with awk:
sort input.txt | uniq -c | awk ' { print $2 "\t" $3 "\t" $1*$4 } '

Resources