I need to summarize the frequency of one column of several large tab-separated files.
An example of the content in the file is :
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
After concatenating the files, the trouble is the following:
Column 2 essentially is a frequency count of the amount of times the combination of Column 0 and Column 1 were seen together.
I need to add the frequency of all of the identical combinations in Column 2 of the concatenated file.
For instance: If in File A the contents are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
and in File B the contents are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
the contents in the concatenated File C are as follows:
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
Blue table 3
Blue chair 2
Big cat 1
Small cat 2
I want to sum the frequencies of all identical combos in Column 0 and Column 1 in a File D to get the following results:
Blue table 6
Blue chair 4
Big cat 2
Small cat 4
I tried to sort and count the info with the following command:
sort <input_file> | uniq -c <output_file>
but the result is the following:
2 Big cat 1
2 Blue chair 2
2 Blue table 3
2 Small cat 2
Does anyone have a suggestion of a terminal command that can produce my desired results?
Thank you in advance for any help.
You're close; you have all the numbers you need. The total for each row is the count of rows that you got from uniq (column 1) times the frequency count (column 4). You can calculate that with awk:
sort input.txt | uniq -c | awk ' { print $2 "\t" $3 "\t" $1*$4 } '
Related
Shell script to read two columns in CSV and count how many unique values in column two per each unique value in column 1
I have a sheet that looks like
1 a
2 a
3 b
3 c
2 a
2 f
2 a
1 d
The output I need is this:
1 2
2 3
3 2
cut -f 1,2 | sort | uniq -c | sort
I tried the above but I am doing something wrong. New to shell scripting here.
I have some ascii files I’m processing, with 35 columns each, and variable number of rows. I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
I’ve done similar processing in the past, but by outputting temp files for each awk command, reading each successive temp file in to eventually create a final ascii file. Then, I would delete the temp files after. I’m hoping there is an easier/faster method than having to create a bunch of temp files.
Below is an initial working processing step, that the above awk commands would need to follow and fit into. This step gets the data from foo.txt, removes the header, and processes only the rows containing a particular, but varying, string.
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
There’s another processing step for different data files, that I would also need the 2 new columns discussed earlier. This is simply appending a unique file name from what’s being catted to the last column of every row in a new ascii file. This command is actually in a loop with varying input files, but I’ve simplified it here.
cat foo.txt | tail -n +2 | awk -v fname="$fname" '{print $0 OFS fname;}' >> foo_new.txt
An example of one of the foo.txt files.
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
Below would be the example foo_new.txt desired. The requested 2 columns of output from awk (last 2 columns). In this example, column 5 is the difference between column 3 and 2 plus 1. Column 6 is the result of column 1 divided by column 5.
20 0 5 F001 6 3.3
4 2 3 F002 2 2.0
12 4 8 F003 5 2.4
For the second example foo_new.txt. The last column is an example of fname. These are computed in the shell script, and passed to awk. I don't care if the results in column 7 (fname) are at the end or placed between columns 4 and 5, so long as it gets along with the other awk statements.
20 0 5 F001 6 3.3 C1
4 2 3 F002 2 2.0 C2
12 4 8 F003 5 2.4 C3
The best luck so far, but unfortunately this is producing a file with the original output first, and the added output below it. I'd like to have the added output appended on as columns (#5 and #6).
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
cat foo_new.txt | awk '{print $4=$3-$2+1, $5=$1/($3-$2+1)}' >> foo_new.txt
Consider an input file data with header line like this (based closely on your minimal example):
Col1 Col2 Col3 Col4
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
You want the output to contain a column 5 that is the value of $3 - $2 + 1 (column 3 minus column 2 plus 1), and a column 6 that is the value of column 1 divided by column 5 (with 1 decimal place in the output), and a file name that is based on a variable fname passed to the script but that has a unique value for each line. And you only want lines where column 4 matches F and 3 digits, and you want to skip the first line. That can all be written directly in awk:
awk -v fname=C '
NR == 1 { next }
$4 ~ /^F[0-9][0-9][0-9]$/ { c5 = $3 - $2 + 1
c6 = sprintf("%.1f", $1 / c5)
print $0, c5, c6, fname NR
}' data
You could write that on one line too:
awk -v fname=C 'NR==1{next} $4~/^F[0-9][0-9][0-9]$/ { c5=$3-$2+1; print $0,c5,sprintf("%.1f",$1/c5), fname NR }' data
The output is:
20 0 5 F001 6 3.3 C2
4 2 3 F002 2 2.0 C3
12 4 8 F003 5 2.4 C4
Clearly, you could change the file name so that the counter starts from 0 or 1 by using counter++ or ++counter respectively in place of the NR in the print statement, and you could format it with leading zeros or whatever else you want with sprintf() again. If you want to drop the first line of each file, rather than just the first file, change the NR == 1 condition to FNR == 1 instead.
Note that this does not need the preprocessing provided by cat foo.txt | tail -n +2.
I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
That's just:
awk -vN=9 -vanother_column=10 '{ v36 = $N - $(N+1); print $0, v36, $another_column / v36 }' input_file.tsv
I guess your file has some "header"/special "first line", so if it's the first line, then preserve it:
awk ... 'NR==1{print $0, "36_header", "37_header"} NR>1{ ... the script above ... }`
Taking first 3 columns from the example script you presented, and substituting N for 2 and another_column for 1, we get the following script:
# recreate input file
cat <<EOF |
20 0 5
4 2 3
12 4 8
100 10 29
EOF
tr -s ' ' |
tr ' ' '\t' > input_file.tsv
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N; print $0, tmp, $another_column / tmp }' input_file.tsv
and it will output:
20 0 5 5 4
4 2 3 1 4
12 4 8 4 3
100 10 29 19 5.26316
Such script:
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N + 1; print $0, tmp, sprintf("%.1f", $another_column / tmp) }' input_file.tsv
I think get's closer output to what you want:
20 0 5 6 3.3
4 2 3 2 2.0
12 4 8 5 2.4
100 10 29 20 5.0
And I guess that by that (N+1) you meant "the difference between two columns with 1 added".
Let's say I have this input file 49142202.txt:
A 5
B 6
C 3
A 4
B 2
C 1
Is it possible to sort the groups in column 1 by the value in column 2? The desired output is as follows:
B 6 <-- B group at the top, because 6 is larger than 5 and 3
B 2 <-- 2 less than 6
A 5 <-- A group in the middle, because 5 is smaller than 6 and larger than 3
A 4 <-- 4 less than 5
C 3 <-- C group at the bottom, because 3 is smaller than 6 and 5
C 1 <-- 1 less than 3
Here is my solution:
join -t$'\t' -1 2 -2 1 \
<(cat 49142202.txt | sort -k2nr,2 | sort --stable -k1,1 -u | sort -k2nr,2 \
| cut -f1 | nl | tr -d " " | sort -k2,2) \
<(cat 49142202.txt | sort -k1,1 -k2nr,2) \
| sort --stable -k2n,2 | cut -f1,3
The first input to join sorted by column 2 is this:
2 A
1 B
3 C
The second input to join sorted by column 1 is this:
A 5
A 4
B 6
B 2
C 3
C 1
The output of join is:
A 2 5
A 2 4
B 1 6
B 1 2
C 3 3
C 3 1
Which is then sorted by the nl line number in column 2 and then the original input columns 1 and 3 are kept with cut.
I know it can be done a lot easier with for example groupby of pandas of Python, but is there a more elegant way of doing it, while sticking to the use of GNU Coreutils such as sort, join, cut, tr and nl? Preferably I want to avoid a memory inefficient awk solution, but please share those as well. Thanks!
As explained in the comment my solution tries to reduce the number of pipes, unnecessary cat commands and more especially the number of pipeline sort operations since sorting is a complex/time consuming operation:
I reached the following solution where f_grp_sort is the input file:
for elem in $(sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}')
do
grep $elem <(sort -k2nr f_grp_sort)
done
OUTPUT:
B 6
B 2
A 5
A 4
C 3
C 1
Explanations:
sort -k2nr f_grp_sort will generate the following output:
B 6
A 5
A 4
C 3
B 2
C 1
and sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}' will generate the output:
B
A
C
the awk will just generate in the same order 1 unique element of the first column of the temporary output.
Then the for elem in $(...)do grep $elem <(sort -k2nr f_grp_sort); done
will grep for lines containing B then A, then C what will provide the required output.
Now as enhancement, you can use a temporary file to avoid doing sort -k2nr f_grp_sort operation twice:
$ sort -k2nr f_grp_sort > tmp_sorted_file && for elem in $(awk '!seen[$1]++{print $1}' tmp_sorted_file); do grep $elem tmp_sorted_file; done && rm tmp_sorted_file
So, this won't work for all cases, but if the values in your first column can be turned into bash variables, we can use dynamically named arrays to do this instead of a bunch of joins. It should be pretty fast.
The first while block reads in the contents of the file, getting the first two space separated strings and putting them into col1 and col2. We then create a series of arrays named like ARR_A and ARR_B where A and B are the values from column 1 (but only if $col1 only contains characters that can be used in bash variable names). The array contains the column 2 values associated with these column 1 values.
I use your fancy sort chain to get the order we want column 1 values to print out in, we just loop through them, then for each column 1 array we sort the values and echo out column 1 and column 2.
The dynamc variable bits can be hard to follow, but for the right values in column 1 it will work. Again, if there's any characters that can't be part of a bash variable name in column 1, this solution will not work.
file=./49142202.txt
while read col1 col2 extra
do
if [[ "$col1" =~ ^[a-zA-Z0-9_]+$ ]]
then
eval 'ARR_'${col1}'+=("'${col2}'")'
else
echo "Bad character detected in Column 1: '$col1'"
exit 1
fi
done < "$file"
sort -k2nr,2 "$file" | sort --stable -k1,1 -u | sort -k2nr,2 | while read col1 extra
do
for col2 in $(eval 'printf "%s\n" "${ARR_'${col1}'[#]}"' | sort -r)
do
echo $col1 $col2
done
done
This was my test, a little more complex than your provided example:
$ cat 49142202.txt
A 4
B 6
C 3
A 5
B 2
C 1
C 0
$ ./run
B 6
B 2
A 5
A 4
C 3
C 1
C 0
Thanks a lot #JeffBreadner and #Allan! I came up with yet another solution, which is very similar to my first one, but gives a bit more control, because it allows for easier nesting with for loops:
for x in $(sort -k2nr,2 $file | sort --stable -k1,1 -u | sort -k2nr,2 | cut -f1); do
awk -v x=$x '$1==x' $file | sort -k2nr,2
done
Do you mind, if I don't accept either of your answers, until I have time to evaluate the time and memory performance of your solutions? Otherwise I would probably just go for the awk solution by #Allan.
I am thinking on the following problem.
I can have an array of strings like
Col1 Col2 Col3 Col4
aa aa aa aa
aaa aaa aaaaa aaa
aaaa aaaaaaa aa a
...........................
Actually it is CSV file. And I should find a way to divide this vertically into one or more files. Condition for splitting is that no one file contain no row that exceeds some bytes. For simplicity we can rewrite that array with lengths:
Col1 Col2 Col3 Col4
2 2 2 2
3 3 5 3
4 7 2 1
...........................
And let's say the limit is 10, i.e. if > 9 we should split. So if we split into 2 files [Col1, Col2, Col3] and [Col4] this will not satisfy the condition because the first file will contain 3 + 3 + 5 > 9 in the second row and 4 + 7 + 2 > 9 in the third row. If we split into [Col1, Col2] and [Col3, Col4] this will not satisfy the condition because the first file will contain 4 + 7 > 9 in the third row. So we are splitting this into 3 files like [Col1], [Col2, Col3] and [Col4]. Now every file is correct and looks like:
File1 | File2 | File3
------------------------------
Col1 | Col2 Col3 | Col4
2 | 2 2 | 2
3 | 3 5 | 3
4 | 7 2 | 1
...............................
So it should split from left to right giving maximum columns as possible to the left file. The problem is that this file can be huge and I don't want to read it into memory and so we read the initial file line by line and somehow I should determine a set of indexes to split. If that is possible at all? I hope I described the problem well, so you can understand it.
Generally awk is quite good at handling large csv files.
You could try something like this to retrieve the max length for each column and then decide how to split.
Let's say the file.txt contains
Col1;Col2;Col3;Col4
aa;aa;aa;aa
aaa;aaa;aaaaa;aaa
aaaa;aaaaaaa;aa;a
(Assuming windows style quotes) Running the following :
> awk -F";" "NR>1{for (i=1; i<=NF; i++) max[i]=(length($i)>max[i]?length($i):max[i])} END {for (i=1; i<=NF; i++) printf \"%d%s\", max[i], (i==NF?RS:FS)}" file.txt
Will output :
4;7;5;3
Could you try this on your real data set ?
This question already has answers here:
Efficient way to map ids
(2 answers)
Closed 9 years ago.
I have two text files,
File 1 with data like
User game count
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
...
File 2
1 Basketball
2 Football
3 Rugby
...
90 TT
91 Volleyball
...
Now what I want to do is add another column to File 2 such that I have the corresponding index of the game from File 2 as an extra column in File 1.
I have 2 million entries in File 1. So I want to add another column specifying the index(basically the line number or order) of the game from file 2. How can I do this efficiently.
Right now I am doing this line by line. Reading a line from file 1, grep the corresponding game from file 2 for its line number and saving/writing that to a file.
This will take me ages. How can I speed this up if I have 10 million rows in file 2 and 3000 rows in file 1?
With awk, read field 1 from File2 into an array indexed by field 2, look up the array using field 2 from File1 as you iterate through it
awk 'NR == FNR{a[$2]=$1; next}; {print $0, a[$2]}' File2 File1
A Rugby 2 3
A Football 2 2
B Volleyball 1 91
C TT 2 90
You can construct an associative array from the second file, with game names as keys and the game index as values. then for each line in file 1 search the array for the wanted id, and write it back
Associative arrays provide O(1) time complexity.
Use the join command:
$ cat file1
A Rugby 2
A Football 2
B Volleyball 1
C TT 2
$ cat file2
1 Basketball
2 Football
3 Rugby
90 TT
91 Volleyball
$ join -1 3 -2 1 -o 1.1,1.2,1.3,2.2 \
<(sort -k 3 file1) <(sort -k 1 file2)
B Volleyball 1 Basketball
A Football 2 Football
A Rugby 2 Football
C TT 2 Football
Here's another approach: only read the small file into memory, and then read the bigger file line-by-line. Once each ID has been found, bail out:
awk '
NR == FNR {
f1[$2] = $0
n++
next
}
($2 in f1) {
print f1[$2], $1
delete f1[$2]
if (--n == 0) exit
}
' file1 file2
Rereading your question, I don't know if I've answered the question: do you want an extra column appended to file1 or file2?