Frequency count of particular field appended to line without deleting duplicates - sorting

Trying to work out how to get a frequency appended or prepended to each line in a file WITHOUT deleting duplicate occurrences (which uniq can do for me).
So, if input file is:
mango
mango
banana
apple
watermelon
banana
I need output:
mango 2
mango 2
banana 2
apple 1
watermelon 1
banana 2
All the solutions I have seen delete the duplicates. In other words, what I DON'T want is:
mango 2
banana 2
apple 1
watermelon 1

Basically you cannot do it in one pass without keeping everything in memory. If this is what you want to do, then use python/perl/awk/whatever. The algorithm is quite simple.
Let's do it with standard Unix tools. This is a bit cumbersome and can be improved but should do the work:
$ sort input | uniq -c > input.count
$ nl input | sort -k 2 > input.line
$ join -1 2 -2 2 input.line input.count | sort -k 2 | awk '{print $1 " " $3}
The first step is to count the number occurrences of a given word.
As you said you cannot both repeat and keep line ordering. So we have to fix that. The second step prepends the line number that we will use later to fix the ordering issue.
In the last step, we join the two temporary files on the original word, the second column contains the original line number sort we sort on this key and strip it from the final output.

Related

Linux sort by column and in reverse order

I'm trying to sort a file by second column, but in reverse order.
I tried:
sort -k2n -r file.txt
The output is not in reverse order, so it seems -r is being ignored.
I'm in CentOS.
Try to add a space after the -k and before the column position. e.g. something like below
sort -k 2n -r file.txt
I just needed to remove the "n" next to column number:
sort -k 2 -r file.txt
Say we have this.txt
one 1
two 2
three 3
four 4
five 5
Now simply do
$ sort -k2,2nr this.txt
five 5
four 4
three 3
two 2
one 1

How can I remove duplicates only once an X number of occurrences is reached with awk?

I know how to use awk to remove duplicate lines in a file:
awk '!x[$0]++' myfile.txt
But how can I remove the duplicates only if there are more than two occurrences of this duplicate?
For example:
apple
apple
banana
apple
pear
banana
cherry
would become:
banana
pear
banana
cherry
Thanks in advance!
I would harness GNU AWK for this task following following way, let file.txt content be
apple
apple
banana
apple
pear
banana
cherry
then
awk 'FNR==NR{cnt[$0]+=1;next}cnt[$0]<=2' file.txt file.txt
gives output
banana
pear
banana
cherry
Explanation: This is 2-pass approach. FNR=NR (current file number of row equal to total number of row) does hold true only for 1st file, here I simply count number of occurences in file.txt by increasing (+=) value in array cnt under key being whole line ($0) by 1 then I instruct GNU AWK to go to next line as I do not want to do anything else. After that only lines which fullfill number of occurrences is less or equal two are outputed. Note: file.txt file.txt is intentional.
(tested in gawk 4.2.1)
If you don't care about output order, this would do what you want without reading the whole file into memory in awk:
$ sort file | awk '
$0!=prev { if (cnt<3) printf "%s", vals; prev=$0; cnt=vals="" }
{ vals=vals $0 ORS; cnt++ }
END { if (cnt<3) printf "%s", vals }
'
banana
banana
cherry
pear
The output of sort has all the values grouped together so you only need to look at the count when the values change to know how many of the previous value there were. sort still has to consider the whole input but it's designed to handle massive files by using demand paging, etc. and so is far more likely to be able to handle huge files than reading it all into memory in awk.
If you do care about output order you could use a DSU approach, see How to sort data based on the value of a column for part (multiple lines) of a file?

Sort file by one key only

I have large log file comprised of input from many sources, with each line prefixed with the hostname. The log is the output of operations happening in parallel across many hosts, so the logs are somewhat jumbled together.
What I'd like to do is sort the logs by hostname and nothing else so that the events for each server still show up natural order. The sort docs below seem to imply that -k1,1 should accomplish this, but still result in the lines being fully sorted.
-k, --key=POS1[,POS2]
start a key at POS1 (origin 1), end it at POS2 (default end of line)
I've made a simple test file:
1 grape
1 banana
2 orange
3 lemon
1 apple
and the expected output would be:
1 grape
1 banana
1 apple
2 orange
3 lemon
But the observed output is:
$ sort -k1,1 sort_test.txt
1 apple
1 banana
1 grape
2 orange
3 lemon
sort -s -k 1,1 sort_test.txt
The -s disables 'last-resort' sorting, which sorts on everything that wasn't part of a specified key.

Combining lines with same string in Bash

I have a file with a bunch of lines that looks like this:
3 world
3 moon
3 night
2 world
2 video
2 pluto
1 world
1 pluto
1 moon
1 mars
I want to take each line that contains the same word, and combine them while adding the preceding number, so that it looks like this:
6 world
4 moon
3 pluto
3 night
2 video
1 mars
I've been trying combinations with sed, but I can't seem to get it right. My next idea was to sort them, and then check if the following line was the same word, then add them, but I couldn't figure out how to get it to sort by word rather than the number.
Sum and sort:
awk -F" " '{c[$2]+=$1} END {for (i in c){print c[i], i}}' | sort -n -r

How to find which line from first file appears most frequently in second file?

I have two lists. I need to determine which word from the first list appears most frequently in the second list. The first, list1.txt contains a list of words, sorted alphabetically, with no duplicates. I have used some scripts which ensures that each word appears on a unique line, e.g.:
canyon
fish
forest
mountain
river
The second file, list2.txt is in UTF-8 and also contains many items. I have also used some scripts to ensure that each word appears on a unique line, but some items are not words, and some might appear many times, e.g.:
fish
canyon
ocean
ocean
ocean
ocean
1423
fish
109
fish
109
109
ocean
The script should output the most frequently matching item. For e.g., if run with the 2 files above, the output would be “fish”, because that word from list1.txt most often occurs in list2.txt.
Here is what I have so far. First, it searches for each word and creates a CSV file with the matches:
#!/bin/bash
while read -r line
do
count=$(grep -c ^$line list2.txt)
echo $line”,”$count >> found.csv
done < ./list1.txt
After that, found.csv is sorted descending by the second column. The output is the word appearing on the first line.
I do not think though, that this is a good script, because it is not so efficient, and it is possible that there might not be a most frequent matching item, for e.g.:
If there is a tie between 2 or more words, e.g. “fish”, “canyon”, and “forest” each appear 5 times, while no other appear as often, the output would be these 3 words in alphabetical order, separated by commas, e.g.: “canyon,fish,forest”.
If none of the words from list1.txt appears in list2.txt, then the output is simply the first word from the file list1.txt, e.g. “canyon”.
How can I create a more efficient script which finds which word from the first list appears most often in the second?
You can use the following pipeline:
grep -Ff list1.txt list2.txt | sort | uniq -c | sort -n | tail -n1
F tells grep to search literal words, f tells it to use list1.txt as the list of words to search for. The rest sorts the matches, counts duplicates, and sorts them according to the number of occurrences. The last part selects the last line, i.e. the most common one (plus the number of occurrences).
> awk 'FNR==NR{a[$1]=0;next}($1 in a){a[$1]++}END{for(i in a)print a[i],i}' file1 file2 | sort -rn|head -1
assuming 'list1.txt' is sorted, I would use unix join :
sort list2.txt | join -1 1 -2 1 list1.txt - | sort |\
uniq -c | sort -n | tail -n1

Resources