How to sort,uniq and display line that appear more than X times - bash

I have a file like this:
80.13.178.2
80.13.178.2
80.13.178.2
80.13.178.2
80.13.178.1
80.13.178.3
80.13.178.3
80.13.178.3
80.13.178.4
80.13.178.4
80.13.178.7
I need to display unique entries for repeated line (similar to uniq -d) but only entries that occur more than just twice (twice being an example so flexibility to define the lower limit.)
Output for this example should be like this when looking for entries with three or more occurrences:
80.13.178.2
80.13.178.3

Feed the output from uniq -cd to awk
sort test.file | uniq -cd | awk -v limit=2 '$1 > limit{print $2}'

With pure awk:
awk '{a[$0]++}END{for(i in a){if(a[i] > 2){print i}}}' a.txt
It iterates over the file and counts the occurances of every IP. At the end of the file it outputs every IP which occurs more than 2 times.

Related

How many times has the letter "N" or its repeat(eg: "NNNNN") been found in a text file?

I am given a file.txt (text file) with a string of data. Example contents:
abcabccabbabNababbababaaaNNcacbba
abacabababaaNNNbacabaaccabbacacab
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
aaababababababacacacacccbababNbNa
abababbacababaaacccc
To find the number of distinct repeated patterns of "N" (repeated one or more times) that are present in the file using unix commands.
I am unsure on what commands to use even after trying a range of different commands.
$ grep -E -c "(N)+" file.txt
the output must be 6
One way:
$ sed 's/[^N]\{1,\}/\n/g' file.txt | grep -c N
6
How it works:
Replace all sequences of one or more non-N characters in the input with a newline.
This turns strings like abcabccabbabNababbababaaaNNcacbba into
N
NN
Count the number of lines with at least one N (Ignoring the empty lines).
Regular-expression free alternative:
$ tr -sc N ' ' < file.txt | wc -w
6
Uses tr to replace all runs of non-N characters with a single space, and counts the remaining words (Which are the N sequences). Might not even need the -s option.
Using GNU awk (well, just tested with gawk, mawk, busybox awk and awk version 20121220 and it seemed to work with all of them):
$ gawk -v RS="^$" -F"N+" '{print NF-1}' file
6
It reads in the whole file as a single record, uses regex N+ as field separator and outputs the field count minus one. For other awks:
$ awk -v RS="" -F"N+" '{c+=NF-1}END{print c}' file
It reads in empty line separated blocks of records, counts and sums fields.
Here is an awk that should work on most system.
awk -F'N+' '{a+=NF-1} END {print a}' file
6
It splits the line by one or more N and then count number of fields-1 pr line.
If you have a text file, and you want to count the number times a sequence of letters of N appear, you can do:
awk '{a+=gsub(/N+/,"")}END{print a}' file
This, however, will distinguish sequences that are split over multiple lines. Example:
abcNNN
NNefg
If you want this to be counted as a single sequence, you should do:
awk 'BEGIN{RS=OFS=""}{$1=$1}{a+=gsub(/N+/,"")}END{print a}' file

Bash: Remove unique and keep duplicate

I have a large file with 100k lines and about 22 columns. I would like to remove all lines in which the content in column 15 only appears once. So as far as I understand its the reverse of
sort -u file.txt
After the lines that are unique in column 15 are removed, I would like to shuffle all lines again, so nothing is sorted. For this I would use
shuf file.txt
The resulting file should include only lines that have at least one duplicate (in column 15) but are in a random order.
I have tried to work around sort -u but it only sorts out the unique lines and discards the actual duplicates I need. However, not only do I need the unique lines removed, I also want to keep every line of a duplicate, not just one representitive for a duplicate.
Thank you.
Use uniq -d to get a list of all the duplicate values, then filter the file so only those lines are included.
awk -F'\t' 'NR==FNR { dup[$0]; next; }
$15 in dup' <(awk -F'\t' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt
awk '{print $15}' file.txt | sort | uniq -d returns a list of all the duplicate values in column 15.
The NR==FNR line in the first awk script turns this into an associative array.
The second line processes file.txt and prints any lines where column 15 is in the array.

Compare column1 in File with column1 in File2, output {Column1 File1} that does not exist in file 2

Below is my file 1 content:
123|yid|def|
456|kks|jkl|
789|mno|vsasd|
and this is my file 2 content
123|abc|def|
456|ghi|jkl|
789|mno|pqr|
134|rst|uvw|
The only thing I want to compare in File 1 based on File 2 is column 1. Based on the files above, the output should only output:
134|rst|uvw|
Line to Line comparisons are not the answer since both column 2 and 3 contains different things but only column 1 contains the exact same thing in both files.
How can I achieve this?
Currently I'm using this in my code:
#sort FILEs first before comparing
sort $FILE_1 > $FILE_1_sorted
sort $FILE_2 > $FILE_2_sorted
for oid in $(cat $FILE_1_sorted |awk -F"|" '{print $1}');
do
echo "output oid $oid"
#for every oid in FILE 1, compare it with oid FILE 2 and output the difference
grep -v diff "^${oid}|" $FILE_1 $FILE_2 | grep \< | cut -d \ -f 2 > $FILE_1_tmp
You can do this in Awk very easily!
awk 'BEGIN{FS=OFS="|"}FNR==NR{unique[$1]; next}!($1 in unique)' file1 file2
Awk works by processing input lines one at a time. And there are special clauses which Awk provides, BEGIN{} and END{} which encloses actions to be run before and after the processing of the file.
So the part BEGIN{FS=OFS="|"} is set before the file processing happens, and FS and OFS are special variables in Awk which stand for input and output field separators. Since you have a provided a file that is de-limited by | you need to parse it by setting FS="|" also to print it back with |, so set OFS="|"
The main part of the command comes after BEGIN clause, the part FNR==NR is meant to process the first file argument provided in the command, because FNR keeps track of the line numbers for the both the files combined and NR for only the current file. So for each $1 in the first file, the values are hashed into the array called unique and then when the next file processing happens, the part !($1 in unique) will drop those lines in second file whose $1 value is not int the hashed array.
Here is another one liner that uses join, sort and grep
join -t"|" -j 1 -a 2 <(sort -t"|" -k1,1 file1) <(sort -t"|" -k1,1 file2) |\
grep -E -v '.*\|.*\|.*\|.*\|'
join does two things here. It pairs all lines from both files with matching keys and, with the -a 2 option, also prints the unmatched lines from file2.
Since join requires input files to be sorted, we sort them.
Finally, grep removes all lines that contain more than three fields from the output.

Bash to find count of multiple strings in a large file

I'm trying to get the count of various strings in a large txt file using bash commands.
I.e. find the count of the strings 'pig', 'horse', and 'cat' using bash, and get an output say 'pig: 7, horse: 3, cat: 5'. I would like a way to search through the txt file only once, because it is very large (so I do not want to search for 'pig' through the whole txt file, then go back and search for 'horse', etc.)
Any help with commands would be appreciated. Thanks!
grep -Eo 'pig|horse|cat' txt.file | sort | uniq -c | awk '{print $2": "$1}'
Breaking that into pieces:
grep -Eo 'pig|horse|cat' Print all the occurrences (-o) of the
extended (-e) regex
sort Sort the resulting words
uniq -c Output unique values (of sorted input)
with the count (-c) of each value
awk '{print $2": "$1}' For each line, print the second field (the word)
then a colon and a space, and then the first
field (the count).

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.
You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'
$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. VoilĂ !
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)
using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file
As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,
I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

Resources