unique file based on 2 line match - bash

I have a file that has lines like this. I'd like to uniq this where every unique item consists of the 2 lines. so since
bob
100
is here twice I would only print it the one time. Help please. thanks,
bob
100
bill
130
joe
123
bob
100
joe
120

Try this:
printf "%s %s\n" $(< file) | sort -u | tr " " "\n"
Output:
bill
130
bob
100
joe
120
joe
123
With bash builtins:
declare -A a # declare associative array
while read name; do read value; a[$name $value]=; done < file
printf "%s\n" ${!a[#]} # print array keys
Output:
joe
120
joe
123
bob
100
bill
130

Try sed:
sed 'N;s/\n/ /' file | sort -u | tr ' ' '\n'
N: read next line, and append to current line
;: command separator
s/\n/ /: replace eol with space

I would use awk:
awk 'NR%2{l=$0}!(NR%2){seen[l"\n"$0]}END{for(i in seen)print i}' input
Let me explain the command in a multi-line version:
# On odd lines numbers store the current line in l.
# Note that line numbers starting at 1 in awk
NR%2 {l=$0}
# On even line numbers create an index in a associative array.
# The index is the last line plus the current line.
# Duplicates would simply overwrite themselves.
!(NR%2) {seen[l"\n"$0]}
# After the last line of input has been processed iterate
# through the array and print the indexes
END {for(i in seen)print i}

Related

How to read two lines in shell script (In single iteration line1, line2 - Next iteration it should take line2,line3.. so on)

In my shell script, one of the variable contains set of lines. I have a requirement to get the two lines info at single iteration in which my awk needs it.
var contains:
12 abc
32 cdg
9 dfk
98 nhf
43 uyr
5 ytp
Here, In a loop I need line1, line2[i.e 12 abc \n 32 cdg] content and next iteration needs line2, line3 [32 cdg \n 9 dfk] and so on..
I tried to achieve by
while IFS= read -r line
do
count=`echo ${line} | awk -F" " '{print $1}'`
id=`echo ${line} | awk -F" " '{print $2}'`
read line
next_id=`echo ${line} | awk -F" " '{print $2}'`
echo ${count}
echo ${id}
echo ${next_id}
## Here, I have some other logic of awk..
done <<< "$var"
It's reading line1, line2 at first iteration. At second iteration it's reading line3, line4. But, I required to read line2, line3 at second iteration. Can anyone please sort out my requirement.
Thanks in advance..
Don't mix a shell script spawing 3 separate subshells for awk per-iteration when a single call to awk will do. It will be orders of magnitude faster for large input files.
You can group the messages as desired, just by saving the first line in a variable, skipping to the next record and then printing the line in the variable and the current record through the end of the file. For example, with your lines in the file named lines, you could do:
awk 'FNR==1 {line=$0; next} {print line"\n"$0"\n"; line=$0}' lines
Example Use/Output
$ awk 'FNR==1 {line=$0; next} {print line"\n"$0"\n"; line=$0}' lines
12 abc
32 cdg
32 cdg
9 dfk
9 dfk
98 nhf
98 nhf
43 uyr
43 uyr
5 ytp
(the additional line-break was simply included to show separation, the output format can be changed as desired)
You can add a counter if desired and output the count via the END rule.
The solution depends on what you want to do with the two lines.
My first thought was something like
sed '2,$ s/.*/&\n&/' <<< "${yourvar}"
But this won't help much when you must process two lines (I think | xargs -L2 won't help).
When you want them in a loop, try
while IFS= read -r line; do
if [ -n "${lastline}" ]; then
echo "Processing lines starting with ${lastline:0:2} and ${line:0:2}"
fi
lastline="${line}"
done <<< "${yourvar}"

Add line numbers for duplicate lines in a file

My text file would read as:
111
111
222
222
222
333
333
My resulting file would look like:
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Or the resulting file could alternatively look like the following:
1
2
1
2
3
1
2
I've specified a comma as a delimiter here but it doesn't matter what the delimeter is --- I can modify that at a future date.In reality, I don't even need the original text file contents, just the line numbers, because I can just paste the line numbers against the original text file.
I am just not sure how I can go through numbering the lines based on repeated entries.
All items in list are duplicated at least once. There are no single occurrences of a line in the file.
$ awk -v OFS=',' '{print ++cnt[$0], $0}' file
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Use a variable to save the previous line, and compare it to the current line. If they're the same, increment the counter, otherwise set it back to 1.
awk '{if ($0 == prev) counter++; else counter = 1; prev=$0; print counter}'
Perl solution:
perl -lne 'print ++$c{$_}' file
-n reads the input line by line
-l handles newlines
++$c{$_} increments the value assigned to the contents of the current line $_ in the hash table %c.
Software tools method, given textfile as input:
uniq -c textfile | cut -d' ' -f7 | xargs -L 1 seq 1
Shell loop-based variant of the above:
uniq -c textfile | while read a b ; do seq 1 $a ; done
Output (of either method):
1
2
1
2
3
1
2

Counting the number of lines that have the same entry in the first column in bash

I have a data file that looks like the following:
123456, 1623326
123456, 2346525
123457, 2435466
123458, 2564252
123456, 2435145
The first column is the "ID" -- a string variable. The second column does not matter to me. I want to end up with
123456, 3
123457, 1
123458, 1
where the second column now counts how many entries there are in the original file that correspond with the unique "ID" in the first column.
Any solution in bash or perl would be fantastic. Even Stata would be good, but I figure this is harder to do in Stata. Please let me know if anything is unclear.
In Stata this is just
contract ID
cut -d',' -f1 in.txt | sort | uniq -c | awk '{print $2 ", " $1}'
gives:
123456, 3
123457, 1
123458, 1
This counts the number of lines with the same first six characters:
$ sort file | uniq -c -w6
3 123456, 1623326
1 123457, 2435466
1 123458, 2564252
Documentation
From man uniq:
-w, --check-chars=N
compare no more than N characters in lines
Split off the number in the first field and use it as a hash key, increasing its count each time
use warnings;
use strict;
my $file = 'data_cnt.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
my %cnt;
while (<$fh>) {
$cnt{(/^(\d+)/)[0]}++;
}
print "$_, $cnt{$_}\n" for keys %cnt;
The regex captures consecutive digits at a beginning of a line. As that is returned as a list we index into it to get the number, (/.../)[0], which is used as a hash key. When a number is seen for the first time it is added to the hash as a key and its value is set to 1 due to ++. When a number that already exists as a key is seen its value is incremented by ++. This is a typical frequency counter.
With your numbers in file data_cnt.txt this outputs
123457, 1
123456, 3
123458, 1
The output can be sorted by hash values, if you need that
say "$_, $cnt{$_}" for sort { $cnt{$b} <=> $cnt{$a} } (keys %cnt);
Prints
123456, 3
123457, 1
123458, 1
This can fit into a one-liner, if preferred for some reason
perl -nE '
$cnt{(/^(\d+)/)[0]}++;
}{ say "$_, $cnt{$_}" for sort { $cnt{$b} <=> $cnt{$a} } keys %cnt
' data_cnt.txt
It should be entered as one line at a terminal. The }{ is short for the END { } block. The code is the same from the short script above. The -E is the same as -e while it enables feature say.
You can use awk:
awk 'BEGIN{FS=OFS=", "} counts[$1]++{} END{for (i in counts) print i, counts[i]}' file
123456, 3
123457, 1
123458, 1
FS=OFS=", " Sets input & output field separator as ", "
counts[$1]++{} increments a counter stored by first column in array counts by 1 for every instance. {} is same do-nothing
In the END block we iterate through counts array and print each unique id and the count
cut, sort, uniq, sed version
cut -d',' -f1 | sort | uniq -c | sed 's/^ *\([^ ]*\) \(.*\)/\2, \1/'
or simple Perl version with sorted by the first column
perl -F',' -anE'$s{$F[0]}++}{say"$_, $s{$_}"for sort keys%s'
or sorted by count descending and then by the first column
perl -F',' -anE'$s{$F[0]}++}{say"$_, $s{$_}"for sort{$s{$b}<=>$s{$a}or$a cmp$b}keys%s'
or in order in which key comes first
perl -F',' -anE'push#a,$F[0]if!$s{$F[0]}++}{say"$_, $s{$_}"for#a'
or just in pseudorandom order
perl -F',' -anE'$s{$F[0]}++}{say"$_, $s{$_}"for keys%s'
and so on.
Perl one-liner:
perl -naE '$h{$F[0]}++}{for(sort keys %h){say "$_ $h{$_}"}' file.txt
123456, 3
123457, 1
123458, 1
-n loops over each line in the file
-a splits each line on whitespace, and populates #F array with each entry
}{ denotes an END block, which allows us to iterate over the hash after all lines in the file have been processed
In Perl
$ perl -MData::Dump -ne "++#n{/(\d+)/}; END {dd \%n}" data.txt
{ 123456 => 3, 123457 => 1, 123458 => 1 }
With datamash:
datamash -W -s -g1 count 1 < data
Output:
123456, 3
123457, 1
123458, 1

Reorder lines of file by given sequence

I have a document A which contains n lines. I also have a sequence of n integers all of which are unique and <n. My goal is to create a document B which has the same contents as A, but with reordered lines, based on the given sequence.
Example:
A:
Foo
Bar
Bat
sequence: 2,0,1 (meaning: First line 2, then line 0, then line 1)
Output (B):
Bat
Foo
Bar
Thanks in advance for the help
Another solution:
You can create a sequence file by doing (assuming sequence is comma delimited):
echo $sequence | sed s/,/\\n/g > seq.txt
Then, just do:
paste seq.txt A.txt | sort tmp2.txt | sed "s/^[0-9]*\s//"
Here's a bash function. The order can be delimited by anything.
Usage: schwartzianTransform "A.txt" 2 0 1
function schwartzianTransform {
local file="$1"
shift
local sequence="$#"
echo -n "$sequence" | sed 's/[^[:digit:]][^[:digit:]]*/\
/g' | paste -d ' ' - "$file" | sort -n | sed 's/^[[:digit:]]* //'
}
Read the file into an array and then use the power of indexing :
echo "Enter the input file name"
read ip
index=0
while read line ; do
NAME[$index]="$line"
index=$(($index+1))
done < $ip
echo "Enter the file having order"
read od
while read line ; do
echo "${NAME[$line]}";
done < $od
[aman#aman sh]$ cat test
Foo
Bar
Bat
[aman#aman sh]$ cat od
2
0
1
[aman#aman sh]$ ./order.sh
Enter the input file name
test
Enter the file having order
od
Bat
Foo
Bar
an awk oneliner could do the job:
awk -vs="$s" '{d[NR-1]=$0}END{split(s,a,",");for(i=1;i<=length(a);i++)print d[a[i]]}' file
$s is your sequence.
take a look this example:
kent$ seq 10 >file #get a 10 lines file
kent$ s=$(seq 0 9 |shuf|tr '\n' ','|sed 's/,$//') # get a random sequence by shuf
kent$ echo $s #check the sequence in var $s
7,9,1,0,5,4,3,8,6,2
kent$ awk -vs="$s" '{d[NR-1]=$0}END{split(s,a,",");for(i=1;i<=length(a);i++)print d[a[i]]}' file
8
10
2
1
6
5
4
9
7
3
One way(not an efficient one though for big files):
$ seq="2 0 1"
$ for i in $seq
> do
> awk -v l="$i" 'NR==l+1' file
> done
Bat
Foo
Bar
If your file is a big one, you can use this one:
$ seq='2,0,1'
$ x=$(echo $seq | awk '{printf "%dp;", $0+1;print $0+1> "tn.txt"}' RS=,)
$ sed -n "$x" file | awk 'NR==FNR{a[++i]=$0;next}{print a[$0]}' - tn.txt
The 2nd line prepares a sed command print instruction, which is then used in the 3rd line with the sed command. This prints only the line numbers present in the sequence, but not in the order of the sequence. The awk command is used to order the sed result depending on the sequence.

Sort a text file by line length including spaces

I have a CSV file that looks like this
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
I need to sort it by line length including spaces. The following command doesn't
include spaces, is there a way to modify it so it will work for me?
cat $# | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'
Answer
cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-
In both cases, we have solved your stated problem by moving away from awk for your final cut.
Lines of matching length - what to do in the case of a tie:
The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.
(Those who want more control of sorting these ties might look at sort's --key option.)
Why the question's attempted solution fails (awk line-rebuilding):
It is interesting to note the difference between:
echo "hello awk world" | awk '{print}'
echo "hello awk world" | awk '{$1="hello"; print}'
They yield respectively
hello awk world
hello awk world
The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:
"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
$1 = $1 # force record to be reconstituted
print $0 # or whatever else with $0
"This forces awk to rebuild the record."
Test input including some lines of equal length:
aa A line with MORE spaces
bb The very longest line in the file
ccb
9 dd equal len. Orig pos = 1
500 dd equal len. Orig pos = 2
ccz
cca
ee A line with some spaces
1 dd equal len. Orig pos = 3
ff
5 dd equal len. Orig pos = 4
g
The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:
perl -e 'print sort { length($a) <=> length($b) } <>'
You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.
In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.
Benchmark results
Below are the results of a benchmark across solutions from other answers to this question.
Test method
10 sequential runs on a fast machine, averaged
Perl 5.24
awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)
Results
Caleb's perl solution took 11.2 seconds
my perl solution took 11.6 seconds
neillb's awk solution #1 took 20 seconds
neillb's awk solution #2 took 23 seconds
anubhava's awk solution took 24 seconds
Jonathan's awk solution took 25 seconds
Fritz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.
Another perl solution
perl -ne 'push #a, $_; END{ print sort { length $a <=> length $b } #a }' file
Try this command instead:
awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
Pure Bash:
declare -a sorted
while read line; do
if [ -z "${sorted[${#line}]}" ] ; then # does line length already exist?
sorted[${#line}]="$line" # element for new length
else
sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
fi
done < data.csv
for key in ${!sorted[*]}; do # iterate over existing indices
echo -e "${sorted[$key]}" # echo lines with equal length
done
Python Solution
Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
Benchmark:
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real 0m5.308s
user 0m3.733s
sys 0m1.490s
perl -e 'print sort { length($a) <=> length($b) } <>'
real 0m8.840s
user 0m7.117s
sys 0m2.279s
The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).
awk '{ printf "%d:%s\n", length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]*://'
The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:
awk '{ print length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]* //'
I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):
awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-
With POSIX Awk:
{
c = length
m[c] = m[c] ? m[c] RS $0 : $0
} END {
for (c in m) print m[c]
}
Example
1) pure awk solution. Let's suppose that line length cannot be more > 1024
then
cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'
2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:
LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1
using Raku (formerly known as Perl6)
~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
To reverse the sort, add .reverse in the middle of the chain of method calls--immediately after .sort(). Here's code showing that .chars includes spaces:
~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0
Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:
~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
real 0m1.308s
user 0m1.213s
sys 0m0.173s
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- > /dev/null
real 0m1.189s
user 0m1.170s
sys 0m0.050s
HTH.
https://raku.org
Here is a multibyte-compatible method of sorting lines by length. It requires:
wc -m is available to you (macOS has it).
Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
testfile has a character encoding matching your locale (e.g., UTF-8).
Here's the full command:
cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
Explaining part-by-part:
l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
sub(/ */, "", c); ← trims white space from the character count value returned by wc.
{ print c, $0 } ← prints the line's character count value, a space, and the original line.
| sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
| cut -d" " -f2- ← removes the prepended character count values.
It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).
Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):
cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1

Resources