Awk & sed text manipulation (extract most negative value from specific group) - shell

I do have text manipulation problem that I need to solve in awk, sed & shell.
My text looks like this:
>Sample_1
100 101
aaattattacaaaaataattacaaattattacaaaaagaattattacaaaaagaattacaaaa
-1.60 .(((((((.....)))))))........................................... []
>Sample_2
1 35
aattattacaaaaagaattattacaaaaagaatta
0.00 ................................... _
>Sample_3
1 123
gctcacacctgtaatcccagcactttgggaggctgagg
-27.80 ((((.....))))......((((((.(((...))))))).)[][][[][]]
-26.40 (((((.((...(((((..((((((....))......... [[][]][]
-25.80 ((((.....)))).....((((((............... [][][][[][]]
123 145
ctgaggcaggcagatcacgaggtcacgagatcaa
-26.20 (((.....)))))) [][][[][]]
-25.90 ....((((..((....)) [][[][]]
-25.70 ..(((..((....))..(()) [[][]][[][]]
145 256
gtaatcccagcactttgggaggctgaggcaggcaga
0.00 ........................................... _
256 342
-25.00 ..((....((((.....((((((...)))....))... [[][]]
-24.00 ..((.((((.((((())... [[][][]]
-23.70 .((((((...(((((..((.. [[][]][]
I want to:
Extract Sample name (>Sample_1);
Extract numeric value that goes after the sample name (it's either 0 or negative value);
From the negative value group (e.g. -27.80;-26.40;-25.80) extract number that goes first (it's the most negative value).
Perfect output would look like this:
>Sample_1
-1.60
>Sample_2
0.00
>Sample_3
-27.80
-26.20
0.00
-25.00
I tried to do this in awk printing $1, grepping '>', 0 & negative values, but wasn't able to diverge column into groups & and to extract the most negative value.
awk '{print $1}' file | egrep -i '>|0.00|-'

You tagged your question with sed and awk, but if you're O.K. with Perl instead, you could write:
#!/usr/bin/perl -w
use warnings;
use strict;
my $min = undef;
while(<>)
{
if(m/^(-?\d+\.\d+)/)
{
if(! defined($min) || $1 < $min)
{ $min = $1; }
}
else
{
if(defined $min)
{
print "$min\n";
$min = undef;
}
if(m/^>/)
{ print; }
}
}
if(defined $min)
{ print "$min\n"; }

awk '/^[0-]/ && new_group {print $1} {new_group = (/^[ \t]/)} /^>/' file

Related

To split and arrange number in single inverted

I have around 65000 products codes in a text file.I wanted to split those number in group of 999 each .Then-after want each 999 number with single quotes separated by comma.
Could you please suggest how I can achieve above scenario through Unix script.
87453454
65778445
.
.
.
.
Till 65000 productscodes
Need to arrange in below pattern:
'87453454','65778445',
With awk:
awk '
++c == 1 { out = "\047" $0 "\047"; next }
{ out = out ",\047" $0 "\047" }
c == 999 { print out; c = 0 }
END { if (c) print out }
' file
Or, with GNU sed:
sed "
:a
\$bb
N
0~999{
:b
s/\n/','/g
s/^/'/
s/$/'/
b
}
ba" file
With Perl:
perl -ne '
sub pq { chomp; print "\x27$_\x27" } pq;
for (1 .. 998) {
if (defined($_ = <>)) {
print ",";
pq
}
}
print "\n"
' < file
Credit for Mauke perl#libera.chat
65000 isn't that many lines for awk - just do it all in one shot :
mawk 'BEGIN { FS = RS; RS = "^$"; OFS = (_="\47")(",")_
} gsub(/^|[^0-9]*$/,_, $!(NF = NF))'
'66771756','69562431','22026341','58085790','22563930',
'63801696','24044132','94255986','56451624','46154427'
That's for grouping them all in one line. To make 999 ones, try
jot -r 50 10000000 99999999 |
# change "5" to "999" here
rs -C= 0 5 |
mawk 'sub(".*", "\47&\47", $!(NF -= _==$NF ))' FS== OFS='\47,\47'
'36452530','29776340','31198057','36015730','30143632'
'49664844','83535994','86871984','44613227','12309645'
'58002568','31342035','72695499','54546650','21800933'
'38059391','36935562','98323086','91089765','65672096'
'17634208','14009291','39114390','35338398','43676356'
'14973124','19782405','96782582','27689803','27438921'
'79540212','49141859','25714405','42248622','25589123'
'11466085','87022819','65726165','86718075','56989625'
'12900115','82979216','65469187','63769703','86494457'
'26544666','89342693','64603075','26102683','70528492'
_==$NF checks whether right most column is empty or not,
—- i.e. whether there's a trailing edge sep that needds to be trimmed
If your input file only contains short codes as shown in your example, you could use the following hack:
xargs -L 999 bash -c "printf \'%s\', \"\$#\"; echo" . <inputFile >outputFile
Alternatively, you can use this sed command:
sed -Ene"s/(.*)/'\1',/;H" -e{'0~999','$'}'{z;x;s/\n//g;p}' <inputFile >outputFile
s/(.*)/'\1',/ wraps each line in '...',
but does not print it (-n)
instead, H appends the modified line to the so called hold space; basically a helper variable storing a single string.
(This also adds a line break as a separator, but we remove that later).
Every 999 lines (0~999) and at the end of the input file ($) ...
... the hold space is then printed and cleared (z;x;...;p)
while deleting all delimiter-linebreaks (s/\n//g) mentioned earlier.

Replacing numbers with their respective strings in awk

I am a newbie in bash/awk programming and I have a file looks like this:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . 1
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . 1,4
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . 8,64,512
With awk, I want to change the numbers in the last column ($10) with their descriptions. I assigned the numbers and their definitions in two different arrays. The way I was thinking was to change these numbers by iterating the two array together. Here, 0 is "unknown", 1 is "germline", 4 is "somatic" and goes on.
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
number=$(IFS=,; echo "${z[*]}")
def=$(IFS=,; echo "${t[*]}")
awk -v a="$number" -v b="${def}" 'BEGIN { OFS="\t" } /#/ {next}
{
x=split(a, e, /,/)
y=split(b, f, /,/)
delete c
m=split($10, c, /,/)
for (i=1; i<=m; i++) {
for (j=1; j<=x; j++) {
if (c[i]==e[j]) {
c[i]=f[j]
}
}
$10+=sprintf("%s, ",c[i])
}
print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
}' input.vcf > output.vcf
The output should look like this:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,paternal
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
I would be so glad if you could help me!
All the best
Assuming you don't really need to define the lists of numbers and names as 2 shell arrays for some other reason:
$ cat tst.awk
BEGIN {
split("0 1 2 4 8 16 32 64 128 256 512 1024 1073741824",nrsArr)
split("unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other",namesArr)
for (i in nrsArr) {
nr2name[nrsArr[i]] = namesArr[i]
}
}
!/#/ {
n = split($NF,nrs,/,/)
sub(/[^[:space:]]+$/,"")
printf "%s", $0
for (i=1; i<=n; i++) {
printf "%s%s", nr2name[nrs[i]], (i<n ? "," : ORS)
}
}
$ awk -f tst.awk input.vcf
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
The above preserves whatever white space you had in your input file in case that matters.
You may use this awk:
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
awk -v z="${z[*]}" -v t="${t[*]}" '
BEGIN {
split(z, zarr)
split(t, tarr)
for (i=1; i in zarr; ++i)
map[zarr[i]] = tarr[i]
}
{
split($NF, arr, /,/)
s = ""
for (i=1; i in arr; ++i)
s = s (i == 1 ? "" : ",") map[arr[i]]
$NF = s;
}
1
' file
btw number 4 is mapped to inherited not paternal as you have in your expected output.
Use this short Perl in-line script:
perl -F'\t' -lane '
BEGIN {
#keys = qw( 0 1 2 4 8 16 32 64 128 256 512 1024 1073741824 );
#vals = qw( unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other );
%val = map { $keys[$_] => $vals[$_] } 0..$#keys;
}
print join "\t", #F[0..8], ( join ",", map { $val{$_} } split /,/, $F[9] );
' in_file > out_file
The Perl script uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\t/' : Split into #F on TAB, rather than on whitespace.
%val = map { $keys[$_] => $vals[$_] } 0..$#keys; : Create %val - a hash lookup table with keys = numeric codes and values = mutation/variant types.
Note that in Perl, arrays are 0-indexed.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
Assumptions:
OP has confirmed beforehand that the z and t arrays are valid (eg, same number of elements in both arrays)
OP may want to (dynamically) change the contents of the z and t arrays so we'll leave the array assignments at the bash level (ie, won't hardcode inside of awk)
the substitution strings could contain white space so we'll keep OP's current method of building comma-delimited strings (from the z and t) arrays; also assumes replacement strings do not contain commas; this should simplify parsing of the replacement strings within awk
while OP has explicitly coded for (awk) field #10, we'll assume this number could change; we'll focus on processing the last field in a row
Small change to initialization code:
# original arrays
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
# renamed variables (format: x,y,z,...)
nums=$(IFS=,; echo "${z[*]}")
alphas=$(IFS=,; echo "${t[*]}")
One awk idea:
awk -v nums="${nums}" -v alphas="${alphas}" ' # pass comma-delimited variables to awk
BEGIN { OFS="\t" # copied from original code:w
n=split(nums,num,/,/) # split comma-delimted variables
a=split(alphas,alpha,/,/) # into arrays
}
/#/ { next } # copied from original code
{ l=split($NF,lastf,/,/) # split the last (comma-delimited) field
$NF="" # clear the last field
pfx="" # initialize our prefix string
for (i=1; i<=l; i++) # loop through entries in the last field
for (j=1; j<=n; j++) # loop through array of numbers
if ( lastf[i] == num[j] ) # if array entries match ...
{ $NF= $NF pfx alpha[j] # append the associated alpha to the last field
pfx="," # set the prefix to "," for the next item
break # break out one level to process next entry in the last field
}
}
{ print } # print the current line (with modified last field)
' input.vcf
The above generates:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic.germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive

Removing duplicate rows in .tsv while keeping some of the data (bash, perl)

I have a number of large .tsv files such as the following:
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04 15.9 S1
3 chr1_12570 0.9 45.3 S2
2 chr1_12500 0.03 13.8 S3
I would like to remove duplicate rows based on the pos column, while still keeping the values of both rows for columns 3 and 5 so that the output could look something like this:
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04,0.03 15.9 S1,S3
3 chr1_12570 0.9 45.3 S2
My idea was to first sort the .tsv files using the shell sort function:
sort -k 2,2 *.tsv
And then write a script that would compare each line to the following line.
If the string in the pos column is the same for both lines, then it would concatenate the values of column 3 and 5 in row n+1 to the ones in row n.
However I have no idea how to do this.
I am familiar with awk/sed/grep/bash but also have some (limited) perl basics.
Thanks for your help !
Here is an example of how you could approach it in Perl:
use feature qw(say);
use strict;
use warnings;
my $fn = 'file1.tsv';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my #pos;
my %info;
while( my $line = <$fh> ) {
chomp $line;
my ($nbr, $pos, $pvalue, $percentage, $samplename) = split /\t/, $line;
if ( !exists $info{$pos} ) {
$info{$pos} = {
nbr => $nbr,
pvalue => [$pvalue],
percentage => $percentage,
samplename => [$samplename],
};
push #pos, $pos;
}
else {
push #{$info{$pos}{pvalue}}, $pvalue;
push #{$info{$pos}{samplename}}, $samplename;
}
}
close $fh;
print $header;
for my $pos (#pos) {
my $data = $info{$pos};
say join "\t", $data->{nbr}, $pos,
(join ",", #{$data->{pvalue}}), $data->{percentage},
(join ",", #{$data->{samplename}});
}
Output:
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04,0.03 15.9 S1,S3
3 chr1_12570 0.9 45.3 S2
file "myscript":
#! /usr/bin/env bash
file="$1"
result="$(tr -s '\t' < "${file}" | tail -n +2 |
awk -F'\t' -v OFS='\t' '
$0 == "" {
next
}
# MAIN
{
if (col3[$2] == "") {
col1[$2] = $1
col3[$2] = $3
col4[$2] = $4
col5[$2] = $5
} else {
col3[$2] = col3[$2]","$3
col5[$2] = col5[$2]","$5
}
}
END {
for (pos in col1) {
print col1[pos], pos, col3[pos], col4[pos], col5[pos]
}
}
' | sort -k 2,2 )"
first_line="$(head -n 1 "${file}")"
echo "${first_line}"
echo "${result}"
Run it as:
bash myscript <your tsv file>
It will write result to stdout.
Using a combination of GNU datamash and awk to get just the desired columns:
$ datamash --header-in -sf -g2 collapse 3,5 < input.tsv | \
awk 'BEGIN { FS=OFS="\t"; print "rownbr\tpos\tpvalue\tpercentage\tsamplename" }
{ print $1, $2, $6, $4, $7 }'
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04,0.03 15.9 S1,S3
3 chr1_12570 0.9 45.3 S2
Ignore the header line in the file (--header-in), group records on the second column (-g2), sort based on that column (-s), output the full line (-f) in addition to the given operations, and for the 3rd and 5th columns, collapse all rows of the group into a single CSV entry. Tnen use awk to put the desired columns in the right order.
Perl is perfect tool for this task.
save header of the data for future output.
extract pos field to be used as hash key
save a line into a hash if we not seen this pos before, otherwise
merge value and name into the line.
Once all lines processed output result (in this case I use 'format' and write)
use strict;
use warnings;
use feature 'say';
my(#pos,%seen,%lines);
my $header = <DATA>; # obtain header
chomp $header;
while(<DATA>) {
next if /^\s*$/; # skip empty lines
chomp;
my $key = (split '\s+')[1]; # extract 'pos' to use as $key
if( $seen{$key} ) {
my($value,$name) = (split '\s+')[2,4]; # extract value and name
$lines{$key} =~ s/(\d\s+\S+\s+\S+)/$1,$value/; # merge value
$lines{$key} =~ s/$/,$name/; # merge name
} else {
push #pos, $key; # preserve order
$lines{$key} = $_; # store lines in a hash
$seen{$key} = 1;
}
}
say $header; # output header
my #data;
for (#pos) { # use stored hash 'indexes'
#data = split '\s+',$lines{$_}; # split into fields
write; # output
}
# format STDOUT_HEADER =
# rownbr pos pvalue percentage samplename
# .
format STDOUT =
#<<<<< #<<<<<<<<< #<<<<<<<< #<<<<< #<<<<<<<<<<<<
$data[0],$data[1],$data[2],$data[3],$data[4]
.
__DATA__
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04 15.9 S1
3 chr1_12570 0.9 45.3 S2
2 chr1_12500 0.03 13.8 S3
Output
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04,0.03 15.9 S1,S3
3 chr1_12570 0.9 45.3 S2

i have a protein sequence file i want to count trimers in it using sed or grep

I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.
If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv

Find smallest missing integer in an array

I'm writing a bash script which requires searching for the smallest available integer in an array and piping it into a variable.
I know how to identify the smallest or the largest integer in an array but I can't figure out how to identify the 'missing' smallest integer.
Example array:
1
2
4
5
6
In this example I would need 3 as a variable.
Using sed for this would be silly. With GNU awk you could do
array=(1 2 4 5 6)
echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }'
...which remembers all numbers, then counts from 1 until it finds one that it doesn't remember and prints that. You can then remember this number in bash with
array=(1 2 4 5 6)
number=$(echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }')
However, if you're already using bash, you could just do the same thing in pure bash:
#!/bin/bash
array=(1 2 4 5 6)
declare -a seen
for i in ${array[#]}; do
seen[$i]=1
done
for((number = 1; seen[number] == 1; ++number)); do true; done
echo $number
You can iterate from minimal to maximal number and take first non existing element,
use List::Util qw( first );
my #arr = sort {$a <=> $b} qw(1 2 4 5 6);
my $min = $arr[0];
my $max = $arr[-1];
my %seen;
#seen{#arr} = ();
my $first = first { !exists $seen{$_} } $min .. $max;
This code will do as you ask. It can easily be accelerated by using a binary search, but it is clearest stated in this way.
The first element of the array can be any integer, and the subroutine returns the first value that isn't in the sequence. It returns undef if the complete array is contiguous.
use strict;
use warnings;
use 5.010;
my #data = qw/ 1 2 4 5 6 /;
say first_missing(#data);
#data = ( 4 .. 99, 101 .. 122 );
say first_missing(#data);
sub first_missing {
my $start = $_[0];
for my $i ( 1 .. $#_ ) {
my $expected = $start + $i;
return $expected unless $_[$i] == $expected;
}
return;
}
output
3
100
Here is a Perl one liner:
$ echo '1 2 4 5 6' | perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;'
If you want to switch from a pipeline to a file input:
$ perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;' file
Since it is sorted in the process, input can be in arbitrary order.
$ cat tst.awk
BEGIN {
split("1 2 4 5 6",a)
for (i=1;a[i+1]==a[i]+1;i++) ;
print a[i]+1
}
$ awk -f tst.awk
3
Having fun with #Borodin's excellent answer:
#!/usr/bin/env perl
use 5.020; # why not?
use strict;
use warnings;
sub increasing_stream {
my $start = int($_[0]);
return sub {
$start += 1 + (rand(1) > 0.9);
};
}
my $stream = increasing_stream(rand(1000));
my $first = $stream->();
say $first;
while (1) {
my $next = $stream->();
say $next;
last unless $next == ++$first;
$first = $next;
}
say "Skipped: $first";
Output:
$ ./tyu.pl
381
382
383
384
385
386
387
388
389
390
391
392
393
395
Skipped: 394
Here's one bash solution (assuming the numbers are in a file, one per line):
sort -n numbers.txt | grep -n . |
grep -v -m1 '\([0-9]\+\):\1' | cut -f1 -d:
The first part sorts the numbers and then adds a sequence number to each one, and the second part finds the first sequence number which doesn't correspond to the number in the array.
Same thing, using sort and awk (bog-standard, no extensions in either):
sort -n numbers.txt | awk '$1!=NR{print NR;exit}'
Here is a slight variation on the theme set by other answers. Values coming in are not necessarily pre-sorted:
$ cat test
sort -nu <<END-OF-LIST |
1
5
2
4
6
END-OF-LIST
awk 'BEGIN { M = 1 } M > $1 { next } M == $1 { M++; next }
M < $1 { exit } END { print M }'
$ sh test
3
Notes:
If numbers are pre-sorted, do not bother with the sort.
If there are no missing numbers, the next higher number is output.
In this example, a here document supplies numbers, but one can use a file or pipe.
M may start greater than the smallest to ignore missing numbers below a threshold.
To auto-start the search at the lowest number, change BEGIN { M = 1 } to NR == 1 { M = $1 }.

Resources