Reduced permutations - bash

Consider the following string
abcd
I can return 2 character permutations
(cartesian product)
like this
$ echo {a,b,c,d}{a,b,c,d}
aa ab ac ad ba bb bc bd ca cb cc cd da db dc dd
However I would like to remove redundant entries such as
ba ca cb da db dc
and invalid entries
aa bb cc dd
so I am left with
ab ac ad bc bd cd
Example

Here's a pure bash one:
#!/bin/bash
pool=( {a..d} )
for((i=0;i<${#pool[#]}-1;++i)); do
for((j=i+1;j<${#pool[#]};++j)); do
printf '%s\n' "${pool[i]}${pool[j]}"
done
done
and another one:
#!/bin/bash
pool=( {a..d} )
while ((${#pool[#]}>1)); do
h=${pool[0]}
pool=("${pool[#]:1}")
printf '%s\n' "${pool[#]/#/$h}"
done
They can be written as functions (or scripts):
get_perms_ordered() {
local i j
for((i=1;i<"$#";++i)); do
for((j=i+1;j<="$#";++j)); do
printf '%s\n' "${!i}${!j}"
done
done
}
or
get_perms_ordered() {
local h
while (("$#">1)); do
h=$1; shift
printf '%s\n' "${#/#/$h}"
done
}
Use as:
$ get_perms_ordered {a..d}
ab
ac
ad
bc
bd
cd
This last one can easily be transformed into a recursive function to obtain ordered permutations of a given length (without replacement—I'm using the silly ball-urn probability vocabulary), e.g.,
get_withdraws_without_replacement() {
# $1=number of balls to withdraw
# $2,... are the ball "colors"
# return is in array gwwr_ret
local n=$1 h r=()
shift
((n>0)) || return
((n==1)) && { gwwr_ret=( "$#" ); return; }
while (("$#">=n)); do
h=$1; shift
get_withdraws_without_replacement "$((n-1))" "$#"
r+=( "${gwwr_ret[#]/#/$h}" )
done
gwwr_ret=( "${r[#]}" )
}
Then:
$ get_withdraws_without_replacement 3 {a..d}
$ echo "${gwwr_ret[#]}"
abc abd acd bcd

You can use awk to filter away the entries you don't want:
echo {a,b,c,d}{a,b,c,d} | awk -v FS="" -v RS=" " '$1 == $2 { next } ; $1 > $2 { SEEN[ $2$1 ] = 1 ; next } ; { SEEN[ $1$2 ] =1 } ; END { for ( I in SEEN ) { print I } }'
In details:
echo {a,b,c,d}{a,b,c,d} \
| awk -v FS="" -v RS=" " '
# Ignore identical values
$1 == $2 { next }
# Reorder and record inverted entries
$1 > $2 { SEEN[ $2$1 ] = 1 ; next }
# Record everything else
{ SEEN[ $1$2 ] = 1 }
# Print the final list
END { for ( I in SEEN ) { print I } }
'
FS="" tells awk that each character is a separate field.
RS=" " uses spaces to separate records.

I'm sure someone's going to do this in one line of awk, but here is something in bash:
#!/bin/bash
seen=":"
result=""
for i in "$#"
do
for j in "$#"
do
if [ "$i" != "$j" ]
then
if [[ $seen != *":$j$i:"* ]]
then
result="$result $i$j"
seen="$seen$i$j:"
fi
fi
done
done
echo $result
Output:
$ ./prod.sh a b c d
ab ac ad bc bd cd
$ ./prod.sh I have no life
Ihave Ino Ilife haveno havelife nolife

here is a pseudo code to achieve that, based on your restrictions, and
using an array for your characters:
for (i=0;i<array.length;i++)
{
for (j=i+1;j<array.length;j++)
{
print array[i] + array[j]; //concatenation
}
}

I realized that I am not looking for permutations, but the power set. Here
is an implementation in Awk:
{
for (c = 0; c < 2 ^ NF; c++) {
e = 0
for (d = 0; d < NF; d++)
if (int(c / 2 ^ d) % 2) {
printf "%s", $(d + 1)
}
print ""
}
}
Input:
a b c d
Output:
a
b
ab
c
ac
bc
abc
d
ad
bd
abd
cd
acd
bcd
abcd
Example

Related

How to fast sum values in directed graph in shell?

I have a directed graph with like 2000 nodes stored in a file. Each line represents an edge from the node stored in the first column to the node stored in the second column, it is even easy to visualize the data for example in dot(1). Columns are separated by tabs, rows separated by newlines and nodes are named with any of the a-zA-Z0-9_ characters. Tree can have multiple roots, it may have cycles, which should be ignored. I don't care about cycles, they are redundant, but they can happen in the input. Below I am presenting an example of the graph, with tr to substitute spaces for tabs and here-document, to easy reproduce the input file:
tr ' ' '\t' <<EOF >connections.txt
str1 str2
str2 str3
str3 str4
str100 str2
str100 str101
str101 str102
EOF
I have also a list of some node in the graph, called heads. These will be the starting nodes, ie. heads:
tr ' ' '\t' <<EOF >heads.txt
str1
str100
EOF
And I have also a list of associated "cost" with each node. Example with some random data:
tr ' ' '\t' <<EOF >cost.txt
str1 1
str2 5
str3 10
str4 548
str100 57
str101 39
str102 23
EOF
I want to sum the "cost" of each node while traversing the tree from nodes stored in head.txt and print the cost with some traversing information for each leaf.
I want to:
for each node in heads.txt
sum the cost from costs.txt of the node into some variable
find that node in connections.txt
find what does this node connect to
and repeat the algorithm for each of the nodes the node connects to
when the node is connected with nothing, print the sum of costs
Ideally the script would look like:
$ script.sh heads.txt connections.txt cost.txt
str1->str2->str3->str4 1+5+10+548 564
str100->str2->str3->str4 57+5+10+548 620
str100->str101->str102 57+39+23 119
And I even have written this, and it works:
#!/bin/bash
set -euo pipefail
headsf=$1
connectionsf=$2
costf=$3
get_cost() {
grep "^$1"$'\t' "$costf" | cut -f2 || echo 0
}
get_conn() {
grep "^$1"$'\t' "$connectionsf" | cut -f2
}
check_conns() {
grep -q "^$1"$'\t' "$connectionsf"
}
f_output() {
printf "%s\t%s\n" "$1" "$2"
}
f() {
local func cost
func="$1"
cost=$(get_cost "$func")
if ! check_conns "$func"; then
f_output "${2:+$2->}$func" "${3:+$3+}$cost"
return
fi
get_conn "$func" |
while IFS=$'\t' read -r calls; do
if [ "$func" = "$calls" ]; then
echo "$func is recursive" >&2
continue
fi
if <<<"$2" grep -q -w "$calls"; then
printf "$2 calls recursive $calls\n" >&2
continue
fi
f "$calls" "${2:+$2->}$func" "${3:+$3+}$cost"
done
}
while IFS= read -r head; do
f "$head" "" ""
done < "$headsf" |
while IFS=$'\t' read -r func calc; do
tmp=$(<<<$calc bc)
printf "%s\t%s\t%s\n" "$func" "$calc" "$tmp"
done |
column -t -s $'\t'
However it is impossibly slow on bigger inputs. Even with sample files here (only 6 lines) the script takes 200ms on my machine. How can I speed it up? Can the inputs be sorted, joined somehow to speed it up (grep doesn't care if the input is sorted)? Can this be done faster in awk or other unix tools?
I would like to limit myself to bash shell and standard *unix tools, coreutils, moreutils, datamash and such. I tried doing it in awk, but failed, I have no idea how to do find something recursively in the input in awk. It this feels to me "doable" in a shell script really fast.
Since no one has posted an answer yet, here is an awk solution as a starting point:
#!/usr/bin/awk -f
BEGIN {
FS=OFS="\t"
}
FILENAME=="connections.txt" {
edges[$1,++count[$1]]=$2
next
}
FILENAME=="cost.txt" {
costs[$1]=$2
next
}
FILENAME=="heads.txt" {
f($1)
}
function f(node,
path,cost,sum,prev,sep1,sep2,i) {
if(node in prev)
# cycle detected
return
path=path sep1 node
cost=cost sep2 costs[node]
sum+=costs[node]
if(!count[node]) {
print path,cost,sum
}
else {
prev[node]
for(i=1;i<=count[node];++i)
f(edges[node,i],path,cost,sum,prev,"->","+")
delete prev[node]
}
}
Make it read connections.txt and cost.txt before heads.txt.
Its output (padded):
$ awk -f tst.awk connections.txt cost.txt heads.txt
str1->str2->str3->str4 1+5+10+548 564
str100->str2->str3->str4 57+5+10+548 620
str100->str101->str102 57+39+23 119
You say you want only standard tools, but you also mention using dot on your data, so I'm assuming you have the other graphviz utilities available... in particular, gvpr, which is like awk for graphs:
#!/usr/bin/env bash
graph=$(mktemp)
join -t$'\t' -j1 -o 0,1.2,2.2 -a2 \
<(sort -k1,1 connections.txt) \
<(sort -k1,1 cost.txt) |
awk -F$'\t' 'BEGIN { print "digraph g {" }
{ printf "%s [cost = %d ]\n", $1, $3
if ($2 != "") printf "%s -> %s\n", $1, $2 }
END { print "}" }' > "$graph"
while read root; do
gvpr -a "$root" '
BEGIN {
int depth;
int seen[string];
string path[int];
int costs[int];
}
BEG_G {
$tvtype = TV_prepostfwd;
$tvroot = node($, ARGV[0]);
}
N {
if ($.name in seen) {
depth--;
} else {
seen[$.name] = 1;
path[depth] = $.name;
costs[depth] = $.cost;
depth++;
if (!fstout($) && path[0] == ARGV[0]) {
int i, c = 0;
for (i = 0; i < depth - 1; i++) {
printf("%s->", path[i]);
}
printf("%s\t", $.name);
for (i = 0; i < depth - 1; i++) {
c += costs[i];
printf("%d+", costs[i]);
}
c += $.cost;
printf("%d\t%d\n", $.cost, c);
}
}
}' "$graph"
done < heads.txt
rm -f "$graph"
Running this after creating your data files:
$ ./paths.sh
str1->str2->str3->str4 1+5+10+548 564
str100->str2->str3->str4 57+5+10+548 620
str100->str101->str102 57+39+23 119
Or, since it's so ubiquitous it might as well be standard, a sqlite-based solution. This one doesn't even require bash/zsh/ksh93, unlike the above.
$ sqlite3 -batch -noheader -list <<EOF
.separator "\t"
CREATE TABLE heads(node TEXT);
.import heads.txt heads
CREATE TABLE costs(node TEXT PRIMARY KEY, cost INTEGER) WITHOUT ROWID;
.import cost.txt costs
CREATE TABLE connections(from_node TEXT, to_node TEXT
, PRIMARY KEY(from_node, to_node)) WITHOUT ROWID;
.import connections.txt connections
WITH RECURSIVE paths(tail, path, costs, cost) AS
(SELECT h.node, h.node, c.cost, c.cost
FROM heads AS h
JOIN costs AS c ON h.node = c.node
UNION ALL
SELECT conn.to_node, p.path || '->' || conn.to_node
, p.costs || '+' || c.cost, p.cost + c.cost
FROM paths AS p
JOIN connections AS conn ON conn.from_node = p.tail
JOIN costs AS c ON c.node = conn.to_node
)
SELECT path, costs, cost FROM paths AS p
WHERE tail NOT IN (SELECT from_node FROM connections)
ORDER BY path;
EOF
str1->str2->str3->str4 1+5+10+548 564
str100->str101->str102 57+39+23 119
str100->str2->str3->str4 57+5+10+548 620

Aggregating csv file in bash script

I have csv file with multiple lines. Each line has the same number of columns. What I need to do is to group those lines by a few specified columns and aggregate data from other columns. Example of input file:
proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2
For above example I need to group lines by first two columns. From 3rd column I need to choose the min value, for 4th column max value, and 5th column should have the sum. So, for such input file I need output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
I need to process it in bash (I can use awk or sed as well).
With bash and sort:
#!/bin/bash
# create associative arrays
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de # date start and date end
declare -A -i sum # set integer attribute
# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }
# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do
# if associative array is still empty for this entry
# fill with current strings/value
if [[ -z ${p[$p1,$p2]} ]]; then
p[$p1,$p2]="$p1,$p2"
ds[$p1,$p2]="$d1"
de[$p1,$p2]="$d2"
sum[$p1,$p2]="$s"
continue
fi
# compare strings, set new strings and sum value
if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
[[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
[[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
sum[$p1,$p2]=sum[$p1,$p2]+s
fi
done < file
# print content of all associative arrays with key vom associative array p
for i in "${!p[#]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done
Usage: ./script.sh | sort
Output to stdout:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
See: help declare, help read and of course man bash
With awk + sort
awk -F',|-' '
BEGIN{
A["Jan"]="01"
A["Feb"]="02"
A["Mar"]="03"
A["Apr"]="04"
A["May"]="05"
A["Jun"]="06"
A["July"]="07"
A["Aug"]="08"
A["Sep"]="09"
A["Oct"]="10"
A["Nov"]="11"
A["Dec"]="12"
}
{
B[$1","$2]=B[$1","$2]+$9
z=sprintf( "%.2d",$3)
y=sprintf("%s",$5 A[$4] z)
if(!start[$1$2])
{
end[$1$2]=0
start[$1$2]=99999999
}
if (y < start[$1$2])
{
start[$1$2]=y
C[$1","$2]=$3"-"$4"-"$5
}
x=sprintf( "%.2d",$6)
w=sprintf("%s",$8 A[$7] x)
if(w > end[$1$2] )
{
end[$1$2]=w
D[$1","$2]=$6"-"$7"-"$8
}
}
END{
for (i in B)print i "," C[i] "," D[i] "," B[i]
}
' infile | sort
Extended GNU awk solution:
awk -F, 'function parse_date(d_str){
split(d_str, d, "-");
t = mktime(sprintf("%d %d %d 00 00 00", d[3], m[d[2]], d[1]));
return t
}
BEGIN{ m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6;
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;
}
{
k=$1 SUBSEP $2;
if (k in a){
if (parse_date(a[k]["min"]) > parse_date($3)) { a[k]["min"]=$3 }
if (parse_date(a[k]["max"]) < parse_date($4)) { a[k]["max"]=$4 }
} else {
a[k]["min"]=$3; a[k]["max"]=$4
}
a[k]["sum"]+= $5
}
END{
for (i in a) {
split(i, j, SUBSEP);
print j[1], j[2], a[i]["min"], a[i]["max"], a[i]["sum"]
}
}' OFS=',' file
The output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

i have a protein sequence file i want to count trimers in it using sed or grep

I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.
If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv

Find smallest missing integer in an array

I'm writing a bash script which requires searching for the smallest available integer in an array and piping it into a variable.
I know how to identify the smallest or the largest integer in an array but I can't figure out how to identify the 'missing' smallest integer.
Example array:
1
2
4
5
6
In this example I would need 3 as a variable.
Using sed for this would be silly. With GNU awk you could do
array=(1 2 4 5 6)
echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }'
...which remembers all numbers, then counts from 1 until it finds one that it doesn't remember and prints that. You can then remember this number in bash with
array=(1 2 4 5 6)
number=$(echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }')
However, if you're already using bash, you could just do the same thing in pure bash:
#!/bin/bash
array=(1 2 4 5 6)
declare -a seen
for i in ${array[#]}; do
seen[$i]=1
done
for((number = 1; seen[number] == 1; ++number)); do true; done
echo $number
You can iterate from minimal to maximal number and take first non existing element,
use List::Util qw( first );
my #arr = sort {$a <=> $b} qw(1 2 4 5 6);
my $min = $arr[0];
my $max = $arr[-1];
my %seen;
#seen{#arr} = ();
my $first = first { !exists $seen{$_} } $min .. $max;
This code will do as you ask. It can easily be accelerated by using a binary search, but it is clearest stated in this way.
The first element of the array can be any integer, and the subroutine returns the first value that isn't in the sequence. It returns undef if the complete array is contiguous.
use strict;
use warnings;
use 5.010;
my #data = qw/ 1 2 4 5 6 /;
say first_missing(#data);
#data = ( 4 .. 99, 101 .. 122 );
say first_missing(#data);
sub first_missing {
my $start = $_[0];
for my $i ( 1 .. $#_ ) {
my $expected = $start + $i;
return $expected unless $_[$i] == $expected;
}
return;
}
output
3
100
Here is a Perl one liner:
$ echo '1 2 4 5 6' | perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;'
If you want to switch from a pipeline to a file input:
$ perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;' file
Since it is sorted in the process, input can be in arbitrary order.
$ cat tst.awk
BEGIN {
split("1 2 4 5 6",a)
for (i=1;a[i+1]==a[i]+1;i++) ;
print a[i]+1
}
$ awk -f tst.awk
3
Having fun with #Borodin's excellent answer:
#!/usr/bin/env perl
use 5.020; # why not?
use strict;
use warnings;
sub increasing_stream {
my $start = int($_[0]);
return sub {
$start += 1 + (rand(1) > 0.9);
};
}
my $stream = increasing_stream(rand(1000));
my $first = $stream->();
say $first;
while (1) {
my $next = $stream->();
say $next;
last unless $next == ++$first;
$first = $next;
}
say "Skipped: $first";
Output:
$ ./tyu.pl
381
382
383
384
385
386
387
388
389
390
391
392
393
395
Skipped: 394
Here's one bash solution (assuming the numbers are in a file, one per line):
sort -n numbers.txt | grep -n . |
grep -v -m1 '\([0-9]\+\):\1' | cut -f1 -d:
The first part sorts the numbers and then adds a sequence number to each one, and the second part finds the first sequence number which doesn't correspond to the number in the array.
Same thing, using sort and awk (bog-standard, no extensions in either):
sort -n numbers.txt | awk '$1!=NR{print NR;exit}'
Here is a slight variation on the theme set by other answers. Values coming in are not necessarily pre-sorted:
$ cat test
sort -nu <<END-OF-LIST |
1
5
2
4
6
END-OF-LIST
awk 'BEGIN { M = 1 } M > $1 { next } M == $1 { M++; next }
M < $1 { exit } END { print M }'
$ sh test
3
Notes:
If numbers are pre-sorted, do not bother with the sort.
If there are no missing numbers, the next higher number is output.
In this example, a here document supplies numbers, but one can use a file or pipe.
M may start greater than the smallest to ignore missing numbers below a threshold.
To auto-start the search at the lowest number, change BEGIN { M = 1 } to NR == 1 { M = $1 }.

How to use awk or anything else to number of shared x values of 2 different y values in a csv file consists of column a and b?

Let me be specific. We have a csv file consists of 2 columns x and y like this:
x,y
1h,a2
2e,a2
4f,a2
7v,a2
1h,b6
4f,b6
4f,c9
7v,c9
...
And we want to count how many shared x values two y values have, which means we want to get this:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
And b6,a2,2 should not show up. Does anyone know how to do this by awk? Or anything else?
Thx ahead!
Try this executable awk script:
#!/usr/bin/awk -f
BEGIN {FS=OFS=","}
NR==1 { print "y1" OFS "y2" OFS "share" }
NR>1 {last=a[$1]; a[$1]=(last!=""?last",":"")$2}
END {
for(i in a) {
cnt = split(a[i], arr, FS)
if( cnt>1 ) {
for(k=1;k<cnt;k++) {
for(i=2;i<=cnt;i++) {
if( arr[k] != arr[i] ) {
key=arr[k] OFS arr[i]
if(out[key]=="") {order[++ocnt]=key}
out[key]++
}
}
}
}
}
for(i=1;i<=ocnt;i++) {
print order[i] OFS out[order[i]]
}
}
When put into a file called awko and made executable, running it like awko data yields:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
I'm assuming the file is sorted by y values in the second column as in the question( after the header ). If it works for you, I'll add some explanations tomorrow.
Additionally for anyone who wants more test data, here's a silly executable awk script for generating some data similar to what's in the question. Makes about 10K lines when run like gen.awk.
#!/usr/bin/awk -f
function randInt(max) {
return( int(rand()*max)+1 )
}
BEGIN {
a[1]="a"; a[2]="b"; a[3]="c"; a[4]="d"; a[5]="e"; a[6]="f"
a[7]="g"; a[8]="h"; a[9]="i"; a[10]="j"; a[11]="k"; a[12]="l"
a[13]="m"; a[14]="n"; a[15]="o"; a[16]="p"; a[17]="q"; a[18]="r"
a[19]="s"; a[20]="t"; a[21]="u"; a[22]="v"; a[23]="w"; a[24]="x"
a[25]="y"; a[26]="z"
print "x,y"
for(i=1;i<=26;i++) {
amultiplier = randInt(1000) # vary this to change the output size
r = randInt(amultiplier)
anum = 1
for(j=1;j<=amultiplier;j++) {
if( j == r ) { anum++; r = randInt(amultiplier) }
print a[randInt(26)] randInt(5) "," a[i] anum
}
}
}
I think if you can get the input into a form like this, it's easy:
1h a2 b6
2e a2
4f a2 b6 c9
7v a2 c9
In fact, you don't even need the x value. You can convert this:
a2 b6
a2
a2 b6 c9
a2 c9
Into this:
a2,b6
a2,b6
a2,c9
a2,c9
That output can be sorted and piped to uniq -c to get approximately the output you want, so we only need to think much about how to get from your input to the first and second states. Once we have those, the final step is easy.
Step one:
sort /tmp/values.csv \
| awk '
BEGIN { FS="," }
{
if (x != $1) {
if (x) print values
x = $1
values = $2
} else {
values = values " " $2
}
}
END { print values }
'
Step two:
| awk '
{
for (i = 1; i < NF; ++i) {
for (j = i+1; j <= NF; ++j) {
print $i "," $j
}
}
}
'
Step three:
| sort | awk '
BEGIN {
combination = $0
print "y1,y2,share"
}
{
if (combination == $0) {
count = count + 1
} else {
if (count) print combination "," count
count = 1
combination = $0
}
}
END { print combination "," count }
'
This awk script does the job:
BEGIN { FS=OFS="," }
NR==1 { print "y1","y2","share" }
NR>1 { ++seen[$1,$2]; ++x[$1]; ++y[$2] }
END {
for (y1 in y) {
for (y2 in y) {
if (y1 != y2 && !(y2 SUBSEP y1 in c)) {
for (i in x) {
if (seen[i,y1] && seen[i,y2]) {
++c[y1,y2]
}
}
}
}
}
for (key in c) {
split(key, a, SUBSEP)
print a[1],a[2],c[key]
}
}
Loop through the input, recording both the original elements and the combinations. Once the file has been processed, look at each pair of y values. The if statement does two things: it prevents equal y values from being compared and it saves looping through the x values twice for every pair. Shared values are stored in c.
Once the shared values have been aggregated, the final output is printed.
This sed script does the trick:
#!/bin/bash
echo y1,y2,share
x=$(wc -l < file)
b=$(echo "$x -2" | bc)
index=0
for i in $(eval echo "{2..$b}")
do
var_x_1=$(sed -n ''"$i"p'' file | sed 's/,.*//')
var_y_1=$(sed -n ''"$i"p'' file | sed 's/.*,//')
a=$(echo "$i + 1" | bc)
for j in $(eval echo "{$a..$x}")
do
var_x_2=$(sed -n ''"$j"p'' file | sed 's/,.*//')
var_y_2=$(sed -n ''"$j"p'' file | sed 's/.*,//')
if [ "$var_x_1" = "$var_x_2" ] ; then
array[$index]=$var_y_1,$var_y_2
index=$(echo "$index + 1" | bc)
fi
done
done
counter=1
for (( k=1; k<$index; k++ ))
do
if [ ${array[k]} = ${array[k-1]} ] ; then
counter=$(echo "$counter + 1" | bc)
else
echo ${array[k-1]},$counter
counter=1
fi
if [ "$k" = $(echo "$index-1"|bc) ] && [ $counter = 1 ]; then
echo ${array[k]},$counter
fi
done

Resources