I have a list similar to this...
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
I want to create a table where some of the data is taken from the nearest result. This prevents me from simply replacing "\n", "Name:", etc to make my table.
This is what I want to end up with...
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
I hope that makes sense. The last 2 columns are taken from the nearest previous 1ID and 2ID.
There could be any number of entries after the "ID" values.
Assumptions:
data is always formatted as presented and there is always a complete 3-tuple of name/age/species
first field of each line is spelled/capitalized exactly as in the example (the solution is based on an exact match)
Sample data file:
$ cat species.dat
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
One awk solution:
awk -F":" '
$1 == "1ID" { id1=$2 ; next }
$1 == "2ID" { id2=$2 ; next }
$1 == "Name" { name=$2 ; next }
$1 == "Age" { age=$2 ; next }
$1 == "Species" { print name,age,$2,id1,id2 }
' species.dat
NOTE: The next clauses are optional since each line is matching on a specific value in field 1 ($1).
Running the above generates:
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
Please see if following code fits your requirements
use strict;
use warnings;
use feature 'say';
my($id1,$id2,$name,$age,$species);
my $ready = 0;
$~ = 'STDOUT_HEADER';
write;
$~ = 'STDOUT';
while(<DATA>) {
$id1 = $1 if /^1ID:\s*(\d+)/;
$id2 = $1 if /^2ID:\s*(\d+)/;
$name = $1 if /^Name:\s*(\w+)/;
$age = $1 if /^Age:\s*(\d+)/;
$species = $1 if /^Species:\s*(\w+)/;
$ready = 1 if /^Species:/; # trigger flag for output
if( $ready ) {
$ready = 0;
write;
}
}
format STDOUT_HEADER =
Name Age Species Id1 Id2
---------------------------------
.
format STDOUT =
#<<<<<<< #>> #<<<<<< #>> #>>>>>>
$name, $age, $species, $id1, $id2
.
__DATA__
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
Output
Name Age Species Id1 Id2
---------------------------------
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
Would you try the following:
awk -F: '{a[$1]=$2} /^Species:/ {print a["Name"],a["Age"],a["Species"],a["1ID"],a["2ID"]}' file.txt
Here is an example in Perl:
use feature qw(say);
use strict;
use warnings;
my $fn = 'file.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my ($id1, $id2);
while( my $line = <$fh> ) {
chomp $line;
if ( $line =~ /^1ID:(\d+)/ ) {
$id1 = $1;
}
elsif ( $line =~ /^2ID:(\d+)/ ) {
$id2 = $1;
}
else {
my ( $name, $age, $species ) = get_block( $fh, $line );
say "$name $age $species $id1 $id2";
}
}
close $fh;
sub get_value {
my ( $line, $key ) = #_;
my ($key2, $value) = $line =~ /^(\S+):(.*)/;
if ( $key2 ne $key ) {
die "Bad format";
}
return $value;
}
sub get_block {
my ( $fh, $line ) = #_;
my $name = get_value( $line, 'Name' );
$line = <$fh>;
my $age = get_value( $line, 'Age' );
$line = <$fh>;
my $species = get_value( $line, 'Species' );
return ( $name, $age, $species );
}
Output:
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
This might work for you (GNU sed):
sed -En '/^1ID./{N;h};/^Name/{N;N;G;s/\S+://g;s/\n/ /gp}' file
Stow the ID's in the hold space. Gather up the record in the pattern space, append the ID's, remove the labels and replace the newlines by spaces.
Related
I'm trying to compare 4 text files for counts in each line:
file1.txt:
32
44
75
22
88
file2.txt
32
44
75
22
88
file3.txt
11
44
75
22
77
file4.txt
32
44
75
22
88
each line represents a title
line1 = customerID count
line2 = employeeID count
line3 = active_users
line4 = inactive_users
line5 = deleted_users
I'm trying to compare file2.txt, file3.txt and file4.txt with file1.txt; file1.txt will always have the correct counts.
Example: Since file2.txt matches exactly line by line to file1.txt in the example above then i'm trying to output "file2.txt is good" but since file3.txt line1 and line5 do not match to file1.txt I'm trying to output "customerID for file3.txt does not match by 21 records", (i.e. 32 - 11 = 21), and "deleted_users in file3.txt does not match by 11 records", (88 - 77 = 11).
If shell is easier then that is fine too.
One way to process files by lines in parallel
use warnings;
use strict;
use feature 'say';
my #files = #ARGV;
#my #files = map { $_ . '.txt' } qw(f1 f2 f3 f4); # my test files' names
# Open all files, filehandles in #fhs
my #fhs = map { open my $fh, '<', $_ or die "Can't open $_: $!"; $fh } #files;
# For reporting, enumerate file names
my %files = map { $_ => $files[$_] } 0..$#files;
# Process (compare) the same line from all files
my $line_cnt;
LINE: while ( my #line = map { my $line = <$_>; $line } #fhs )
{
defined || last LINE for #line;
++$line_cnt;
s/(?:^\s+|\s+$)//g for #line;
for my $i (1..$#line) {
if ($line[0] != $line[$i]) {
say "File $files[$i] differs at line $line_cnt";
}
}
}
This compares the whole line by == (after leading and trailing spaces are stripped), since it is a given that each line carries a single number which need be compared.
It prints, with my test files named f1.txt, f2.txt, ...
File f3.txt differs at line 1
File f3.txt differs at line 5
Store the line names in an array, store the correct values in another array. Then, loop over the files, and for each of them, read their lines and compare them to the stored correct values. You can use the special variable $. that contains the line number of the last access file handle to serve as an index to the arrays. Lines are 1-based, arrays are 0-based, so we need to subtract 1 to get the correct index.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #line_names = ('customerID count',
'employeeID count',
'active_users',
'inactive_users',
'deleted_users');
my #correct;
open my $in, '<', shift or die $!;
while (<$in>) {
chomp;
push #correct, $_;
}
while (my $file = shift) {
open my $in, '<', $file or die $!;
while (<$in>) {
chomp;
if ($_ != $correct[$. - 1]) {
say "$line_names[$. - 1] in $file does not match by ",
$correct[$. - 1] - $_, ' records';
}
}
}
Read first file into array then loop over other files using the same function to read into array. Within this loop consider every line, calc diff and print message with text from #names if diff is not zero.
#!/usr/bin/perl
use strict;
use warnings;
my #names = qw(customerID_count employeeID_count active_users inactive_users deleted_users);
my #files = qw(file1.txt file2.txt file3.txt file4.txt);
my #first = readfile($files[0]);
for (my $i = 1; $i <= $#files; $i++) {
print "\n$files[0] <=> $files[$i]:\n";
my #second = readfile($files[$i]);
for (my $j = 0; $j <= $#names; $j++) {
my $diff = $first[$j] - $second[$j];
$diff = -$diff if $diff < 0;
if ($diff > 0) {
print "$names[$j] does not match by $diff records\n";
}
}
}
sub readfile {
my ($file) = #_;
open my $handle, '<', $file;
chomp(my #lines = <$handle>);
close $handle;
return grep(s/\s*//g, #lines);
}
Output is:
file1.txt <=> file2.txt:
file1.txt <=> file3.txt:
customerID_count does not match by 21 records
deleted_users does not match by 11 records
file1.txt <=> file4.txt:
A mash-up of bash, and mostly the GNU versions of standard utils like diff, sdiff, sed, et al, plus the ifne util, and even an eval:
f=("" "customerID count" "employeeID count" \
"active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do
diff -qws file1.txt $n ||
$(sdiff file1 $n | ifne -n exit | nl |
sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' |
xargs printf 'eval echo "%s for '"$n"' does not match by %s records.";\n') ;
done
Output:
Files file1.txt and file2.txt are identical
Files file1.txt and file3.txt differ
customerID count for file3.txt does not match by 21 records.
deleted_users for file3.txt does not match by 11 records.
Files file1.txt and file4.txt are identical
The same code, tweaked for prettier output:
f=("" "customerID count" "employeeID count" \
"active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do
diff -qws file1.txt $n ||
$(sdiff file1 $n | ifne -n exit | nl |
sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' |
xargs printf 'eval echo "%s does not match by %s records.";\n') ;
done |
sed '/^Files/!s/^/\t/;/^Files/{s/.* and //;s/ are .*/ is good/;s/ differ$/:/}'
Output:
file2.txt is good
file3.txt:
customerID count does not match by 21 records.
deleted_users does not match by 11 records.
file4.txt is good
Here is an example in Perl:
use feature qw(say);
use strict;
use warnings;
{
my $ref = read_file('file1.txt');
my $N = 3;
my #value_info;
for my $i (1..$N) {
my $fn = 'file'.($i+1).'.txt';
my $values = read_file( $fn );
push #value_info, [ $fn, $values];
}
my #labels = qw(customerID employeeID active_users inactive_users deleted_users);
for my $info (#value_info) {
my ( $fn, $values ) = #$info;
my $all_ok = 1;
my $j = 0;
for my $value (#$values) {
if ( $value != $ref->[$j] ) {
printf "%s: %s does not match by %d records\n",
$fn, $labels[$j], abs( $value - $ref->[$j] );
$all_ok = 0;
}
$j++;
}
say "$fn: is good" if $all_ok;
}
}
sub read_file {
my ( $fn ) = #_;
my #values;
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
while( my $line = <$fh>) {
if ( $line =~ /(\d+)/) {
push #values, $1;
}
}
close $fh;
return \#values;
}
Output:
file2.txt: is good
file3.txt: customerID does not match by 21 records
file3.txt: deleted_users does not match by 11 records
file4.txt: is good
I'm writing a bash script which requires searching for the smallest available integer in an array and piping it into a variable.
I know how to identify the smallest or the largest integer in an array but I can't figure out how to identify the 'missing' smallest integer.
Example array:
1
2
4
5
6
In this example I would need 3 as a variable.
Using sed for this would be silly. With GNU awk you could do
array=(1 2 4 5 6)
echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }'
...which remembers all numbers, then counts from 1 until it finds one that it doesn't remember and prints that. You can then remember this number in bash with
array=(1 2 4 5 6)
number=$(echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }')
However, if you're already using bash, you could just do the same thing in pure bash:
#!/bin/bash
array=(1 2 4 5 6)
declare -a seen
for i in ${array[#]}; do
seen[$i]=1
done
for((number = 1; seen[number] == 1; ++number)); do true; done
echo $number
You can iterate from minimal to maximal number and take first non existing element,
use List::Util qw( first );
my #arr = sort {$a <=> $b} qw(1 2 4 5 6);
my $min = $arr[0];
my $max = $arr[-1];
my %seen;
#seen{#arr} = ();
my $first = first { !exists $seen{$_} } $min .. $max;
This code will do as you ask. It can easily be accelerated by using a binary search, but it is clearest stated in this way.
The first element of the array can be any integer, and the subroutine returns the first value that isn't in the sequence. It returns undef if the complete array is contiguous.
use strict;
use warnings;
use 5.010;
my #data = qw/ 1 2 4 5 6 /;
say first_missing(#data);
#data = ( 4 .. 99, 101 .. 122 );
say first_missing(#data);
sub first_missing {
my $start = $_[0];
for my $i ( 1 .. $#_ ) {
my $expected = $start + $i;
return $expected unless $_[$i] == $expected;
}
return;
}
output
3
100
Here is a Perl one liner:
$ echo '1 2 4 5 6' | perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;'
If you want to switch from a pipeline to a file input:
$ perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;' file
Since it is sorted in the process, input can be in arbitrary order.
$ cat tst.awk
BEGIN {
split("1 2 4 5 6",a)
for (i=1;a[i+1]==a[i]+1;i++) ;
print a[i]+1
}
$ awk -f tst.awk
3
Having fun with #Borodin's excellent answer:
#!/usr/bin/env perl
use 5.020; # why not?
use strict;
use warnings;
sub increasing_stream {
my $start = int($_[0]);
return sub {
$start += 1 + (rand(1) > 0.9);
};
}
my $stream = increasing_stream(rand(1000));
my $first = $stream->();
say $first;
while (1) {
my $next = $stream->();
say $next;
last unless $next == ++$first;
$first = $next;
}
say "Skipped: $first";
Output:
$ ./tyu.pl
381
382
383
384
385
386
387
388
389
390
391
392
393
395
Skipped: 394
Here's one bash solution (assuming the numbers are in a file, one per line):
sort -n numbers.txt | grep -n . |
grep -v -m1 '\([0-9]\+\):\1' | cut -f1 -d:
The first part sorts the numbers and then adds a sequence number to each one, and the second part finds the first sequence number which doesn't correspond to the number in the array.
Same thing, using sort and awk (bog-standard, no extensions in either):
sort -n numbers.txt | awk '$1!=NR{print NR;exit}'
Here is a slight variation on the theme set by other answers. Values coming in are not necessarily pre-sorted:
$ cat test
sort -nu <<END-OF-LIST |
1
5
2
4
6
END-OF-LIST
awk 'BEGIN { M = 1 } M > $1 { next } M == $1 { M++; next }
M < $1 { exit } END { print M }'
$ sh test
3
Notes:
If numbers are pre-sorted, do not bother with the sort.
If there are no missing numbers, the next higher number is output.
In this example, a here document supplies numbers, but one can use a file or pipe.
M may start greater than the smallest to ignore missing numbers below a threshold.
To auto-start the search at the lowest number, change BEGIN { M = 1 } to NR == 1 { M = $1 }.
Consider the following string
abcd
I can return 2 character permutations
(cartesian product)
like this
$ echo {a,b,c,d}{a,b,c,d}
aa ab ac ad ba bb bc bd ca cb cc cd da db dc dd
However I would like to remove redundant entries such as
ba ca cb da db dc
and invalid entries
aa bb cc dd
so I am left with
ab ac ad bc bd cd
Example
Here's a pure bash one:
#!/bin/bash
pool=( {a..d} )
for((i=0;i<${#pool[#]}-1;++i)); do
for((j=i+1;j<${#pool[#]};++j)); do
printf '%s\n' "${pool[i]}${pool[j]}"
done
done
and another one:
#!/bin/bash
pool=( {a..d} )
while ((${#pool[#]}>1)); do
h=${pool[0]}
pool=("${pool[#]:1}")
printf '%s\n' "${pool[#]/#/$h}"
done
They can be written as functions (or scripts):
get_perms_ordered() {
local i j
for((i=1;i<"$#";++i)); do
for((j=i+1;j<="$#";++j)); do
printf '%s\n' "${!i}${!j}"
done
done
}
or
get_perms_ordered() {
local h
while (("$#">1)); do
h=$1; shift
printf '%s\n' "${#/#/$h}"
done
}
Use as:
$ get_perms_ordered {a..d}
ab
ac
ad
bc
bd
cd
This last one can easily be transformed into a recursive function to obtain ordered permutations of a given length (without replacement—I'm using the silly ball-urn probability vocabulary), e.g.,
get_withdraws_without_replacement() {
# $1=number of balls to withdraw
# $2,... are the ball "colors"
# return is in array gwwr_ret
local n=$1 h r=()
shift
((n>0)) || return
((n==1)) && { gwwr_ret=( "$#" ); return; }
while (("$#">=n)); do
h=$1; shift
get_withdraws_without_replacement "$((n-1))" "$#"
r+=( "${gwwr_ret[#]/#/$h}" )
done
gwwr_ret=( "${r[#]}" )
}
Then:
$ get_withdraws_without_replacement 3 {a..d}
$ echo "${gwwr_ret[#]}"
abc abd acd bcd
You can use awk to filter away the entries you don't want:
echo {a,b,c,d}{a,b,c,d} | awk -v FS="" -v RS=" " '$1 == $2 { next } ; $1 > $2 { SEEN[ $2$1 ] = 1 ; next } ; { SEEN[ $1$2 ] =1 } ; END { for ( I in SEEN ) { print I } }'
In details:
echo {a,b,c,d}{a,b,c,d} \
| awk -v FS="" -v RS=" " '
# Ignore identical values
$1 == $2 { next }
# Reorder and record inverted entries
$1 > $2 { SEEN[ $2$1 ] = 1 ; next }
# Record everything else
{ SEEN[ $1$2 ] = 1 }
# Print the final list
END { for ( I in SEEN ) { print I } }
'
FS="" tells awk that each character is a separate field.
RS=" " uses spaces to separate records.
I'm sure someone's going to do this in one line of awk, but here is something in bash:
#!/bin/bash
seen=":"
result=""
for i in "$#"
do
for j in "$#"
do
if [ "$i" != "$j" ]
then
if [[ $seen != *":$j$i:"* ]]
then
result="$result $i$j"
seen="$seen$i$j:"
fi
fi
done
done
echo $result
Output:
$ ./prod.sh a b c d
ab ac ad bc bd cd
$ ./prod.sh I have no life
Ihave Ino Ilife haveno havelife nolife
here is a pseudo code to achieve that, based on your restrictions, and
using an array for your characters:
for (i=0;i<array.length;i++)
{
for (j=i+1;j<array.length;j++)
{
print array[i] + array[j]; //concatenation
}
}
I realized that I am not looking for permutations, but the power set. Here
is an implementation in Awk:
{
for (c = 0; c < 2 ^ NF; c++) {
e = 0
for (d = 0; d < NF; d++)
if (int(c / 2 ^ d) % 2) {
printf "%s", $(d + 1)
}
print ""
}
}
Input:
a b c d
Output:
a
b
ab
c
ac
bc
abc
d
ad
bd
abd
cd
acd
bcd
abcd
Example
I was wondering if you could give me a hand to find a solution (not necesseraly giving me a code) to my problem.
I would like to create a "matching matrix" in perl or Bash.
Basically, my first file is an extracted list of IDs, not unique (file1)
ID1
ID4
ID20
ID1
For making my life easier my second file is just a long line with multiples IDs (file2)
ID1 ID2 ID3 ID4 .... IDn
I would like to achieve this output:
ID1 ID2 ID3 ID4 ID5 ID6 ID7 .... ID20 IDn
ID1 X
ID4 X
ID20 X
ID1 X
The tricky part for me is to add the "X" when a match is found.
Any help, hint is more than appreciated.
Here's my answer for this issue:
use warnings;
use strict;
open(FILE1, "file1") || die "cannot find file1";
open(FILE2, "file2") || die "cannot find file2";
#read file2 to create hash
my $file2 = <FILE2>;
$file2 =~ s/\n//;
my #ids = split(/\s+/, $file2);
my %hash;
my $i = 1;
map {$hash{$_} = $i++;} #ids;
#print the first line
printf "%6s",'';
for my $id (#ids) {
printf "%6s", $id;
}
print "\n";
#print the other lines
while(<FILE1>) {
chomp;
my $id = $_;
printf "%6s", $id;
my $index = $hash{$id};
if ($index) {
$index = 6 * $index;
printf "%${index}s\n",'x';
} else {
print "\n";
}
}
I do have text manipulation problem that I need to solve in awk, sed & shell.
My text looks like this:
>Sample_1
100 101
aaattattacaaaaataattacaaattattacaaaaagaattattacaaaaagaattacaaaa
-1.60 .(((((((.....)))))))........................................... []
>Sample_2
1 35
aattattacaaaaagaattattacaaaaagaatta
0.00 ................................... _
>Sample_3
1 123
gctcacacctgtaatcccagcactttgggaggctgagg
-27.80 ((((.....))))......((((((.(((...))))))).)[][][[][]]
-26.40 (((((.((...(((((..((((((....))......... [[][]][]
-25.80 ((((.....)))).....((((((............... [][][][[][]]
123 145
ctgaggcaggcagatcacgaggtcacgagatcaa
-26.20 (((.....)))))) [][][[][]]
-25.90 ....((((..((....)) [][[][]]
-25.70 ..(((..((....))..(()) [[][]][[][]]
145 256
gtaatcccagcactttgggaggctgaggcaggcaga
0.00 ........................................... _
256 342
-25.00 ..((....((((.....((((((...)))....))... [[][]]
-24.00 ..((.((((.((((())... [[][][]]
-23.70 .((((((...(((((..((.. [[][]][]
I want to:
Extract Sample name (>Sample_1);
Extract numeric value that goes after the sample name (it's either 0 or negative value);
From the negative value group (e.g. -27.80;-26.40;-25.80) extract number that goes first (it's the most negative value).
Perfect output would look like this:
>Sample_1
-1.60
>Sample_2
0.00
>Sample_3
-27.80
-26.20
0.00
-25.00
I tried to do this in awk printing $1, grepping '>', 0 & negative values, but wasn't able to diverge column into groups & and to extract the most negative value.
awk '{print $1}' file | egrep -i '>|0.00|-'
You tagged your question with sed and awk, but if you're O.K. with Perl instead, you could write:
#!/usr/bin/perl -w
use warnings;
use strict;
my $min = undef;
while(<>)
{
if(m/^(-?\d+\.\d+)/)
{
if(! defined($min) || $1 < $min)
{ $min = $1; }
}
else
{
if(defined $min)
{
print "$min\n";
$min = undef;
}
if(m/^>/)
{ print; }
}
}
if(defined $min)
{ print "$min\n"; }
awk '/^[0-]/ && new_group {print $1} {new_group = (/^[ \t]/)} /^>/' file