Matching matrix in perl or bash - bash

I was wondering if you could give me a hand to find a solution (not necesseraly giving me a code) to my problem.
I would like to create a "matching matrix" in perl or Bash.
Basically, my first file is an extracted list of IDs, not unique (file1)
ID1
ID4
ID20
ID1
For making my life easier my second file is just a long line with multiples IDs (file2)
ID1 ID2 ID3 ID4 .... IDn
I would like to achieve this output:
ID1 ID2 ID3 ID4 ID5 ID6 ID7 .... ID20 IDn
ID1 X
ID4 X
ID20 X
ID1 X
The tricky part for me is to add the "X" when a match is found.
Any help, hint is more than appreciated.

Here's my answer for this issue:
use warnings;
use strict;
open(FILE1, "file1") || die "cannot find file1";
open(FILE2, "file2") || die "cannot find file2";
#read file2 to create hash
my $file2 = <FILE2>;
$file2 =~ s/\n//;
my #ids = split(/\s+/, $file2);
my %hash;
my $i = 1;
map {$hash{$_} = $i++;} #ids;
#print the first line
printf "%6s",'';
for my $id (#ids) {
printf "%6s", $id;
}
print "\n";
#print the other lines
while(<FILE1>) {
chomp;
my $id = $_;
printf "%6s", $id;
my $index = $hash{$id};
if ($index) {
$index = 6 * $index;
printf "%${index}s\n",'x';
} else {
print "\n";
}
}

Related

UNIX Pattern Sequence

The following scenario is for pattern search using UNIX Shell where the pattern between two strings need to happen and then a new column with sequence need to happen
Input Data
1|AB|1|2
2|BC|1|2
ID CLOSED
3|AB|1|2
4|BC|1|2
ID CLOSED
Query
As per the data above, we need to add SEQ column after UN and it should add
seq 1 as the first value and sequence 2 to the second part and so on till End.
Expected Output
1|AB|1|2|1
2|BC|1|2|1
3|AB|1|2|2
4|BC|1|2|2
Tried solution as first part but isn't giving correct output
sed -n '/^ID/,/^ID CLOSED/{p;/^pattern2/q}'
Any particular reason you want to use sed for this? It seems like a better fit for awk:
awk -v{,O}FS='|' '
BEGIN { seq = 1 }
/CLOSED/ { seq++ }
!/^ID/ { $5=seq; print }'
Output:
1|AB|1|2|1
2|BC|1|2|1
3|AB|1|2|2
4|BC|1|2|2
Maybe something like this:
(
seq=1
echo "ID NAME ID1 ID2 ID3 UN SEQ"
while read id name id1 id2 id3 un; do
[ "$id $name" = "ID NAME" ] && continue
[ "$id $name" = "ID CLOSED" ] && { let "seq+=1"; continue; }
echo "$id $name $id1 $id2 $id3 $un $seq"
done < /path/to/the/datafile
echo "ID CLOSED"
) | column -t -s' '
Doing this with just a sed instruction is not impossible I think, but a way much harder ;)

Building table from list using nearest value?

I have a list similar to this...
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
I want to create a table where some of the data is taken from the nearest result. This prevents me from simply replacing "\n", "Name:", etc to make my table.
This is what I want to end up with...
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
I hope that makes sense. The last 2 columns are taken from the nearest previous 1ID and 2ID.
There could be any number of entries after the "ID" values.
Assumptions:
data is always formatted as presented and there is always a complete 3-tuple of name/age/species
first field of each line is spelled/capitalized exactly as in the example (the solution is based on an exact match)
Sample data file:
$ cat species.dat
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
One awk solution:
awk -F":" '
$1 == "1ID" { id1=$2 ; next }
$1 == "2ID" { id2=$2 ; next }
$1 == "Name" { name=$2 ; next }
$1 == "Age" { age=$2 ; next }
$1 == "Species" { print name,age,$2,id1,id2 }
' species.dat
NOTE: The next clauses are optional since each line is matching on a specific value in field 1 ($1).
Running the above generates:
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
Please see if following code fits your requirements
use strict;
use warnings;
use feature 'say';
my($id1,$id2,$name,$age,$species);
my $ready = 0;
$~ = 'STDOUT_HEADER';
write;
$~ = 'STDOUT';
while(<DATA>) {
$id1 = $1 if /^1ID:\s*(\d+)/;
$id2 = $1 if /^2ID:\s*(\d+)/;
$name = $1 if /^Name:\s*(\w+)/;
$age = $1 if /^Age:\s*(\d+)/;
$species = $1 if /^Species:\s*(\w+)/;
$ready = 1 if /^Species:/; # trigger flag for output
if( $ready ) {
$ready = 0;
write;
}
}
format STDOUT_HEADER =
Name Age Species Id1 Id2
---------------------------------
.
format STDOUT =
#<<<<<<< #>> #<<<<<< #>> #>>>>>>
$name, $age, $species, $id1, $id2
.
__DATA__
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
Output
Name Age Species Id1 Id2
---------------------------------
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
Would you try the following:
awk -F: '{a[$1]=$2} /^Species:/ {print a["Name"],a["Age"],a["Species"],a["1ID"],a["2ID"]}' file.txt
Here is an example in Perl:
use feature qw(say);
use strict;
use warnings;
my $fn = 'file.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my ($id1, $id2);
while( my $line = <$fh> ) {
chomp $line;
if ( $line =~ /^1ID:(\d+)/ ) {
$id1 = $1;
}
elsif ( $line =~ /^2ID:(\d+)/ ) {
$id2 = $1;
}
else {
my ( $name, $age, $species ) = get_block( $fh, $line );
say "$name $age $species $id1 $id2";
}
}
close $fh;
sub get_value {
my ( $line, $key ) = #_;
my ($key2, $value) = $line =~ /^(\S+):(.*)/;
if ( $key2 ne $key ) {
die "Bad format";
}
return $value;
}
sub get_block {
my ( $fh, $line ) = #_;
my $name = get_value( $line, 'Name' );
$line = <$fh>;
my $age = get_value( $line, 'Age' );
$line = <$fh>;
my $species = get_value( $line, 'Species' );
return ( $name, $age, $species );
}
Output:
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
This might work for you (GNU sed):
sed -En '/^1ID./{N;h};/^Name/{N;N;G;s/\S+://g;s/\n/ /gp}' file
Stow the ID's in the hold space. Gather up the record in the pattern space, append the ID's, remove the labels and replace the newlines by spaces.

Bash while read input issue

An possible way to use while read is:
while read server application date; do ..
So now i could print only the applications, i understand that. So here comes my question:
with my example i know exactly how many "arrays" there are but how would i do it if i dont know how many "arrays" exist per line?
Example file:
Server : ID1 ; ID2 ; ID3
Server : ID1
Server : ID1 ; ID2
Server : ID1 ; ID2 ; ID3 ; ID4
It doesnt have to be with read but how could i read them so i could for example
echo "$Server $ID3"
p.s sorry for the bad english
so what i am doing so far is this:
#!/bin/bash
file=$1
csv=$2
echo Server : Applikation : AO : BV : SO > endlist.txt
while read server aid; do
grep $aid $csv | while IFS=";" read id aid2 name alia status typ beschreibung gesch gesch2 finanzierung internet service servicemodell AO BV SO it betrieb hersteller; do
if [[ $aid == $aid2 ]]
then
echo $server : $name : $AO : $BV : $SO >> endlist.txt
fi
done
done < $file
the Problem is that the first while read is for now only SERVER and AID but i want to edit this file so more than one AID is possible
It doesnt have to be with read but how could i read them so i could for example
echo "$Server $ID3"
First split the input on :, then read the array on ;. Use bash arrays and read -a to save input to an array.
# split the input on `:` and spaces
while IFS=' :' read -r server temp_ids; do
# split the ids on `;` and spaces into an array
IFS=' ;' read -r -a id <<<"$temp_ids"
# check if there are at least 3 elements
if ((${#id[#]} >= 3)); then
# array numbering starts from 0
echo "$server ${id[2]}"
else
echo There is no 3rd element...
fi
done <<EOF
Server : ID1 ; ID2 ; ID3
Server : ID1
Server : ID1 ; ID2
Server : ID1 ; ID2 ; ID3 ; ID4
EOF
will output:
Server ID3
There is no 3rd element...
There is no 3rd element...
Server ID3

Compare 4 files line by line to see if they match or don't match

I'm trying to compare 4 text files for counts in each line:
file1.txt:
32
44
75
22
88
file2.txt
32
44
75
22
88
file3.txt
11
44
75
22
77
file4.txt
32
44
75
22
88
each line represents a title
line1 = customerID count
line2 = employeeID count
line3 = active_users
line4 = inactive_users
line5 = deleted_users
I'm trying to compare file2.txt, file3.txt and file4.txt with file1.txt; file1.txt will always have the correct counts.
Example: Since file2.txt matches exactly line by line to file1.txt in the example above then i'm trying to output "file2.txt is good" but since file3.txt line1 and line5 do not match to file1.txt I'm trying to output "customerID for file3.txt does not match by 21 records", (i.e. 32 - 11 = 21), and "deleted_users in file3.txt does not match by 11 records", (88 - 77 = 11).
If shell is easier then that is fine too.
One way to process files by lines in parallel
use warnings;
use strict;
use feature 'say';
my #files = #ARGV;
#my #files = map { $_ . '.txt' } qw(f1 f2 f3 f4); # my test files' names
# Open all files, filehandles in #fhs
my #fhs = map { open my $fh, '<', $_ or die "Can't open $_: $!"; $fh } #files;
# For reporting, enumerate file names
my %files = map { $_ => $files[$_] } 0..$#files;
# Process (compare) the same line from all files
my $line_cnt;
LINE: while ( my #line = map { my $line = <$_>; $line } #fhs )
{
defined || last LINE for #line;
++$line_cnt;
s/(?:^\s+|\s+$)//g for #line;
for my $i (1..$#line) {
if ($line[0] != $line[$i]) {
say "File $files[$i] differs at line $line_cnt";
}
}
}
This compares the whole line by == (after leading and trailing spaces are stripped), since it is a given that each line carries a single number which need be compared.
It prints, with my test files named f1.txt, f2.txt, ...
File f3.txt differs at line 1
File f3.txt differs at line 5
Store the line names in an array, store the correct values in another array. Then, loop over the files, and for each of them, read their lines and compare them to the stored correct values. You can use the special variable $. that contains the line number of the last access file handle to serve as an index to the arrays. Lines are 1-based, arrays are 0-based, so we need to subtract 1 to get the correct index.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #line_names = ('customerID count',
'employeeID count',
'active_users',
'inactive_users',
'deleted_users');
my #correct;
open my $in, '<', shift or die $!;
while (<$in>) {
chomp;
push #correct, $_;
}
while (my $file = shift) {
open my $in, '<', $file or die $!;
while (<$in>) {
chomp;
if ($_ != $correct[$. - 1]) {
say "$line_names[$. - 1] in $file does not match by ",
$correct[$. - 1] - $_, ' records';
}
}
}
Read first file into array then loop over other files using the same function to read into array. Within this loop consider every line, calc diff and print message with text from #names if diff is not zero.
#!/usr/bin/perl
use strict;
use warnings;
my #names = qw(customerID_count employeeID_count active_users inactive_users deleted_users);
my #files = qw(file1.txt file2.txt file3.txt file4.txt);
my #first = readfile($files[0]);
for (my $i = 1; $i <= $#files; $i++) {
print "\n$files[0] <=> $files[$i]:\n";
my #second = readfile($files[$i]);
for (my $j = 0; $j <= $#names; $j++) {
my $diff = $first[$j] - $second[$j];
$diff = -$diff if $diff < 0;
if ($diff > 0) {
print "$names[$j] does not match by $diff records\n";
}
}
}
sub readfile {
my ($file) = #_;
open my $handle, '<', $file;
chomp(my #lines = <$handle>);
close $handle;
return grep(s/\s*//g, #lines);
}
Output is:
file1.txt <=> file2.txt:
file1.txt <=> file3.txt:
customerID_count does not match by 21 records
deleted_users does not match by 11 records
file1.txt <=> file4.txt:
A mash-up of bash, and mostly the GNU versions of standard utils like diff, sdiff, sed, et al, plus the ifne util, and even an eval:
f=("" "customerID count" "employeeID count" \
"active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do
diff -qws file1.txt $n ||
$(sdiff file1 $n | ifne -n exit | nl |
sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' |
xargs printf 'eval echo "%s for '"$n"' does not match by %s records.";\n') ;
done
Output:
Files file1.txt and file2.txt are identical
Files file1.txt and file3.txt differ
customerID count for file3.txt does not match by 21 records.
deleted_users for file3.txt does not match by 11 records.
Files file1.txt and file4.txt are identical
The same code, tweaked for prettier output:
f=("" "customerID count" "employeeID count" \
"active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do
diff -qws file1.txt $n ||
$(sdiff file1 $n | ifne -n exit | nl |
sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' |
xargs printf 'eval echo "%s does not match by %s records.";\n') ;
done |
sed '/^Files/!s/^/\t/;/^Files/{s/.* and //;s/ are .*/ is good/;s/ differ$/:/}'
Output:
file2.txt is good
file3.txt:
customerID count does not match by 21 records.
deleted_users does not match by 11 records.
file4.txt is good
Here is an example in Perl:
use feature qw(say);
use strict;
use warnings;
{
my $ref = read_file('file1.txt');
my $N = 3;
my #value_info;
for my $i (1..$N) {
my $fn = 'file'.($i+1).'.txt';
my $values = read_file( $fn );
push #value_info, [ $fn, $values];
}
my #labels = qw(customerID employeeID active_users inactive_users deleted_users);
for my $info (#value_info) {
my ( $fn, $values ) = #$info;
my $all_ok = 1;
my $j = 0;
for my $value (#$values) {
if ( $value != $ref->[$j] ) {
printf "%s: %s does not match by %d records\n",
$fn, $labels[$j], abs( $value - $ref->[$j] );
$all_ok = 0;
}
$j++;
}
say "$fn: is good" if $all_ok;
}
}
sub read_file {
my ( $fn ) = #_;
my #values;
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
while( my $line = <$fh>) {
if ( $line =~ /(\d+)/) {
push #values, $1;
}
}
close $fh;
return \#values;
}
Output:
file2.txt: is good
file3.txt: customerID does not match by 21 records
file3.txt: deleted_users does not match by 11 records
file4.txt: is good

Use Perl to extract fields from a large text file using column numbers from another file

I have two files
A text file with a list of numbers, e.g.
1
2
3
5
7
6
each number is on a different line, and the row number corresponds to a column number in the second file.
A large text file with ~150,000 columns. I need to extract only the columns listed in the first file.
I would prefer to do this command-line-style so I came up with this
for i in columns.txt do
perl -lane 'print $F[i]' final.txt > output.txt
I want to do something like this, but without combining Bash and script Perl
for ($i = 0; $i < $num_col; $i++) { # for each line in the column file 1
my $column = $columns[$i]; # extract the number from the line
perl -lane 'print $F[$column]' final.txt > output.txt; # and cut that column from file #2 into output file
}
How to do this without combining a Perl script and Bash?
I am new to scripting so explanations would be great as well as code help.
final looks like this but with dimensions 106713 x 119,962
6665 AA AG TG CC GG GT TA TT
6667 AT TC AT CG GA GA TC AA
6668 AC TC TT CA GT GA TC CG
6669 AG AC AA CT GG GA TC CA
6670 AA AT AG AC GG GA TC AA
ID 2 2 1 1 1 6 6 1 #this single row is the columns.txt file
ID rs3755048 rs2276637 rs1043502 rs879089 rs647812 rs2076310 c6_pos32913147 rs1051741
ID 0 0 0 0 0 0 0 0
ID 219793146 219797929 20850335 20841103 20866804 33274012 32913147 224098852
If you want to replace one-liner with script equivalent, that would be:
# local $/ = "\n";
open my $fh, "<", "final.txt" or die $!;
open my $out, ">", "output.txt" or die $!;
while (<$fh>) {
chomp;
my #F = split ' ', $_;
print $out "$F[$column]\n";
}
Something like this should do what you want, but it may be slow. If it is too slow to be useful then come back and we can talk about optimising it.
use strict;
use warnings;
use autodie;
my #columns = do {
open my $fh, '<', 'columns.txt';
local $/;
<$fh> =~ /\d+/g;
};
my #indices = map { $_ - 1 } sort { $a <=> $b } #columns;
open my $fh, '<', 'final.txt';
while (<$fh>) {
my #fields = split ' ', $_, $indices[-1] + 2;
print join(' ', #fields[#indices]), "\n";
}
close $fh;
use a single perl one-liner with a condition on $ARGV change, something like:
perl -lane 'if ($fh ne $ARGV){$fh=$ARGV;$++}push #cols, $_ if $i<2;if ($i>1){foreach(#cols){print $F[$_]})}' ONE TWO

Resources