I have file which contains some data, like this
2011-01-02 100100 1
2011-01-02 100200 0
2011-01-02 100199 3
2011-01-02 100235 4
and have some "dictionary" in separate file
100100 Event1
100200 Event2
100199 Event3
100235 Event4
and I know that
0 - warning
1 - error
2 - critical
etc...
I need some script with sed/awk/grep or something else which helps me receive data like this
100100 Event1 Error
100200 Event2 Warning
100199 Event3 Critical
etc
will be grateful for ideas how to do this in best way, or for working example
update
sometimes I have data like this
2011-01-02 100100 1
2011-01-02 sometext 100200 0
2011-01-02 100199 3
2011-01-02 sometext 100235 4
where sometext = any 6 characters (maybe this is helpful info)
in this case I need whole data:
2011-01-02 sometext EventNameFromDictionary Error
or without "sometext"
awk 'BEGIN {
lvl[0] = "warning"
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR {
evt[$1] = $2; next
}
{
print $2, evt[$2], lvl[$3]
}' dictionary infile
Adding a new answer for the new requirement and because of the limited formatting options inside a comment:
awk 'BEGIN {
lvl[0] = "warning"
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR {
evt[$1] = $2; next
}
{
if (NF > 3) {
idx = 3; $1 = $1 OFS $2
}
else idx = 2
print $1, $idx in evt ? \
evt[$idx] : $idx, $++idx in lvl ? \
lvl[$idx] : $idx
}' dictionary infile
You won't need to escape the new lines inside the tertiary operator if you're using GNU awk.
Some awk implementations may have problems with this part:
$++idx in lvl ? lvl[$idx] : $idx
If you're using one of those,
change it to:
$(idx + 1) in lvl ? lvl[$(idx + 1)] : $(idx + 1)
OK, comments added:
awk 'BEGIN {
lvl[0] = "warning" # map the error levels
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR { # while reading the first
# non-empty input file
evt[$1] = $2 # build the associative array evt
next # skip the rest of the program
# keyed by the value of the first column
# the second column represents the values
}
{ # now reading the rest of the input
if (NF > 3) { # if the number of columns is greater than 3
idx = 3 # set idx to 3 (the key in evt)
$1 = $1 OFS $2 # and merge $1 and $2
}
else idx = 2 # else set idx to 2
print $1, \ # print the value of the first column
$idx in evt ? \ # if the value of the second (or the third,
\ # depeneding on the value of idx), is an existing
\ # key in the evt array, print its value
evt[$idx] : $idx, \ # otherwise print the actual column value
$++idx in lvl ? \ # the same here, but first increment the idx
lvl[$idx] : $idx # because we're searching the lvl array now
}' dictionary infile
I hope perl is ok too:
#!/usr/bin/perl
use strict;
use warnings;
open(DICT, 'dict.txt') or die;
my %dict = %{{ map { my ($id, $name) = split; $id => $name } (<DICT>) }};
close(DICT);
my %level = ( 0 => "warning",
1 => "error",
2 => "critical" );
open(EVTS, 'events.txt') or die;
while (<EVTS>)
{
my ($d, $i, $l) = split;
$i = $dict{$i} || $i; # lookup
$l = $level{$l} || $l; # lookup
print "$d\t$i\t$l\n";
}
Output:
$ ./script.pl
2011-01-02 Event1 error
2011-01-02 Event2 warning
2011-01-02 Event3 3
2011-01-02 Event4 4
Related
I have a number of large .tsv files such as the following:
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04 15.9 S1
3 chr1_12570 0.9 45.3 S2
2 chr1_12500 0.03 13.8 S3
I would like to remove duplicate rows based on the pos column, while still keeping the values of both rows for columns 3 and 5 so that the output could look something like this:
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04,0.03 15.9 S1,S3
3 chr1_12570 0.9 45.3 S2
My idea was to first sort the .tsv files using the shell sort function:
sort -k 2,2 *.tsv
And then write a script that would compare each line to the following line.
If the string in the pos column is the same for both lines, then it would concatenate the values of column 3 and 5 in row n+1 to the ones in row n.
However I have no idea how to do this.
I am familiar with awk/sed/grep/bash but also have some (limited) perl basics.
Thanks for your help !
Here is an example of how you could approach it in Perl:
use feature qw(say);
use strict;
use warnings;
my $fn = 'file1.tsv';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my #pos;
my %info;
while( my $line = <$fh> ) {
chomp $line;
my ($nbr, $pos, $pvalue, $percentage, $samplename) = split /\t/, $line;
if ( !exists $info{$pos} ) {
$info{$pos} = {
nbr => $nbr,
pvalue => [$pvalue],
percentage => $percentage,
samplename => [$samplename],
};
push #pos, $pos;
}
else {
push #{$info{$pos}{pvalue}}, $pvalue;
push #{$info{$pos}{samplename}}, $samplename;
}
}
close $fh;
print $header;
for my $pos (#pos) {
my $data = $info{$pos};
say join "\t", $data->{nbr}, $pos,
(join ",", #{$data->{pvalue}}), $data->{percentage},
(join ",", #{$data->{samplename}});
}
Output:
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04,0.03 15.9 S1,S3
3 chr1_12570 0.9 45.3 S2
file "myscript":
#! /usr/bin/env bash
file="$1"
result="$(tr -s '\t' < "${file}" | tail -n +2 |
awk -F'\t' -v OFS='\t' '
$0 == "" {
next
}
# MAIN
{
if (col3[$2] == "") {
col1[$2] = $1
col3[$2] = $3
col4[$2] = $4
col5[$2] = $5
} else {
col3[$2] = col3[$2]","$3
col5[$2] = col5[$2]","$5
}
}
END {
for (pos in col1) {
print col1[pos], pos, col3[pos], col4[pos], col5[pos]
}
}
' | sort -k 2,2 )"
first_line="$(head -n 1 "${file}")"
echo "${first_line}"
echo "${result}"
Run it as:
bash myscript <your tsv file>
It will write result to stdout.
Using a combination of GNU datamash and awk to get just the desired columns:
$ datamash --header-in -sf -g2 collapse 3,5 < input.tsv | \
awk 'BEGIN { FS=OFS="\t"; print "rownbr\tpos\tpvalue\tpercentage\tsamplename" }
{ print $1, $2, $6, $4, $7 }'
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04,0.03 15.9 S1,S3
3 chr1_12570 0.9 45.3 S2
Ignore the header line in the file (--header-in), group records on the second column (-g2), sort based on that column (-s), output the full line (-f) in addition to the given operations, and for the 3rd and 5th columns, collapse all rows of the group into a single CSV entry. Tnen use awk to put the desired columns in the right order.
Perl is perfect tool for this task.
save header of the data for future output.
extract pos field to be used as hash key
save a line into a hash if we not seen this pos before, otherwise
merge value and name into the line.
Once all lines processed output result (in this case I use 'format' and write)
use strict;
use warnings;
use feature 'say';
my(#pos,%seen,%lines);
my $header = <DATA>; # obtain header
chomp $header;
while(<DATA>) {
next if /^\s*$/; # skip empty lines
chomp;
my $key = (split '\s+')[1]; # extract 'pos' to use as $key
if( $seen{$key} ) {
my($value,$name) = (split '\s+')[2,4]; # extract value and name
$lines{$key} =~ s/(\d\s+\S+\s+\S+)/$1,$value/; # merge value
$lines{$key} =~ s/$/,$name/; # merge name
} else {
push #pos, $key; # preserve order
$lines{$key} = $_; # store lines in a hash
$seen{$key} = 1;
}
}
say $header; # output header
my #data;
for (#pos) { # use stored hash 'indexes'
#data = split '\s+',$lines{$_}; # split into fields
write; # output
}
# format STDOUT_HEADER =
# rownbr pos pvalue percentage samplename
# .
format STDOUT =
#<<<<< #<<<<<<<<< #<<<<<<<< #<<<<< #<<<<<<<<<<<<
$data[0],$data[1],$data[2],$data[3],$data[4]
.
__DATA__
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04 15.9 S1
3 chr1_12570 0.9 45.3 S2
2 chr1_12500 0.03 13.8 S3
Output
rownbr pos pvalue percentage samplename
1 chr1_12000 0.05 5.6 S1
1 chr1_12500 0.04,0.03 15.9 S1,S3
3 chr1_12570 0.9 45.3 S2
I am trying to split a Bash array into multiple columns in order to display as a table in a Markdown file.
I have searched around for a quick one-liner to do this using Bash, AWK and other languages. I know about the column command, but I can't save the output to a variable or file (stdout). I know you can loop the array, extracting values into separate chunks, but there must be an quicker, more efficient way.
keywords.md
awk
accessibility
bash
behat
c++
cache
d3.js
dates
engineering
elasticsearch
...
columns.sh
local data="$(sort "keywords.md")" # read contents of file
local data=($data) # split contents into an array
local table="||||||\n" # create markdown table header
table="${table}|---|---|---|---|---|"
local numColumns=5
# split data into five columns and append to $table variable
I am trying to get this result.
||||||
|---|---|---|---|---|
|awk|bash|c++|d3.js|engineering
|accessibility|behat|cache|dates|elasticsearch
result from column command
Here's the general approach:
$ cat tst.awk
BEGIN {
numCols = (numCols ? numCols : 5)
OFS = "|"
}
{
colNr = (NR - 1) % numCols + 1
if ( colNr == 1 ) {
numRows++
}
vals[numRows,colNr] = $0
}
END {
hdr2 = OFS
for (colNr=1; colNr<=numCols; colNr++) {
hdr2 = hdr2 "---" OFS
}
hdr1 = hdr2
gsub(/-/,"",hdr1)
print hdr1 ORS hdr2
for (rowNr=1; rowNr<=numRows; rowNr++) {
printf "|"
for (colNr=1; colNr<=numCols; colNr++) {
val = vals[rowNr,colNr]
printf "%s%s", val, (colNr<numCols ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk file
||||||
|---|---|---|---|---|
|awk|accessibility|bash|behat|c++
|cache|d3.js|dates|engineering|elasticsearch
but it obviously doesn't output the columns in the order you asked for in your question as I don't understand how you arrive at that order.
Here's a perl version that prints out the values going down by column like in your sample desired output:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my $ncolumns = 5;
# Read the list of values.
my #data;
while (<>) {
chomp;
push #data, $_;
}
# Partition the data up into rows, added down by column
my #columns;
my $nrows = #data / $ncolumns;
##data = sort { $a cmp $b } #data;
while (#data) {
my #c = splice #data, 0, $nrows;
for my $n (0 .. $#c) {
push #{$columns[$n]}, $c[$n];
}
}
# And print them out
say '|' x $ncolumns;
say '|', join('|', ('---') x $ncolumns), '|';
for my $r (0 .. $nrows - 1) {
my #row;
for my $c (0 .. $ncolumns - 1) {
my $item = $columns[$r]->[$c];
push #row, $item if defined $item;
}
push #row, ('')x$ncolumns;
say '|', join('|', #row[0 .. $ncolumns - 1]);
}
Usage:
$ ./table.pl keywords.md
|||||
|---|---|---|---|---|
|awk|bash|c++|d3.js|engineering
|accessibility|behat|cache|dates|elasticsearch
I'm trying to do a case/if-else statement on a CSV file (e.g., myfile.csv) that analyzes a column, then creates a new column in a new csv (e.g., myfile_new.csv).
The source data (myfile.csv) looks like this:
unique_id,variable1,variable2
1,,C
2,1,
3,,A
4,,B
5,1,
I'm trying to do two transformations:
For the second field, if the input file has any data in the field, have it be 1, otherwise 0.
The third field is flattened into three fields. If the input file has an A in the third field, the third output field has 1, and 0 otherwise; the same for B and C and the fourth/fifth field in the output file.
I want the result (myfile_new.csv) to look like this:
unique_id,variable1,variable2_A,variable2_B,variable2_C
1,0,0,0,1
2,1,0,0,0
3,0,1,0,0
4,0,0,1,0
5,1,0,0,0
I'm trying to do the equivalent of this in SQL
select unique_id,
case when len(variable1)>0 then 1 else 0 as variable1,
case when variable2 = 'A' then 1 else 0 end as variable2_A,
case when variable2 = 'B' then 1 else 0 end as variable2_B,
case when variable2 = 'C' then 1 else 0 end as variable2_C, ...
I'm open to whatever, but CSV files will be 500GB - 1TB in size so it needs to work with that size file.
Here is an awk solution that would do it:
awk 'BEGIN {
FS = ","
OFS = ","
}
NR == 1 {
$3 = "variable2_A"
$4 = "variable2_B"
$5 = "variable2_C"
print
next
}
{
$2 = ($2 == "") ? 0 : 1
$3 = ($3 == "A" ? 1 : 0) "," ($3 == "B" ? 1 : 0) "," ($3 == "C" ? 1 : 0)
print
}' myfile.csv > myfile_new.csv
In the BEGIN block, we set input and output file separator to a comma.
The NR == 1 block creates the header for the output file and skips the third block.
The third block checks if the second field is empty and stores 0 or 1 in it; the $3 statement concatenates the result of using the ternary operator ?: three times, comma separated.
The output is
unique_id,variable1,variable2_A,variable2_B,variable2_C
1,0,0,0,1
2,1,0,0,0
3,0,1,0,0
4,0,0,1,0
5,1,0,0,0
Quick and dirty solution using a while loop.
#!/bin/bash
#Variables:
line=""
result=""
linearray[0]=0
while read line; do
unset linearray #Clean the variables from the previous loop
unset result
IFS=',' read -r -a linearray <<< "$line" #Splits the line into an array, using the comma as the field seperator
result="${linearray[0]}""," #column 1, at index 0, is the same in both files.
if [ -z "${linearray[1]}" ]; then #If column 2, at index 1, is empty, then...
result="$result""0""," #Pad empty strings with zero
else #Otherwise...
result="$result""${linearray[1]}""," #Copy the non-zero column 2 from the old line
fi
#The following read index 2, for column 3, and add on the appropriate text. Only one can ever be true.
if [ "${linearray[2]}" == "A" ]; then result="$result""1,0,0"; fi
if [ "${linearray[2]}" == "B" ]; then result="$result""0,1,0"; fi
if [ "${linearray[2]}" == "C" ]; then result="$result""0,0,1"; fi
if [ "${linearray[2]}" == "" ]; then result="$result""0,0,0"; fi
echo $result >> myfile_new.csv #append the resulting line to the new file
done <myfile.csv
I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.
If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv
I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"