Transpose from row to column - bash

Traspose from line to column is the objetive, taking in consideration the first column, which is the date
Input file
72918,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009
72918,2356,2357,2358,2359,2360,2361,2362,2363,2364
72918,0,0,0,0,0,0,0,0,0
72918,0,0,0,0,0,0,1,0,0
72918,1496,1502,1752,1752,1752,1752,1751,974,972
73018,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009,111000009
73018,2349,2350,2351,2352,2353,2354,2355,2356,2357
73018,0,0,0,0,0,0,0,0,0
73018,0,0,0,0,0,0,0,0,0
73018,1524,1526,1752,1752,1752,1752,1752,256,250
Output desired
72918,111000009,2356,0,0,1496
72918,111000009,2357,0,0,1502
72918,111000009,2358,0,0,1752
72918,111000009,2359,0,0,1752
72918,111000009,2360,0,0,1752
72918,111000009,2361,0,0,1752
72918,111000009,2362,0,1,1751
72918,111000009,2363,0,0,974
72918,111000009,2364,0,0,972
73018,111000009,2349,0,0,1524
73018,111000009,2350,0,0,1526
73018,111000009,2351,0,0,1752
73018,111000009,2352,0,0,1752
73018,111000009,2353,0,0,1752
73018,111000009,2354,0,0,1752
73018,111000009,2355,0,0,1752
73018,111000009,2356,0,0,256
73018,111000009,2357,0,0,250
Please advise, thanks in advance.

This code seems to do exactly what you need:
awk -F, '
func init_block() {ts=$1;delete a;cnt=0;nf0=NF}
func dump_block() {for(f=2;f<=nf0;f+=1){printf("%s",ts);for(r=1;r<=cnt;r+=1){printf(",%s",a[r,f])};print ""}}
BEGIN{ts=-1}
ts<0{init_block()}
ts!=$1{dump_block();init_block()}
{cnt+=1;for(f=1; f<=NF; f++) a[cnt,f]=$f}
END{dump_block()}' <input.txt >output.txt
It collects rows until the timestamp changes, then prints the transpose of the block with keeping the same timestamp. The number of fields in the input must be the same within each block so that this code behaves correctly.

Related

How to get the mean value of different column values when one column match a value

I have a big data file with many columns. I would like to get the mean value of some of the columns if another column has a specific value.
For example if $19=9.1 then get the mean of $24, $25,$27, $28, $32 and $35 and write these values in a file like
9.1 (mean$24) (mean$25) ..... (mean$32) (mean$35)
and add two more lines for two other values of $19 column, for example, 11.9 and 13.9, resulting:
9.1 (mean$24) (mean$25) ..... (mean$32) (mean$35)
11.9 (mean$24) (mean$25) ..... (mean$32) (mean$35)
13.9 (mean$24) (mean$25) ..... (mean$32) (mean$35)
I have seen a post "awk average part of a column if lines (specific field) match" which makes the mean of only one column if the first has some value, but I do not know how to extend the solution to my problem.
this should work, if you fill in the blanks...
$ awk 'BEGIN {n=split("1.9 11.9 13.9",a)}
{k=$19; c[k]++; m24[k]+=$24; m25[k]+=$25; ...}
END {for(i=1;i<=n;i++) print k=a[i], m24[k]/c[k], m25[k]/c[k], ...}' file
perhaps handle c[k]=0 condition as well, with something like this:
function mean(sum,count) {return (count==0?"NaN":sum/count)}

extracting text blocks from input file with awk or sed and save each block in a separate output file

I am trying to use "awk" to extract text blocks (first field/column only, but multiple lines, the number of lines vary between blocks) based on separators (# and --. These columns represent sequence IDs.
Using "awk" I am able to separate the blocks and print the first column, but I can not redirect these text blocks to separate output files.
Code:
awk '/#/,/--/{print $1}' OTU_test.txt
Ideally, I would like to save each file (text block excluding the separators) based on some text found in the first line of each block (e.g. MEMB.nem.6; MEMB.nem. is content, but the number changes)
Example of input file
enter image description here
#OTU_MEMB.nem.6
EF494252.1.2070 6750.0 D_0__Eukaryota;D_1__Opisthokonta;D_2__Nucletmycea;D_3__Fungi;D_7__Dothideomycetes;D_8__Capnodiales;D_9__uncultured fungus 1.000
FJ235519.1.1436 5957.0 D_0__Eukaryota;D_1__Opisthokonta;D_2__Nucletmycea;D_3__Fungi;D_7__Dothideomycetes;D_8__Capnodiales;D_9__uncultured fungus 1.000
New.ReferenceOTU9219 5418.0 D_0__Eukaryota;D_1__Opisthokonta;D_2__Nucletmycea;D_3__Fungi 1.000
GQ120120.1.1635 471.0 D_0__Eukaryota;D_1__Opisthokonta;D_2__Nucletmycea;D_3__Fungi;D_7__Dothideomycetes;D_8__Capnodiales;D_9__uncultured fungus 0.990
--
#OTU_MEMB.nem.163
New.CleanUp.ReferenceOTU59580 12355.0 D_0__Eukaryota;D_1__Opisthokonta;D_2__Holozoa;D_3__Metazoa (Animalia);D_7__Chromadorea;D_8__Monhysterida 0.700
New.ReferenceOTU11809 1312.0 D_0__Eukaryota;D_1__Opisthokonta;D_2__Holozoa;D_3__Metazoa (Animalia);D_7__Chromadorea;D_8__Monhysterida 0.770
--
#OTU_MEMB.nem.35
New.CleanUp.ReferenceOTU120578 12116.0 D_0__Eukaryota;D_1__Opisthokonta;D_2__Holozoa;D_3__Metazoa (Animalia);D_7__Chromadorea;D_8__Desmoscolecida;D_9__Desmoscolex sp. DeCoSp2 0.780
Expected output files (first column only, no separators).
MEMB.nem.6.txt
EF494252.1.2070
FJ235519.1.1436
New.ReferenceOTU9219
GQ120120.1.1635
MEMB.nem.163.txt
New.CleanUp.ReferenceOTU59580
New.ReferenceOTU11809
MEMB.nem.35.txt
New.CleanUp.ReferenceOTU120578
I have searched a lot, but so far I have been unsuccessful. I would be very happy if someone can advice me.
Thanks,
Tiago
awk '
sub(/^#OTU_/,"") {
close(out)
out = $0 ".txt"
next
}
!/^--/ {
print $1 > out
}
' file

AWK filter data by characters

A data set like below:
07021201180934221NJL B2018 12X 15253 030C 000000.299
07021201180934231NSL B2018 12X 15253 030C 00000.014
07021201180941061NNL B2018 030C 000000.288
Questions are:
The characters in the first string "120118" means date "ddmmyy", how could I filter rows according to date characters using awk?
The characters in first string "NJL" or "NSL" or "NNL" means data type, what awk command to filter lines according to those three characters?
The third column could be some description like "12X 15253" or empty, how could I filter data out if the column is empty?
Thanks in advance!
this script has all the conditions
$ awk 'substr($1,5,6)==120118 && substr($1,length($1)-2)=="NNL" && $3$4==""' file

Edit fields in csv files using bash

I have a bunch of csv files that need "cleaning".
Specifically, there is a column that contains timestamp values, however some lines have a value of '1' instead.
What I wish to do, is replace those 1's with the last valid (timestamp) value, i.e. replace the value of i-th line with that of that of line i-1.
I provide a sample of the file
URL192.168.2.2,420042,20/07/2015 09:40:00,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:00,3232236038,3232236034
URL192.168.2.2,420042, 1,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:01,3232236038,3232236034
So in this example, the 1 must be replaced with 20/07/2015 09:40:00. I tried it using awk but couldn't nail it.
Assuming no commas in the other fields, an awk program like this should work:
BEGIN { FS = OFS = "," }
$3!=1 { prev = $3 }
$3==1 { $3 = prev }
{ print }
Warning: this is untested code.
The first line sets the field separator to a comma, for both input and output. The second line saves the timestamp of every row that has a timestamp in the third field. The third line writes the most recently saved timestamp to every row that doesn't have a timestamp in the third field. And the fourth line writes every input line, whether modified or not, to the output.
Let me know how you get on.

Change date and data cells in .csv file progressively

I have a file that I'm trying to get ready for my boss in time for his manager's meeting tomorrow morning at 8:00AM -8GMT. I want to retroactively change the dates in non consecutive rows in this .csv file: (truncated)
,,,,,
,,,,,sideshow
,,,
date_bob,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bob_available,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383
bob_used,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312
,,,
date_mel,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
mel_available,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537
mel_used,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159
,,,
date_sideshow-ws2,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
sideshow-ws2_available,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239
sideshow-ws2_used,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441
,,,
,,,,,simpsons
,,,
date_bart,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bart_available,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559
bart_used,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117
,,,
date_homer,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
homer_available,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799
homer_used,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877
,,,
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
lisa_available,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899
lisa_used,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777
In other words a row that now reads:
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
would desirably read:
date_lisa,09-04-14,09-05-14,09-06-14,09-07-14,09-08-14,09-09-14,09-10-14,09-11-14,09-12-14,09-13-14,09-14-14,09-15-14,09-16-14,09-17-14
I'd like to make the daily available numbers less at the beginning and then get progressively bigger day by day. This will mean that the used rows will have to be proportionately smaller at the beginning and then get progressively bigger in lock step with the available rows as they shrink.
Not by a large amount don't make it look obvious just a few GB here and there. I plan to make pivot tables and graphs out of this and so it has to vary a little. BTW the numbers are all in MB as I generated them using df -m.
Thanks in advance if anyone can help me.
The following awk does what you need:
awk -F, -v OFS=, '
/^date/ {
split ($2, date, /-/);
for (i=2; i<=NF; i++) {
$i = date[1] "-" sprintf ("%02d", date[2] - NF + i) "-" date[3]
}
}
/available|used/ {
for (i=2; i<=NF; i++) {
$i = int (($i*i)/NF)
}
}1' csv
Set the Input and Output Field Separator to ,
All the lines that start with date, we split the second column to find the date part.
We iterate from second column to the end of the line and set the column to new calculated start date which basically uses the current date and the total number of fields.
All other lines remain as is and gets printed along with modified lines.
This has a caveat of not rolling over different months correctly.
For data fields we iterate from second column to the end of line and do a calculation to make them progressively greater than the previous one to match the original value for last field.

Resources