Bash: How to extract table-like structures from text file - bash

I have a log file which contains some data and important table-like parts as following:
//Some data
--------------------------------------------------------------------------------
----- Output Table -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
fooooooooo 0 0 3 0 0
boooooooooooooooooooooo 0 0 30 0 0
abv 0 0 16 0 0
bhbhbhbh 0 0 3 0 0
foooo 0 0 198 0 0
WARNING: Some message...
WARNING: Some message...
aaaaaaaaa 0 0 60 0 7
bbbbbbbb 0 0 48 0 7
ccccccc 0 0 45 0 7
rrrrrrr 0 0 50 0 7
abcabca 0 0 42 0 6
// Some data...
--------------------------------------------------------------------------------
----- Another Output Table -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
$$foo12 0 0 3 0 0
$$foo12_720_720_14_2 0 0 30 0 0
I want to extract all that kind of tables from given file and save in separate files.
Notes:
A start of the table indicates a line which contains {NAME, Attr1, ..., Attr5} words.
WARNING messages may exist in the scope of a table and should be ignored.
Table ends when an empty line occurs and the next of that blank line is not a "WARNING" line.
So I expect the following 2 files as output:
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
fooooooooo 0 0 3 0 0
boooooooooooooooooooooo 0 0 30 0 0
abv 0 0 16 0 0
bhbhbhbh 0 0 3 0 0
foooo 0 0 198 0 0
aaaaaaaaa 0 0 60 0 7
bbbbbbbb 0 0 48 0 7
ccccccc 0 0 45 0 7
rrrrrrr 0 0 50 0 7
abcabca 0 0 42 0 6
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
$$foo12 0 0 3 0 0
$$foo12_720_720_14_2 0 0 30 0 0

I would write the following awk script by following your directions.
#! /usr/bin/awk -f
# start a table with a NAME line
/^ +NAME/ {
titles = $0
print
next
}
# don't print if not in table
! titles {
next
}
# blank line may mean end-of-table
/^$/ {
EOT = 1
next
}
# warning is not EOT
/^WARNING/ {
EOT = 0
next
}
# end of table means we're not in a table anymore, Toto
EOT {
titles = 0
EOT = 0
next
}
# print what's in the table
{ print }

Try this -
awk -F'[[:space:]]+' 'NF>6 || ($0 ~ /-/ && $0 !~ "Output") {print $0}' f
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
fooooooooo 0 0 3 0 0
boooooooooooooooooooooo 0 0 30 0 0
abv 0 0 16 0 0
bhbhbhbh 0 0 3 0 0
foooo 0 0 198 0 0
aaaaaaaaa 0 0 60 0 7
bbbbbbbb 0 0 48 0 7
ccccccc 0 0 45 0 7
rrrrrrr 0 0 50 0 7
abcabca 0 0 42 0 6
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
$$foo12 0 0 3 0 0
$$foo12_720_720_14_2 0 0 30 0 0

Related

How to merge rows if values in one column contains consective numbers and all other columns match

I have a very large file (~700M rows) and I would like to reduce the size by grouping mostly matching rows. Specifically, the file is sorted by fields 1 and 2 and I would like to group rows where field 2 contains consecutive numbers but all other fields match. If there is a gap in field 2 or if any other fields do not match the previous row then I would like to start a new interval. Ideally, I would like the output to return the interval range for the grouped rows and would prefer a solution that works in bash with awk and/or sed. I'm open to other solutions as well as long as they don't require re-sorting or other operations that might crash with such a long file.
The input file looks something like this.
NW_005179401.1 100 1 0 0 0 0 0 0 0 0
NW_005179401.1 101 1 0 0 0 0 0 0 0 0
NW_005179401.1 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 1 0 0 0 0 0 1 0 0
NW_005179401.1 104 1 0 0 0 0 0 1 0 0
NW_005179401.1 105 1 0 0 0 0 0 1 0 0
NW_005179401.1 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 1 0 0 0 0 0 1 0 0
NW_005179401.1 109 1 0 0 0 0 0 1 0 0
NW_005179401.1 110 1 0 0 0 0 0 1 0 0
NW_005179401.1 111 1 0 0 0 0 0 1 0 0
NW_005179401.1 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 0 0 1 1 0 0 0 0 2
NW_005179401.1 993 0 0 1 1 0 0 0 0 2
NW_005179401.1 994 0 0 1 1 0 0 0 0 2
NW_005179401.1 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 0 0 1 1 0 0 0 0 0
NW_005179401.1 997 0 0 1 1 0 0 0 0 0
NW_005179401.1 998 0 0 1 1 0 0 0 0 0
NW_005179401.1 999 0 0 1 1 0 0 0 0 0
In reality the file has more fields but all contain integers like fields 3 and beyond in the example. The ideal output will look like this, with first and last values from consecutive field 2 interval printed in output fields 2 and 3.
NW_005179401.1 100 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0
I found solutions group consecutive rows with matches in specific fields, but none that also look for consecutive integers in one field and not one that can return the range. One thought was using uniq with the -c flag while skipping the first 2 fields, then adding the counts to the value in field 2, but given the additional condition of requiring consecutive numbers in field 2 I'm not too sure where to start with this one. Thanks in advance.
EDIT: I apologize for not originally adding my attempted code but my pipeline used the bioinformatics program bedtools and it kept getting killed for lack of memory, which wasn't something I expected to be troubleshot due to lack of pre-programmed functionality. I am an awk novice and didn't know where to start for an alternative pipeline for reformatting this type of file.
I doubt there is a standard tool like uniq -c for this. But you can use this custom awk script:
awk '{$1=$1} $0!=n {s=$2; printf "%s", g}
{$2=$2+1; n=$0; $2=s" "$2-1; g=$0 ORS}
END {printf "%s", g}' yourFile
n is the the next anticipated record,
e.g. if the current line is abc 100 x y z then n=abc 101 x y z.
g is the group of records to be printed in case the next anticipated line n does not occur and the group ends.
s is the start number of group g, i.e. the lower bound of the interval.
{$1=$1} is only there to ensure that the field separators in the current line $0 and the generated line n are consistent, so that we can check equality using ==, or rather != in this case.
For your example, this prints
NW_005179401.1 100 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0
$ cat tst.awk
{
prevVals = currVals
origRec = $0
$2 = ""
currVals = $0
$0 = origRec
}
($2 != endKey+1) || (currVals != prevVals) {
if ( NR>1 ) {
prt()
}
begKey = $2
}
{ endKey = $2 }
END { prt() }
function prt( origRec) {
origRec = $0
$2 = begKey OFS endKey
print
$0 = origRec
}
$ awk -f tst.awk file
NW_005179401.1 100 102 1 0 0 0 0 0 1 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 0 0 1 1 0 0 0 0 2
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 0
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0

Renaming file based on a value in a tsv file

My input is a tsv file with 5 columns. It has the column names 'Position' 'A', 'B' and so on, that repeat every now and then in the tsv. How can I split this tsv file so that each one has one set of the column headers and the data underneth, but not the next set of column headers.
Input:
Position A B C D Seg2
1 9 0 0 0 0
2 0 0 16 0 0
3 0 19 0 0 0
4 0 0 18 0 0
Position A B C D Seg1
1 9 0 0 0 1
2 0 0 22 0 0
3 0 19 0 0 0
4 0 0 19 0 0
5 39 0 0 0 0
6 43 0 0 0 0
The ideal output would be the above in split into two tsv files, one named Seg1.tsv and the other Seg2.tsv.
What I have:
awk '/Position/{x="F"++i;}{print > x;}' file.tsv
How can I modify the above to rename the files?
You should just derive the filename from the last column :
awk '/Position/{x=$6".tsv"}{print > x;}' file.tsv

how to record properties of other variables in stata

I have to generate variables entry_1, entry_2 and entry_3 which will adopt the value 1 if id_i for that particular month had entry=1.
Example.
id month entry entry_1 entry_2 entry_3
1 1 1 1 0 0
1 2 0 0 0 0
1 3 0 0 1 1
1 4 0 0 0 0
2 1 0 1 0 0
2 2 0 0 0 0
2 3 1 0 1 1
2 4 0 0 0 0
3 1 0 1 0 0
3 2 0 0 0 0
3 3 1 0 1 1
3 4 0 0 0 0
Would anyone be so kind to propose an idea of how to implement a loop in order to do this?
I am thinking of something like this:
forvalues i=1(1)3 {
gen entry`i'=0
replace entry`i'=1 if on that particular month id=`i' had entry=1
}
You could do something like this (although your data don't quite look right for the question you're asking):
forvalues i = 1/3 {
gen entry_`i' = id == `i' & entry == 1
}
This generates a dummy variable entry_i for each i in the forvalues loop where entry_i = 1 if id is i and entry is 1, and 0 otherwise.
The code can be simplified down to at most one loop.
clear
input id month entry entry_1 entry_2 entry_3
1 1 1 1 0 0
1 2 0 0 0 0
1 3 0 0 1 1
1 4 0 0 0 0
2 1 0 1 0 0
2 2 0 0 0 0
2 3 1 0 1 1
2 4 0 0 0 0
3 1 0 1 0 0
3 2 0 0 0 0
3 3 1 0 1 1
3 4 0 0 0 0
end
forval j = 1/4 {
egen entry`j' = total(entry & id == `j'), by(month)
}
list id month entry entry? , sepby(id)
+--------------------------------------------------------+
| id month entry entry1 entry2 entry3 entry4 |
|--------------------------------------------------------|
1. | 1 1 1 1 0 0 0 |
2. | 1 2 0 0 0 0 0 |
3. | 1 3 0 0 1 1 0 |
4. | 1 4 0 0 0 0 0 |
|--------------------------------------------------------|
5. | 2 1 0 1 0 0 0 |
6. | 2 2 0 0 0 0 0 |
7. | 2 3 1 0 1 1 0 |
8. | 2 4 0 0 0 0 0 |
|--------------------------------------------------------|
9. | 3 1 0 1 0 0 0 |
10. | 3 2 0 0 0 0 0 |
11. | 3 3 1 0 1 1 0 |
12. | 3 4 0 0 0 0 0 |
+--------------------------------------------------------+

How to join general string to every first column of every sub row

I want to join every first general string in the case below "ADMIN" and "DB" to the data which they represent and the place which will they take to be every time on the first column.
Example:
ADMIN
ADMIN_DB Running 1 0 1 0 0 0 80
ADMIN_CATALOG Running 0 0 1 0 0 0 452
ADMIN_CAT Running 0 0 1 0 0 0 58
DB
SLAVE_DB Running 2 0 3 0 0 0 94
DB_BAK Running 1 0 1 0 0 0 54
HISTORY_DB Running 0 0 1 0 0 0 40
HISTORY_DB_BAK Running 0 0 1 0 0 0 59
Expectation:
ADMIN ADMIN_DB Running 1 0 1 0 0 0 80
ADMIN ADMIN_CATALOG Running 0 0 1 0 0 0 452
ADMIN ADMIN_CAT Running 0 0 1 0 0 0 58
DB SLAVE_DB Running 2 0 3 0 0 0 94
DB DB_BAK Running 1 0 1 0 0 0 54
DB HISTORY_DB Running 0 0 1 0 0 0 40
DB HISTORY_DB_BAK Running 0 0 1 0 0 0 59
In the past I have one example this is the start point which can do the thing but I'm not aware so much in that kind of scripting: perl -ne 'chomp; if($. % 2){print "$_,";next;}
How about
awk 'NF==1{ val=$0; next} {print val" "$0}' input
You can format the output using the column utilty as
$ awk 'NF==1{ val=$0; next} { print val" "$0}' input | column -t
ADMIN ADMIN_DB Running 1 0 1 0 0 0 80
ADMIN ADMIN_CATALOG Running 0 0 1 0 0 0 452
ADMIN ADMIN_CAT Running 0 0 1 0 0 0 58
DB SLAVE_DB Running 2 0 3 0 0 0 94
DB DB_BAK Running 1 0 1 0 0 0 54
DB HISTORY_DB Running 0 0 1 0 0 0 40
DB HISTORY_DB_BAK Running 0 0 1 0 0 0 59

In Pure Data how to keyup, keydown, and while keydown?

I'm trying to setup a little midi keyboard (using my computer's keyboard) in Pure Data. It works this way:
press a key > send a note_on on midi channel
stop pressing a key > send a note_off on midi channel
The problem is, that when you keep a key pressed the [key] object generates a series of inputs instead of a single (long) one. This stops the (desired) note from playing (since the original input stops, after ~500ms) and re-starts playing the note many times in a row.
I've already tried [change], [timer]+[moses] and other non-solutions, I'm looking for a better implementation of [key] that can handle long key-presses
I'm looking for something that does [key]'s job but that can handle a long-press, if I long-press a key with [key] for more than a second it does something like:
key....(1 sec passes)...keyup.key.keyup.key.keyup. and it goes on and on...
the problem is that your Operating System(!) generates repeated key events if you keep pressing the key.
solution
so the simple solution is to tell your OS to suppress repeated key events.
workaround
the more complicated workaround is to keep track of the current state of the given key and suppress duplicate keydowns. this is most easily done if you only track a single key (rather than all at once):
e.g. an abstraction [keypress 97] that will detect keypresses of a (ascii 97):
[key] [keyup]
| |
[select $1] [select $1]
| |
[t b b] |
| [stop( |
| | |
| +----- |
| \|
| [del 50]
| |
[1( [0(
| |
| -----------+
|/
[change]
|
[outlet]
What about [keyname]:
http://en.flossmanuals.net/pure-data/sensors/game-controllers/
Here is an example patch that will write to an array when multiple keys are pressed. It should be possible to use this as a polyphonic input. I think then using [tabread] and iterating the array index number would indicate whether a key is pressed or not (the index should match the ascii/key number):
#N canvas 800 301 544 205 10;
#X obj 23 23 keyname;
#X symbolatom 89 40 10 0 0 0 - - -;
#X floatatom 23 46 5 0 0 0 - - -;
#X obj 181 18 key;
#X floatatom 181 46 3 0 0 0 - - -;
#X floatatom 220 44 3 0 0 0 - - -;
#X obj 220 18 keyup;
#X obj 44 87 pack float symbol float float;
#X obj 67 117 print;
#X obj 46 151 tabwrite array1;
#N canvas 0 0 450 300 (subpatch) 0;
#X array array1 256 float 1;
#A 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
#X coords 0 1.2 255 0 256 100 1 0 0;
#X restore 277 33 graph;
#X connect 0 0 2 0;
#X connect 0 1 1 0;
#X connect 1 0 7 1;
#X connect 2 0 7 0;
#X connect 2 0 9 0;
#X connect 3 0 4 0;
#X connect 4 0 7 2;
#X connect 4 0 9 1;
#X connect 5 0 7 3;
#X connect 5 0 9 1;
#X connect 6 0 5 0;
#X connect 7 0 8 0;
Example with a + g pressed at the same time:
After pressing s:
While a:
After pressing a:
I was able to find something here as well: http://puredata.hurleur.com/sujet-3718-pdkb-basic-virtual-midi-keyboard
zipfile: http://puredata.hurleur.com/attachment.php?item=1635
Looks neat, not sure if it functions.

Resources