My input is a tsv file with 5 columns. It has the column names 'Position' 'A', 'B' and so on, that repeat every now and then in the tsv. How can I split this tsv file so that each one has one set of the column headers and the data underneth, but not the next set of column headers.
Input:
Position A B C D Seg2
1 9 0 0 0 0
2 0 0 16 0 0
3 0 19 0 0 0
4 0 0 18 0 0
Position A B C D Seg1
1 9 0 0 0 1
2 0 0 22 0 0
3 0 19 0 0 0
4 0 0 19 0 0
5 39 0 0 0 0
6 43 0 0 0 0
The ideal output would be the above in split into two tsv files, one named Seg1.tsv and the other Seg2.tsv.
What I have:
awk '/Position/{x="F"++i;}{print > x;}' file.tsv
How can I modify the above to rename the files?
You should just derive the filename from the last column :
awk '/Position/{x=$6".tsv"}{print > x;}' file.tsv
Related
I have a very large file (~700M rows) and I would like to reduce the size by grouping mostly matching rows. Specifically, the file is sorted by fields 1 and 2 and I would like to group rows where field 2 contains consecutive numbers but all other fields match. If there is a gap in field 2 or if any other fields do not match the previous row then I would like to start a new interval. Ideally, I would like the output to return the interval range for the grouped rows and would prefer a solution that works in bash with awk and/or sed. I'm open to other solutions as well as long as they don't require re-sorting or other operations that might crash with such a long file.
The input file looks something like this.
NW_005179401.1 100 1 0 0 0 0 0 0 0 0
NW_005179401.1 101 1 0 0 0 0 0 0 0 0
NW_005179401.1 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 1 0 0 0 0 0 1 0 0
NW_005179401.1 104 1 0 0 0 0 0 1 0 0
NW_005179401.1 105 1 0 0 0 0 0 1 0 0
NW_005179401.1 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 1 0 0 0 0 0 1 0 0
NW_005179401.1 109 1 0 0 0 0 0 1 0 0
NW_005179401.1 110 1 0 0 0 0 0 1 0 0
NW_005179401.1 111 1 0 0 0 0 0 1 0 0
NW_005179401.1 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 0 0 1 1 0 0 0 0 2
NW_005179401.1 993 0 0 1 1 0 0 0 0 2
NW_005179401.1 994 0 0 1 1 0 0 0 0 2
NW_005179401.1 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 0 0 1 1 0 0 0 0 0
NW_005179401.1 997 0 0 1 1 0 0 0 0 0
NW_005179401.1 998 0 0 1 1 0 0 0 0 0
NW_005179401.1 999 0 0 1 1 0 0 0 0 0
In reality the file has more fields but all contain integers like fields 3 and beyond in the example. The ideal output will look like this, with first and last values from consecutive field 2 interval printed in output fields 2 and 3.
NW_005179401.1 100 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0
I found solutions group consecutive rows with matches in specific fields, but none that also look for consecutive integers in one field and not one that can return the range. One thought was using uniq with the -c flag while skipping the first 2 fields, then adding the counts to the value in field 2, but given the additional condition of requiring consecutive numbers in field 2 I'm not too sure where to start with this one. Thanks in advance.
EDIT: I apologize for not originally adding my attempted code but my pipeline used the bioinformatics program bedtools and it kept getting killed for lack of memory, which wasn't something I expected to be troubleshot due to lack of pre-programmed functionality. I am an awk novice and didn't know where to start for an alternative pipeline for reformatting this type of file.
I doubt there is a standard tool like uniq -c for this. But you can use this custom awk script:
awk '{$1=$1} $0!=n {s=$2; printf "%s", g}
{$2=$2+1; n=$0; $2=s" "$2-1; g=$0 ORS}
END {printf "%s", g}' yourFile
n is the the next anticipated record,
e.g. if the current line is abc 100 x y z then n=abc 101 x y z.
g is the group of records to be printed in case the next anticipated line n does not occur and the group ends.
s is the start number of group g, i.e. the lower bound of the interval.
{$1=$1} is only there to ensure that the field separators in the current line $0 and the generated line n are consistent, so that we can check equality using ==, or rather != in this case.
For your example, this prints
NW_005179401.1 100 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0
$ cat tst.awk
{
prevVals = currVals
origRec = $0
$2 = ""
currVals = $0
$0 = origRec
}
($2 != endKey+1) || (currVals != prevVals) {
if ( NR>1 ) {
prt()
}
begKey = $2
}
{ endKey = $2 }
END { prt() }
function prt( origRec) {
origRec = $0
$2 = begKey OFS endKey
print
$0 = origRec
}
$ awk -f tst.awk file
NW_005179401.1 100 102 1 0 0 0 0 0 1 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 0 0 1 1 0 0 0 0 2
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 0
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0
I have a log file which contains some data and important table-like parts as following:
//Some data
--------------------------------------------------------------------------------
----- Output Table -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
fooooooooo 0 0 3 0 0
boooooooooooooooooooooo 0 0 30 0 0
abv 0 0 16 0 0
bhbhbhbh 0 0 3 0 0
foooo 0 0 198 0 0
WARNING: Some message...
WARNING: Some message...
aaaaaaaaa 0 0 60 0 7
bbbbbbbb 0 0 48 0 7
ccccccc 0 0 45 0 7
rrrrrrr 0 0 50 0 7
abcabca 0 0 42 0 6
// Some data...
--------------------------------------------------------------------------------
----- Another Output Table -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
$$foo12 0 0 3 0 0
$$foo12_720_720_14_2 0 0 30 0 0
I want to extract all that kind of tables from given file and save in separate files.
Notes:
A start of the table indicates a line which contains {NAME, Attr1, ..., Attr5} words.
WARNING messages may exist in the scope of a table and should be ignored.
Table ends when an empty line occurs and the next of that blank line is not a "WARNING" line.
So I expect the following 2 files as output:
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
fooooooooo 0 0 3 0 0
boooooooooooooooooooooo 0 0 30 0 0
abv 0 0 16 0 0
bhbhbhbh 0 0 3 0 0
foooo 0 0 198 0 0
aaaaaaaaa 0 0 60 0 7
bbbbbbbb 0 0 48 0 7
ccccccc 0 0 45 0 7
rrrrrrr 0 0 50 0 7
abcabca 0 0 42 0 6
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
$$foo12 0 0 3 0 0
$$foo12_720_720_14_2 0 0 30 0 0
I would write the following awk script by following your directions.
#! /usr/bin/awk -f
# start a table with a NAME line
/^ +NAME/ {
titles = $0
print
next
}
# don't print if not in table
! titles {
next
}
# blank line may mean end-of-table
/^$/ {
EOT = 1
next
}
# warning is not EOT
/^WARNING/ {
EOT = 0
next
}
# end of table means we're not in a table anymore, Toto
EOT {
titles = 0
EOT = 0
next
}
# print what's in the table
{ print }
Try this -
awk -F'[[:space:]]+' 'NF>6 || ($0 ~ /-/ && $0 !~ "Output") {print $0}' f
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
fooooooooo 0 0 3 0 0
boooooooooooooooooooooo 0 0 30 0 0
abv 0 0 16 0 0
bhbhbhbh 0 0 3 0 0
foooo 0 0 198 0 0
aaaaaaaaa 0 0 60 0 7
bbbbbbbb 0 0 48 0 7
ccccccc 0 0 45 0 7
rrrrrrr 0 0 50 0 7
abcabca 0 0 42 0 6
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
$$foo12 0 0 3 0 0
$$foo12_720_720_14_2 0 0 30 0 0
I don't know if this question is considered to be related to stackoverflow (I'm sorry if it's not but I have searched and did not find an answer anywhere).
I have coded a full adder
Output:
Truth Table :
a1 a2 b1 b2 S1 S2 C
______________________________
0 0 0 0 0 0 0
0 0 0 1 0 1 0
0 0 1 0 1 0 0
0 0 1 1 1 1 0
0 1 0 0 0 1 0
0 1 0 1 0 0 1
0 1 1 0 1 1 0
0 1 1 1 1 0 1
1 0 0 0 1 0 0
1 0 0 1 1 1 0
1 0 1 0 0 1 0
1 0 1 1 0 0 1
1 1 0 0 1 1 0
1 1 0 1 1 0 1
1 1 1 0 0 0 1
1 1 1 1 0 1 1
If somebody has ever calculated this, can they tell me if my output is correct
a1 a2 b1 b2 S1 S2 C a b s c
______________________________
0 0 0 0 0 0 0 0 0 0 0 nothing plus nothing is nothing
0 0 0 1 0 1 0 0 2 2 0 nothing plus two is two
0 0 1 0 1 0 0 0 1 1 0 nothing plus one is one
0 0 1 1 1 1 0 0 3 3 0 nothing plus three is three
0 1 0 0 0 1 0 2 0 2 0 two plus nothing is two
0 1 0 1 0 0 1 2 2 0 1 two plus two is four (four not in 0-3)
0 1 1 0 1 1 0 2 1 3 0 two plus 1 is three
0 1 1 1 1 0 1 2 3 1 1 two plus three is five (one and four)
1 0 0 0 1 0 0 1 0 1 0 one plus nothing is one
1 0 0 1 1 1 0 1 2 3 0 one plus two is three
1 0 1 0 0 1 0 1 1 2 0 one plus one is two
1 0 1 1 0 0 1 1 3 0 1 one plus three is four
1 1 0 0 1 1 0 3 0 3 0 three plus nothing is three
1 1 0 1 1 0 1 3 2 1 1 three plus two is five (one and four)
1 1 1 0 0 0 1 3 1 0 1 three plus one is four
1 1 1 1 0 1 1 3 3 2 1 three plus three is 6 (two and four)
Looks right. Ordering your 16 rows a little differently would make them flow in a more logical order.
It's an adder! Just check if it's adding. Let's take this row:
a2 a1 b2 b1 C S2 S2
1 0 1 1 1 0 1
Here I have reordered the columns in an easier to read manner: higher order bits first.
The a input is 10 = 2 (base 10). The b input is 11 = 3 (base 10). The output is 101, which
is 5 (base 10). So this one is right: 2 + 3 == 5.
I'll let you check the other rows.
I have a file with lots of pieces of information that I want to split on the first column.
Example (example.gen):
1 rs3094315 752566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
1 rs2094315 752999 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3044315 759996 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3054375 799966 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3094375 999566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs3078315 799866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs4054315 759986 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
Desired output:
Chr1.gen
1 rs3094315 752566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
1 rs2094315 752999 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
Chr2.gen
2 rs3044315 759996 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3054375 799966 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3094375 999566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
Chr3.gen
3 rs3078315 799866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs4054315 759986 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
Chr4.gen
4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
I've tried to do this with the following shell scripts, but it doesn't work - I can't work out how to get awk to recognise a variable defined outside the awk script itself.
First script attempt (no awk loop):
for i in {1..23}
do
awk '{$1 = $i}' example.gen > Chr$i.gen
done
Second script attempt (with awk loop):
for i in {1..23}
do
awk '{for (i = 1; i <= 23; i++) $1 = $i}' example.gen > Chr$i.gen
done
I'm sure its probably quite basic, but I just can't work it out...
Thank you!
With awk:
awk '{print > "Chr"$1".gen"}' file
It just prints and redirects it to a file. And how is this file defined? With "Chr" + first_column + ".gen".
With your sample input it creates 4 files. For example the 4th is:
$ cat Chr4.gen
4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
First, use #fedorqui's answer, as that is best. But to understand the mistake you made with your first attempt (which was close), read on.
Your first attempt failed because you put the test inside the action (in the braces), not preceding it. The minimal fix:
awk "\$1 == $i" example.gen > Chr$i.gen
This uses double quotes to allow the value of i to be seen by the awk script, but that requires you to then escape the dollar sign for $1 so that you don't substitute the value of the shell's first positional argument. Cleaner but longer:
awk -v i=$i '$1 == i' example.gen > Chr$i.gen
This adds creates a variable i inside the awk script with the same value as the shell's i variable.
I have a file containing 100000 lines like this
1 0110100010010101
2 1000010010111001
3 1000011001111000
10 1011110000111110
123 0001000000100001
I would like to know how can I display efficiently just the second field by adding whitespaces between characters.
0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1
1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0
1 0 1 1 1 1 0 0 0 0 1 1 1 1 1 0
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
One solution would be to get the second column with awk and then add the whitespaces using sed. But as the file is too long I would like to avoid using pipes. Then I'm wondering if I can do that by just using awk.
Thanks in advance
is this ok?
awk '{gsub(/./,"& ",$2);print $2}' yourFile
example
kent$ echo "1 0110100010010101
2 1000010010111001
3 1000011001111000"|awk '{gsub(/./,"& ",$2);print $2}'
0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1
1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0
update
more than 2 digits in 1st column won't work? I didn't get it:
kent$ echo "133 0110100010010101
233 1000010010111001
333 1000011001111000"|awk '{gsub(/./,"& ",$2);print $2}'
0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1
1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0
gsub(/./,"& ", $2)
1 /./ match any single character
2 "& " & here means the matched string, in this case, each character
3 $2 column 2
so it means, replace each character in 2nd column into the character itself + " ".
One way using only awk:
awk '{ gsub( /./, "& ", $2 ); print $2; }' infile
That yields:
0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1
1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1
1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0
EDIT: Kent and I gave the same implementation, so, for this answer to be a bit more useful, I will add the sed one:
sed -e 's/^[^ ]* *//; s/./& /g' infile
Just adding a sed alternative:
sed -e 's/^.* *//;s/./& /g;s/ $//' file
Three comands:
Remove the characters and spaces on the start of the line
Replace everycharacter with itself followed by a space
(Optional) Remove the trailing space at the end of the line
sed solution.
sed 's/.* //;s/\(.\)/\1 /g'
It adds an extra space at the end of each line. Add ;s/ $// to the expression to remove it.
This might work for you (GNU sed):
sed 's/^\S*\s*//;s/\B/ /g' /file