I need to split a file with ~5million rows based on some columns, i.e, I need to keep some columns on the different chunks. I am aware of split command for row-wise splitting, but don't know if there is any similar function to split column-wise with as I would like to. My file has 196 ANN columns
SNPID CHR POS Z F N LNBF ANN1 ANN2 ANN3
rs367896724 1 10177 0 0 0 -3.36827717630604 0 0 0
rs555500075 1 10352 0 0 0 -2.30999509213213 0 1 0
rs575272151 1 11008 0 0 0 -1.14611711529388 0 0 1
rs544419019 1 11012 0 0 0 -1.14611711529388 1 1 1
The desired output will be
#chunk1
SNPID CHR POS Z F N LNBF ANN1
rs367896724 1 10177 0 0 0 -3.36827717630604 0
rs555500075 1 10352 0 0 0 -2.30999509213213 0
rs575272151 1 11008 0 0 0 -1.14611711529388 0
rs544419019 1 11012 0 0 0 -1.14611711529388 1
#chunk2
SNPID CHR POS Z F N LNBF ANN2
rs367896724 1 10177 0 0 0 -3.36827717630604 0
rs555500075 1 10352 0 0 0 -2.30999509213213 1
rs575272151 1 11008 0 0 0 -1.14611711529388 0
rs544419019 1 11012 0 0 0 -1.14611711529388 1
#chunk3
SNPID CHR POS Z F N LNBF ANN3
rs367896724 1 10177 0 0 0 -3.36827717630604 0
rs555500075 1 10352 0 0 0 -2.30999509213213 0
rs575272151 1 11008 0 0 0 -1.14611711529388 1
rs544419019 1 11012 0 0 0 -1.14611711529388 1
The names of my ANN columns are not like ANN1 ANN2, the names are quite different to each other, I have just used ANN for simplicity.
The speed would be an issue, since the file is quite huge
UPDATE: if it would be possible I would like to split the files every 10 or 20 ANN columns (the total number of ANN is 196)
Something like this might work:
% cat script.awk
{
for (i=8;i<=NF;i++) {
print $1, $2, $3, $4, $5, $6, $7, $i >> "chunk"(i-7)".txt"
}
}
This will write 8 columns for each ANN columns into chunk1.txt, chunk2.txt, ... chunkN.txt (First 7 and then one ANN column). Run it with:
awk -f script.awk input_file
I assume that >> will open a file handle, append the line and then close it. So it's properly possible to optimize it.
A solution with perl:
The initial file, with a few extra columns
$ cat file
SNPID CHR POS Z F N LNBF ANN1 ANN2 ANN3 ANN4 ANN5 ANN6 ANN7 ANN8
rs367896724 1 10177 0 0 0 -3.36827717630604 0 0 0 a b c d e
rs555500075 1 10352 0 0 0 -2.30999509213213 0 1 0 f g h i j
rs575272151 1 11008 0 0 0 -1.14611711529388 0 0 1 k l m n o
rs544419019 1 11012 0 0 0 -1.14611711529388 1 1 1 p q r s t
A perl script to split it up
$ perl -alne '
$n=4; # how many data columns to put into the "split" files
for ( ($i,$j)=(7,1); $i < #F; $i+=$n,$j++ ) {
open($fh{$j}, ">", "file.$j") unless $fh{$j};
#data = (#F[0..6], #F[$i .. $i+$n-1]);
print {$fh{$j}} "#data";
}
' file
The results
$ cat file.1
SNPID CHR POS Z F N LNBF ANN1 ANN2 ANN3 ANN4
rs367896724 1 10177 0 0 0 -3.36827717630604 0 0 0 a
rs555500075 1 10352 0 0 0 -2.30999509213213 0 1 0 f
rs575272151 1 11008 0 0 0 -1.14611711529388 0 0 1 k
rs544419019 1 11012 0 0 0 -1.14611711529388 1 1 1 p
$ cat file.2
SNPID CHR POS Z F N LNBF ANN5 ANN6 ANN7 ANN8
rs367896724 1 10177 0 0 0 -3.36827717630604 b c d e
rs555500075 1 10352 0 0 0 -2.30999509213213 g h i j
rs575272151 1 11008 0 0 0 -1.14611711529388 l m n o
rs544419019 1 11012 0 0 0 -1.14611711529388 q r s t
Related
I have a very large file (~700M rows) and I would like to reduce the size by grouping mostly matching rows. Specifically, the file is sorted by fields 1 and 2 and I would like to group rows where field 2 contains consecutive numbers but all other fields match. If there is a gap in field 2 or if any other fields do not match the previous row then I would like to start a new interval. Ideally, I would like the output to return the interval range for the grouped rows and would prefer a solution that works in bash with awk and/or sed. I'm open to other solutions as well as long as they don't require re-sorting or other operations that might crash with such a long file.
The input file looks something like this.
NW_005179401.1 100 1 0 0 0 0 0 0 0 0
NW_005179401.1 101 1 0 0 0 0 0 0 0 0
NW_005179401.1 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 1 0 0 0 0 0 1 0 0
NW_005179401.1 104 1 0 0 0 0 0 1 0 0
NW_005179401.1 105 1 0 0 0 0 0 1 0 0
NW_005179401.1 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 1 0 0 0 0 0 1 0 0
NW_005179401.1 109 1 0 0 0 0 0 1 0 0
NW_005179401.1 110 1 0 0 0 0 0 1 0 0
NW_005179401.1 111 1 0 0 0 0 0 1 0 0
NW_005179401.1 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 0 0 1 1 0 0 0 0 2
NW_005179401.1 993 0 0 1 1 0 0 0 0 2
NW_005179401.1 994 0 0 1 1 0 0 0 0 2
NW_005179401.1 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 0 0 1 1 0 0 0 0 0
NW_005179401.1 997 0 0 1 1 0 0 0 0 0
NW_005179401.1 998 0 0 1 1 0 0 0 0 0
NW_005179401.1 999 0 0 1 1 0 0 0 0 0
In reality the file has more fields but all contain integers like fields 3 and beyond in the example. The ideal output will look like this, with first and last values from consecutive field 2 interval printed in output fields 2 and 3.
NW_005179401.1 100 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0
I found solutions group consecutive rows with matches in specific fields, but none that also look for consecutive integers in one field and not one that can return the range. One thought was using uniq with the -c flag while skipping the first 2 fields, then adding the counts to the value in field 2, but given the additional condition of requiring consecutive numbers in field 2 I'm not too sure where to start with this one. Thanks in advance.
EDIT: I apologize for not originally adding my attempted code but my pipeline used the bioinformatics program bedtools and it kept getting killed for lack of memory, which wasn't something I expected to be troubleshot due to lack of pre-programmed functionality. I am an awk novice and didn't know where to start for an alternative pipeline for reformatting this type of file.
I doubt there is a standard tool like uniq -c for this. But you can use this custom awk script:
awk '{$1=$1} $0!=n {s=$2; printf "%s", g}
{$2=$2+1; n=$0; $2=s" "$2-1; g=$0 ORS}
END {printf "%s", g}' yourFile
n is the the next anticipated record,
e.g. if the current line is abc 100 x y z then n=abc 101 x y z.
g is the group of records to be printed in case the next anticipated line n does not occur and the group ends.
s is the start number of group g, i.e. the lower bound of the interval.
{$1=$1} is only there to ensure that the field separators in the current line $0 and the generated line n are consistent, so that we can check equality using ==, or rather != in this case.
For your example, this prints
NW_005179401.1 100 102 1 0 0 0 0 0 0 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 1 0 0 0 0 0 1 0 0
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 2
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0
$ cat tst.awk
{
prevVals = currVals
origRec = $0
$2 = ""
currVals = $0
$0 = origRec
}
($2 != endKey+1) || (currVals != prevVals) {
if ( NR>1 ) {
prt()
}
begKey = $2
}
{ endKey = $2 }
END { prt() }
function prt( origRec) {
origRec = $0
$2 = begKey OFS endKey
print
$0 = origRec
}
$ awk -f tst.awk file
NW_005179401.1 100 102 1 0 0 0 0 0 1 0 0
NW_005179401.1 103 106 1 0 0 0 0 0 1 0 0
NW_005179401.1 108 112 0 0 1 1 0 0 0 0 2
NW_005179401.1 992 995 0 0 1 1 0 0 0 0 0
NW_005179401.1 996 999 0 0 1 1 0 0 0 0 0
I have a program that prints out the following:
bash-3.2$ ./drawgrid
0
1 1 0
1 1 0
0 0 0
1
0 1 1
0 1 1
0 0 0
2
0 0 0
1 1 0
1 1 0
3
0 0 0
0 1 1
0 1 1
Is it possible to pipe the output of this command such that I get all the 3x3 matrices (together with their number) displayed on a table, for example a 2x2 like this?
0 1
1 1 0 0 1 1
1 1 0 0 1 1
0 0 0 0 0 0
2 3
0 0 0 0 0 0
1 1 0 0 1 1
1 1 0 0 1 1
I tried searching, and came across the column command, but I did not figure it out.
Thank you
You can use pr -2T to get the following output, which is close to what you expected:
0 2
1 1 0 0 0 0
1 1 0 1 1 0
0 0 0 1 1 0
1 3
0 1 1 0 0 0
0 1 1 0 1 1
0 0 0 0 1 1
You could use an awk script:
NF == 1 {
if ($NF % 2 == 0) {
delete line
line[1]=$1
f=1
} else {
print line[1]"\t"$1
f=0
}
n=1
}
NF > 1 {
n++
if (f)
line[n]=$0
else
print line[n]"\t"$0
}
And pipe to it like so:
$ ./drawgrid | awk -f 2x2.awk
0 1
1 1 0 0 1 1
1 1 0 0 1 1
0 0 0 0 0 0
2 3
0 0 0 0 0 0
1 1 0 0 1 1
1 1 0 0 1 1
You can get exactly what you expect with a short bash script and a little array index thought:
#!/bin/bash
declare -a idx
declare -a acont
declare -i cnt=0
declare -i offset=0
while IFS=$'\n'; read -r line ; do
[ ${#line} -eq 1 ] && { idx+=( $line ); ((cnt++)); }
[ ${#line} -gt 1 ] && { acont+=( $line );((cnt++)); }
done
for ((i = 0; i < ${#idx[#]}; i+=2)); do
printf "%4s%8s\n" ${idx[i]} ${idx[i+1]}
for ((j = offset; j < offset + 3; j++)); do
printf " %8s%8s\n" ${acont[j]} ${acont[j+3]}
done
offset=$((j + 3))
done
exit 0
Output
$ bash array_cols.sh <dat/cols.txt
0 1
1 1 0 0 1 1
1 1 0 0 1 1
0 0 0 0 0 0
2 3
0 0 0 0 0 0
1 1 0 0 1 1
1 1 0 0 1 1
ok, let's I have a txt file like this...
X 1 : D i s t a n c e [ m m ]
Y 1 : I n t e n s i t y
X 2 : D i s t a n c e [ m m ]
Y 2 : I n t e n s i t y
I m a g e ( 2 3 7 . 2 3 u )
X 1 Y 1
0 . 0 0 0 0 0 0 4 0 . 0 0 0 0 0 0
0 . 0 0 2 0 0 0 5 7 . 0 0 0 0 0 0
...etc
And several others similar to this...
X 1 : D i s t a n c e [ m m ]
Y 1 : I n t e n s i t y
X 2 : D i s t a n c e [ m m ]
Y 2 : I n t e n s i t y
I m a g e ( 2 6 5 . 2 7 u )
X 1 Y 1
0 . 0 0 0 0 0 0 3 6 . 0 0 0 0 0 0
0 . 0 0 2 0 0 0 3 4 . 0 0 0 0 0 0
0 . 0 0 4 0 0 0 4 0 . 0 0 0 0 0 0
When I use paste, to merge horizontally the content of these files...
#! /bin/bash
zeta=$(ls)
paste $zeta >> file_1.txt
I get this (example if there were two files):
X 1 : D i s t a n c e [ m m ]
X 1 : D i s t a n c e [ m m ]
Y 1 : I n t e n s i t y
Y 1 : I n t e n s i t y
X 2 : D i s t a n c e [ m m ]
X 2 : D i s t a n c e [ m m ]
Y 2 : I n t e n s i t y
Y 2 : I n t e n s i t y
I m a g e ( 2 3 7 . 2 3 u )
I m a g e ( 2 6 5 . 2 7 u )
X 1 Y 1
X 1 Y 1
0 . 0 0 0 0 0 0 4 0 . 0 0 0 0 0 0
0 . 0 0 0 0 0 0 3 6 . 0 0 0 0 0 0
0 . 0 0 2 0 0 0 5 7 . 0 0 0 0 0 0
0 . 0 0 2 0 0 0 3 4 . 0 0 0 0 0 0
0 . 0 0 4 0 0 0 4 1 . 0 0 0 0 0 0
0 . 0 0 4 0 0 0 4 0 . 0 0 0 0 0 0
Why do I have this intermingle of lines?
How can I do to put exactly the content of a txt file just aside of the content of the other txt file? In this case have the columns 1 and 2 for my first file, and the columns 3 and 4 for my second file. And then massively for several files?
Thanks for any hint,
Maybe you can put several '\t' between context of the line and '\n' :
cat text1.txt | tr "\n" "\t\t\n" > text1.txt
After the processes, you can use your old method to paste them together. :)
I have a file with lots of pieces of information that I want to split on the first column.
Example (example.gen):
1 rs3094315 752566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
1 rs2094315 752999 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3044315 759996 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3054375 799966 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3094375 999566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs3078315 799866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs4054315 759986 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
Desired output:
Chr1.gen
1 rs3094315 752566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
1 rs2094315 752999 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
Chr2.gen
2 rs3044315 759996 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3054375 799966 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
2 rs3094375 999566 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
Chr3.gen
3 rs3078315 799866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
3 rs4054315 759986 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
Chr4.gen
4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
I've tried to do this with the following shell scripts, but it doesn't work - I can't work out how to get awk to recognise a variable defined outside the awk script itself.
First script attempt (no awk loop):
for i in {1..23}
do
awk '{$1 = $i}' example.gen > Chr$i.gen
done
Second script attempt (with awk loop):
for i in {1..23}
do
awk '{for (i = 1; i <= 23; i++) $1 = $i}' example.gen > Chr$i.gen
done
I'm sure its probably quite basic, but I just can't work it out...
Thank you!
With awk:
awk '{print > "Chr"$1".gen"}' file
It just prints and redirects it to a file. And how is this file defined? With "Chr" + first_column + ".gen".
With your sample input it creates 4 files. For example the 4th is:
$ cat Chr4.gen
4 rs4900215 752998 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs5094315 759886 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
4 rs6094315 798866 A G 0 1 0 1 0 0 1 0 0 0 1 0 0 1
First, use #fedorqui's answer, as that is best. But to understand the mistake you made with your first attempt (which was close), read on.
Your first attempt failed because you put the test inside the action (in the braces), not preceding it. The minimal fix:
awk "\$1 == $i" example.gen > Chr$i.gen
This uses double quotes to allow the value of i to be seen by the awk script, but that requires you to then escape the dollar sign for $1 so that you don't substitute the value of the shell's first positional argument. Cleaner but longer:
awk -v i=$i '$1 == i' example.gen > Chr$i.gen
This adds creates a variable i inside the awk script with the same value as the shell's i variable.
I am looking for a C++ library for Discrete Wavelet Transform (DWT) which can also return
the NxN DWT matrix of the transform.
There was a similar question opened here
Looking for a good C/C++ wavelet library for signal processing
but I am looking for something more specific as you can see.
It would be more helpful if the library is under some non-GNU license that lets me use it in proprietary software (LGPL, MPL, BSD etc.)
Thanks in advance
The reason why this matrix is never computed is that it is very inefficient to compute the DWT using it. The FWT approach is much faster.
For a signal of length 16 and a 3-level haar transform, I found that this matrix in matlab
>> h=[1 1];
>> g=[1 -1];
>> m1=[[ones(1,8) zeros(1,8); ...
zeros(1,8) ones(1,8); ...
1 1 1 1 -1 -1 -1 -1 zeros(1,8); ...
zeros(1,8) 1 1 1 1 -1 -1 -1 -1]/sqrt(8); ...
[1 1 -1 -1 zeros(1,12); ...
zeros(1,4) 1 1 -1 -1 zeros(1,8); ...
zeros(1,8) 1 1 -1 -1 zeros(1,4); ...
zeros(1,12) 1 1 -1 -1]/sqrt(4); ...
[g zeros(1,14); ...
zeros(1,2) g zeros(1,12); ...
zeros(1,4) g zeros(1,10); ...
zeros(1,6) g zeros(1,8); ...
zeros(1,8) g zeros(1,6); ...
zeros(1,10) g zeros(1,4); ...
zeros(1,12) g zeros(1,2); ...
zeros(1,14) g]/sqrt(2)]
m1 =
A A A A A A A A 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 A A A A A A A A
A A A A -A -A -A -A 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 A A A A -A -A -A -A
B B -B -B 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 B B -B -B 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 B B -B -B 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 B B -B -B
C -C 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 C -C 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 C -C 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 C -C 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 C -C 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 C -C 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 C -C 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 C -C
where A=1/sqrt(8), B=1/sqrt(4) and C=1/sqrt(2).
corresponds to the FWT. That shows you how you build your matrix from the filters. You start with the bottom half of the matrix --a matrix of zeroes, putting filter g 2 steps further every row. then make the filter twice as wide and repeat, only now shift 4 steps at a time. repeat this until you are at the highest level of decomposition, the finally put the approximation filter in at the same width (here, 8).
just as a check
>> signal=1:16; % ramp
>> [h g]=daubcqf(2); % Haar coefficients from the Rice wavelet toolbox
>> fwt(h,signal,3) % fwt code by Jeffrey Kantor
>> m1*signal' % should produce the same vector
Hope that helps you writing it in C++. It is not difficult (a bit of bookkeeping) but as said, noone uses it because efficient algorithms do not need it.