Removing certain columns from a text file [duplicate] - macos

This question already has answers here:
Deleting columns from a file with awk or from command line on linux
(4 answers)
Closed 8 years ago.
I have a text file that looks like this:
A B C A B C A B C A B
G T C A G T C A G T C
A B C A B C A B C A B
A B C A B C A B C A B
A D E A B D E A B D E
A B C A B C A B C A B
C B D G C B D G C B D
Is there a way to extract only certain columns and leave the other columns intact?
For example removing only columns 2 and 5:
A C A C A B C A B
G C A T C A G T C
A C A C A B C A B
A C A C A B C A B
A E A D E A B D E
A C A C A B C A B
C D G B D G C B D
Thanks in advance.
UPDATE:
Found this answer using awk, but this extract whole "block" of columns and I only want to extract some.
Awk for extracting columns 3 to 5:
awk -F 'FS' 'BEGIN{FS="\t"}{for (i=1; i<=NF-1; i++) if(i<3 || i>5) {printf $i FS};{print $NF}}' input.txt

in your case you could do
cat your_file |cut -d ' ' --complement -s -f2,5
where ' ' is the delimiter(in your case the space)

Related

destructure sequence into lexical variables

I have a sequence with a known number of elements (from a pcre match) and would like to map this into lexical variables.
I can probably loop over the sequence and put every element onto the stack and then :> ( a b c d ) but is there an idiomatic way to do this ?
Oh and my sequence has more than 4 elements, so first4 doesn't cut it, although I could obviously use first4 and then first3 on a subset of the sequence.
If you are sure that's want you really want to do, you could use firstn from quotations.generalizations:
SYMBOLS: a b c d e f g h ;
[let
{ 1 2 3 4 5 6 7 8 }
8 firstn :> ( a b c d e f g h )
a b c d e f g h . . . . . . . . ]
But it sounds like a bad idea. It's tricky, because the lexical variables are not "real" variables, the compiler converts them into stack shuffling. That's why they don't play nice with macros and :> can't be called like a regular word.
If you use dynamic variables it's easier:
SYMBOLS: a b c d e f g h ;
{ 1 2 3 4 5 6 7 8 }
{ a b c d e f g h } [ set ] 2each
{ a b c d e f g h } [ get . ] each

Processing data swapped over files BASH

First, I would like to apologize for my extremely basic knowledge about coding. Then I hope that I will be able to express myself correctly about my issue. Do no hesitate to ask for further clarifications or anything else...
I'm encountering troubles postprocessing data...
My goal is to recombine data which were swapped.
EDIT : here is a .rar folder containing my test example which works and the one that I try to make working... (do not be afraid by the time it requires to process the data)
https://drive.google.com/file/d/1AEPUc8haT5_Z3LR3jnZZlpyfxhdDwwo6/view?usp=sharing
EDIT 2 : Here is what I expect on paper (Its my TestReorder3OK folder in my rar archive)
enter image description here
EDIT 3 : MINIMAL COMPLETE EXAMPLE
Script :
#!/bin/bash
# Definir le nombre de replica
NP=3
NP1=$[NP-1]
rm torder*
for repl in `seq 0 $NP1`
do
echo $repl
# colle la colonne 2 du fichier .lammps dans un fichier rep_0, puis dans la seconde boucle, la colonne 3 dans rep_1, etc.
awk -v rep=$repl '{r2=rep+2;print $r2}' < log.lammps > rep_$repl
i=0
j=0
# cree une boucle dans la boucle
for a in `cat rep_$repl`
do
i=$[i+1]
j=$[j+3]
head -$i screen.$repl.temp | tail -1 >> torder.$a
head -$j ccccd2_H_${repl}_col.bak2 | tail -3 >> ccccd2_H_${a}_temp_col.bak2
done
done
log.lammps file
1 0 1 2
2 1 0 2
3 1 2 0
Starting at column 2, this file contains the number associated to the inputs below. Here is an expanded explanation :
column 2 has three values : 0, 1 and 1 ; the 0 is associated to the first three lines of the file ccccd2_H_0_col.bak2, the next three ones are associated the 1 and the last three ones again to the value 1.
column 3 has also three values : 1, 0 and 2 ; the 1 is associated to the first three lines of the file ccccd2_H_1_col.bak2, the next three ones are associated the 0 and the last three ones again to the value 2.
Same story for column 4.
Now what I want, is that every set of three lines associated to the 0 value go into a single file. Every set of three lines associated to the 1 value go into another single file, and the sets of three lines associated to the 2 value to a last file.
Inputs :
ccccd2_H_0_col.bak2
blank line
N a b c
C d e f
N g h i
C j k l
N m n o
C p q r
ccccd2_H_1_col.bak2
blank line
N s t u
C v w x
N y z a
C b c d
N e f g
C h i j
ccccd2_H_2_col.bak2
blank line
N k l m
C n o p
N q r s
C t u v
N w x y
C z a b
Outputs : These are the desired outputs and the one that I get for simple test files
ccccd2_H_0_temp_col
blank line
N a b c
C d e f
N y z a
C b c d
N w x y
C z a b
ccccd2_H_1_temp_col
blank line
N g h i
C j k l
N m n o
C p q r
N s t u
C v w x
ccccd2_H_2_temp_col
blank line
N e f g
C h i j
N k l m
C n o p
N q r s
C t u v
This works fine on small test files (as shown here), but not on my real system. For my real system, I have the log.lammps file that contains 14 rows and 10,001 lines, and my input files that contain 121,121 lines (so 10,001 * block of 121 lines). It creates files 10 times larger with more data than it should.
Can you enlighten me about my issue ? I think this is linked to the difference of line number from my files containing a single row and the files containing cartesian coordinates, but I really don't understand the link nor the way to solve it...
Thank you in advance...
I think I understand what you're trying do do now and this GNU awk script (for ARGIND, ENDFILE and inbuilt open file management) will do it:
$ cat ../tst.awk
ARGIND == 1 {
for (inFileNr=2; inFileNr<=NF; inFileNr++) {
outFileNrs[inFileNr,NR] = $inFileNr
}
next
}
ENDFILE { RS = "" }
{ print ORS $0 > ("ccccd2_H_" outFileNrs[ARGIND,FNR] "_temp_col") }
Look:
INPUT:
$ ls
ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 log.lammps
$ cat log.lammps
1 0 1 2
2 1 0 2
3 1 2 0
$ paste ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 | sed 's/\t/\t\t/g'
N a b c N s t u N k l m
C d e f C v w x C n o p
N g h i N y z a N q r s
C j k l C b c d C t u v
N m n o N e f g N w x y
C p q r C h i j C z a b
SCRIPT EXECUTION:
$ awk -f ../tst.awk log.lammps ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2
OUTPUT:
$ ls
ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 log.lammps
ccccd2_H_0_temp_col ccccd2_H_1_temp_col ccccd2_H_2_temp_col
$ paste ccccd2_H_0_temp_col ccccd2_H_1_temp_col ccccd2_H_2_temp_col | sed 's/\t/\t\t/g'
N a b c N g h i N e f g
C d e f C j k l C h i j
N y z a N m n o N k l m
C b c d C p q r C n o p
N w x y N s t u N q r s
C z a b C v w x C t u v

Rewrite matrix into rules

I have a lot of rectangular matrices where each cell represents some outcome. As matrices are difficult to maintain, it is my goal to rewrite all of them into rules.
Example Matrix 1:
This is easy to turn into rules (pseudocode):
if (i <= 5 and j <=3) then A
else if (i <= 5 and j >=4) then B
else C
How do I rewrite the following matrix?
Plain text:
ij 1 2 3 4 5 6 7 8 9
1 A A A A C C C C B
2 A A A C C C C B B
3 A A C C C C B B B
4 A C C C C B B B B
5 C C C C B B B B B
6 C C C B B B B B B
7 C C B B B B B B B
8 C B B B B B B B B
9 B B B B B B B B B
The second matrix can be represented as:
if (i+j <= 5)
return A;
else if (i+j <= 9)
return C;
else
return B;
In general, you can check which side of a diagonal line a point is on by testing i+j for a / line, or i-j for a \ line.

Concatenate last columns from multiple files of one type

I am trying to cat the last 2 columns of multiple text files side by side. The files are in a directory of various types of files. All files have >2 columns, but no guarantee all files have the same number of columns.
For example, if I have:
file1.txt
1 a b J H
2 b c E E
3 c d L L
4 d e L L
5 e f O O
file2.txt
1 a b M B
2 b c O E
3 c d O E
I want:
J H M B
E E O E
L L O E
L L
O O
The closest I've got is:
awk '{print $(NF-1), "\t", $NF}' *.txt
Which is almost what I want.
For the concatenation, I was thinking something like here for concatenation
pr -m -t one.txt two.txt
awk 'NR==FNR{a[NR]=$(NF-1)" "$NF;next}{print $(NF-1),$NF,a[FNR]}' file2.txt file1.txt
Tested:
> cat temp2
1 a b M B
2 b c O E
3 c d O E
> cat temp1
1 a b J H
2 b c E E
3 c d L L
4 d e L L
5 e f O O
> awk 'NR==FNR{a[NR]=$(NF-1)" "$NF;next}{print $(NF-1),$NF,a[FNR]}' temp2 temp1
J H M B
E E O E
L L O E
L L
O O
>
join -a1 -a2 one.txt two.txt | cut -d' ' -f4,5,8,9

Which function/algorithm for this merging and filling operation?

I have written R code that merges two data frames based on first column and for missing data adds the value from above. Here is what is does:
Two input data frames:
1 a
2 b
3 c
5 d
And
1 e
4 f
6 g
My code gives this output:
1 a e
2 b e
3 c e
4 c f
5 d f
6 d g
My code is however inefficient as it is not vectorized properly. Are there some R functions which I could use? Basically a function I am looking for is that fills in missing values / NA values and takes the value from previous element and puts it in place of NA.
I looked through reference book of R, but could not find anything.
Here is a solution making use of zoo::na.locf
library(zoo)
a <- data.frame(id=c(1,2,3,5), v=c("a","b","c", "d"))
b <- data.frame(id=c(1,4,6), v=c("e", "f", "g"))
n <- max(c(a$id, b$id))
an <- merge(data.frame(id=1:n), a, all.x=T)
bn <- merge(data.frame(id=1:n), b, all.x=T)
an$v <- na.locf(an$v)
bn$v <- na.locf(bn$v)
data.frame(an$id, an$v, bn$v)
an.id an.v bn.v
1 1 a e
2 2 b e
3 3 c e
4 4 c f
5 5 d f
6 6 d g

Resources