Below you can see the reproduced sample of my data.
DATA <- structure(list(ID = c("101", "101", "101", "101", "101", "101","101", "101", "101", "101"), IDA = c("1", "1", "2", "3", "4","5", "5", "1859", "1860", "1861"), DATE = structure(c(1300928400,1277946000, 1277946000, 1278550800, 1278550800, 1453770000, 1329958800,1506474000, 1485133200, 1485133200), tzone = "UTC", class = c("POSIXct","POSIXt")), NR = c("CH-0001", "CH-0001","CH-0002", "CH-0003", "CH-0004", "CH-0005","CH-0005", "CH-1859", "CH-1860", "CH-1861"), PAT = c("101-1", "101-1", "101-2", "101-3", "101-4", "101-5","101-5", "101-1859", "101-1860", "101-1861"), INT1 = c(245005,280040, 280040, 280040, 280040, 240040, 240040, NA, NA, NA),INT2 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), INT3 = c(NA_real_,NA_real_, 280010, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, 245035, NA_real_), INT4 = c(NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_), INTX1 = c(NA_real_, 275040, NA_real_,NA_real_, NA_real_, NA_real_, 240080, NA_real_, NA_real_,NA_real_), INTX2 = c(276790, NA_real_, 7612645, NA_real_,NA_real_, NA_real_, 5078219, NA_real_, NA_real_, NA_real_), INTX173 = c(NA_real_, NA_real_, NA_real_, 3456878,NA_real_, NA_real_, 3289778, NA_real_, NA_real_, NA_real_), INTX4 = c(NA_real_, NA_real_, 11198767, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 7025676), KAT = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1)), row.names = c(NA,-10L), class = c("tbl_df", "tbl", "data.frame"))
As you see, I have eight columns called: INT1:INT4 and INTX1:INTX4. For each row there are only a maximum of four values for these variables and the rest are NAs. I need to create four new variables called ING1:ING4 and tell R to check the 8 columns one by one per row and assign the first value it finds in that row to ING1, the second value to ING2, the third value to ING3, and the fourth value to ING4.At the end, it is possible that, for a row, all or some of the ING1:ING4 columns are filled with values.
I would expect for row 1 I get the following ING columns:
ING1 == 245005, ING2 == 276790, ING3 == NA, ING4 ==NA
I think I need to write a loop for that but as I am a beginner I am lost how to do it. Could you kindly help me with it?
Try this:
fun <- function(select, prefix = "ING", ncol = -1, data = cur_data()) {
select <- substitute(select)
out <- asplit(t(
apply(subset(data, select = eval(select)), 1, sort, na.last = TRUE)
), 2)
names(out) <- paste0(prefix, seq_along(out))
if (ncol > 0) out <- out[seq_len(ncol)]
do.call(data.frame, out)
}
And its use:
dplyr
library(dplyr)
DATA %>%
mutate(fun(INT1:INTX4, ncol=4))
# # A tibble: 10 × 18
# ID IDA DATE NR PAT INT1 INT2 INT3 INT4 INTX1 INTX2 INTX173 INTX4 KAT ING1 ING2 ING3 ING4
# <chr> <chr> <dttm> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 1 2011-03-24 01:00:00 CH-0001 101-1 245005 NA NA NA NA 276790 NA NA 0 245005 276790 NA NA
# 2 101 1 2010-07-01 01:00:00 CH-0001 101-1 280040 NA NA NA 275040 NA NA NA 0 275040 280040 NA NA
# 3 101 2 2010-07-01 01:00:00 CH-0002 101-2 280040 NA 280010 NA NA 7612645 NA 11198767 0 280010 280040 7612645 11198767
# 4 101 3 2010-07-08 01:00:00 CH-0003 101-3 280040 NA NA NA NA NA 3456878 NA 0 280040 3456878 NA NA
# 5 101 4 2010-07-08 01:00:00 CH-0004 101-4 280040 NA NA NA NA NA NA NA 0 280040 NA NA NA
# 6 101 5 2016-01-26 01:00:00 CH-0005 101-5 240040 NA NA NA NA NA NA NA 0 240040 NA NA NA
# 7 101 5 2012-02-23 01:00:00 CH-0005 101-5 240040 NA NA NA 240080 5078219 3289778 NA 0 240040 240080 3289778 5078219
# 8 101 1859 2017-09-27 01:00:00 CH-1859 101-1859 NA NA NA NA NA NA NA NA 1 NA NA NA NA
# 9 101 1860 2017-01-23 01:00:00 CH-1860 101-1860 NA NA 245035 NA NA NA NA NA 1 245035 NA NA NA
# 10 101 1861 2017-01-23 01:00:00 CH-1861 101-1861 NA NA NA NA NA NA NA 7025676 1 7025676 NA NA NA
base R
cbind(DATA, fun(data = DATA, INT1:INTX4, ncol=4))
# ID IDA DATE NR PAT INT1 INT2 INT3 INT4 INTX1 INTX2 INTX173 INTX4 KAT ING1 ING2 ING3 ING4
# 1 101 1 2011-03-24 01:00:00 CH-0001 101-1 245005 NA NA NA NA 276790 NA NA 0 245005 276790 NA NA
# 2 101 1 2010-07-01 01:00:00 CH-0001 101-1 280040 NA NA NA 275040 NA NA NA 0 275040 280040 NA NA
# 3 101 2 2010-07-01 01:00:00 CH-0002 101-2 280040 NA 280010 NA NA 7612645 NA 11198767 0 280010 280040 7612645 11198767
# 4 101 3 2010-07-08 01:00:00 CH-0003 101-3 280040 NA NA NA NA NA 3456878 NA 0 280040 3456878 NA NA
# 5 101 4 2010-07-08 01:00:00 CH-0004 101-4 280040 NA NA NA NA NA NA NA 0 280040 NA NA NA
# 6 101 5 2016-01-26 01:00:00 CH-0005 101-5 240040 NA NA NA NA NA NA NA 0 240040 NA NA NA
# 7 101 5 2012-02-23 01:00:00 CH-0005 101-5 240040 NA NA NA 240080 5078219 3289778 NA 0 240040 240080 3289778 5078219
# 8 101 1859 2017-09-27 01:00:00 CH-1859 101-1859 NA NA NA NA NA NA NA NA 1 NA NA NA NA
# 9 101 1860 2017-01-23 01:00:00 CH-1860 101-1860 NA NA 245035 NA NA NA NA NA 1 245035 NA NA NA
# 10 101 1861 2017-01-23 01:00:00 CH-1861 101-1861 NA NA NA NA NA NA NA 7025676 1 7025676 NA NA NA
This question already has answers here:
How can I format the output of a bash command in neat columns
(7 answers)
Closed 4 years ago.
I have a file that comes from R. It is basically the output of write.table command using as delimiter " ". An example of this file would look like this:
file1.txt
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
What I need to get is a new file in a very specific format, basically I need to align all the columns using spaces, I can't use tabs.
The desired output will be
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
Thus, between 5285 and II-3, first row, there would be 3 white spaces and between 8F412 and III-3, third row, only two white spaces. The lengths of first tree fields can be different, however the length for the rest of columns is always fixed (two characters) but the last one that can be 12 characters
I can do this in a text editor but I have a very huge file, and I would like to do it using bash, awk or R
Use column:
$ column -t file
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
Use awk so that you have tight control on how you want to format each field:
awk '{ printf("%-5s %-5s %-5s %s %s %s %s %s %s %s %s %s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12) }' file
Produces:
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
here is another approach
$ tr ' ' '\t' <file | expand -t2
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you
For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40
You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .
I have a list such as below, where the 1 column is position and the other columns aren't important for this question.
1 1 2 3 4 5
2 1 2 3 4 5
5 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
I want to fill in the gaps such that the list is continuous and it reads
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
I am familiar with awk and shell scripts, but whatever way it can be done is fine with me.
Thanks for any help..
this one-liner may work for you:
awk '$1>++p{for(;p<$1;p++)print p" 0 0 0 0 0"}1' file
with your example:
kent$ echo '1 1 2 3 4 5
2 1 2 3 4 5
5 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5'|awk '$1>++p{for(;p<$1;p++)print p" 0 0 0 0 0"}1'
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
You can use the following awk one-liner:
awk '{b=a;a=$1;while(a>(b++)+1){print(b+1)," 0 0 0 0 0"}}1' input.file
Tested with here-doc input:
awk '{b=a;a=$1;while(a>(b++)+1){print(b+1)," 0 0 0 0 0"}}1' <<EOF
1 1 2 3 4 5
2 1 2 3 4 5
5 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
EOF
the output is as follows:
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
Explanation:
On every input line b is set to a where a is the value of the first column. Because of the order in which b and a are initialized, b can be used in a while loop that runs as long as b < a-1 and inserts the missing lines, filled up with zeros. The 1 at the end of the script will finally print the input line.
This is only for fun:
join -a2 FILE <(seq -f "%g 0 0 0 0 0" $(tail -1 FILE | cut -d' ' -f1)) | cut -d' ' -f -6
produces:
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
Here is another way:
awk '{x=$1-b;while(x-->1){print ++b," 0 0 0 0 0"};b=$1}1' file
Test:
$ cat file
1 1 2 3 4 5
2 1 2 3 4 5
5 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5
$ awk '{x=$1-b;while(x-->1){print ++b," 0 0 0 0 0"};b=$1}1' file
1 1 2 3 4 5
2 1 2 3 4 5
3 0 0 0 0 0
4 0 0 0 0 0
5 1 2 3 4 5
6 0 0 0 0 0
7 0 0 0 0 0
8 1 2 3 4 5
9 1 2 3 4 5
10 1 2 3 4 5
11 1 2 3 4 5