Arrange() does not work within for loop in R - for-loop

Nice to meet you all. This is my first question in this wonderful place. Thank you in advance for your time and problem-solving efforts.
I want to see the top 10 donor countries' names each year.
My data looks like this.
head(Afghanistan)
# A tibble: 6 × 62
Donor Y_1960 Y_1961 Y_1962 Y_1963 Y_1964 Y_1965 Y_1966 Y_1967 Y_1968 Y_1969 Y_1970 Y_1971 Y_1972 Y_1973 Y_1974 Y_1975 Y_1976 Y_1977 Y_1978
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Austral… NA NA NA NA NA 0.03 0.04 0.04 0.06 0.74 0.1 0.12 0.15 0.58 0.15 0.25 0.26 0.65 1.03
2 Austria NA NA NA NA 0.02 0.02 0.03 0.05 0.02 0.01 NA 0.05 NA NA NA NA 0.03 0.09 0.08
3 Belgium NA NA NA NA NA NA NA NA NA NA NA NA NA 0.01 0.01 0.01 0.04 0.03 0.06
4 Canada NA NA NA NA NA 0.01 0.03 0.03 0.01 0.01 0.7 2.14 1.14 2.22 0.54 1.47 0.38 0.34 4.19
5 Czech R… NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 Denmark NA NA NA NA NA NA NA 0.01 0.02 0.02 0.02 0.02 0.05 0.12 0.05 0.93 2.57 0.06 0.41
# … with 42 more variables
head(DAC)
iso2c country year unemp unemp_ILO gdpgrowth gdppc iso3c
6 AT Austria 1965 NA NA 3.480175 2.810013 AUT
7 AT Austria 1966 NA NA 5.642861 4.904479 AUT
8 AT Austria 1967 NA NA 3.008048 2.241010 AUT
9 AT Austria 1968 NA NA 4.472313 3.931242 AUT
10 AT Austria 1969 2.8 NA 6.275867 5.909496 AUT
11 AT Austria 1970 2.4 NA 6.321143 5.950497 AUT
Before making a for loop, I tried this to test the code.
new <- Afghanistan %>%
select(Donor, Y_1990) %>%
arrange(desc(Y_1990)) %>%
drop_na() %>%
slice_max(Y_1990, n=10)
new
# A tibble: 10 × 2
Donor Y_1990
<chr> <dbl>
1 United States 56
2 Sweden 16.0
3 Germany 8.07
4 Norway 3.35
5 Netherlands 3.13
6 Canada 2.73
7 United Kingdom 2.39
8 Switzerland 2.01
9 France 1.87
10 Denmark 1.58
When I run the above code, arrange() works well. However, after using it within for loop, it only shows alphabetical order by country names.
year <- seq(1960, 2020, 1)
for(i in year){
Year <- paste0("Y", "_", i)
new <- Afghanistan %>%
select(Donor, Year) %>%
arrange(desc(Year)) %>%
drop_na()
print(new[1:10, 1])
}
# A tibble: 10 × 2
Donor Y_1990
<chr> <dbl>
1 Australia 1.23
2 Austria 1.49
3 Belgium 0.01
4 Canada 2.73
5 Denmark 1.58
6 Finland 0.15
7 France 1.87
8 Germany 8.07
9 Ireland 0.05
10 Japan 0.03
# A tibble: 10 × 2
Donor Y_1991
<chr> <dbl>
1 Australia 4.75
2 Austria 0.62
3 Canada 5.31
4 Denmark 2.71
5 Finland 2.28
6 France 1.5
7 Germany 5.52
8 Italy 0.4
9 Japan 0.02
10 Korea 0.11
Like the above, the results are not arranged by the value of Year. How should I fix this? I tried to use rank() but it also only gives alphabetical results.

You have fallen victim to the intricacies of non-standard evaluation. You need to quote-unquote Year like this:
arrange(!!sym(Year))
In this case when you use arrange it is expecting a column named Year so you have to do something special that tells arrange to use the value assigned to Year.
Not you might be asking: "why doesn't select have this problem?" And the answer is it does behave this way but it also tries to figure out what you meant if you don't "quote-unquote" In fact you should see a warning the first time you do this in your session. In this case the solution is to do something like this:
select(all_of(Year))
Some reading on NSE:
https://www.r-bloggers.com/2019/07/bang-bang-how-to-program-with-dplyr/
https://www.brodrigues.co/blog/2019-06-20-tidy_eval_saga/
http://adv-r.had.co.nz/Computing-on-the-language.html
And some other SO posts on this matter:
How do {{}} double curly brackets work in dplyr?
Turn a {{}} (dplyr double curly braces) interpolation into a string

Related

Analysis on the basis of comparison of 1st column of 1 files with 1st column of N number of files and print all files based of column 1

I have tab separated files and need to compare FILE_1 with N (10) files, If the IDS of column 1 of first file match with the 1st column of other files print file 1 and value of the other files and if the IDS not presnt , first file and NA to the column of other file. The example of the input and expected output file are given below.
File 1
A 1.1 0.2 0.3 1.1
B 1.3 2.1 0.2 0.1
C 1.8 0.5 2.6 3.8
D 1.2 5.1 1.7 0.1
E 1.9 4.3 2.8 1.6
F 1.6 5.1 2.9 7.1
G 1.8 2.8 0.3 3.7
H 1.9 3.6 3.7 0.1
I 1.0 2.4 4.9 2.5
J 1.1 2.0 0.1 0.4
File 2
A d1 Q2 Q.3 E.1
B a.3 S.1 A.2 R.1
J a.1 2.0 031 4a4
File 3
E 1d9 4a3 2A8 1D6
F 1a.6 5a1 2W9 7Q1
J QA8 1.8 0W3 3E7
File 4
F 1aa 5a 2Q 7WQ
G ac UW 0QW 3aQ
A QQ aws AW qw
I have tried the following code with two file initially but not getting the expected output
awk '
FILENAME == "File_2" {
id = $0
val[id] = $2","$3","$5
}
FILENAME == "File_1" {
id = $1
string
if (val[id] == "") {
print id " " "NA"
} else {
print id " " val[id]
}
}
' File_2 File_1
The above code print the File_2 and NA at the end of each line.
My expected output is looks like below
Final Expected Output
A 1.1 0.2 0.3 1.1 d1 Q2 Q.3 E.1 NA NA NA NA QQ aws AW qw
B 1.3 2.1 0.2 0.1 a.3 S.1 A.2 R.1 NA NA NA NA NA NA NA NA
C 1.8 0.5 2.6 3.8 NA NA NA NA NA NA NA NA NA NA NA NA
D 1.2 5.1 1.7 0.1 NA NA NA NA NA NA NA NA NA NA NA NA
E 1.9 4.3 2.8 1.6 NA NA NA NA 1d9 4a3 2A8 1D6 NA NA NA NA
F 1.6 5.1 2.9 7.1 NA NA NA NA 1a.6 5a1 2W9 7Q1 1aa 5a 2Q 7WQ
G 1.8 2.8 0.3 3.7 NA NA NA NA NA NA NA NA ac UW 0QW 3aQ
H 1.9 3.6 3.7 0.1 NA NA NA NA NA NA NA NA NA NA NA NA
I 1.0 2.4 4.9 2.5 NA NA NA NA NA NA NA NA NA NA NA NA
J 1.1 2.0 0.1 0.4 a.1 2.0 031 4a4 QA8 1.8 0W3 3E7 NA NA NA NA
Using GNU awk for arrays of arrays, ARGIND, and gensub():
$ cat tst.awk
BEGIN { FS=OFS="\t" }
ARGIND < (ARGC-1) {
key = $1
sub("[^"FS"]+"FS"?","")
fileNrsKeys2vals[ARGIND][key] = $0
fileNrs2numFlds[ARGIND] = NF
next
}
{
printf "%s", $0
for ( fileNr=1; fileNr<ARGIND; fileNr++ ) {
if ( fileNr in fileNrs2numFlds ) {
numFlds = fileNrs2numFlds[fileNr]
printf "%s", ( $1 in fileNrsKeys2vals[fileNr] ?
OFS fileNrsKeys2vals[fileNr][$1] :
gensub(/ /,OFS"NA","g",sprintf("%*s",numFlds,"")) )
}
}
print ""
}
$ awk -f tst.awk file2 file3 file4 file1
A 1.1 0.2 0.3 1.1 d1 Q2 Q.3 E.1 NA NA NA NA QQ aws AW qw
B 1.3 2.1 0.2 0.1 a.3 S.1 A.2 R.1 NA NA NA NA NA NA NA NA
C 1.8 0.5 2.6 3.8 NA NA NA NA NA NA NA NA NA NA NA NA
D 1.2 5.1 1.7 0.1 NA NA NA NA NA NA NA NA NA NA NA NA
E 1.9 4.3 2.8 1.6 NA NA NA NA 1d9 4a3 2A8 1D6 NA NA NA NA
F 1.6 5.1 2.9 7.1 NA NA NA NA 1a.6 5a1 2W9 7Q1 1aa 5a 2Q 7WQ
G 1.8 2.8 0.3 3.7 NA NA NA NA NA NA NA NA ac UW 0QW 3aQ
H 1.9 3.6 3.7 0.1 NA NA NA NA NA NA NA NA NA NA NA NA
I 1.0 2.4 4.9 2.5 NA NA NA NA NA NA NA NA NA NA NA NA
J 1.1 2.0 0.1 0.4 a.1 2.0 031 4a4 QA8 1.8 0W3 3E7 NA NA NA NA
This solution requires a " | sort" since awk arrays are not guaranteed to be in order. It also is sensitive to the number of spaces immediately following the index letter ("A", "B", "C", etc.):
Mac_3.2.57$cat mergeLinesV0.awk
BEGIN {
i1=1
i2=1
i3=1
i4=1
} NR == FNR {
ar1[i1]=$0
i1=i1+1
f1size=FNR
next
}{
f1done=1
} NR-f1size == FNR && f1done {
ar2[i2]=$0
i2=i2+1
f2size=FNR
next
}{
f2done=1
} NR-f1size-f2size == FNR && f2done {
ar3[i3]=$0
i3=i3+1
f3size=FNR
next
}{
f3done=1
} NR-f1size-f2size-f3size == FNR && f3done {
ar4[i4]=$0
i4=i4+1
f4size=FNR
next
} END {
for(i1 in ar1){
printf("%s ", ar1[i1])
found2=0
for(i2 in ar2){
if(substr(ar1[i1],1,1)==substr(ar2[i2],1,1)){
printf("%s ", substr(ar2[i2],5))
found2=1
}
}
if(!found2){
printf("NA NA NA NA ")
}
found3=0
for(i3 in ar3){
if(substr(ar1[i1],1,1)==substr(ar3[i3],1,1)){
printf("%s ", substr(ar3[i3],5))
found3=1
}
}
if(!found3){
printf("NA NA NA NA ")
}
found4=0
for(i4 in ar4){
if(substr(ar1[i1],1,1)==substr(ar4[i4],1,1)){
printf("%s\n", substr(ar4[i4],5))
found4=1
}
}
if(!found4){
printf("NA NA NA NA\n")
}
}
}
Mac_3.2.57$awk -f mergeLinesV0.awk File1 File2 File3 File4 | sort
A 1.1 0.2 0.3 1.1 d1 Q2 Q.3 E.1 NA NA NA NA QQ aws AW qw
B 1.3 2.1 0.2 0.1 a.3 S.1 A.2 R.1 NA NA NA NA NA NA NA NA
C 1.8 0.5 2.6 3.8 NA NA NA NA NA NA NA NA NA NA NA NA
D 1.2 5.1 1.7 0.1 NA NA NA NA NA NA NA NA NA NA NA NA
E 1.9 4.3 2.8 1.6 NA NA NA NA 1d9 4a3 2A8 1D6 NA NA NA NA
F 1.6 5.1 2.9 7.1 NA NA NA NA 1a.6 5a1 2W9 7Q1 1aa 5a 2Q 7WQ
G 1.8 2.8 0.3 3.7 NA NA NA NA NA NA NA NA ac UW 0QW 3aQ
H 1.9 3.6 3.7 0.1 NA NA NA NA NA NA NA NA NA NA NA NA
I 1.0 2.4 4.9 2.5 NA NA NA NA NA NA NA NA NA NA NA NA
J 1.1 2.0 0.1 0.4 a.1 2.0 031 4a4 QA8 1.8 0W3 3E7 NA NA NA NA
Mac_3.2.57$cat File1
A 1.1 0.2 0.3 1.1
B 1.3 2.1 0.2 0.1
C 1.8 0.5 2.6 3.8
D 1.2 5.1 1.7 0.1
E 1.9 4.3 2.8 1.6
F 1.6 5.1 2.9 7.1
G 1.8 2.8 0.3 3.7
H 1.9 3.6 3.7 0.1
I 1.0 2.4 4.9 2.5
J 1.1 2.0 0.1 0.4
Mac_3.2.57$cat File2
A d1 Q2 Q.3 E.1
B a.3 S.1 A.2 R.1
J a.1 2.0 031 4a4
Mac_3.2.57$cat File3
E 1d9 4a3 2A8 1D6
F 1a.6 5a1 2W9 7Q1
J QA8 1.8 0W3 3E7
Mac_3.2.57$cat File4
F 1aa 5a 2Q 7WQ
G ac UW 0QW 3aQ
A QQ aws AW qw
Mac_3.2.57$
Given your 4 example files (as file1.txt .. file4.txt), here is a ruby that does that:
ruby -lne '
BEGIN{
files={}
seen=Set.new()
data=Hash.new { |h, k| h[k] = Hash.new { |hh, kk| hh[kk] = [] } }
}
fields=$_.split(/\t/)
if $<.file.lineno==1; files[$<.file.path]=fields.length-1; end
seen<<fields[0]
data[fields[0]][files.keys.last]=fields[1..]
END{
seen.each{|k| row=[k]
files.each{|file, width|
if data[k][file].empty?
row.push(*["NA"]*width)
else
row.push(*data[k][file])
end
}
puts row.join("\t")
}
}' file?.txt
Prints:
A 1.1 0.2 0.3 1.1 d1 Q2 Q.3 E.1 NA NA NA NA QQ aws AW qw
B 1.3 2.1 0.2 0.1 a.3 S.1 A.2 R.1 NA NA NA NA NA NA NA NA
C 1.8 0.5 2.6 3.8 NA NA NA NA NA NA NA NA NA NA NA NA
D 1.2 5.1 1.7 0.1 NA NA NA NA NA NA NA NA NA NA NA NA
E 1.9 4.3 2.8 1.6 NA NA NA NA 1d9 4a3 2A8 1D6 NA NA NA NA
F 1.6 5.1 2.9 7.1 NA NA NA NA 1a.6 5a1 2W9 7Q1 1aa 5a 2Q 7WQ
G 1.8 2.8 0.3 3.7 NA NA NA NA NA NA NA NA ac UW 0QW 3aQ
H 1.9 3.6 3.7 0.1 NA NA NA NA NA NA NA NA NA NA NA NA
I 1.0 2.4 4.9 2.5 NA NA NA NA NA NA NA NA NA NA NA NA
J 1.1 2.0 0.1 0.4 a.1 2.0 031 4a4 QA8 1.8 0W3 3E7 NA NA NA NA
This produces exact expected output:
gawk '{
a[$1] = 1
for (i = 2; i <= 5; ++i)
b[$1, (ARGIND - 1) * 4 + (i - 2)] = $i
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc";
for (i in a) {
t = i
for (j = 0; j < ARGIND * 4; ++j)
t = t OFS (b[i, j] ? b[i, j] : "NA")
print t
}
}' File_{1..4} | column -t

making four new columns based on 8 existing columns

Below you can see the reproduced sample of my data.
DATA <- structure(list(ID = c("101", "101", "101", "101", "101", "101","101", "101", "101", "101"), IDA = c("1", "1", "2", "3", "4","5", "5", "1859", "1860", "1861"), DATE = structure(c(1300928400,1277946000, 1277946000, 1278550800, 1278550800, 1453770000, 1329958800,1506474000, 1485133200, 1485133200), tzone = "UTC", class = c("POSIXct","POSIXt")), NR = c("CH-0001", "CH-0001","CH-0002", "CH-0003", "CH-0004", "CH-0005","CH-0005", "CH-1859", "CH-1860", "CH-1861"), PAT = c("101-1", "101-1", "101-2", "101-3", "101-4", "101-5","101-5", "101-1859", "101-1860", "101-1861"), INT1 = c(245005,280040, 280040, 280040, 280040, 240040, 240040, NA, NA, NA),INT2 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), INT3 = c(NA_real_,NA_real_, 280010, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, 245035, NA_real_), INT4 = c(NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_), INTX1 = c(NA_real_, 275040, NA_real_,NA_real_, NA_real_, NA_real_, 240080, NA_real_, NA_real_,NA_real_), INTX2 = c(276790, NA_real_, 7612645, NA_real_,NA_real_, NA_real_, 5078219, NA_real_, NA_real_, NA_real_), INTX173 = c(NA_real_, NA_real_, NA_real_, 3456878,NA_real_, NA_real_, 3289778, NA_real_, NA_real_, NA_real_), INTX4 = c(NA_real_, NA_real_, 11198767, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 7025676), KAT = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1)), row.names = c(NA,-10L), class = c("tbl_df", "tbl", "data.frame"))
As you see, I have eight columns called: INT1:INT4 and INTX1:INTX4. For each row there are only a maximum of four values for these variables and the rest are NAs. I need to create four new variables called ING1:ING4 and tell R to check the 8 columns one by one per row and assign the first value it finds in that row to ING1, the second value to ING2, the third value to ING3, and the fourth value to ING4.At the end, it is possible that, for a row, all or some of the ING1:ING4 columns are filled with values.
I would expect for row 1 I get the following ING columns:
ING1 == 245005, ING2 == 276790, ING3 == NA, ING4 ==NA
I think I need to write a loop for that but as I am a beginner I am lost how to do it. Could you kindly help me with it?
Try this:
fun <- function(select, prefix = "ING", ncol = -1, data = cur_data()) {
select <- substitute(select)
out <- asplit(t(
apply(subset(data, select = eval(select)), 1, sort, na.last = TRUE)
), 2)
names(out) <- paste0(prefix, seq_along(out))
if (ncol > 0) out <- out[seq_len(ncol)]
do.call(data.frame, out)
}
And its use:
dplyr
library(dplyr)
DATA %>%
mutate(fun(INT1:INTX4, ncol=4))
# # A tibble: 10 × 18
# ID IDA DATE NR PAT INT1 INT2 INT3 INT4 INTX1 INTX2 INTX173 INTX4 KAT ING1 ING2 ING3 ING4
# <chr> <chr> <dttm> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 1 2011-03-24 01:00:00 CH-0001 101-1 245005 NA NA NA NA 276790 NA NA 0 245005 276790 NA NA
# 2 101 1 2010-07-01 01:00:00 CH-0001 101-1 280040 NA NA NA 275040 NA NA NA 0 275040 280040 NA NA
# 3 101 2 2010-07-01 01:00:00 CH-0002 101-2 280040 NA 280010 NA NA 7612645 NA 11198767 0 280010 280040 7612645 11198767
# 4 101 3 2010-07-08 01:00:00 CH-0003 101-3 280040 NA NA NA NA NA 3456878 NA 0 280040 3456878 NA NA
# 5 101 4 2010-07-08 01:00:00 CH-0004 101-4 280040 NA NA NA NA NA NA NA 0 280040 NA NA NA
# 6 101 5 2016-01-26 01:00:00 CH-0005 101-5 240040 NA NA NA NA NA NA NA 0 240040 NA NA NA
# 7 101 5 2012-02-23 01:00:00 CH-0005 101-5 240040 NA NA NA 240080 5078219 3289778 NA 0 240040 240080 3289778 5078219
# 8 101 1859 2017-09-27 01:00:00 CH-1859 101-1859 NA NA NA NA NA NA NA NA 1 NA NA NA NA
# 9 101 1860 2017-01-23 01:00:00 CH-1860 101-1860 NA NA 245035 NA NA NA NA NA 1 245035 NA NA NA
# 10 101 1861 2017-01-23 01:00:00 CH-1861 101-1861 NA NA NA NA NA NA NA 7025676 1 7025676 NA NA NA
base R
cbind(DATA, fun(data = DATA, INT1:INTX4, ncol=4))
# ID IDA DATE NR PAT INT1 INT2 INT3 INT4 INTX1 INTX2 INTX173 INTX4 KAT ING1 ING2 ING3 ING4
# 1 101 1 2011-03-24 01:00:00 CH-0001 101-1 245005 NA NA NA NA 276790 NA NA 0 245005 276790 NA NA
# 2 101 1 2010-07-01 01:00:00 CH-0001 101-1 280040 NA NA NA 275040 NA NA NA 0 275040 280040 NA NA
# 3 101 2 2010-07-01 01:00:00 CH-0002 101-2 280040 NA 280010 NA NA 7612645 NA 11198767 0 280010 280040 7612645 11198767
# 4 101 3 2010-07-08 01:00:00 CH-0003 101-3 280040 NA NA NA NA NA 3456878 NA 0 280040 3456878 NA NA
# 5 101 4 2010-07-08 01:00:00 CH-0004 101-4 280040 NA NA NA NA NA NA NA 0 280040 NA NA NA
# 6 101 5 2016-01-26 01:00:00 CH-0005 101-5 240040 NA NA NA NA NA NA NA 0 240040 NA NA NA
# 7 101 5 2012-02-23 01:00:00 CH-0005 101-5 240040 NA NA NA 240080 5078219 3289778 NA 0 240040 240080 3289778 5078219
# 8 101 1859 2017-09-27 01:00:00 CH-1859 101-1859 NA NA NA NA NA NA NA NA 1 NA NA NA NA
# 9 101 1860 2017-01-23 01:00:00 CH-1860 101-1860 NA NA 245035 NA NA NA NA NA 1 245035 NA NA NA
# 10 101 1861 2017-01-23 01:00:00 CH-1861 101-1861 NA NA NA NA NA NA NA 7025676 1 7025676 NA NA NA

Pretty print a space delimited file [duplicate]

This question already has answers here:
How can I format the output of a bash command in neat columns
(7 answers)
Closed 4 years ago.
I have a file that comes from R. It is basically the output of write.table command using as delimiter " ". An example of this file would look like this:
file1.txt
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
What I need to get is a new file in a very specific format, basically I need to align all the columns using spaces, I can't use tabs.
The desired output will be
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
Thus, between 5285 and II-3, first row, there would be 3 white spaces and between 8F412 and III-3, third row, only two white spaces. The lengths of first tree fields can be different, however the length for the rest of columns is always fixed (two characters) but the last one that can be 12 characters
I can do this in a text editor but I have a very huge file, and I would like to do it using bash, awk or R
Use column:
$ column -t file
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
Use awk so that you have tight control on how you want to format each field:
awk '{ printf("%-5s %-5s %-5s %s %s %s %s %s %s %s %s %s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12) }' file
Produces:
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA
here is another approach
$ tr ' ' '\t' <file | expand -t2
5285 II-3 II-2 2 NA NA NA NA 40 NA NA c.211A>G
8988 III-3 III-4 1 NA NA NA NA NA NA NA c.211A>G
8F412 III-3 III-4 2 NA NA 28 NA NA NA NA c.211A>G
4H644 III-3 III-4 2 NA NA NA NA NA NA NA NA

understanding how to construct a higher order markov chain

Suppose I want to predict if a person is of class1=healthy or of class2= fever. I have a data set with the following domain: {normal,cold,dizzy}
The transition matrix would contain the probability of transition generated from our training dataset while the initial vector would contain the probability that a person starts(day1) with a state x from the domain {normal,cold,dizzy}, again this is also generated from our training set.
If I want to build a first order markov chain, I would generate a 3x3 transition matrix and a 1x3 initial vector per class like so:
> TransitionMatrix
normal cold dizzy
normal NA NA NA
cold NA NA NA
dizzy NA NA NA
>Initial Vector
normal cold dizzy
[1,] NA NA NA
The NA will be filled with the corresponding probabilities.
1-My question is about transition matrices in higher order chain. For example in second order MC would we have a transition matrix of size domain²xdomain² like so:
normal->normal normal->cold normal->dizzy cold->normal cold->cold cold->dizzy dizzy->normal dizzy->cold dizzy->dizzy
normal->normal NA NA NA NA NA NA NA NA NA
normal->cold NA NA NA NA NA NA NA NA NA
normal->dizzy NA NA NA NA NA NA NA NA NA
cold->normal NA NA NA NA NA NA NA NA NA
cold->cold NA NA NA NA NA NA NA NA NA
cold->dizzy NA NA NA NA NA NA NA NA NA
dizzy->normal NA NA NA NA NA NA NA NA NA
dizzy->cold NA NA NA NA NA NA NA NA NA
dizzy->dizzy NA NA NA NA NA NA NA NA NA
here the cell (1,1) represents the following sequence: normal->normal->normal->normal
or would it instead be just domain²xdomain like so:
normal cold dizzy
normal->normal NA NA NA
normal->cold NA NA NA
normal->dizzy NA NA NA
cold->normal NA NA NA
cold->cold NA NA NA
cold->dizzy NA NA NA
dizzy->normal NA NA NA
dizzy->cold NA NA NA
dizzy->dizzy NA NA NA
here the cell (1,1) represents normal->normal->normal which is different from the previous representation
2-What about the initial vector for a MC of degree 2. Would we need two initial vectors of size 1xdomain like so:
normal cold dizzy
[1,] NA NA NA
leading to two initial vectors per class. the first giving the probability of occurrence of {normal,cold,dizzy} on the first day for the healthy/fever class while the second gives the probability of occurrence on the second day for the healthy/fever. this would give 4 initial vectors.
OR would we just need one initial vector of size 1xdomain²like so:
normal->normal normal->cold normal->dizzy cold->normal cold->cold cold->dizzy dizzy->normal dizzy->cold dizzy->dizzy
[1,] NA NA NA NA NA NA NA NA NA
I can see how the second way of representing the initial vector would be problematic in case we want to classify an observation with only one state.
Say the set of spaces is S. Typically, in the nth order,
The transition matrix has dimensions |S|n X |S|. This is because given the current n history of states, we need the probability of the single next state. It is true that this single next state induces another compound state of history n, but the transition itself is to the single next state. See this example in Wikipedia, e.g..
The initial distribution is a distribution over |S|n elements (your second option).

Extract individual column from a HIVE table

Below is a select query from a HIVE table:
select * from test_aviation limit 5;
OK
2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0855 -5.00 0.00 0.00 -1 0900-0959 17.00 0912 1230 7.00 1230 1237 7.00 7.00 0.00 0 1200-1259 0.00 0.00 390.00 402.00 378.00 1.00 2475.00 10
2015 1 1 2 5 2015-01-02 AA 19805 AA N795AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0850 -10.00 0.00 0.00 -1 0900-0959 15.00 0905 1202 9.00 1230 1211 -19.00 0.00 0.00 -2 1200-1259 0.00 0.00 390.00 381.00 357.00 1.00 2475.00 10
2015 1 1 3 6 2015-01-03 AA 19805 AA N788AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 15.00 0908 1138 13.00 1230 1151 -39.00 0.00 0.00 -2 1200-1259 0.00 0.00 390.00 358.00 330.00 1.00 2475.00 10
2015 1 1 4 7 2015-01-04 AA 19805 AA N791AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 14.00 0907 1159 19.00 1230 1218 -12.00 0.00 0.00 -1 1200-1259 0.00 0.00 390.00 385.00 352.00 1.00 2475.00 10
2015 1 1 5 1 2015-01-05 AA 19805 AA N783AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 27.00 0920 1158 24.00 1230 1222 -8.00 0.00 0.00 -1 1200-1259 0.00 0.00 390.00 389.00 338.00 1.00 2475.00 10
Time taken: 0.067 seconds, Fetched: 5 row(s)
Structure of HIVE table
hive> describe test_aviation;
OK
col_value string
Time taken: 0.221 seconds, Fetched: 1 row(s)
I want to segregate the entire table in different columns.I have written a query like below to extract 12th column:
SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 12) from test_aviation;
Output:
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1437067221195_0008, Tracking URL = http://localhost:8088/proxy/application_1437067221195_0008/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1437067221195_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-07-17 02:46:56,215 Stage-1 map = 0%, reduce = 0%
2015-07-17 02:47:27,650 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1437067221195_0008 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://localhost:8088/proxy/application_1437067221195_0008/
Examining task ID: task_1437067221195_0008_m_000000 (and more) from job job_1437067221195_0008
Task with the most failures(4):
-----
Task ID:
task_1437067221195_0008_m_000000
URL:
http://localhost:8088/taskdetails.jsp?jobid=job_1437067221195_0008&tipid=task_1437067221195_0008_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:195)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract#4def4616 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0855 -5.00 0.00 0.00 -1 0900-0959 17.00 0912 1230 7.00 1230 1237 7.00 7.00 0.00 0 1200-1259 0.00 0.00 390.00 402.00 378.00 1.00 2475.00 10 :java.lang.String, ^(?:([^,]*),?){1}:java.lang.String, 12:java.lang.Integer} of size 3
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1243)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:182)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:166)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:79)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:540)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1219)
... 18 more
Caused by: java.lang.IndexOutOfBoundsException: No group 12
at java.util.regex.Matcher.group(Matcher.java:487)
at org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(UDFRegExpExtract.java:56)
... 23 more
Please help me to extract different columns from a HIVE table.
Try this:
select split(col_value,' ')[11] as column_12 from test_aviation;
Assuming you have space delimiters.
'\\t' if tab
'\\|' for pipe...
':'
and so on

Resources