R crash on write.csv() for a data.table [closed] - windows

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 7 years ago.
Improve this question
Referring to the question Crashing R when calling `write.table` on particular data set, I can almost "reliably" crash 64-bit R --vanilla on Windows-64bit by saving a large data.table in one session. When I say almost, once it happened (when demonstrating the crash to a guy in IT!) that I got the message
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
'getCharCE' must be called on a CHARSXP
referenced in the above question.
To crash R I just need to
save(DT, "datatablefile.RData")
and then in another R session which could be --vanilla, I just say...
load("datatablefile.RData")
write.csv(DT, file='datatablefile.csv')
which will then crash after a minute or two. Note in particular that it will NOT crash if I say
load("datatablefile.RData")
library(data.table)
write.csv(DT, file='datatablefile.csv')
When I say something like
library(data.table)
N <- 1000
DT <- data.table(id=1:N, name=sample(letters, N, replace=TRUE))
save(DT, file='dttest.RData')
and then in another session
load('dttest.RData')
write.csv(DT, 'dttest.csv')
I don't get a crash...
There was the suggestion it might be linked to rbindlist(), so
library(data.table)
N <- 10000000
DT1 <- data.table(id=1:N, name=sample(letters, N, replace=TRUE))
DT2 <- data.table(id=1:N, name=sample(letters, N, replace=TRUE))
DT <- rbindlist(list(DT1, DT2))
save(DT, file='dttest.RData')
Note that I have tried this for N <- 10000000, on this 32gb machine and it still works fine...
It has been suggested it might be due to factors?
library(data.table)
N <- 1000
DT1 <- data.table(id=1:N, name=sample(letters, N, replace=TRUE),
code=as.factor(sample(letters[1:5], N, replace=TRUE)))
DT2 <- data.table(id=1:N, name=sample(letters, N, replace=TRUE),
code=as.factor(sample(letters[1:5], N, replace=TRUE)))
DT <- rbindlist(list(DT1, DT2))
save(DT, file='dttest.RData')
str(DT)
Classes ‘data.table’ and 'data.frame': 20000000 obs. of 3 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ name: chr "v" "u" "t" "z" ...
$ code: Factor w/ 5 levels "a","b","c","d",..: 2 5 4 2 2 1 2 3 2 4 ...
- attr(*, ".internal.selfref")=<externalptr>
Then in the other session
> load('dttest.RData')
> tables()
Error: could not find function "tables"
> str(DT)
Classes ‘data.table’ and 'data.frame': 20000000 obs. of 3 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ name: chr "v" "u" "t" "z" ...
$ code: Factor w/ 5 levels "a","b","c","d",..: 2 5 4 2 2 1 2 3 2 4 ...
- attr(*, ".internal.selfref")=<externalptr>
> write.csv(DT, 'dttest.csv')
which then works fine...
It seems fine when I write a large data.table which can contain chr, num, Date but seeems to fail when it contains Factors...
Any suggestions as to how I might figure out how to create a reliable demonstration of how to do replicate this? The contents of the tables themselves are highly confidential.
Update
I've just tried doing
setkey(DT,id)
but it didn't cause a crash.

Related

What could be the algorithm for ordered 9 digit problem? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Given any 3 digits N as an input,
we can only use digits from 1 to 9, in such a way that order never breaks and any repetition of number.
for example
If N = 150.
123 + 4 + 5 - 6 + 7 + 8 + 9 = 150
We can combine digits and insert '-' and '+' operations to get the desired N value.
Line up the numbers: 1 2 3 4 5 6 7 8 9.
There are 8 spaces between these numbers. Each space could be a '+' or a '-' or a blank (joining the digits together).
Thus, there are 3 ** 8 i.e., 6561 different possible combinations of operations you could use.
That's small enough to just try all of them in a loop and check which one works.
You can do it like this(this code is written by python):
N = 150
digit_from, digit_to = 1, 9 ### from 1 to 9 ###
def find(N, pos, equation, num, coff):
if pos > digit_to:
if N - coff * num == 0:
print(equation)
else:
find( N-coff*num, pos+1, equation+'+'+str(pos), pos, 1 ) ### plus ###
find( N-coff*num, pos+1, equation+'-'+str(pos), pos, -1 ) ### minus ###
find( N, pos+1, equation+str(pos), num*10+pos, coff ) ### blank ###
return
find( N, digit_from + 1, '150 = 1', digit_from, 1 )
Result(In case of N = 150 and digits between 1 to 9):
150 = 1+23+45-6+78+9
150 = 1+234+5+6-7-89
150 = 1-2+3-4+56+7+89
150 = 12+3+45-6+7+89
150 = 123+4+5-6+7+8+9
150 = 123+45+6-7-8-9
150 = 123-4-56+78+9
Thanks

Need help on a problemset in a programming contest [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I've attended a local programming contest on my country. The name of the contest is "ACM-ICPC Indonesia National Contest 2013".
The contest has ended on 2013-10-13 15:00:00 (GMT +7) and I am still curious about one of the problems.
You can find the original version of the problem here.
Brief Problem Explanation:
There are a set of "jobs" (tasks) that should be performed on several "servers" (computers).
Each job should be executed strictly from start time Si to end time Ei
Each server can only perform one task at a time.
(The complicated thing goes here) It takes some time for a server to switch from one job to another.
If a server finishes job Jx, then to start job Jy it will need an intermission time Tx,y after job Jx completes. This is the time required by the server to clean up job Jx and load job Jy.
In other word, job Jy can be run after job Jx if and only if Ex + Tx,y ≤ Sy.
The problem is to compute the minimum number of servers needed to do all jobs.
Example:
For example, let there be 3 jobs
S(1) = 3 and E(1) = 6
S(2) = 10 and E(2) = 15
S(3) = 16 and E(3) = 20
T(1,2) = 2, T(1,3) = 5
T(2,1) = 0, T(2,3) = 3
T(3,1) = 0, T(3,2) = 0
In this example, we need 2 servers:
Server 1: J(1), J(2)
Server 2: J(3)
Sample Input:
Short explanation: The first 3 is the number of test cases, following by number of jobs (the second 3 means that there are 3 jobs for case 1), then followed by Ei and Si, then the T matrix (sized equal with number of jobs).
3
3
3 6
10 15
16 20
0 2 5
0 0 3
0 0 0
4
8 10
4 7
12 15
1 4
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
4
8 10
4 7
12 15
1 4
0 50 50 50
50 0 50 50
50 50 0 50
50 50 50 0
Sample Output:
Case #1: 2
Case #2: 1
Case #3: 4
Personal Comments:
The time required can be represented as a graph matrix, so I'm supposing this as a directed acyclic graph problem.
Methods I tried so far is brute force and greedy, but got Wrong Answer. (Unfortunately I don't have my code anymore)
Could probably solved by dynamic programming too, but I'm not sure.
I really have no clear idea on how to solve this problem.
So a simple hint or insight will be very helpful to me.
You can solve this by computing the maximum matching in a bipartite graph.
The idea is you are trying to match job end times with job start times.
A matched end time x with start time y means that the same server will do job x and job y.
The number of servers you need will correspond to the number of unmatched start times (because each of these jobs will require a new server).
Example Python code using NetworkX:
import networkx as nx
G=nx.DiGraph()
S=[3,10,16] # start times
E=[6,15,20] # end times
T = [ [0, 2, 5],
[0, 0, 3],
[0, 0, 0] ] # T[x][y]
N=len(S)
for jobx in range(N):
G.add_edge('start','end'+str(jobx),capacity=1)
G.add_edge('start'+str(jobx),'end',capacity=1)
for joby in range(N):
if E[jobx]+T[jobx][joby] <= S[joby]:
G.add_edge('end'+str(jobx),'start'+str(joby),capacity=1)
print N-nx.max_flow(G,'start','end')

Transposing a column with a line in a matrix [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
We have this matrix of 4x4:
a b c d
e f g h
1 2 3 4
5 6 7 8
By transposing the matrix we get:
a e 1 5
b f 2 6
c g 3 7
d h 4 8
My question is:
What matrix do we get by "transposing column 2 with row 4?"
I need to understand the operation in itself, what does it imply/mean? I never thought of "transposing a column with a line".
AFAIK, It means you are to swap column 2 and row 4, instead of column 1 with row 1 and column2 with row 2 etc.
The code is basically the same as a full transposition, except you only have one column/row
Matrix transposition is a mathematical operation in which a matrix's rows become its columns. From a mathematical perspective, there's no real benefit to transposing only one row in a M x N matrix, but the code to transpose one row is not much different than transposing an entire matrix.
The matrix you get after the transposition would be:
a b 1 d
e f 2 h
c g 3 7
5 6 4 8

Linear time complexity ranking algorithm when the orders are precomputed

I am trying to write an efficient ranking algorithm in C++ but I will present my case in R as it is far easier to understand this way.
> samples_x <- c(4, 10, 9, 2, NA, 3, 7, 1, NA, 8)
> samples_y <- c(5, 7, 9, NA, 1, 4, NA, 8, 2, 10)
> orders_x <- order(samples_x)
> orders_y <- order(samples_y)
> cbind(samples_x, orders_x, samples_y, orders_y)
samples_x orders_x samples_y orders_y
[1,] 4 8 5 5
[2,] 10 4 7 9
[3,] 9 6 9 6
[4,] 2 1 NA 1
[5,] NA 7 1 2
[6,] 3 10 4 8
[7,] 7 3 NA 3
[8,] 1 2 8 10
[9,] NA 5 2 4
[10,] 8 9 10 7
Suppose the above is already precomputed. Performing a simple ranking on each of the sample sets takes linear time complexity (the result is much like the rank function):
> ranks_x <- rep(0, length(samples_x))
> for (i in 1:length(samples_x)) ranks_x[orders_x[i]] <- i
For a work project I am working on, it would be useful for me to emulate the following behaviour in linear time complexity:
> cc <- complete.cases(samples_x, samples_y)
> ranks_x <- rank(samples_x[cc])
> ranks_y <- rank(samples_y[cc])
The complete.cases function, when given n sets of the same length, returns the indices for which none of the sets contain NAs. The order function returns the permutation of indices corresponding to the sorted sample set. The rank function returns the ranks of the sample set.
How to do this? Let me know if I have provided sufficient information as to the problem in question.
More specifically, I am trying to build a correlation matrix based on Spearman's rank sum correlation coefficient test in a way such that NAs are handled properly. The presence of NAs requires that the rankings be calculated for every pairwise sample set (s n^2 log n); I am trying to avoid that by calculating the orders once for every sample set (s n log n) and use a linear complexity for every pairwise comparison. Is this even doable?
Thanks in advance.
It looks like, when you work out the rank correlation of two arrays, you want to delete from both arrays elements in positions where either has NA.
You have
for (i in 1:length(samples_x)) ranks_x[orders_x[i]] <- i
Could you change this to something like
wp <- 0;
for (i in 1:length(samples_x)) {
if ((samples_x[orders_x[i]] == NA) ||
(samples_y[orders_x[i]] == NA))
{
ranks_x[orders_x[i]] <- NA;
}
else
{
ranks_x[orders_x[i]] <- wp++;
}
}
Then you could either go along later and compress out the NAs, or hope the correlation subroutine just ignores them.

matlab for loop: fastest and most efficient method to reproduce large matrix

My data is a 2096x252 matrix of double values. I need a for loop or an equivalent which performs the following:
Each time the matrix is reproduced the first array is deleted and the second becomes the first. When the loop runs again, the remaining matrix is reproduced and the first array is deleted and the next becomes the first and so on.
I've tried using repmat but it is too slow and tedious when dealing with large matrices (2096x252).
Example input:
1 2 3 4
3 4 5 6
3 5 7 5
9 6 3 2
Desired output:
1 2 3 4
3 4 5 6
3 5 7 5
9 6 3 2
3 4 5 6
3 5 7 5
9 6 3 2
3 5 7 5
9 6 3 2
9 6 3 2
Generally with Matlab it is much faster to pre-allocate a large array than to build it incrementally. When you know in advance the final size of the large array there's no reason not to follow this general advice.
Something like the following should do what you want. Suppose you have an array in(nrows, ncols); then
indices = [0 nrows:-1:1];
out = zeros(sum(indices),ncols);
for ix = 1:nrows
out(1+sum(indices(1:ix)):sum(indices(1:ix+1)),:) = in(ix:end,:);
end
This worked on your small test input. I expect you can figure out what is going on.
Whether it is the fastest of all possible approaches I don't know, but I expect it to be much faster than building a large matrix incrementally.
Disclaimer:
You'll probably have memory issues with large matrices, but that is not the question.
Now, to the business:
For a given matrix A, the straightforward approach with the for loop would be:
[N, M] = size(A);
B = zeros(sum(1:N), M);
offset = 1;
for i = 1:N
B(offset:offset + N - i, :) = A(i:end, :);
offset = offset + size(A(i:end, :), 1);
end
B is the desired output matrix.
However, this solution is expected to be slow as well, because of the for loop.
Edit: preallocated B instead of dynamically changing size (this optimization should achieve a slight speedup).

Resources