dataset = dataset.batch(50)
dataset = dataset.prefetch(buffer_size=1)
Is it prefetch 1 batch or 1 element?
Per the API document in tensorflow, the buffer_size is the max num of elements prefetch. But it seems it is num of batch after batching the dataset.
Since you are using dataset.prefetch(buffer_size=1) after dataset.batch(), it means that it will prefetch 1 batch.
Related
I am working with a data frame of 100m rows, that I would like to partition into 100 Parquet files of 1m rows each. I do not want to partition on any particular column value: I just want 100 chunks of 1m rows.
I know that this is possible by adding a "dummy" column, and passing that to partition_cols:
data_size = len(data)
partition_size = 1_000_000
n_partitions, remainder = divmod(data_size, partition_size)
data["partition_id"] = np.concatenate([
np.repeat(list(range(n_partitions)), partition_size),
np.repeat(n_partitions + 1, remainder),
])
data.to_parquet("out", partition_cols=["partition_id"])
But it feels wasteful to write an extra 100m 64-bit integers!
Parquet files are also typically compressed, very often using the Snappy algorithm (occasionally GZip or Brotli). And these are long runs of identical integers, so in principle they should compress extremely well.
However, I don't know how the Parquet file format and underlying Arrow array format interact with various compression algorithms. Assuming that I'm using Snappy, will my millions of extra integers be compressed to a handful of bytes? Or will this partition_id column actually inflate the size of my dataset by some appreciable amount?
To answer your actual question: Yes, as there are only a 100 distinct values they compress very well. As you can see below there is no significant influence on the over all size. Partitioning introduces some overhead of course but that happens with and without ids.
(This is R as I am more familiar with it but it uses the same C++ arrow backend so the results are the same)
library(arrow)
library(fs)
# create some dummy data
n <- 100000000
data <- data.frame(a = rnorm(n), b = rnorm(n))
partition_id <- rep(1:100, each = 1000000)
data_ids <- cbind(data, partition_id)
# Save data
file <- file_temp("data", ext = ".parquet")
file_with_id <- file_temp("data", ext = ".parquet")
path <- path_temp("data_partitioned")
path_nrow <- path_temp("data_partitioned_nrow")
write_parquet(data, file, compression = "snappy")
write_parquet(data.frame(rep(1,1000000)), just_ids, compression = "snappy")
write_parquet(data_ids, file_with_id, compression = "snappy")
write_dataset(data_ids, path, format = "parquet", partitioning = "partition_id")
write_dataset(data, path_nrow, format = "parquet", max_rows_per_file = 1000000)
file_size(file)
# 1.49G
file_size(file_with_id)
# 1.49G
dir_ls(path, recurse = TRUE) |> file_size() |> sum()
# 1.84G
dir_ls(path_nrow, recurse = TRUE) |> file_size() |> sum()
# 1.84G
pyarrow's write_dataset has the following parameter which should solve your issue without adding a partition_id column:
max_rows_per_file int, default 0
Maximum number of rows per file. If greater than 0 then this will limit how many rows are placed in any single file. Otherwise there
will be no limit and one file will be created in each output directory
unless files need to be closed to respect max_open_files
Reading binary files in Crystal is supposed to be done with Bytes.new(size) and File#read, but... what if you don't know how many bytes you'll read in advance, and you want to keep reading chunks at a time?
Here's an example, reading 3 chunks from an imaginary file format that specifies the length of data chunks with an initial byte:
file = File.open "something.bin", "rb"
The following doesn't work, since Bytes can't be concatenated (as it's really a Slice(UInt8), and slices can't be concatenated):
data = Bytes.new(0)
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk
end
The best thing I've come up with is to use an Array(UInt8) instead of Bytes, and call to_a on all the bytes read:
data = [] of UInt8
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk.to_a
end
However, there's then seemingly no way to turn that back into Bytes (Array#to_slice was removed), which is needed for many applications and recommended by the authors to be the type of all binary data.
So... how do I keep reading from a file, concatenating to the end of previous data, and get Bytes out of it?
One solution would be to copy the data to a resized Bytes on every iteration. You could also collect the Bytes instances in a container (e.g. Array) and merge them at the end, but that would all mean additional copy operations.
The best solution would probably be to use a buffer that is large enough to fit all data that could possibly be read - or at least be very likely to (resize if necessary).
If the maximum size is just 3 * 255 bytes this is a no-brainer. You can size down at the end if the buffer is too large.
data = Bytes.new 3 * UInt8::MAX
bytes_read = 0
3.times do
bytes_to_read = file.read_byte.not_nil!
file.read_fully(data + bytes_read)
bytes_read += bytes_to_read
end
# resize to actual size at the end:
data = data[0, bytes_read]
Note: As the data format tells how many bytes to read, you should use read_fully instead of read which would silently ignore if there are actually less bytes to read.
EDIT: Since the number of chunks and thus the maximum size is not known in advance (per comment), you should use a dynamically resizing buffer. This can be easily implemented using IO::Memory, which will take care of resizing the buffer accordingly if necessary.
io = IO::Memory.new
loop do
bytes_to_read = file.read_byte
break if bytes_to_read.nil?
IO.copy(file, io, bytes_to_read)
end
data = io.to_slice
I have the following function:
import random
lst = []
for i in range(100):
lst.append(random.randint(1, 10))
print(lst)
buffer = []
# This is the peace of code which I am interested to convert into tensorflow.
for a in lst:
buffer.append(a)
if len(buffer) > 5:
buffer.pop(0)
if len(buffer) == 5:
print(buffer)
So, from the code, I need to create a buffer (that could be a variable in tensorflow). This buffer should hold the extracted features from the last conv layer. The variable will be an input to an RNN in my case.
The advantage of this approach is that when we have large images, and when we need to feed a RNN with a (batch of images) * (sequence length) * (size of 1 image), that will require a very big batch of images to be loaded into the main memory. On the other hand, according to the code above, we will be feeding 1 image at a time using the Datasets from tensorflow, or an input queue or any other alternative. As a result, we will be storing in memory the feature of size: batch_size * sequence_length * feature space .In addition, we can say:
if len(buffer) == n:
# empty out the buffer after using its elements
buffer = [] # Or any other alternative way
I am aware that I can feed my network batches of images, but I need to accomplish the mentioned code based on some literature.
Any help is much appreciated!!
I try to regenerate your output using tf.FIFOQueue (https://www.tensorflow.org/api_docs/python/tf/FIFOQueue). I have given my code below with the comments where necessary.
BATCH_SIZE = 20
lst = []
for i in range(BATCH_SIZE):
lst.append(random.randint(1, 10))
print(lst)
curr_data = np.reshape(lst, (BATCH_SIZE, 1)) # reshape the tensor so that [BATCH_SIZE 1]
# queue starts here
queue_input_data = tf.placeholder(tf.int32, shape=[1]) # Placeholder for feed the data
queue = tf.FIFOQueue(capacity=50, dtypes=[tf.int32], shapes=[1]) # Queue define here
enqueue_op = queue.enqueue([queue_input_data]) # enqueue operation
len_op = queue.size() # chek the queue size
#check the length of the queue and dequeue one if greater than 5
dequeue_one = tf.cond(tf.greater(len_op, 5), lambda: queue.dequeue(), lambda: 0)
#check the length of the queue and dequeue five elemts if equals to 5
dequeue_many = tf.cond(tf.equal(len_op, 5), lambda:queue.dequeue_many(5), lambda: 0)
with tf.Session() as session:
for i in range(BATCH_SIZE):
_ = session.run(enqueue_op, feed_dict={queue_input_data: curr_data[i]}) # enqueue one element each ietaration
len = session.run(len_op) # check the legth of the queue
print(len)
element = session.run(dequeue_one) # dequeue the first element
print(element)
However, following two problems are associated with the above code,
Only the dequeue one and dequeue many operations are available and you cannot see the elements inside the queue (I don't think you will need this since you are looking something like a pipeline).
I think that tf.cond is the only way to implement a conditional operation (I couldn't find any other suitable function similar to that). However, since its similar to the if-then-else statement, its mandatory to define an operation when the statement is false also (not just having only if statement without the else). Since Tensorflow is all about building a graph I think its necessary to include the two branches (when the condition is true and false).
Moreover, a good explanation for Tensorflow input pipelines can be found here (http://ischlag.github.io/2016/11/07/tensorflow-input-pipeline-for-large-datasets/).
Hope this helps.
I am working with large matrices of data (Nrow x Ncol) that are too large to be stored in memory. Instead, it is standard in my field of work to save the data into a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I also need to be able to modify a column and then save the updated column back into the binary file. So far I have managed to figure out how to save a matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after I edit the contents of a column I cannot figure out how to save that column back into the binary file.
As an example, suppose the data file is a 32-bit identity matrix that has been saved to disk.
Nrow = 500
Ncol = 325
data = eye(Float32,Nrow,Ncol)
stream_data = open("data","w")
write(stream_data,data[:])
close(stream_data)
Reading the entire file from disk and then reshaping back into the matrix is straightforward:
stream_data = open("data","r")
data_matrix = read(stream_data,Float32,Nrow*Ncol)
data_matrix = reshape(data_matrix,Nrow,Ncol)
close(stream_data)
As I said before, the data-matrices I am working with are too large to read into memory and as a result the code written above would normally not be possible to execute. Instead, I need to work with 1 column at a time. The following is a solution to read 1 column (e.g. the 7th column) of the matrix into memory:
icol = 7
stream_data = open("data","r")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
data_col = read(stream_data,Float32,Nrow)
close(stream_data)
Note that the coefficient '4' in the 'position_data' variable is because I am working with Float32. Also, I don't fully understand what the seek command is doing here, but it seems to be giving me the correct output based on the following tests:
data == data_matrix # true
data[:,7] == data_col # true
For the sake of this problem, lets say I have determined that the column I loaded (i.e. the 7th column) needs to be replaced with zeros:
data_col = zeros(Float32,size(data_col))
The problem now, is to figure out how to save this column back into the binary file without affecting any of the other data. Naturally I intend to use 'write' to perform this task. However, I am not entirely sure how to proceed. I know I need to start by opening up a stream to the data; however I am not sure what 'mode' I need to use: "w", "w+", "a", or "a+"? Here is a failed attempt using "w":
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
The original binary file (before my failed attempt to edit the binary file) occupied 650000 bytes on disk. This is consistent with the fact that the matrix is size 500x325 and Float32 numbers occupy 4 bytes (i.e. 4*500*325 = 650000). However, after my attempt to edit the binary file I have observed that the binary file now occupies only 14000 bytes of space. Some quick mental math shows that 14000 bytes corresponds to 7 columns of data (4*500*7 = 14000). A quick check confirms that the binary file has replaced all of the original data with a new matrix with size 500x7, and whose elements are all zeros.
stream_data = open("data","r")
data_new_matrix = read(stream_data,Float32,Nrow*7)
data_new_matrix = reshape(data_new_matrix,Nrow,7)
sum(abs(data_new_matrix)) # 0.0f0
What do I need to do/change in order to only modify only the 7th 'column' in the binary file?
Instead of
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
in the OP, write
icol = 7
stream_data = open("data","r+")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
i.e. replace "w" with "r+" and everything works.
The reference to open is http://docs.julialang.org/en/release-0.4/stdlib/io-network/#Base.open and it explains the various modes. Preferably open shouldn't be used with the original somewhat confusing but definitely slower string parameter.
You can use SharedArrays for the need you describe:
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
# do something with data
data[:,1]=a[:,1].+1
exit()
# restart julia
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
#show data[1,1]
# prints 1
Now, be mindful that you're supposed to handle synchronisation to read/write from/to this file (if you have async workers) and that you're not supposed to change the size of the array (unless you know what you're doing).
I would like to get the running maximum by writing Stata code.
I think I am quite close:
gen ctrhigh`iv' = max(ctr, L1.ctr, L2.ctr, L3.ctr, ..., L`iv'.ctr)
As you can see, my data are time series and `iv' represents the window (e.g. 5, 10 or 200 days)
The only problem is that you cannot pass a varlist or string containing numbers to max. E.g. the following is not possible:
local ivs 5 10 50 100 200
foreach iv in `ivs' {
local vals
local i = 1
while (`i' <= `iv') {
vals "`vals' `i'"
local ++i
}
gen ctrhigh`iv' = max(varlist vals) //not possible
}
How would I achieve this instead?
Example of quickly computing a running standard deviation
* standard deviation of ctr, see http://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods *
gen ctr_sq = ctr^2
by tid: gen ctr_cum = sum(ctr) if !missing(ctr)
by tid: gen ctr_sq_cum = sum(ctr_sq) if !missing(ctr_sq)
foreach iv in $ivs {
if `iv' == 1 continue
by tid: gen ctr_sum = ctr_cum - L`iv'.ctr_cum if !missing(ctr_cum) & !missing(L`iv'.ctr_cum)
by tid: gen ctr_sq_sum = ctr_sq_cum - L`iv'.ctr_sq_cum if !missing(ctr_sq_cum) & !missing(L`iv'.ctr_sq_cum)
by tid: gen ctrsd`iv' = sqrt((`iv' * ctr_sq_sum - ctr_sum^2) / (`iv'*(`iv'-1))) if !missing(ctr_sq_sum) & !missing(ctr_sum)
label variable ctrsd`iv' "Rolling std dev of close ticker rank by `iv' days."
drop ctr_sum ctr_sq_sum
}
drop ctr_sq ctr_cum ctr_sq_cum
Note: this is not an exact sd, it's an approximation. I realize that this is very different from a maximum, but this may serve as an illustration on how to deal with large data computations.
Your example is time series data and implies that you have tsset the data. You don't say whether you also have panel or longitudinal structure. I will assume the worst and assume the latter as it doesn't make the code much worse. So, suppose tsset id date. In fact, that's irrelevant to the code here except to make explicit my assumption that id is an identifier and date a time variable.
An unattractive way to do this is to loop over observations. Suppose window is set to 42.
local window = 42
gen max = .
tsset id date
quietly forval i = 1/`=_N' {
su ctr if inrange(date, date[`i'] - `window', date[`i']) & id == id[`i'], meanonly
replace max = r(max) in `i'
}
So, in words as well: summarize values of ctr if date within window and it's in the same panel (same id), and put the maximum in the current observation.
The meanonly option is not well named. It calculates some other quantities besides the mean, and the maximum is one. But you do want the meanonly option to make summarize go as fast as possible.
See my 2007 paper on events in intervals, freely available at http://www.stata-journal.com/sjpdf.html?articlenum=pr0033
I say unattractive, but this approach does have the advantage that it is easy to work with once you understand it.
I am not setting up an expression with lots of arguments to max(). You said 200 as an example and nothing stated that you might not ask for more, so far as I can see there may be no upper limit on window length, but there will be a limit on how complicated that expression can be.
If I think of a better way to do it, I'll post it. Or someone else will....
It seems like I can pass a string of arguments to max, like so:
* OPTION 1: compute running max by days *
foreach iv in $ivs {
* does not make sense for less than two days *
if `iv' < 2 continue
di "computing running max for ctr interval `iv'"
* set high for this amount of days *
local vars "ctr"
forval i = 1 / `iv' {
local vars "`vars', L`i'.ctr"
}
by tid: gen ctrh`iv' = max(`vars')
}
* OPTION 2: compute running max by days, ensuring that entire range is nonmissing *
foreach iv in $ivs {
* does not make sense for less than two days *
if `iv' < 2 continue
di "computing running max for ctr interval `iv'"
* set high for this amount of days *
local vars "ctr"
local condition "!missing(ctr)"
forval i = 1 / `iv' {
local vars "`vars', L`i'.ctr"
local condition "`condition' & !missing(L`i'.ctr)"
}
by tid: gen ctrh`iv' = max(`vars') if `condition'
}
This computes very quickly and does exactly what I need.
However, if you need an arbitrarily large window I think you should resort to Nick's answer.