Snappy compression and runs of repeated values in a Parquet column - parquet

I am working with a data frame of 100m rows, that I would like to partition into 100 Parquet files of 1m rows each. I do not want to partition on any particular column value: I just want 100 chunks of 1m rows.
I know that this is possible by adding a "dummy" column, and passing that to partition_cols:
data_size = len(data)
partition_size = 1_000_000
n_partitions, remainder = divmod(data_size, partition_size)
data["partition_id"] = np.concatenate([
np.repeat(list(range(n_partitions)), partition_size),
np.repeat(n_partitions + 1, remainder),
])
data.to_parquet("out", partition_cols=["partition_id"])
But it feels wasteful to write an extra 100m 64-bit integers!
Parquet files are also typically compressed, very often using the Snappy algorithm (occasionally GZip or Brotli). And these are long runs of identical integers, so in principle they should compress extremely well.
However, I don't know how the Parquet file format and underlying Arrow array format interact with various compression algorithms. Assuming that I'm using Snappy, will my millions of extra integers be compressed to a handful of bytes? Or will this partition_id column actually inflate the size of my dataset by some appreciable amount?

To answer your actual question: Yes, as there are only a 100 distinct values they compress very well. As you can see below there is no significant influence on the over all size. Partitioning introduces some overhead of course but that happens with and without ids.
(This is R as I am more familiar with it but it uses the same C++ arrow backend so the results are the same)
library(arrow)
library(fs)
# create some dummy data
n <- 100000000
data <- data.frame(a = rnorm(n), b = rnorm(n))
partition_id <- rep(1:100, each = 1000000)
data_ids <- cbind(data, partition_id)
# Save data
file <- file_temp("data", ext = ".parquet")
file_with_id <- file_temp("data", ext = ".parquet")
path <- path_temp("data_partitioned")
path_nrow <- path_temp("data_partitioned_nrow")
write_parquet(data, file, compression = "snappy")
write_parquet(data.frame(rep(1,1000000)), just_ids, compression = "snappy")
write_parquet(data_ids, file_with_id, compression = "snappy")
write_dataset(data_ids, path, format = "parquet", partitioning = "partition_id")
write_dataset(data, path_nrow, format = "parquet", max_rows_per_file = 1000000)
file_size(file)
# 1.49G
file_size(file_with_id)
# 1.49G
dir_ls(path, recurse = TRUE) |> file_size() |> sum()
# 1.84G
dir_ls(path_nrow, recurse = TRUE) |> file_size() |> sum()
# 1.84G

pyarrow's write_dataset has the following parameter which should solve your issue without adding a partition_id column:
max_rows_per_file int, default 0
Maximum number of rows per file. If greater than 0 then this will limit how many rows are placed in any single file. Otherwise there
will be no limit and one file will be created in each output directory
unless files need to be closed to respect max_open_files

Related

Quantstrat applystrategy incorrect dimensions trying to work with manual mktdata OHCLV data vs getSymbols

I apologize for not having a working example atm
All I really need is a sample format for how to load multiple symbols from a csv
The function call says
https://www.rdocumentation.org/packages/quantstrat/versions/0.16.7/topics/applyStrategy
mktdata
"an xts object containing market data. depending on indicators, may need to be in OHLCV or BBO formats, default NULL"
The reason I don't wish to use getSymbols is because I do some preprocessing and load the data from csv's because my internet is shoddy. I do download data, but about once a week. My preprocess produces different symbols from a subset of 400 symbols based on the time periods I scan. I'm trying to frontload all my download processing, and no matter what I try, I can't get it to load from either a dataframe or an xts object. Right now I'm converting from csv to dataframe to xts and attempting to load.
I have noticed my xts objects differ from the getSymbols (error about incorrect dimensions). Specifically if I call colnames. Mine will say none, where as getSymbols subelements list 6 columns.
Anyways. What I would like to do, is see a minimal example of loading custom OHCLV data from a csv into an xts that can be supplied as an object to mktdata = in the applyStrategy call. That way I can format my code to match
I have the code to load and create the xts object from a dataframe.
#loads from a dataframe which includes Symbol, Date, Open, High, Low, Close, Volume, Adjusted
tempData <- symbol_data_set[symbol_data_set$Symbol %in% symbolstring & symbol_data_set$Date >= startDate & symbol_data_set$Date<=endDate,]
#creates a list of xts
vectorXTS <- mclapply(symbolstring,function(x)
{
df <- symbol_data_set[symbol_data_set$Symbol==x & symbol_data_set$Date >= startDate & symbol_data_set$Date<=endDate,]
#temp <- as.xts(
temp <- cbind(as.data.frame(df[,2]),as.data.frame(df[,-1:-2]))
rownames(df) <- df$Date
#,order.by=as.POSIXct(df$Date),)
z <- read.zoo(temp, index = 1, col.names=TRUE, header = TRUE)
#sets names to Symbol.Open ...
colnames(z) <- c(paste0(symbolstring[x],".Open"),paste0(symbolstring[x],".High"),paste0(symbolstring[x],".Low"),paste0(symbolstring[x],".Close"),paste0(symbolstring[x],".Volume"),paste0(symbolstring[x],".Adjusted"))
return(as.xts(z, match.to=AAPL))
#colnames(as.xts(z))
})
names(symbolstring) <- symbolstring
names(vectorXTS) <- symbolstring
for(i in symbolstring) assign(symbolstring[i],vectorXTS[i])
colnames(tempData) <- c(paste0(x,".Symbol"),paste0(x,".Date"),paste0(x,".Open"),paste0(x,".High"),paste0(x,".Low"),paste0(x,".Close"),paste0(x,".Volume"),paste0(x,".Adjusted"))
head(tempData)
rownames(tempData) <- tempData$Date
#attempts to use this xts object I created
results <- applyStrategy(strategy= strategyName, portfolios = portfolioName,symbols=symbolstring,mktdata)
error
Error in mktdata[, keep] : incorrect number of dimensions
This is how you store an xts getSymbols object in a file and reload it for use for quantStrat's applyStrategy (two methods shown, the read.xts method is the ideal as you can see how the csv's are stored)
getSymbols("AAPL",from=startDate,to=endDate,adjust=TRUE,src='yahoo',auto.assign = TRUE)
saveRDS(AAPL, file= 'stuff.Rdata')
AAPL <- readRDS(file= 'stuff.Rdata')
write.zoo(AAPL,file="zoo.csv", index.name = "Date", row.names=FALSE)
rm(AAPL)
AAPL <- as.xts(read.zoo(file="zoo.csv",header = TRUE))
If you want to work with multiple symbols, I had this work.
Note initially I had a reference to the 1st element, i.e. vectorXTS[[1]], and it worked
Note: at least setting it up like this got it to run...
vectorXTS <- mclapply(symbolstring,function(x)
{
df <- symbol_data_set[symbol_data_set$Symbol==x & symbol_data_set$Date >= startDate & symbol_data_set$Date<=endDate,]
temp <- cbind(as.data.frame(df[,2]),as.data.frame(df[,-1:-2]))
rownames(df) <- df$Date
z <- read.zoo(temp, index = 1, col.names=TRUE, header = TRUE)
colnames(z) <- c(paste0(x,".Open"),paste0(x,".High"),paste0(x,".Low"),paste0(x,".Close"),paste0(x,".Volume"),paste0(x,".Adjusted"))
write.zoo(z,file=paste0(x,"zoo.csv"), index.name = "Date", row.names=FALSE)
return(as.xts(read.zoo(file=paste0(x,"zoo.csv"),header = TRUE)))
})
names(vectorXTS) <- symbolstring
#this will assign to memory vs vectorXTS if one wishes to avoid using mktdata = vectorXTS[[]]
for(i in symbolstring) assign(i,vectorXTS[[i]])
results <- applyStrategy(strategy= strategyName, portfolios = portfolioName,symbols=symbolstring, mktdata = vectorXTS[[]])
#alternatively
#results <- applyStrategy(strategy= strategyName, portfolios = portfolioName,symbols=symbolstring)

How to continuously read a binary file in Crystal and get Bytes out of it?

Reading binary files in Crystal is supposed to be done with Bytes.new(size) and File#read, but... what if you don't know how many bytes you'll read in advance, and you want to keep reading chunks at a time?
Here's an example, reading 3 chunks from an imaginary file format that specifies the length of data chunks with an initial byte:
file = File.open "something.bin", "rb"
The following doesn't work, since Bytes can't be concatenated (as it's really a Slice(UInt8), and slices can't be concatenated):
data = Bytes.new(0)
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk
end
The best thing I've come up with is to use an Array(UInt8) instead of Bytes, and call to_a on all the bytes read:
data = [] of UInt8
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk.to_a
end
However, there's then seemingly no way to turn that back into Bytes (Array#to_slice was removed), which is needed for many applications and recommended by the authors to be the type of all binary data.
So... how do I keep reading from a file, concatenating to the end of previous data, and get Bytes out of it?
One solution would be to copy the data to a resized Bytes on every iteration. You could also collect the Bytes instances in a container (e.g. Array) and merge them at the end, but that would all mean additional copy operations.
The best solution would probably be to use a buffer that is large enough to fit all data that could possibly be read - or at least be very likely to (resize if necessary).
If the maximum size is just 3 * 255 bytes this is a no-brainer. You can size down at the end if the buffer is too large.
data = Bytes.new 3 * UInt8::MAX
bytes_read = 0
3.times do
bytes_to_read = file.read_byte.not_nil!
file.read_fully(data + bytes_read)
bytes_read += bytes_to_read
end
# resize to actual size at the end:
data = data[0, bytes_read]
Note: As the data format tells how many bytes to read, you should use read_fully instead of read which would silently ignore if there are actually less bytes to read.
EDIT: Since the number of chunks and thus the maximum size is not known in advance (per comment), you should use a dynamically resizing buffer. This can be easily implemented using IO::Memory, which will take care of resizing the buffer accordingly if necessary.
io = IO::Memory.new
loop do
bytes_to_read = file.read_byte
break if bytes_to_read.nil?
IO.copy(file, io, bytes_to_read)
end
data = io.to_slice

Julia: How to modify a column of a matrix that has been saved as a binary file?

I am working with large matrices of data (Nrow x Ncol) that are too large to be stored in memory. Instead, it is standard in my field of work to save the data into a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I also need to be able to modify a column and then save the updated column back into the binary file. So far I have managed to figure out how to save a matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after I edit the contents of a column I cannot figure out how to save that column back into the binary file.
As an example, suppose the data file is a 32-bit identity matrix that has been saved to disk.
Nrow = 500
Ncol = 325
data = eye(Float32,Nrow,Ncol)
stream_data = open("data","w")
write(stream_data,data[:])
close(stream_data)
Reading the entire file from disk and then reshaping back into the matrix is straightforward:
stream_data = open("data","r")
data_matrix = read(stream_data,Float32,Nrow*Ncol)
data_matrix = reshape(data_matrix,Nrow,Ncol)
close(stream_data)
As I said before, the data-matrices I am working with are too large to read into memory and as a result the code written above would normally not be possible to execute. Instead, I need to work with 1 column at a time. The following is a solution to read 1 column (e.g. the 7th column) of the matrix into memory:
icol = 7
stream_data = open("data","r")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
data_col = read(stream_data,Float32,Nrow)
close(stream_data)
Note that the coefficient '4' in the 'position_data' variable is because I am working with Float32. Also, I don't fully understand what the seek command is doing here, but it seems to be giving me the correct output based on the following tests:
data == data_matrix # true
data[:,7] == data_col # true
For the sake of this problem, lets say I have determined that the column I loaded (i.e. the 7th column) needs to be replaced with zeros:
data_col = zeros(Float32,size(data_col))
The problem now, is to figure out how to save this column back into the binary file without affecting any of the other data. Naturally I intend to use 'write' to perform this task. However, I am not entirely sure how to proceed. I know I need to start by opening up a stream to the data; however I am not sure what 'mode' I need to use: "w", "w+", "a", or "a+"? Here is a failed attempt using "w":
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
The original binary file (before my failed attempt to edit the binary file) occupied 650000 bytes on disk. This is consistent with the fact that the matrix is size 500x325 and Float32 numbers occupy 4 bytes (i.e. 4*500*325 = 650000). However, after my attempt to edit the binary file I have observed that the binary file now occupies only 14000 bytes of space. Some quick mental math shows that 14000 bytes corresponds to 7 columns of data (4*500*7 = 14000). A quick check confirms that the binary file has replaced all of the original data with a new matrix with size 500x7, and whose elements are all zeros.
stream_data = open("data","r")
data_new_matrix = read(stream_data,Float32,Nrow*7)
data_new_matrix = reshape(data_new_matrix,Nrow,7)
sum(abs(data_new_matrix)) # 0.0f0
What do I need to do/change in order to only modify only the 7th 'column' in the binary file?
Instead of
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
in the OP, write
icol = 7
stream_data = open("data","r+")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
i.e. replace "w" with "r+" and everything works.
The reference to open is http://docs.julialang.org/en/release-0.4/stdlib/io-network/#Base.open and it explains the various modes. Preferably open shouldn't be used with the original somewhat confusing but definitely slower string parameter.
You can use SharedArrays for the need you describe:
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
# do something with data
data[:,1]=a[:,1].+1
exit()
# restart julia
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
#show data[1,1]
# prints 1
Now, be mindful that you're supposed to handle synchronisation to read/write from/to this file (if you have async workers) and that you're not supposed to change the size of the array (unless you know what you're doing).

File with random data but specific size

I am trying to generate a file in ruby that has a specific size. The content doesn't matter.
Here is what I got so far (and it works!):
File.open("done/#{NAME}.txt", 'w') do |f|
contents = "x" * (1024*1024)
SIZE.to_i.times { f.write(contents) }
end
The problem is: Once I zip or rar this file the created archive is only a few kb small. I guess thats because the random data in the file got compressed.
How do I create data that is more random as if it were just a normal file (for example a movie file)? To be specific: How to create a file with random data that keeps its size when archived?
You cannot guarantee an exact file size when compressing. However, as you suggest in the question, completely random data does not compress.
You can generate a random String using most random number generators. Even simple ones are capable of making hard-to-compress data, but you would have to write your own string-creation code. Luckily for you, Ruby comes with a built-in library that already has a convenient byte-generating method, and you can use it in a variation of your code:
require 'securerandom'
one_megabyte = 2 ** 20 # or 1024 * 1024, if you prefer
# Note use 'wb' mode to prevent problems with character encoding
File.open("done/#{NAME}.txt", 'wb') do |f|
SIZE.to_i.times { f.write( SecureRandom.random_bytes( one_megabyte ) ) }
end
This file is not going to compress much, if at all. Many compressors will detect that and just store the file as-is (making a .zip or .rar file slightly larger than the original).
For a given string size N and compression method c (e.g., from the rubyzip, libarchive or seven_zip_ruby gems), you want to find a string str such that:
str.size == c(str).size == N
I'm doubtful that you can be assured of finding such a string, but here's a way that should come close:
Step 0: Select a number m such that m > N.
Step 1: Generate a random string s with m characters.
Step 2: Compute str = c(str). If str.size <= N, increase m and repeat Step 1; else go to Step 3.
Step 3: Return str[0,N].

add columns to data frame using foreach and %dopar%

In Revolution R 2.12.2 on Windows 7 and Ubuntu 64-bit 11.04 I have a data frame with over 100K rows and over 100 columns, and I derive ~5 columns (sqrt, log, log10, etc) for each of the original columns and add them to the same data frame. Without parallelism using foreach and %do%, this works fine, but it's slow. When I try to parallelize it with foreach and %dopar%, it will not access the global environment (to prevent race conditions or something like that), so I cannot modify the data frame because the data frame object is 'not found.'
My question is how can I make this faster? In other words, how to parallelize either the columns or the transformations?
Simplified example:
require(foreach)
require(doSMP)
w <- startWorkers()
registerDoSMP(w)
transform_features <- function()
{
cols<-c(1,2,3,4) # in my real code I select certain columns (not all)
foreach(thiscol=cols, mydata) %dopar% {
name <- names(mydata)[thiscol]
print(paste('transforming variable ', name))
mydata[,paste(name, 'sqrt', sep='_')] <<- sqrt(mydata[,thiscol])
mydata[,paste(name, 'log', sep='_')] <<- log(mydata[,thiscol])
}
}
n<-10 # I often have 100K-1M rows
mydata <- data.frame(
a=runif(n,1,100),
b=runif(n,1,100),
c=runif(n,1,100),
d=runif(n,1,100)
)
ncol(mydata) # 4 columns
transform_features()
ncol(mydata) # if it works, there should be 8
Notice if you change %dopar% to %do% it works fine
Try the := operator in data.table to add the columns by reference. You'll need with=FALSE so you can put the call to paste on the LHS of :=.
See When should I use the := operator in data.table?
Might it be easier if you did something like
n<-10
mydata <- data.frame(
a=runif(n,1,100),
b=runif(n,1,100),
c=runif(n,1,100),
d=runif(n,1,100)
)
mydata_sqrt <- sqrt(mydata)
colnames(mydata_sqrt) <- paste(colnames(mydata), 'sqrt', sep='_')
mydata <- cbind(mydata, mydata_sqrt)
producing something like
> mydata
a b c d a_sqrt b_sqrt c_sqrt d_sqrt
1 29.344088 47.232144 57.218271 58.11698 5.417018 6.872565 7.564276 7.623449
2 5.037735 12.282458 3.767464 40.50163 2.244490 3.504634 1.940996 6.364089
3 80.452595 76.756839 62.128892 43.84214 8.969537 8.761098 7.882188 6.621340
4 39.250277 11.488680 38.625132 23.52483 6.265004 3.389496 6.214912 4.850240
5 11.459075 8.126104 29.048527 76.17067 3.385126 2.850632 5.389669 8.727581
6 26.729365 50.140679 49.705432 57.69455 5.170045 7.081008 7.050208 7.595693
7 42.533937 7.481240 59.977556 11.80717 6.521805 2.735186 7.744518 3.436157
8 41.673752 89.043099 68.839051 96.15577 6.455521 9.436265 8.296930 9.805905
9 59.122106 74.308573 69.883037 61.85404 7.689090 8.620242 8.359607 7.864734
10 24.191878 94.059012 46.804937 89.07993 4.918524 9.698403 6.841413 9.438217
There are two ways you can handle this:
Loop over each column (or, better yet, a subset of the columns) and apply the transformations to create a temporary data frame, return that, and then do cbind of the list of data frames, as #Henry suggested.
Loop over the transformations, apply each to the data frame, and then return the transformation data frames, cbind, and proceed.
Personally, the way I tend to do things like this is create a bigmatrix object (either in memory or on disk, using the bigmemory package), and you can access all of the columns in shared memory. Just pre-allocate the columns you will fill in, and you won't need to do a post hoc cbind. I tend to do it on disk. Just be sure to run flush(), to make sure everything is written to disk.

Resources