I have been trying to understand how to batch things in benthos but am a bit confused on how to do it..
Take this example:
input:
generate:
interval: ""
count: 40
mapping: |
root = count("test")
pipeline:
processors:
- log:
level: INFO
message: 'Test! ${! (this) } ${! (this % 2 == 0) } ${! batch_size() }'
- group_by_value:
value: ${! (this % 2 == 0) }
- archive:
format: tar
- compress:
algorithm: gzip
output:
file:
path: test/${! (this % 2 == 0) }.tar.gz
codec: all-bytes
My expectation with this would be 2 files in test/.. one called "true.tar" and another called "false.tar", with 20 elements each, (odd and even numbers). What I get instead is a single file with the last message. I understand from the logs that it is not actually batching these based on that condition
I thought group_by_value would kind of create "two streams/batches" of messages that would get separately handled in the output/archive, but it looks like it doesn't behave like that
Could you please help me understand how it works?
additionally, I was also going to limit the size of each of these streams to a certain number, so each would get their number of entries in the TAR limited
Thanks!!
EDIT 1
This is something which works more like expected, but this was I have to "know" how many items I want to batch before actually being able to filter them.. I wonder if I can't just "accumulate" things based on this group_by_value condition and batch them based on a count later?
input:
broker:
inputs:
- generate:
interval: ""
count: 40
mapping: |
root = count("test")
batching:
count: 40
pipeline:
processors:
- group_by_value:
value: ${! (this % 2 == 0) }
- log:
level: INFO
message: 'Test! ${! (this) } ${! (this % 2 == 0) } ${! batch_size() }'
- bloblang: |
meta name = (this) % 2 == 0
- archive:
format: tar
path: ${! (this) }
output:
file:
path: test/${! (meta("name")) }.tar
codec: all-bytes
As you already noticed, group_by_value operates on message batches, which is why your first example produces a single file as output. In fact, it produces a file for each message, but since the file name is identical, each new file ends up overwriting the previous one.
From your edit, I'm not sure I get what you're trying to achieve. The batch policy documentation explains that byte_size, count and period are the available conditions for composing batches. When either of those is met, a batch is flushed, so you don't necessarily have to rely on a specific count. For convenience, the batching policy also has a processors field, which allows you to define an optional list of processors to apply to each batch before it is flushed.
The windowed processing documentation might also be of interest, since it explains how the system_window buffer can be used to chop a stream of messages into tumbling or sliding windows of fixed temporal size. It has a section on grouping here.
Update 22.02.2022: Here's an example of how to perform output batching based on some key, as requested in the comments:
input:
generate:
interval: "500ms"
count: 9
mapping: |
root.key = if count("key_counter") % 3 == 0 {
"foo"
} else {
"bar"
}
root.content = uuid_v4()
pipeline:
processors:
- bloblang: |
root = this
# 3 is the number of messages you'd like to have in the "foo" batch.
root.foo_key_end = this.key == "foo" && count("foo_key_counter") % 3 == 0
output:
broker:
outputs:
- stdout: {}
processors:
- group_by_value:
value: ${! json("key") }
- bloblang: |
root = this
root.foo_key_end = deleted()
root.batch_size = batch_size()
root.batch_index = batch_index()
batching:
# Something big so, unless something bad happens, you should see enough
# messages with key = "foo" before reaching this number
count: 1000
check: this.foo_key_end
Sample output:
> benthos --log.level error -c config_group_by.yaml
{"batch_index":0,"batch_size":3,"content":"84e51d8b-a4e0-42c8-8cbb-13a8b7b37823","key":"foo"}
{"batch_index":1,"batch_size":3,"content":"1b35ff8b-7121-426e-8447-11e834610b90","key":"foo"}
{"batch_index":2,"batch_size":3,"content":"a9d9c661-1068-447f-9324-c418b0d7de9d","key":"foo"}
{"batch_index":0,"batch_size":6,"content":"5c9d26aa-f1dc-46ae-9845-3b035c1e569e","key":"bar"}
{"batch_index":1,"batch_size":6,"content":"17bbc7c1-94ec-4c9e-b0c5-b9c11f18498f","key":"bar"}
{"batch_index":2,"batch_size":6,"content":"7d7b9621-e174-4ca2-8a2e-1679e8177335","key":"bar"}
{"batch_index":3,"batch_size":6,"content":"db24273f-7064-498e-9914-9dd4c671dcd7","key":"bar"}
{"batch_index":4,"batch_size":6,"content":"4cfbea0e-dcc4-4d84-a87f-6930dd797737","key":"bar"}
{"batch_index":5,"batch_size":6,"content":"d6cb4726-4796-444d-91df-a5c278860106","key":"bar"}
Related
I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.
Let's say I have the following log file that continuously logs a server's down/up time:
status.log
UP - "18:00:00"
..
..
DOWN - "19:00:03"
..
..
DOWN - "22:00:47"
..
..
UP - "23:59:48"
UP - "23:59:49"
UP - "23:59:50"
DOWN - "23:59:51"
DOWN - "23:59:52"
UP - "23:59:53"
UP - "23:59:54"
UP - "23:59:56"
UP - "23:59:57"
UP - "23:59:59"
each day is logged in a separate folder under the same filename.
not my actual code, but this is much simpler and transparent approach:
#!/bin/ruby
downtime_log = File.readlines("path/to/log/file").select { |line| line =~ /DOWN/ }
puts "#{downtime_log.count} Downtimes for today"
logic-wise, how can I get the total downtime per file/day in minutes and seconds but not as a total count.
I assume that your file contains exactly one line per second. Then the number of seconds your service was down can be evaluated like you already did in your approach:
number_of_seconds_downtime = File.readlines('path/to/log/file')
.select { |line| line =~ /DOWN/ }
.count
Or simplified:
number_of_seconds_downtime = File.readlines('path/to/log/file')
.count { |line| line =~ /DOWN/ }
To translate this into minutes and seconds just divmod
minutes, seconds = number_of_seconds_downtime.divmod(60)
and output the result like this:
puts "#{minutes}:#{seconds} downtime"
I know that the question isn't new but I haven't found anything useful. In my case I have a 20 GB file and I need to read random lines from it. Now I have simple file index which contains line numbers and corresponding seek offsets. Also I disabled buffering when reading to read only the needed line.
And this is my code:
def create_random_file_gen(file_path, batch_size=0, dtype=np.float32, delimiter=','):
index = load_file_index(file_path)
if (batch_size > len(index)) or (batch_size == 0):
batch_size = len(index)
lines_indices = np.random.random_integers(0, len(index), batch_size)
with io.open(file_path, 'rb', buffering=0) as f:
for line_index in lines_indices:
f.seek(index[line_index])
line = f.readline(2048)
yield __get_features_from_line(line, delimiter, dtype)
The problem is that it's extremely slow: reading of 5000 lines takes 89 seconds on my Mac(here I point to ssd drive). There is code I used for testing:
features_gen = tedlium_random_speech_gen(5000) # just a wrapper for function given above
i = 0
for feature, cls in features_gen:
if i % 1000 == 0:
print("Got %d features" % i)
i += 1
print("Total %d features" % i)
I've read something about files memory mapping but I don't really understand how it works: how the mapping works in essence and will it speed up the process or no.
So the main question what are the possible ways to speed up the process? The only way I see now is to read randomly not every line but blocks of lines.
I have two large files. One of them is an info file(about 270MB and 16,000,000 lines) like this:
1101:10003:17729
1101:10003:19979
1101:10003:23319
1101:10003:24972
1101:10003:2539
1101:10003:28242
1101:10003:28804
The other is a standard FASTQ format(about 27G and 280,000,000 lines) like this:
#ST-E00126:65:H3VJ2CCXX:7:1101:1416:1801 1:N:0:5
NTGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
#ST-E00126:65:H3VJ2CCXX:7:1101:10003:75641:N:0:5
TAAGATAGATAGCCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifie. For each sequence,this part of the Line 1 is unique.
1101:1416:1801 and 1101:10003:75641
And I want to grab the Line 1 and the next three lines from the FASTQ file according to the info file. Here is my code:
import gzip
import re
count = 0
with open('info_path') as info, open('grab_path','w') as grab:
for i in info:
sample = i.strip()
with gzip.open('fq_path') as fq:
for j in fq:
count += 1
if count%4 == 1:
line = j.strip()
m = re.search(sample,j)
if m != None:
grab.writelines(line+'\n'+fq.next()+fq.next()+fq.next())
count = 0
break
And it works, but because both of these two files have millions of lines, it's inefficient(running one day only get 20,000 lines).
UPDATE at July 6th:
I find that the info file can be read into the memory(thank #tobias_k for reminding me), so I creat a dictionary that the keys are info lines and the values are all 0. After that, I read the FASTQ file every 4 line, use the identifier part as the key,if the value is 0 then return the 4 lines. Here is my code:
import gzip
dic = {}
with open('info_path') as info:
for i in info:
sample = i.strip()
dic[sample] = 0
with gzip.open('fq_path') as fq, open('grap_path',"w") as grab:
for j in fq:
if j[:10] == '#ST-E00126':
line = j.split(':')
match = line[4] +':'+line[5]+':'+line[6][:-2]
if dic.get(match) == 0:
grab.writelines(j+fq.next()+fq.next()+fq.next())
This way is much faster, it takes 20mins to get all the matched lines(about 64,000,000 lines). And I have thought about sorting the FASTQ file first by external sort. Splitting the file that can be read into the memory is ok, my trouble is how to keep the next three lines following the indentifier line while sorting. The Google's answer is to linear these four lines first, but it will take 40mins to do so.
Anyway thanks for your help.
You can sort both files by the identifier (the 1101:1416:1801) part. Even if files do not fit into memory, you can use external sorting.
After this, you can apply a simple merge-like strategy: read both files together and do the matching in the meantime. Something like this (pseudocode):
entry1 = readFromFile1()
entry2 = readFromFile2()
while (none of the files ended)
if (entry1.id == entry2.id)
record match
else if (entry1.id < entry2.id)
entry1 = readFromFile1()
else
entry2 = readFromFile2()
This way entry1.id and entry2.id are always close to each other and you will not miss any matches. At the same time, this approach requires iterating over each file once.
I have a three asset portfolio. I need to set the target return for my second asset
whenever i try i get this error
asset.ts <- as.timeSeries(asset.ret)
spec <- portfolioSpec()
setSolver(spec) <- "solveRshortExact"
constraints <- c("Short")
setTargetReturn(Spec) = mean(colMeans(asset.ts[,2]))
efficientPortfolio(asset.ts, spec, constraints)
Error: is.numeric(targetReturn) is not TRUE
Title:
MV Efficient Portfolio
Estimator: covEstimator
Solver: solveRquadprog
Optimize: minRisk
Constraints: Short
Portfolio Weights:
MSFT AAPL NORD
0 0 0
Covariance Risk Budgets:
MSFT AAPL NORD
Target Return and Risks:
mean mu Cov Sigma CVaR VaR
0 0 0 0 0 0
Description:
Sat Apr 19 15:03:24 2014 by user: Usuario
i have tried and i have searched the web but i have no idea how to set the target return
for a specific expected return of the data set. i could copy the mean of my second asset # but i think due to decimal it could affect the answer.
I ran into this error , when using 2 assets.
Appears to be a bug in the PortOpt methods.
When there's 2 assets, it runs : .mvSolveTwoAssets
Which looks for the TargetReturn in the portfolioSpecs.
But as you know, targetReturn isn't always needed.
But in your code , you have 2 separate variables for spec.
'spec' , and 'Spec'
i.e.: 'Spec' .. assuming this is a typo, then this line needs to be changed.
setTargetReturn(Spec) = mean(colMeans(asset.ts[,2]))