Can I use Tarantool instead of Redis? - tarantool

I'd like to use Tarantool to store data. How can I store data with TTL and simple logic (without the spaces)?
Like this:
box:setx(key, value, ttl);
box:get(key)

Yes, you can expire data in Tarantool and in a much more flexible way than in Redis. Though you can't do this without spaces, because space is a container for data in Tarantool (like database or table in other database systems).
In order to expire data, you have to install expirationd tarantool rock using tarantoolctl rocks install expirationd command. Full documentation on expirationd daemon can be found here.
Feel free to use a sample code below:
#!/usr/bin/env tarantool
package.path = './.rocks/share/tarantool/?.lua;' .. package.path
local fiber = require('fiber')
local expirationd = require('expirationd')
-- setup the database
box.cfg{}
box.once('init', function()
box.schema.create_space('test')
box.space.test:create_index('primary', {parts = {1, 'unsigned'}})
end)
-- print all fields of all tuples in a given space
local print_all = function(space_id)
for _, v in ipairs(box.space[space_id]:select()) do
local s = ''
for i = 1, #v do s = s .. tostring(v[i]) .. '\t' end
print(s)
end
end
-- return true if tuple is more than 10 seconds old
local is_expired = function(args, tuple)
return (fiber.time() - tuple[3]) > 10
end
-- simply delete a tuple from a space
local delete_tuple = function(space_id, args, tuple)
box.space[space_id]:delete{tuple[1]}
end
local space = box.space.test
print('Inserting tuples...')
space:upsert({1, '0 seconds', fiber.time()}, {})
fiber.sleep(5)
space:upsert({2, '5 seconds', fiber.time()}, {})
fiber.sleep(5)
space:upsert({3, '10 seconds', fiber.time()}, {})
print('Tuples are ready:\n')
print_all('test')
print('\nStarting expiration daemon...\n')
-- start expiration daemon
-- in a production full_scan_time should be bigger than 1 sec
expirationd.start('expire_old_tuples', space.id, is_expired, {
process_expired_tuple = delete_tuple, args = nil,
tuples_per_iteration = 50, full_scan_time = 1
})
fiber.sleep(5)
print('\n\n5 seconds passed...')
print_all('test')
fiber.sleep(5)
print('\n\n10 seconds passed...')
print_all('test')
fiber.sleep(5)
print('\n\n15 seconds passed...')
print_all('test')
os.exit()

Related

Improve code result speed by multiprocessing

I'm self study of Python and it's my first code.
I'm working for analyze logs from the servers. Usually I need analyze full day logs. I created script (this is example, simple logic) just for check speed. If I use normal coding the duration of analyzing 20mil rows about 12-13 minutes. I need 200mil rows by 5 min.
What I tried:
Use multiprocessing (met issue with share memory, think that fix it). But as the result - 300K rows = 20 sec and no matter how many processes. (PS: Also need control processors count in advance)
Use threading (I found that it's not give any speed, 300K rows = 2 sec. But normal code same, 300K = 2 sec)
Use asyncio (I think that script is slow because need reads many files). Result same as threading - 300K = 2 sec.
Finally I think that all three my script incorrect and didn't work correctly.
PS: I try to avoid use specific python modules (like pandas) because in this case it will be more difficult to execute on different servers. Better to use common lib.
Please help to check 1st - multiprocessing.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a, n):
proc_num = os.getpid()
a_temp_m = a["vod_miss"]
a_temp_h = a["vod_hit"]
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m[n] = a_temp_m[n] + 1
elif j[3].find('HIT') != -1:
a_temp_h[n] = a_temp_h[n] + 1
a["vod_miss"][n] = a_temp_m[n]
a["vod_hit"][n] = a_temp_h[n]
if __name__ == '__main__':
procs = []
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
n = 1
vod_live_cuts[i] = manager.list([0] * cpu)
vod_live_cuts[ii] = manager.list([0] * cpu)
for m in file:
proc = Process(target=argument, args=(m, vod_live_cuts, (n-1)))
procs.append(proc)
proc.start()
if n >= cpu:
n = 1
proc.join()
else:
n += 1
[proc.join() for proc in procs]
[proc.close() for proc in procs]
I'm expect, each file by def argument will be processed by independent process and finally all results will be saved in dict vod_live_cuts. For each process I added independent list in dict. I think it will help cross operation for use this parameter. But maybe it's wrong way :(
using IPC is costly, so only use "shared objects" for saving the final result, not for intermediate results while parsing the file.
limiting the number of processes is done by using a multiprocessing.Pool, the following code uses it to reach the max hard-disk speed, you only need to post-process the results.
you can only parse data as fast as your HDD can read it (typically 30-80 MB/s), so if you need to improve the performance further you should use SSD or RAID0 for higher disk speed, you cannot get much faster than this without changing your hardware.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager, Pool
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a):
proc_num = os.getpid()
a_temp_m_n = 0 # make it local to process
a_temp_h_n = 0 # as shared lists use IPC
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m_n = a_temp_m_n + 1
elif j[3].find('HIT') != -1:
a_temp_h_n = a_temp_h_n + 1
a["vod_miss"].append(a_temp_m_n)
a["vod_hit"].append(a_temp_h_n)
if __name__ == '__main__':
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
vod_live_cuts[i] = manager.list()
vod_live_cuts[ii] = manager.list()
with Pool(cpu) as pool:
tasks = []
for m in file:
task = pool.apply_async(argument, args=(m, vod_live_cuts))
tasks.append(task)
for task in tasks:
task.get()
print(list(vod_live_cuts[i]))
print(list(vod_live_cuts[ii]))

Can GA4-API fetch the data from requests made with a combination of minute and region and sessions?

Problem
With UA, I was able to get the number of sessions per region per minute (a combination of minute, region, and sessions), but is this not possible with GA4?
If not, is there any plan to support this in the future?
Detail
I ran GA4 Query Explorer with date, hour, minute, region in Dimensions and sessions in Metrics.
But I got an incompatibility error.
What I tried
I have checked with GA4 Dimensions & Metrics Explorer and confirmed that the combination of minute and region is not possible. (see image below).
(updated 2022/05/16 15:35)Checked by Code Execution
I ran it with ruby.
require "google/analytics/data/v1beta/analytics_data"
require 'pp'
require 'json'
ENV['GOOGLE_APPLICATION_CREDENTIALS'] = '' # service acount file path
client = ::Google::Analytics::Data::V1beta::AnalyticsData::Client.new
LIMIT_SIZE = 1000
offset = 0
loop do
request = Google::Analytics::Data::V1beta::RunReportRequest.new(
property: "properties/xxxxxxxxx",
date_ranges: [
{ start_date: '2022-04-01', end_date: '2022-04-30'}
],
dimensions: %w(date hour minute region).map { |d| { name: d } },
metrics: %w(sessions).map { |m| { name: m } },
keep_empty_rows: false,
offset: offset,
limit: LIMIT_SIZE
)
ret = client.run_report(request)
dimension_headers = ret.dimension_headers.map(&:name)
metric_headers = ret.metric_headers.map(&:name)
puts (dimension_headers + metric_headers).join(',')
ret.rows.each do |row|
puts (row.dimension_values.map(&:value) + row.metric_values.map(&:value)).join(',')
end
offset += LIMIT_SIZE
break if ret.row_count <= offset
end
The result was an error.
3:The dimensions and metrics are incompatible.. debug_error_string:{"created":"#1652681913.393028000","description":"Error received from peer ipv4:172.217.175.234:443","file":"src/core/lib/surface/call.cc","file_line":953,"grpc_message":"The dimensions and metrics are incompatible.","grpc_status":3}
Error in your code, Make sure you use the actual dimension name and not the UI name. The correct name of that dimension is dateHourMinute not Date hour and minute
dimensions: %w(dateHourMinute).map { |d| { name: d } },
The query explore returns this request just fine
results
Limited use for region dimension
The as for region. As the error message states the dimensions and metrics are incompatible. The issue being that dateHourMinute can not be used with region. Switch to date or datehour
at the time of writing this is a beta api. I have sent a message off to google to find out if this is working as intended or if it may be changed.

How to speed up the addition of a new column in pandas, based on comparisons on an existing one

I am working on a large-ish dataframe collection with some machine data in several tables. The goal is to add a column to every table which expresses the row's "class", considering its vicinity to a certain time stamp.
seconds = 1800
for i in range(len(tables)): # looping over 20 equally structured tables containing machine data
table = tables[i]
table['Class'] = 'no event'
for event in events[i].values: # looping over 20 equally structured tables containing events
event_time = event[1] # get integer time stamp
start_time = event_time - seconds
table.loc[(table.Time<=event_time) & (table.Time>=start_time), 'Class'] = 'event soon'
The event_times and the entries in table.Time are integers. The point is to assign the class "event soon" to all rows in a specific time frame before an event (the number of seconds).
The code takes quite long to run, and I am not sure what is to blame and what can be fixed. The amount of seconds does not have much impact on the runtime, so the part where the table is actually changed is probabaly working fine and it may have to do with the nested loops instead. However, I don't see how to get rid of them. Hopefully, there is a faster, more pandas way to go about adding this class column.
I am working with Python 3.6 and Pandas 0.19.2
You can use numpy broadcasting to do this vectotised instead of looping
Dummy data generation
num_tables = 5
seconds=1800
def gen_table(count):
for i in range(count):
times = [(100 + j)**2 for j in range(i, 50 + i)]
df = pd.DataFrame(data={'Time': times})
yield df
def gen_events(count, num_tables):
for i in range(num_tables):
times = [1E4 + 100 * (i + j )**2 for j in range(count)]
yield pd.DataFrame(data={'events': times})
tables = list(gen_table(num_tables)) # a list of 5 DataFrames of length 50
events = list(gen_events(5, num_tables)) # a list of 5 DataFrames of length 5
Comparison
For debugging, I added a dict of verification DataFrames. They are not needed, I just used them for debugging
verification = {}
for i, (table, event_df) in enumerate(zip(tables, events)):
event_list = event_df['events']
time_diff = event_list.values - table['Time'].values[:,np.newaxis] # This is where the magic happens
events_close = np.any( (0 < time_diff) & (time_diff < seconds), axis=1)
table['Class'] = np.where(events_close, 'event soon', 'no event')
# The stuff after this line can be deleted since it's only used for the verification
df = pd.DataFrame(data=time_diff, index=table['Time'], columns=event_list)
df['event'] = np.any((0 < time_diff) & (time_diff < seconds), axis=1)
verification[i] = df
newaxis
A good explanation on broadcasting is in Jakevdp's book
table['Time'].values[:,np.newaxis]
gives a (50,1) 2-d array
array([[10000],
[10201],
[10404],
....
[21609],
[21904],
[22201]], dtype=int64)
Verification
For the first step the verification df looks like this:
events 10000.0 10100.0 10400.0 10900.0 11600.0 event
Time
10000 0.0 100.0 400.0 900.0 1600.0 True
10201 -201.0 -101.0 199.0 699.0 1399.0 True
10404 -404.0 -304.0 -4.0 496.0 1196.0 True
10609 -609.0 -509.0 -209.0 291.0 991.0 True
10816 -816.0 -716.0 -416.0 84.0 784.0 True
11025 -1025.0 -925.0 -625.0 -125.0 575.0 True
11236 -1236.0 -1136.0 -836.0 -336.0 364.0 True
11449 -1449.0 -1349.0 -1049.0 -549.0 151.0 True
11664 -1664.0 -1564.0 -1264.0 -764.0 -64.0 False
11881 -1881.0 -1781.0 -1481.0 -981.0 -281.0 False
12100 -2100.0 -2000.0 -1700.0 -1200.0 -500.0 False
12321 -2321.0 -2221.0 -1921.0 -1421.0 -721.0 False
12544 -2544.0 -2444.0 -2144.0 -1644.0 -944.0 False
....
20449 -10449.0 -10349.0 -10049.0 -9549.0 -8849.0 False
20736 -10736.0 -10636.0 -10336.0 -9836.0 -9136.0 False
21025 -11025.0 -10925.0 -10625.0 -10125.0 -9425.0 False
21316 -11316.0 -11216.0 -10916.0 -10416.0 -9716.0 False
21609 -11609.0 -11509.0 -11209.0 -10709.0 -10009.0 False
21904 -11904.0 -11804.0 -11504.0 -11004.0 -10304.0 False
22201 -12201.0 -12101.0 -11801.0 -11301.0 -10601.0 False
Small optimizations of original answer.
You can shave a few lines and some assignments of the original algorithm
for table, event_df in zip(tables, events):
table['Class'] = 'no event'
for event_time in event_df['events']: # looping over 20 equally structured tables containing events
start_time = event_time - seconds
table.loc[table['Time'].between(start_time, event_time), 'Class'] = 'event soon'
You might shave some more if instead of the text 'no event' and 'event soon' you would just use booleans

Redis Sorted Set: Bulk ZSCORE

How to get a list of members based on their ID from a sorted set instead of just one member?
I would like to build a subset with a set of IDs from the actual sorted set.
I am using a Ruby client for Redis and do not want to iterate one by one. Because there could more than 3000 members that I want to lookup.
Here is the issue tracker to a new command ZMSCORE to do bulk ZSCORE.
There is no variadic form for ZSCORE, yet - see the discussion at: https://github.com/antirez/redis/issues/2344
That said, and for the time being, what you could do is use a Lua script for that. For example:
local scores = {}
while #ARGV > 0 do
scores[#scores+1] = redis.call('ZSCORE', KEYS[1], table.remove(ARGV, 1))
end
return scores
Running this from the command line would look like:
$ redis-cli ZADD foo 1 a 2 b 3 c 4 d
(integer) 4
$ redis-cli --eval mzscore.lua foo , b d
1) "2"
2) "4"
EDIT: In Ruby, it would probably be something like the following, although you'd be better off using SCRIPT LOAD and EVALSHA and loading the script from an external file (instead of hardcoding it in the app):
require 'redis'
script = <<LUA
local scores = {}
while #ARGV > 0 do
scores[#scores+1] = redis.call('ZSCORE', KEYS[1], table.remove(ARGV, 1))
end
return scores
LUA
redis = ::Redis.new()
reply = redis.eval(script, ["foo"], ["b", "d"])
Lua script to get scores with member IDs:
local scores = {}
while #ARGV > 0 do
local member_id = table.remove(ARGV, 1)
local member_score = {}
member_score[1] = member_id
member_score[2] = redis.call('ZSCORE', KEYS[1], member_id)
scores[#scores + 1] = member_score
end
return scores

Making variables persistent after a restart on NodeMCU

I'm making a smart home system using nodeMCU, and I need to store and retrieve data from the module. I used the following function.
function save_settings(name,value)
file.remove(name)
file.open(name,"w+")
file.writeline(value)
file.close()
end
It works but it's slow and the NodeMCU crashes if I trigger the above function rapidly... Sometimes requiring a FS format to be able to use it again.
So my question is: is there any other way to make variables persistent between restarts?
I'm using the latest firmware, 0.9.6-dev_20150704, the float version (https://github.com/nodemcu/nodemcu-firmware/releases)
This code took 62-63 ms to complete at first, and seems to add a few fractions of a millisecond with each successive run of the code, after a few hundred executions, it was up to almost 100 ms. It never crashed on me.
function save_setting(name, value)
file.open(name, 'w') -- you don't need to do file.remove if you use the 'w' method of writing
file.writeline(value)
file.close()
end
function read_setting(name)
if (file.open(name)~=nil) then
result = string.sub(file.readline(), 1, -2) -- to remove newline character
file.close()
return true, result
else
return false, nil
end
end
startTime = tmr.now()
test1 = 1200
test2 = 15.7
test3 = 75
test4 = 15000001
save_setting('test1', test1)
save_setting('test2', test2)
save_setting('test3', test3)
save_setting('test4', test4)
1exists, test1 = read_setting('test1')
2exists, test2 = read_setting('test2')
3exists, test3 = read_setting('test3')
4exists, test4 = read_setting('test4')
completeTime = (tmr.now()-startTime)/(1000)
print('time to complete (ms):')
print(tostring(completeTime))
If you upgrade to the newer version (based on SDK 1.4.0) you can use the rtcmem memory slots:
local offset = 10
local val = rtcmem.read32(offset, 1)
rtcmem.write32(offset, val + 1)
That memory is documented to persist through a deep sleep cycle; I've found it to persist across both hardware and software resets (but not cycling power.)

Resources