monetdbe: multiple connections reading vs writing - monetdb

I am finding that with monetdbe (embedded, Python), I can import data to two tables simultaneously from two processes, but I can't do two SELECT queries.
For example, if I run this in Bash:
(python stdinquery.py < sql_examples/wind.sql &); (python stdinquery.py < sql_examples/first_event_by_day.sql &)
then I get this error from one process, while the other finishes its query fine:
monetdbe.exceptions.OperationalError: Failed to open database: MALException:monetdbe.monetdbe_startup:GDKinit() failed (code -2)
I'm a little surprised that it can write two tables at once but not read two tables at once. Am I overlooking something?
My stdinquery.py is just:
import sys
import monetdbe
monet_conn = monetdbe.connect("dw.db")
cursor = monet_conn.cursor()
query = sys.stdin.read()
cursor.executescript(query)
print(cursor.fetchdf())

You are starting multiple concurrent Python processes. Each of those tries to create or open a database on disk at the dw.db location. That won't work because the embedded database processes are not aware of each others.
With the core C library of monetdbe, it is possible to write multi-threaded applications where each connecting application thread uses its own connection object. See this example written in C here. Again this only works for concurrent threads within a single monetdbe process, not multiple concurrent monetdbe processes claiming the same database location.
Unfortunately it is not currently possible with the Python monetdbe module to setup something analogous to the C example above. But probably in the next release
it is going to be possible to use e.g. concurrent.futures to write something similar in Python.

Related

Limitations on file append when using in multi-processed environment

My process creates a log file and appends a new line at the end of the file by using a, e.g:
fopen("log.txt", "a");
The order of the writes is not critical, but I need to ensure that fopen always succeeds. My question is, can the call above be executed from multiple processes at the same time on Windows, Linux and macOS without any race-condition?
If not, what is the most common and easy way to ensure I can write to the log file? There is file-lokcing, but also a file-lock (aka log.txt.lock) possible. Could anyone share some insights or resources which go more into detail?
If you do not use any synchronization between processes, you'll highly likely have moment when several processes will try to write to the file and the best you can get is mesh of input strings.
In order to synchronize any work in several processes (multiprocessing module). Use Lock. It will prevent several processes to do some work simultaneously.
It will look something like this:
import multiprocessing
# create lock in main process and "send" it to child processes.
lock = multiprocessing.Lock()
# ...
# in child Process
with lock:
do_some_work()
If you need more detailed example, feel free to ask.
Also you can check example in official docs

Best ETL Packages In Python

I have 2 use cases:
Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster
Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable).
I am currently using:
SQLAlchemy for extracts (works well generally).
PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out).
An internal tool for the scheduling component (which stores the transform in XML and then the loads from the XML and seems rather long and complicated).
I have been looking through this link but would welcome additional suggestions. Exporting to Spark or similar is also welcome if there is an "easier" process where I can just do everything through Python (I'm only using Redshift because it seems like the best option).
You can try pyetl an etl framework write by python3
from pyetl import Task, DatabaseReader, DatabaseWriter
reader = DatabaseReader("sqlite:///db.sqlite3", table_name="source")
writer = DatabaseWriter("sqlite:///db.sqlite3", table_name="target")
columns = {"id": "uuid", "name": "full_name"}
functions={"id": str, "name": lambda x: x.strip()}
Task(reader, writer, columns=columns, functions=functions).start()
How about
Python
Pandas
This is what we use for our ETL processing.
I'm using Pandas to access my ETL files, try doing something like this:
Create a class with all your queries there.
Create another class that processes the actual Datawarehouse that includes Pandas and Matplotlib for the graph.
Consider having a look at convtools library, it provides lots of data processing primitives, is pure python and has zero dependencies.
Since it generates ad hoc python code under the hood, sometimes it outperforms pandas/polars, so it can some gaps in your workflows. Especially if those have dynamic nature.

Choosing a better parallel architecture in Python

I am working on Data Wrangling problem using Python,
which processes a dirty Excel file into a clean Excel file
I would like to process multiple input files by introducing concurrency/parallelism.
I have the following options 1) Using multiThreading 2) Using multiProceesing modules 3) ParallelPython module,
I have a basic idea of the three methods, I would like to know which method is best and why?
In Bref, Processing of a SINGLE dirty Excel file today takes 3 minutes,
Objective : To introduce parallelism/concurrency to process multiple files at once.
Looking for, best method of parallelism to achieve the objective
Since your process is mostly CPU bound multi-threading won't be fast because of the GIL...
I would recommend multiprocessing or concurrent.futures since they are a bit simpler the ParallelPython (only a bit :) )
example:
with concurrent.futures.ProcessPoolExecutor() as executor:
for file_path, clean_file in zip(files, executor.map(data_wrangler, files)):
print('%s is now clean!' % (file_path))
#do something with clean_file if you want
Only if you need to distribute the load between servers then I would recommend ParallelPython .

Using julia in a cluster

I've been using Julia in parallel on my computer successfully but want to increase the number of processors/workers I use so I plan to use my departmental cluster (UCL Econ). When just using julia on my computer, I have two seperate files. FileA contains all the functions I use, including the main function funcy(x,y,z). FileB calls this function over several processors as follows:
addprocs(4)
require("FileA.jl")
solution = pmap(imw -> funcy(imw,y,z), 1:10)
When I try to run this on the cluster, the require statement does not seem to work (though I don't get an explicit error output which is frustrating). Any advice?

redis: EVAL and the TIME

I like the Lua-scripting for redis but i have a big problem with TIME.
I store events in a SortedSet.
The score is the time, so that in my application i can view all events in given time-window.
redis.call('zadd', myEventsSet, TIME, EventID);
Ok, but this is not working - i can not access the TIME (Servertime).
Is there any way to get a time from the Server without passing it as an argument to my lua-script? Or is passing the time as argument the best way to do it?
This is explicitly forbidden (as far as I remember). The reasoning behind this is that your lua functions must be deterministic and depend only on their arguments. What if this Lua call gets replicated to a slave with different system time?
Edit (by Linus G Thiel): This is correct. From the redis EVAL docs:
Scripts as pure functions
A very important part of scripting is writing scripts that are pure functions. Scripts executed in a Redis instance are replicated on slaves by sending the script -- not the resulting commands.
[...]
In order to enforce this behavior in scripts Redis does the following:
Lua does not export commands to access the system time or other external state.
Redis will block the script with an error if a script calls a Redis command able to alter the data set after a Redis random command like RANDOMKEY, SRANDMEMBER, TIME. This means that if a script is read-only and does not modify the data set it is free to call those commands. Note that a random command does not necessarily mean a command that uses random numbers: any non-deterministic command is considered a random command (the best example in this regard is the TIME command).
There is a wealth of information on why this is, how to deal with this in different scenarios, and what Lua libraries are available to scripts. I recommend you read the whole documentation!

Resources