how to iterate over a list of values returning from ops to jobs in dagster - etl

I am new to the dagster world and working on ops and jobs concepts. \
my requirement is to read a list of data from config_schema and pass it to #op function and return the same list to jobs. \
The code is show as below
#op(config_schema={"table_name":list})
def read_tableNames(context):
lst=context.op_config['table_name']
return lst
#job
def write_db():
tableNames_frozenList=read_tableNames()
print(f'-------------->',type(tableNames_frozenList))
print(f'-------------->{tableNames_frozenList}')
when it accepts the list in #op function, it is showing as a frozenlist type but when i tried to return to jobs it conver it into <class 'dagster._core.definitions.composition.InvokedNodeOutputHandle'> data type
My requirement is to fetch the list of data and iterate over the list and perform some operatiosn on individual data of a list using #ops
Please help to understand this
Thanks in advance !!!

When using ops / graphs / jobs in Dagster it's very important to understand that the code defined within a #graph or #job definition is only executed when your code is loaded by Dagster, NOT when the graph is actually executing. The code defined within a #graph or #job definition is essentially a compilation step that only serves to define the dependencies between ops - there shouldn't be any general-purpose python code within those definitions. Whatever operations you want to perform on data flowing through your job should take place within the #op definitions. So if you wanted to print the values of your list that is be input via a config schema, you might do something like
#op(config_schema={"table_name":list})
def read_tableNames(context):
lst=context.op_config['table_name']
context.log.info(f'-------------->',type(tableNames_frozenList'))
context.log.info(f'-------------->{tableNames_frozenList}')
here's an example using two ops to do this data flow:
#op(config_schema={"table_name":list})
def read_tableNames(context):
lst=context.op_config['table_name']
return lst
#op
def print_tableNames(context, table_names):
context.log.info(f'-------------->',type(table_names)
#job
def simple_flow():
print_tableNames(read_tableNames())
Have a look at some of the Dagster tutorials for more examples

Related

Set the name for each ParallelFor iteration in KFP v2 on Vertex AI

I am currently using kfp.dsl.ParallelFor to train 300 models. It looks something like this:
...
models_to_train_op = get_models()
with dsl.ParallelFor(models_to_train_op.outputs["data"], parallelism=100) as item:
prepare_data_op = prepare_data(item)
train_model_op = train_model(prepare_data_op["train_data"]
...
Currently, the iterations in Vertex AI are labeled in a dropdown as something like for-loop-worker-0, for-loop-worker-1, and so on. For tasks (like prepare_data_op, there's a function called set_display_name. Is there a similar method that allows you to set the iteration name? It would be helpful to relate them to the training data so that it's easier to look through the dropdown UI that Vertex AI provides.
I reached out to a contact I have at Google. They recommended that you can pass the list that is passed to ParallelFor to set_display_name for each 'iteration' of the loop. When the pipeline is compiled, it'll know to set the corresponding iteration.
# Create component that returns a range list
model_list_op = model_list(n_models)
# Parallelize jobs
ParallelFor(model_list_op.outputs["model_list"], parallelism=100) as x:
x.set_display_name(str(model_list_op.outputs["model_list"]))

dry-validation: Case insensitive `included_in?` validation with Dry::Validation.Schema

I'm trying to create a validation for a predetermined list of valid brands as part of an ETL pipeline. My validation requires case insensitivity, as some brands are compound words or abbreviations that are insignificant.
I created a custom predicate, but I cannot figure out how to generate the appropriate error message.
I read the error messages doc, but am having a hard time interpreting:
How to build the syntax for my custom predicate?
Can I apply the messages in my schema class directly, without referencing an external .yml file? I looked here and it seems like it's not as straightforward as I'd hoped.
Below I've given code that represents what I have tried using both built-in predicates, and a custom one, each with their own issues. If there is a better way to compose a rule that achieves the same goal, I'd love to learn it.
require 'dry/validation'
CaseSensitiveSchema = Dry::Validation.Schema do
BRANDS = %w(several hundred valid brands)
# :included_in? from https://dry-rb.org/gems/dry-validation/basics/built-in-predicates/
required(:brand).value(included_in?: BRANDS)
end
CaseInsensitiveSchema = Dry::Validation.Schema do
BRANDS = %w(several hundred valid brands)
configure do
def in_brand_list?(value)
BRANDS.include? value.downcase
end
end
required(:brand).value(:in_brand_list?)
end
# A valid string if case insensitive
valid_product = {brand: 'Valid'}
CaseSensitiveSchema.call(valid_product).errors
# => {:brand=>["must be one of: here, are, some, valid, brands"]} # This message will be ridiculous when the full brand list is applied
CaseInsensitiveSchema.call(valid_product).errors
# => {} # Good!
invalid_product = {brand: 'Junk'}
CaseSensitiveSchema.call(invalid_product).errors
# => {:brand=>["must be one of: several, hundred, valid, brands"]} # Good... (Except this error message will contain the entire brand list!!!)
CaseInsensitiveSchema.call(invalid_product).errors
# => Dry::Validation::MissingMessageError: message for in_brand_list? was not found
# => from .. /gems/2.5.0/gems/dry-validation-0.12.2/lib/dry/validation/message_compiler.rb:116:in `visit_predicate'
The correct way to reference my error message was to reference the predicate method. No need to worry about arg, value, etc.
en:
errors:
in_brand_list?: "must be in the master brands list"
Additionally, I was able to load this error message without a separate .yml by doing this:
CaseInsensitiveSchema = Dry::Validation.Schema do
BRANDS = %w(several hundred valid brands)
configure do
def in_brand_list?(value)
BRANDS.include? value.downcase
end
def self.messages
super.merge({en: {errors: {in_brand_list?: "must be in the master brand list"}}})
end
end
required(:brand).value(:in_brand_list?)
end
I'd still love to see other implementations, specifically for a generic case-insensitive predicate. Many people say dry-rb is fantastically organized, but I find it hard to follow.

How do I create a compound multi-index in rethinkdb?

I am using Rethinkdb 1.10.1 with the official python driver. I have a table of tagged things which are associated to one user:
{
"id": "PK",
"user_id": "USER_PK",
"tags": ["list", "of", "strings"],
// Other fields...
}
I want to query by user_id and tag (say, to find all the things by user "tawmas" with tag "tag"). Starting with Rethinkdb 1.10 I can create a multi-index like this:
r.table('things').index_create('tags', multi=True).run(conn)
My query would then be:
res = (r.table('things')
.get_all('TAG', index='tags')
.filter(r.row['user_id'] == 'USER_PK').run(conn))
However, this query still needs to scan all the documents with the given tag, so I would like to create a compound index based on the user_id and tags fields. Such an index would allow me to query with:
res = r.table('things').get_all(['USER_PK', 'TAG'], index='user_tags').run(conn)
There is nothing in the documentation about compound multi-indexes. However, I
tried to use a custom index function combining the requirements for compound
indexes and multi-indexes by returning a list of ["USER_PK", "tag"] pairs.
My first attempt was in python:
r.table('things').index_create(
'user_tags',
lambda each: [[each['user_id'], tag] for tag in each['tags']],
multi=True).run(conn)
This makes the python driver choke with a MemoryError trying to parse the index function (I guess list comprehensions aren't really supported by the driver).
So, I turned to my (admittedly, rusty) javascript and came up with this:
r.table('things').index_create(
'user_tags',
r.js(
"""(function (each) {
var result = [];
var user_id = each["user_id"];
var tags = each["tags"];
for (var i = 0; i < tags.length; i++) {
result.push([user_id, tags[i]]);
}
return result;
})
"""),
multi=True).run(conn)
This is rejected by the server with a curious exception: rethinkdb.errors.RqlRuntimeError: Could not prove function deterministic. Index functions must be deterministic.
So, what is the correct way to define a compound multi-index? Or is it something
which is not supported at this time?
Short answer:
List comprehensions don't work in ReQL functions. You need to use map instead like so:
r.table('things').index_create(
'user_tags',
lambda each: each["tags"].map(lambda tag: [each['user_id'], tag]),
multi=True).run(conn)
Long answer
This is actually a somewhat subtle aspect of how RethinkDB drivers work. So the reason this doesn't work is that your python code doesn't actually see real copies of the each document. So in the expression:
lambda each: [[each['user_id'], tag] for tag in each['tags']]
each isn't ever bound to an actual document from your database, it's bound to a special python variable which represents the document. I'd actually try running the following just to demonstrate it:
q = r.table('things').index_create(
'user_tags',
lambda each: print(each)) #only works in python 3
And it will print out something like:
<RqlQuery instance: var_1 >
the driver only knows that this is a variable from the function, in particular it has no idea if each["tags"] is an array or what (it's actually just another very similar abstract object). So python doesn't know how to iterate over that field. Basically exactly the same problem exists in javascript.

How do you view the details of a submitted task with IPython Parallel?

I'm submitting tasks using a Load Balanced View.
I would like to be able to connect from a different client and view the remaining tasks by the function and parameters that were submitted.
Forexample:
def someFunc(parm1, parm2):
return parm1 + parm2
lbv = client.load_balanced_view()
async_results = []
for parm1 in [0,1,2]:
for parm2 in [0,1,2]:
ar = lbv.apply_async(someFunc, parm1, parm2)
async_results.append(ar)
From the client I submitted this from I can figure out which result went with which function call based on their order in the async_results array.
What I would like to know is how can I figure out the function and parameters associated with a msg_id if I am retrieving the results from a different client using the queue_status or history commands to get msg_id's and the client.get_result command to retrieve the results.
These things are pickled, and stored in the 'buffers' in the hub's database. If you want to look at them, you have to fetch those buffers from the database, and unpack them.
Assuming you have a list of msg_ids, here is a way that you can reconstruct the f, args, and kwargs for all of those requests:
# msg_ids is a list of msg_id, however you decide to get that
from IPython.zmq.serialize import unpack_apply_message
# load the buffers from the hub's database:
query = rc.db_query({'msg_id' : {'$in' : msg_ids } }, keys=['msg_id', 'buffers'])
# query is now a list of dicts with two keys - msg_id and buffers
# now we can generate a dict by msg_id of the original function, args, and kwargs:
requests = {}
for q in query:
msg_id =
f, args, kwargs = unpack_apply_message(q['buffers'])
requests[q['msg_id']] = (f, args, kwargs)
From this, you should be able to associate tasks based on their function and args.
One Caveat: since f has been through pickling, often the comparison f is original_f will be False, so you have to do looser comparisons, such as f.__module__ + f.__name__ or similar.
For a bit more detail, here is an example that generates some requests,
then reconstructs and associates them based on the function and arguments having some prior knowledge of what the original requests may have looked like.

Django models are not ajax serializable

I have a simple view that I'm using to experiment with AJAX.
def get_shifts_for_day(request,year,month,day):
data= dict()
data['d'] =year
data['e'] = month
data['x'] = User.objects.all()[2]
return HttpResponse(simplejson.dumps(data), mimetype='application/javascript')
This returns the following:
TypeError at /sched/shifts/2009/11/9/
<User: someguy> is not JSON serializable
If I take out the data['x'] line so that I'm not referencing any models it works and returns this:
{"e": "11", "d": "2009"}
Why can't simplejson parse my one of the default django models? I get the same behavior with any model I use.
You just need to add, in your .dumps call, a default=encode_myway argument to let simplejson know what to do when you pass it data whose types it does not know -- the answer to your "why" question is of course that you haven't told poor simplejson what to DO with one of your models' instances.
And of course you need to write encode_myway to provide JSON-encodable data, e.g.:
def encode_myway(obj):
if isinstance(obj, User):
return [obj.username,
obj.firstname,
obj.lastname,
obj.email]
# and/or whatever else
elif isinstance(obj, OtherModel):
return [] # whatever
elif ...
else:
raise TypeError(repr(obj) + " is not JSON serializable")
Basically, JSON knows about VERY elementary data types (strings, ints and floats, grouped into dicts and lists) -- it's YOUR responsibility as an application programmer to match everything else into/from such elementary data types, and in simplejson that's typically done through a function passed to default= at dump or dumps time.
Alternatively, you can use the json serializer that's part of Django, see the docs.

Resources