Vertex AI Pipelines (Kubeflow) skip step with dependent outputs on later step - google-cloud-vertex-ai

I’m trying to run a Vertex AI Pipelines job where I skip a certain pipeline step if the value of a certain pipeline parameter (in this case do_task1) is False. But because there is another step that runs unconditionally and expects the output of the first potentially skipped step, I get the following error, independently of do_task1 being True or False:
AssertionError: component_input_artifact: pipelineparam--task1-output_path not found. All inputs: parameters {
key: "do_task1"
value {
type: STRING
}
}
parameters {
key: "task1_name"
value {
type: STRING
}
}
It seems like the compiler just cannot find the output output_path from task1. So I wonder if there is any way to have some sort of placeholders for the outputs of those steps that are under a dsl.Condition , and thus they get filled with default values unless the actual steps run and fill them with the non-default values.
The code below represents the problem and is easily reproducible.
I'm using google-cloud-aiplatform==1.14.0 and kfp==1.8.11
from typing import NamedTuple
from kfp import dsl
from kfp.v2.dsl import Dataset, Input, OutputPath, component
from kfp.v2 import compiler
from google.cloud.aiplatform import pipeline_jobs
#component(
base_image="python:3.9",
packages_to_install=["pandas"]
)
def task1(
# inputs
task1_name: str,
# outputs
output_path: OutputPath("Dataset"),
) -> NamedTuple("Outputs", [("output_1", str), ("output_2", int)]):
import pandas as pd
output_1 = task1_name + "-processed"
output_2 = 2
df_output_1 = pd.DataFrame({"output_1": [output_1]})
df_output_1.to_csv(output_path, index=False)
return (output_1, output_2)
#component(
base_image="python:3.9",
packages_to_install=["pandas"]
)
def task2(
# inputs
task1_output: Input[Dataset],
) -> str:
import pandas as pd
task1_input = pd.read_csv(task1_output.path).values[0][0]
return task1_input
#dsl.pipeline(
pipeline_root='pipeline_root',
name='pipelinename',
)
def pipeline(
do_task1: bool,
task1_name: str,
):
with dsl.Condition(do_task1 == True):
task1_op = (
task1(
task1_name=task1_name,
)
)
task2_op = (
task2(
task1_output=task1_op.outputs["output_path"],
)
)
if __name__ == '__main__':
do_task1 = True # <------------ The variable to modify ---------------
# compile pipeline
compiler.Compiler().compile(
pipeline_func=pipeline, package_path='pipeline.json')
# create pipeline run
pipeline_run = pipeline_jobs.PipelineJob(
display_name='pipeline-display-name',
pipeline_root='pipelineroot',
job_id='pipeline-job-id',
template_path='pipelinename.json',
parameter_values={
'do_task1': do_task1, # pipeline compilation fails with either True or False values
'task1_name': 'Task 1',
},
enable_caching=False
)
# execute pipeline run
pipeline_run.run()
Any help is much appreciated!

The real issue here is with dsl.Condition(): creates a sub group, where task1_op is an inner task only "visible" from within the sub group. In the latest SDK, it will throw a more explicit error message saying task2 cannot depends on any inner task.
So to resolve the issue, you just need to move task2 to be within the condition--if condition was not met, you don't have a valid input to feed into task2 anyway.
with dsl.Condition(do_task1 == True):
task1_op = (
task1(
task1_name=task1_name,
)
)
task2_op = (
task2(
task1_output=task1_op.outputs["output_path"],
)
)

Related

MyST-Parser: Auto linking / linkifying references to bug tracker issues

I use sphinx w/ MyST-Parser for markdown, and
I want GitHub or GitLab-style auto linking (linkfying) for references.
Is there a way to have MyST render the reference:
#346
In docutils-speak, this is a Text node (example)
And behave as if it was:
[#346](https://github.com/vcs-python/libvcs/pull/346)
So when rendered it'd be like:
#346
Not the custom role:
{issue}`1` <- Not this
Another example: Linkifying the reference #user to a GitHub, GitLab, StackOverflow user.
What I'm currently doing (and why it doesn't work)
Right now I'm using the canonical solution docutils offers: custom roles.
I use sphinx-issues (PyPI), and does just that. It uses a sphinx setting variable, issues_github_path to parse the URL:
e.g. in Sphinx configuration conf.py:
issues_github_path = 'vcs-python/libvcs'
reStructuredText:
:issue:`346`
MyST-Parser:
{issue}`346`
Why custom roles don't work
Sadly, those aren't bi-directional with GitHub/GitLab/tools. If you copy/paste MyST-Parser -> GitHub/GitLab or preview it directly, it looks very bad:
Example of CHANGES:
Example issue: https://github.com/vcs-python/libvcs/issues/363
What we want is to just be able to copy markdown including #347 to and from.
Does a solution already exist?
Are there any projects out there of docutils or sphinx plugins to turn #username or #issues into links?
sphinx (at least) can demonstrable do so for custom roles - as seen in sphinx-issues usage of issues_github_path - by using project configuration context.
MyST-Parser has a linkify extension which uses linkify-it-py
This can turn https://www.google.com into https://www.google.com and not need to use <https://www.google.com>.
Therefore, there may already be a tool out there.
Can it be done through the API?
The toolchain for myst, sphinx and docutils is robust. This is a special case.
This needs to be done at the Text node level. Custom role won't work - as stated above - since it'll create markdown that can't be copied between GitLab and GitHub issues trivially.
The stack:
MyST-Parser API (Markdown-it-py API) > Sphinx APIs (MySTParser + Sphinx) > Docutils API
At the time of writing, I'm using Sphinx 4.3.2, MyST-Parser 0.17.2, and docutils 0.17.1 on python 3.10.2.
Notes
For the sake of an example, I'm using an open source project of mine that is facing this issue.
This is only about autolinking issues or usernames - things that'd easily be mappable to URLs. autodoc code-linking is out of scope.
There is a (defunct) project that does this: sphinxcontrib-issuetracker.
I've rebooted it:
conf.py:
import sys
from pathlib import Path
cwd = Path(__file__).parent
project_root = cwd.parent
sys.path.insert(0, str(project_root))
sys.path.insert(0, str(cwd / "_ext"))
extensions = [
"link_issues",
]
# issuetracker
issuetracker = "github"
issuetracker_project = "cihai/unihan-etl" # e.g. for https://github.com/cihai/unihan-etl
_ext/link_issues.py:
"""Issue linking w/ plain-text autolinking, e.g. #42
Credit: https://github.com/ignatenkobrain/sphinxcontrib-issuetracker
License: BSD
Changes by Tony Narlock (2022-08-21):
- Type annotations
mypy --strict, requires types-requests, types-docutils
Python < 3.10 require typing-extensions
- TrackerConfig: Use dataclasses instead of typing.NamedTuple and hacking __new__
- app.warn (removed in 5.0) -> Use Sphinx Logging API
https://www.sphinx-doc.org/en/master/extdev/logging.html#logging-api
- Add PendingIssueXRef
Typing for tracker_config and precision
- Add IssueTrackerBuildEnvironment
Subclassed / typed BuildEnvironment with .tracker_config
- Just GitHub (for demonstration)
"""
import dataclasses
import re
import sys
import time
import typing as t
import requests
from docutils import nodes
from sphinx.addnodes import pending_xref
from sphinx.application import Sphinx
from sphinx.config import Config
from sphinx.environment import BuildEnvironment
from sphinx.transforms import SphinxTransform
from sphinx.util import logging
if t.TYPE_CHECKING:
if sys.version_info >= (3, 10):
from typing import TypeGuard
else:
from typing_extensions import TypeGuard
logger = logging.getLogger(__name__)
GITHUB_API_URL = "https://api.github.com/repos/{0.project}/issues/{1}"
class IssueTrackerBuildEnvironment(BuildEnvironment):
tracker_config: "TrackerConfig"
issuetracker_cache: "IssueTrackerCache"
github_rate_limit: t.Tuple[float, bool]
class Issue(t.NamedTuple):
id: str
title: str
url: str
closed: bool
IssueTrackerCache = t.Dict[str, Issue]
#dataclasses.dataclass
class TrackerConfig:
project: str
url: str
"""
Issue tracker configuration.
This class provides configuration for trackers, and is passed as
``tracker_config`` arguments to callbacks of
:event:`issuetracker-lookup-issue`.
"""
def __post_init__(self) -> None:
if self.url is not None:
self.url = self.url.rstrip("/")
#classmethod
def from_sphinx_config(cls, config: Config) -> "TrackerConfig":
"""
Get tracker configuration from ``config``.
"""
project = config.issuetracker_project or config.project
url = config.issuetracker_url
return cls(project=project, url=url)
class PendingIssueXRef(pending_xref):
tracker_config: TrackerConfig
class IssueReferences(SphinxTransform):
default_priority = 999
def apply(self) -> None:
config = self.document.settings.env.config
tracker_config = TrackerConfig.from_sphinx_config(config)
issue_pattern = config.issuetracker_issue_pattern
title_template = None
if isinstance(issue_pattern, str):
issue_pattern = re.compile(issue_pattern)
for node in self.document.traverse(nodes.Text):
parent = node.parent
if isinstance(parent, (nodes.literal, nodes.FixedTextElement)):
# ignore inline and block literal text
continue
if isinstance(parent, nodes.reference):
continue
text = str(node)
new_nodes = []
last_issue_ref_end = 0
for match in issue_pattern.finditer(text):
# catch invalid pattern with too many groups
if len(match.groups()) != 1:
raise ValueError(
"issuetracker_issue_pattern must have "
"exactly one group: {0!r}".format(match.groups())
)
# extract the text between the last issue reference and the
# current issue reference and put it into a new text node
head = text[last_issue_ref_end : match.start()]
if head:
new_nodes.append(nodes.Text(head))
# adjust the position of the last issue reference in the
# text
last_issue_ref_end = match.end()
# extract the issue text (including the leading dash)
issuetext = match.group(0)
# extract the issue number (excluding the leading dash)
issue_id = match.group(1)
# turn the issue reference into a reference node
refnode = PendingIssueXRef()
refnode["refdomain"] = None
refnode["reftarget"] = issue_id
refnode["reftype"] = "issue"
refnode["trackerconfig"] = tracker_config
reftitle = title_template or issuetext
refnode.append(
nodes.inline(issuetext, reftitle, classes=["xref", "issue"])
)
new_nodes.append(refnode)
if not new_nodes:
# no issue references were found, move on to the next node
continue
# extract the remaining text after the last issue reference, and
# put it into a text node
tail = text[last_issue_ref_end:]
if tail:
new_nodes.append(nodes.Text(tail))
# find and remove the original node, and insert all new nodes
# instead
parent.replace(node, new_nodes)
def is_issuetracker_env(
env: t.Any,
) -> "TypeGuard['IssueTrackerBuildEnvironment']":
return hasattr(env, "issuetracker_cache") and env.issuetracker_cache is not None
def lookup_issue(
app: Sphinx, tracker_config: TrackerConfig, issue_id: str
) -> t.Optional[Issue]:
"""
Lookup the given issue.
The issue is first looked up in an internal cache. If it is not found, the
event ``issuetracker-lookup-issue`` is emitted. The result of this
invocation is then cached and returned.
``app`` is the sphinx application object. ``tracker_config`` is the
:class:`TrackerConfig` object representing the issue tracker configuration.
``issue_id`` is a string containing the issue id.
Return a :class:`Issue` object for the issue with the given ``issue_id``,
or ``None`` if the issue wasn't found.
"""
env = app.env
if is_issuetracker_env(env):
cache: IssueTrackerCache = env.issuetracker_cache
if issue_id not in cache:
issue = app.emit_firstresult(
"issuetracker-lookup-issue", tracker_config, issue_id
)
cache[issue_id] = issue
return cache[issue_id]
return None
def lookup_issues(app: Sphinx, doctree: nodes.document) -> None:
"""
Lookup issues found in the given ``doctree``.
Each issue reference in the given ``doctree`` is looked up. Each lookup
result is cached by mapping the referenced issue id to the looked up
:class:`Issue` object (an existing issue) or ``None`` (a missing issue).
The cache is available at ``app.env.issuetracker_cache`` and is pickled
along with the environment.
"""
for node in doctree.traverse(PendingIssueXRef):
if node["reftype"] == "issue":
lookup_issue(app, node["trackerconfig"], node["reftarget"])
def make_issue_reference(issue: Issue, content_node: nodes.inline) -> nodes.reference:
"""
Create a reference node for the given issue.
``content_node`` is a docutils node which is supposed to be added as
content of the created reference. ``issue`` is the :class:`Issue` which
the reference shall point to.
Return a :class:`docutils.nodes.reference` for the issue.
"""
reference = nodes.reference()
reference["refuri"] = issue.url
if issue.title:
reference["reftitle"] = issue.title
if issue.closed:
content_node["classes"].append("closed")
reference.append(content_node)
return reference
def resolve_issue_reference(
app: Sphinx, env: BuildEnvironment, node: PendingIssueXRef, contnode: nodes.inline
) -> t.Optional[nodes.reference]:
"""
Resolve an issue reference and turn it into a real reference to the
corresponding issue.
``app`` and ``env`` are the Sphinx application and environment
respectively. ``node`` is a ``pending_xref`` node representing the missing
reference. It is expected to have the following attributes:
- ``reftype``: The reference type
- ``trackerconfig``: The :class:`TrackerConfig`` to use for this node
- ``reftarget``: The issue id
- ``classes``: The node classes
References with a ``reftype`` other than ``'issue'`` are skipped by
returning ``None``. Otherwise the new node is returned.
If the referenced issue was found, a real reference to this issue is
returned. The text of this reference is formatted with the :class:`Issue`
object available in the ``issue`` key. The reference title is set to the
issue title. If the issue is closed, the class ``closed`` is added to the
new content node.
Otherwise, if the issue was not found, the content node is returned.
"""
if node["reftype"] != "issue":
return None
issue = lookup_issue(app, node["trackerconfig"], node["reftarget"])
if issue is None:
return contnode
else:
classes = contnode["classes"]
conttext = str(contnode[0])
formatted_conttext = nodes.Text(conttext.format(issue=issue))
formatted_contnode = nodes.inline(conttext, formatted_conttext, classes=classes)
assert issue is not None
return make_issue_reference(issue, formatted_contnode)
return None
def init_cache(app: Sphinx) -> None:
if not hasattr(app.env, "issuetracker_cache"):
app.env.issuetracker_cache: "IssueTrackerCache" = {} # type: ignore
return None
def check_project_with_username(tracker_config: TrackerConfig) -> None:
if "/" not in tracker_config.project:
raise ValueError(
"username missing in project name: {0.project}".format(tracker_config)
)
HEADERS = {"User-Agent": "sphinxcontrib-issuetracker v{0}".format("1.0")}
def get(app: Sphinx, url: str) -> t.Optional[requests.Response]:
"""
Get a response from the given ``url``.
``url`` is a string containing the URL to request via GET. ``app`` is the
Sphinx application object.
Return the :class:`~requests.Response` object on status code 200, or
``None`` otherwise. If the status code is not 200 or 404, a warning is
emitted via ``app``.
"""
response = requests.get(url, headers=HEADERS)
if response.status_code == requests.codes.ok:
return response
elif response.status_code != requests.codes.not_found:
msg = "GET {0.url} failed with code {0.status_code}"
logger.warning(msg.format(response))
return None
def lookup_github_issue(
app: Sphinx, tracker_config: TrackerConfig, issue_id: str
) -> t.Optional[Issue]:
check_project_with_username(tracker_config)
env = app.env
if is_issuetracker_env(env):
# Get rate limit information from the environment
timestamp, limit_hit = getattr(env, "github_rate_limit", (0, False))
if limit_hit and time.time() - timestamp > 3600:
# Github limits applications hourly
limit_hit = False
if not limit_hit:
url = GITHUB_API_URL.format(tracker_config, issue_id)
response = get(app, url)
if response:
rate_remaining = response.headers.get("X-RateLimit-Remaining")
assert rate_remaining is not None
if rate_remaining.isdigit() and int(rate_remaining) == 0:
logger.warning("Github rate limit hit")
env.github_rate_limit = (time.time(), True)
issue = response.json()
closed = issue["state"] == "closed"
return Issue(
id=issue_id,
title=issue["title"],
closed=closed,
url=issue["html_url"],
)
else:
logger.warning(
"Github rate limit exceeded, not resolving issue {0}".format(issue_id)
)
return None
BUILTIN_ISSUE_TRACKERS: t.Dict[str, t.Any] = {
"github": lookup_github_issue,
}
def init_transformer(app: Sphinx) -> None:
if app.config.issuetracker_plaintext_issues:
app.add_transform(IssueReferences)
def connect_builtin_tracker(app: Sphinx) -> None:
if app.config.issuetracker:
tracker = BUILTIN_ISSUE_TRACKERS[app.config.issuetracker.lower()]
app.connect(str("issuetracker-lookup-issue"), tracker)
def setup(app: Sphinx) -> t.Dict[str, t.Any]:
app.add_config_value("mybase", "https://github.com/cihai/unihan-etl", "env")
app.add_event(str("issuetracker-lookup-issue"))
app.connect(str("builder-inited"), connect_builtin_tracker)
app.add_config_value("issuetracker", None, "env")
app.add_config_value("issuetracker_project", None, "env")
app.add_config_value("issuetracker_url", None, "env")
# configuration specific to plaintext issue references
app.add_config_value("issuetracker_plaintext_issues", True, "env")
app.add_config_value(
"issuetracker_issue_pattern",
re.compile(
r"#(\d+)",
),
"env",
)
app.add_config_value("issuetracker_title_template", None, "env")
app.connect(str("builder-inited"), init_cache)
app.connect(str("builder-inited"), init_transformer)
app.connect(str("doctree-read"), lookup_issues)
app.connect(str("missing-reference"), resolve_issue_reference)
return {
"version": "1.0",
"parallel_read_safe": True,
"parallel_write_safe": True,
}
Mirrors
https://gist.github.com/tony/05a3043d97d37c158763fb2f6a2d5392
https://github.com/ignatenkobrain/sphinxcontrib-issuetracker/issues/25
Mypy users
mypy --strict docs/_ext/link_issues.py work as of mypy 0.971
If you use mypy: pip install types-docutils types-requests
Install:
https://pypi.org/project/types-docutils/
https://pypi.org/project/types-requests/
https://pypi.org/project/typing-extensions/ (Python <3.10)
Example
via unihan-etl#261 / v0.17.2 (source, view, but page may be outdated)

Streamlit Unhashable TypeError when i use st.cache

when i use the st.cache decorator to cash hugging-face transformer model i get
Unhashable TypeError
this is the code
from transformers import pipeline
import streamlit as st
from io import StringIO
#st.cache(hash_funcs={StringIO: StringIO.getvalue})
def model() :
return pipeline("sentiment-analysis", model='akhooli/xlm-r-large-arabic-sent')
after searching in issues section in streamlit repo
i found that hashing argument is not required , just need to pass this argument
allow_output_mutation = True
This worked for me:
from transformers import pipeline
import tokenizers
import streamlit as st
import copy
#st.cache(hash_funcs={tokenizers.Tokenizer: lambda _: None, tokenizers.AddedToken: lambda _: None})
def get_model() :
return pipeline("sentiment-analysis", model='akhooli/xlm-r-large-arabic-sent')
input = st.text_input('Text')
bt = st.button("Get Sentiment Analysis")
if bt and input:
model = copy.deepcopy(get_model())
st.write(model(input))
Note 1:
calling the pipeline with input model(input) changes the model and we shouldn't change a cached value so we need to copy the model and run it on the copy.
Note 2:
First run will load the model using the get_model function next run will use the chace.
Note 3:
You can read more about Advanced caching in stremlit in thier documentation.
Output examples:

Validate Python TypedDict at runtime

I'm working in a Python 3.8+ Django/Rest-Framework environment enforcing types in new code but built on a lot of untyped legacy code and data. We are using TypedDicts extensively for ensuring that data we are generating passes to our TypeScript front-end with the proper data type.
MyPy/PyCharm/etc. does a great job of checking that our new code spits out data that conforms, but we want to test that the output of our many RestSerializers/ModelSerializers fits the TypeDict. If I have a serializer and typed dict like:
class PersonSerializer(ModelSerializer):
class Meta:
model = Person
fields = ['first', 'last']
class PersonData(TypedDict):
first: str
last: str
email: str
and then run code like:
person_dict: PersonData = PersonSerializer(Person.objects.first()).data
Static type checkers don't be able to figure out that person_dict is missing the required email key, because (by design of PEP-589) it is just a normal dict. But I can write something like:
annotations = PersonData.__annotations__
for k in annotations:
assert k in person_dict # or something more complex.
assert isinstance(person_dict[k], annotations[k])
and it will find that email is missing from the data of the serializer. This is well and good in this case, where I don't have any changes introduced by from __future__ import annotations (not sure if this would break it), and all my type annotations are bare types. But if PersonData were defined like:
class PersonData(TypedDict):
email: Optional[str]
affiliations: Union[List[str], Dict[int, str]]
then isinstance is not good enough to check if the data passes (since "Subscripted generics cannot be used with class and instance checks").
What I'm wondering is if there already exists a callable function/method (in mypy or another checker) that would allow me to validate a TypedDict (or even a single variable, since I can iterate a dict myself) against an annotation and see if it validates?
I'm not concerned about speed, etc., since the point of this is to check all our data/methods/functions once and then remove the checks later once we're happy that our current data validates.
The simplest solution I found works using pydantic.
from typing import cast, TypedDict
import pydantic
class SomeDict(TypedDict):
val: int
name: str
# this could be a valid/invalid declaration
obj: SomeDict = {
'val': 12,
'name': 'John',
}
# validate with pydantic
try:
obj = cast(SomeDict, pydantic.create_model_from_typeddict(SomeDict)(**obj).dict())
except pydantic.ValidationError as exc:
print(f"ERROR: Invalid schema: {exc}")
EDIT: When type checking this, it currently returns an error, but works as expected. See here: https://github.com/samuelcolvin/pydantic/issues/3008
You may want to have a look at https://pypi.org/project/strongtyping/. This may help.
In the docs you can find this example:
from typing import List, TypedDict
from strongtyping.strong_typing import match_class_typing
#match_class_typing
class SalesSummary(TypedDict):
sales: int
country: str
product_codes: List[str]
# works like expected
SalesSummary({"sales": 10, "country": "Foo", "product_codes": ["1", "2", "3"]})
# will raise a TypeMisMatch
SalesSummary({"sales": "Foo", "country": 10, "product_codes": [1, 2, 3]})
A little bit of a hack, but you can check two types using mypy command line -c options. Just wrap it in a python function:
import subprocess
def is_assignable(type_to, type_from) -> bool:
"""
Returns true if `type_from` can be assigned to `type_to`,
e. g. type_to := type_from
Example:
>>> is_assignable(bool, str)
False
>>> from typing import *
>>> is_assignable(Union[List[str], Dict[int, str]], List[str])
True
"""
code = "\n".join((
f"import typing",
f"type_to: {type_to}",
f"type_from: {type_from}",
f"type_to = type_from",
))
return subprocess.call(("mypy", "-c", code)) == 0
You could do something like this:
def validate(typ: Any, instance: Any) -> bool:
for property_name, property_type in typ.__annotations__.items():
value = instance.get(property_name, None)
if value is None:
# Check for missing keys
print(f"Missing key: {property_name}")
return False
elif property_type not in (int, float, bool, str):
# check if property_type is object (e.g. not a primitive)
result = validate(property_type, value)
if result is False:
return False
elif not isinstance(value, property_type):
# Check for type equality
print(f"Wrong type: {property_name}. Expected {property_type}, got {type(value)}")
return False
return True
And then test some object, e.g. one that was passed to your REST endpoint:
class MySubModel(TypedDict):
subfield: bool
class MyModel(TypedDict):
first: str
last: str
email: str
sub: MySubModel
m = {
'email': 'JohnDoeAtDoeishDotCom',
'first': 'John'
}
assert validate(MyModel, m) is False
This one prints the first error and returns bool, you could change that to exceptions, possibly with all the missing keys. You could also extend it to fail on additional keys than defined by the model.
I like your solution!. In order to avoid iteration fixes for some user, I added some code to your solution :D
def validate_custom_typed_dict(instance: Any, custom_typed_dict:TypedDict) -> bool|Exception:
key_errors = []
type_errors = []
for property_name, type_ in my_typed_dict.__annotations__.items():
value = instance.get(property_name, None)
if value is None:
# Check for missing keys
key_errors.append(f"\t- Missing property: '{property_name}' \n")
elif type_ not in (int, float, bool, str):
# check if type is object (e.g. not a primitive)
result = validate_custom_typed_dict(type_, value)
if result is False:
type_errors.append(f"\t- '{property_name}' expected {type_}, got {type(value)}\n")
elif not isinstance(value, type_):
# Check for type equality
type_errors.append(f"\t- '{property_name}' expected {type_}, got {type(value)}\n")
if len(key_errors) > 0 or len(type_errors) > 0:
error_message = f'\n{"".join(key_errors)}{"".join(type_errors)}'
raise Exception(error_message)
return True
some console output:
Exception:
- Missing property: 'Combined_cycle'
- Missing property: 'Solar_PV'
- Missing property: 'Hydro'
- 'timestamp' expected <class 'str'>, got <class 'int'>
- 'Diesel_engines' expected <class 'float'>, got <class 'int'>

Airflow Failed: ParseException line 2:0 cannot recognize input near

I'm trying to run a test task on Airflow but I keep getting the following error:
FAILED: ParseException 2:0 cannot recognize input near 'create_import_table_fct_latest_values' '.' 'hql'
Here is my Airflow Dag file:
import airflow
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.models import DAG
args = {
'owner': 'raul',
'start_date': datetime(2018, 11, 12),
'provide_context': True,
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
'email': ['raul.gregglino#leroymerlin.ru'],
'email_on_failure': True,
'email_on_retry': False
}
dag = DAG('opus_data',
default_args=args,
max_active_runs=6,
schedule_interval="#daily"
)
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='create_import_table_fct_latest_values.hql ',
hiveconf_jinja_translate=True,
dag=dag
)
deps = {}
# Explicity define the dependencies in the DAG
for downstream, upstream_list in deps.iteritems():
for upstream in upstream_list:
dag.set_dependency(upstream, downstream)
Here is the content of my HQL file, in case this may be the issue and I can't figure:
*I'm testing the connection to understand if the table is created or not, then I'll try to LOAD DATA, hence the LOAD DATA is commented out.
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED ',';
#LOAD DATA LOCAL INPATH
#'/media/windows_share/schemas/opus/fct_latest_values_20181106.csv'
#OVERWRITE INTO TABLE opus_data.fct_latest_values_new_data;
In the HQL file it should be FIELDS TERMINATED BY ',':
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
And comments should start with -- in HQL file, not #
Also this seems incorrect and causing Exception hql='create_import_table_fct_latest_values.hql '
Have a look at this example:
#Create full path for the file
hql_file_path = os.path.join(os.path.dirname(__file__), source['hql'])
print hql_file_path
run_hive_query = HiveOperator(
task_id='run_hive_query',
dag = dag,
hql = """
{{ local_hive_settings }}
""" + "\n " + open(hql_file_path, 'r').read()
)
See here for more details.
Or put all HQL into hql parameter:
hql='CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data ...'
I managed to find the answer for my issue.
It was related to the path my HiveOperator was calling the file from. As no Variable had been defined to tell Airflow where to look for, I was getting the error I mentioned in my post.
Once I have defined it using the webserver interface (See picture), my dag started to work propertly.
I made a change to my DAG code regarding the file location for organization only and this is how my HiveOperator looks like now:
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='hql/create_import_table_fct_latest_values2.hql',
hiveconf_jinja_translate=True,
dag=dag
)
Thanks to (#panov.st) who helped me in person to identify my issue.

Return status of python unittest

I'm trying to call a unittest from another python file, and evaluate the exit code. I was able to use unittest.TestLoader().loadTestsFromModule and unittest.TextTestRunner.run to call the unittest from another python file, but that's returning the entire results to the cmd. I would like to simply set a variable equal to the status code so I can evaluate it. I was able to find a method unittest.TestResult.wasSuccessful, but I'm having trouble implementing it. When I add it to the use case, I get the following AttributeError: AttributeError: 'ConnectionTest' object has no attribute 'failures'
I've included some code samples below and a mockup of the desired result as an illustration of what I'm trying to achieve. Thank you in advance.
""" Tests/ConnectionTest.py """
import unittest
from Connection import Connection
class ConnectionTest(unittest.TestCase):
def test_connection(self):
#my tests
def test_pass(self):
return unittest.TestResult.wasSuccessful(self)
if __name__ == '__main__':
unittest.main()
""" StatusTest.py """
import unittest
import Tests.ConnectionTest as test
#import Tests.Test2 as test2
#import Tests.Test3 as test3
#import other unit tests ...
suite = unittest.TestLoader().loadTestsFromModule(test)
unittest.TextTestRunner(verbosity=2).run(suite)
""" Return True if unit test passed
"""
def test_passed(test):
if test.test_pass() == 0:
return True
else:
return False
""" Run unittest for each module before using it in code
"""
def main():
tests = "test test2 test3".split()
for test in tests:
if test_passed(test):
# do something
else:
# log failure
pass
Update
To put the question more simply, I need to set the highlighted variable below to the highlighted value.
You mentioned you tried implementing result.wasSuccessful, but would something like the following work:
result = unittest.TextTestRunner(verbosity=2).run(suite)
test_exit_code = int(not result.wasSuccessful())
The value of test_exit_code would then be either 0 when the test suite ran successfully or 1 otherwise.
If you want to disable the output of the TextTestRunner you can specify your own stream, such as:
from io import StringIO
result = unittest.TextTestRunner(stream=StringIO(), verbosity=2).run(suite)

Resources