Airflow Failed: ParseException line 2:0 cannot recognize input near - hadoop

I'm trying to run a test task on Airflow but I keep getting the following error:
FAILED: ParseException 2:0 cannot recognize input near 'create_import_table_fct_latest_values' '.' 'hql'
Here is my Airflow Dag file:
import airflow
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.models import DAG
args = {
'owner': 'raul',
'start_date': datetime(2018, 11, 12),
'provide_context': True,
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
'email': ['raul.gregglino#leroymerlin.ru'],
'email_on_failure': True,
'email_on_retry': False
}
dag = DAG('opus_data',
default_args=args,
max_active_runs=6,
schedule_interval="#daily"
)
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='create_import_table_fct_latest_values.hql ',
hiveconf_jinja_translate=True,
dag=dag
)
deps = {}
# Explicity define the dependencies in the DAG
for downstream, upstream_list in deps.iteritems():
for upstream in upstream_list:
dag.set_dependency(upstream, downstream)
Here is the content of my HQL file, in case this may be the issue and I can't figure:
*I'm testing the connection to understand if the table is created or not, then I'll try to LOAD DATA, hence the LOAD DATA is commented out.
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED ',';
#LOAD DATA LOCAL INPATH
#'/media/windows_share/schemas/opus/fct_latest_values_20181106.csv'
#OVERWRITE INTO TABLE opus_data.fct_latest_values_new_data;

In the HQL file it should be FIELDS TERMINATED BY ',':
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
And comments should start with -- in HQL file, not #
Also this seems incorrect and causing Exception hql='create_import_table_fct_latest_values.hql '
Have a look at this example:
#Create full path for the file
hql_file_path = os.path.join(os.path.dirname(__file__), source['hql'])
print hql_file_path
run_hive_query = HiveOperator(
task_id='run_hive_query',
dag = dag,
hql = """
{{ local_hive_settings }}
""" + "\n " + open(hql_file_path, 'r').read()
)
See here for more details.
Or put all HQL into hql parameter:
hql='CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data ...'

I managed to find the answer for my issue.
It was related to the path my HiveOperator was calling the file from. As no Variable had been defined to tell Airflow where to look for, I was getting the error I mentioned in my post.
Once I have defined it using the webserver interface (See picture), my dag started to work propertly.
I made a change to my DAG code regarding the file location for organization only and this is how my HiveOperator looks like now:
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='hql/create_import_table_fct_latest_values2.hql',
hiveconf_jinja_translate=True,
dag=dag
)
Thanks to (#panov.st) who helped me in person to identify my issue.

Related

Vertex AI Pipelines (Kubeflow) skip step with dependent outputs on later step

I’m trying to run a Vertex AI Pipelines job where I skip a certain pipeline step if the value of a certain pipeline parameter (in this case do_task1) is False. But because there is another step that runs unconditionally and expects the output of the first potentially skipped step, I get the following error, independently of do_task1 being True or False:
AssertionError: component_input_artifact: pipelineparam--task1-output_path not found. All inputs: parameters {
key: "do_task1"
value {
type: STRING
}
}
parameters {
key: "task1_name"
value {
type: STRING
}
}
It seems like the compiler just cannot find the output output_path from task1. So I wonder if there is any way to have some sort of placeholders for the outputs of those steps that are under a dsl.Condition , and thus they get filled with default values unless the actual steps run and fill them with the non-default values.
The code below represents the problem and is easily reproducible.
I'm using google-cloud-aiplatform==1.14.0 and kfp==1.8.11
from typing import NamedTuple
from kfp import dsl
from kfp.v2.dsl import Dataset, Input, OutputPath, component
from kfp.v2 import compiler
from google.cloud.aiplatform import pipeline_jobs
#component(
base_image="python:3.9",
packages_to_install=["pandas"]
)
def task1(
# inputs
task1_name: str,
# outputs
output_path: OutputPath("Dataset"),
) -> NamedTuple("Outputs", [("output_1", str), ("output_2", int)]):
import pandas as pd
output_1 = task1_name + "-processed"
output_2 = 2
df_output_1 = pd.DataFrame({"output_1": [output_1]})
df_output_1.to_csv(output_path, index=False)
return (output_1, output_2)
#component(
base_image="python:3.9",
packages_to_install=["pandas"]
)
def task2(
# inputs
task1_output: Input[Dataset],
) -> str:
import pandas as pd
task1_input = pd.read_csv(task1_output.path).values[0][0]
return task1_input
#dsl.pipeline(
pipeline_root='pipeline_root',
name='pipelinename',
)
def pipeline(
do_task1: bool,
task1_name: str,
):
with dsl.Condition(do_task1 == True):
task1_op = (
task1(
task1_name=task1_name,
)
)
task2_op = (
task2(
task1_output=task1_op.outputs["output_path"],
)
)
if __name__ == '__main__':
do_task1 = True # <------------ The variable to modify ---------------
# compile pipeline
compiler.Compiler().compile(
pipeline_func=pipeline, package_path='pipeline.json')
# create pipeline run
pipeline_run = pipeline_jobs.PipelineJob(
display_name='pipeline-display-name',
pipeline_root='pipelineroot',
job_id='pipeline-job-id',
template_path='pipelinename.json',
parameter_values={
'do_task1': do_task1, # pipeline compilation fails with either True or False values
'task1_name': 'Task 1',
},
enable_caching=False
)
# execute pipeline run
pipeline_run.run()
Any help is much appreciated!
The real issue here is with dsl.Condition(): creates a sub group, where task1_op is an inner task only "visible" from within the sub group. In the latest SDK, it will throw a more explicit error message saying task2 cannot depends on any inner task.
So to resolve the issue, you just need to move task2 to be within the condition--if condition was not met, you don't have a valid input to feed into task2 anyway.
with dsl.Condition(do_task1 == True):
task1_op = (
task1(
task1_name=task1_name,
)
)
task2_op = (
task2(
task1_output=task1_op.outputs["output_path"],
)
)

sqlite select request does not fetch moz_place

I am trying to fetch historics data of the day and to print it.
the error i am getting is :
sqlite3 : Operational error near "(" : syntax error
import sqlite3 as sqlite
import sys
import time
conn = sqlite.connect('places.sqlite.db')
c = conn.cursor()
today = str(time.time())
here i am selecting the 10 first caracter because i want to search for a unix epoch match in seconds and not in milliseconds ( so only the first 10 are interesting to me)
c.execute("SELECT * FROM moz_places WHERE LEFT(last_visit_date, 10)='"+today+"'")
user1 = c.fetchone()
print(user1)
As mentionned earlier, i get "sqlite3 : Operational error near "(" : syntax error"
Do you what is wrong there ?
Here is how to convert moz_places.last_update_time into a string, 'YYYY-MM-DD HH:MM:SS':
UTC: datetime(last_visit_date/1000000, 'unixepoch')
Local time zone: datetime(last_visit_date/1000000, 'unixepoch','localtime')
Here is a link to SQLite doc on Date and Time Functions.
The today string created in python should match format exactly (because it will be doing a string comparison).
From the comments: the name of the places database in Firefox is places.sqlite (not places.sqlite.db). The database name should include full or relative path if it is not in your current working directory.

Unable to load data into parquet file format?

I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`
Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues

How to merge orc files for external tables?

I am trying to merge multiple small ORC files. Came across ALTER TABLE CONCATENATE command but that only works for managed tables.
Hive gave me the following error when I try to run it :
FAILED: SemanticException
org.apache.hadoop.hive.ql.parse.SemanticException: Concatenate/Merge
can only be performed on managed tables
Following are the table parameters :
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
EXTERNAL TRUE
numFiles 535
numRows 27051810
orc.compress SNAPPY
rawDataSize 20192634094
totalSize 304928695
transient_lastDdlTime 1512126635
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
I believe your table is an external table,then there are two ways:
Either you can change it to Managed table (ALTER TABLE <table> SET
TBLPROPERTIES('EXTERNAL'='FALSE') and run the ALTER TABLE
CONCATENATE.Then you can convert the same back to external changing
it to TRUE.
Or you can create a managed table using CTAS and insert the data. Then run the merge query and import the data back to external table
From my previous answer to this question, here is a small script in Python using PyORC to concatenate the small ORC files together. It doesn't use Hive at all, so you can only use it if you have direct access to the files and are able to run a Python script on them, which might not always be the case in managed hosts.
import pyorc
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', type=argparse.FileType(mode='wb'))
parser.add_argument('files', type=argparse.FileType(mode='rb'), nargs='+')
args = parser.parse_args()
schema = str(pyorc.Reader(args.files[0]).schema)
with pyorc.Writer(args.output, schema) as writer:
for i, f in enumerate(args.files):
reader = pyorc.Reader(f)
if str(reader.schema) != schema:
raise RuntimeError(
"Inconsistent ORC schemas.\n"
"\tFirst file schema: {}\n"
"\tFile #{} schema: {}"
.format(schema, i, str(reader.schema))
)
for line in reader:
writer.write(line)
if __name__ == '__main__':
main()

ERROR MESSAGE when cassandra python Driver accept parameters from user GUI

i'm using cassandra 2.1 and CQL 3.2.1 , i want to let user specify keyspace name ,replication strategy , replication factor from UI , and then pass these values to query to execute Insert CQL , but give me an syntax error , i try a lot but nothing go write >>
i'm create keyspace -> connect
and column family -> keyspaces
but insertion cause error
here is my code :
from cassandra.cluster import Cluster
class Connection():
def __init__ (self , ips , keyspace ,replication_strategy ,replication_factor):
self.keyspace=keyspace
self.ips =ips
self.replication_strategy=replication_strategy
self.replication_factor=replication_factor
cluster = Cluster([ips])
session = cluster.connect()
session.execute("CREATE keyspace IF NOT EXISTS connect with replication={ 'class' : 'SimpleStrategy', 'replication_factor' :1}")
session.execute("CREATE TABLE IF NOT EXISTS connect.keyspaces (id int primary key , keyspaces_name text, replication_strategy text, replication_factor int)")
session.execute("INSERT INTO connect.keyspaces(id , keyspaces_name , replication_strategy ,replication_factor ) VALUES (1 " +',' + self.keyspace + ',' + self.replication_strategy +',' + self.replication_factor + ")")
and the ERROR MESSAGE IS :
File "cassandra/cluster.py", line 3822, in cassandra.cluster.ResponseFuture.result (cassandra/cluster.c:74332)
raise self._final_exception
SyntaxException: <Error from server: code=2000 [Syntax error in CQL query] message="line 1:125 no viable alternative at input ',' (...) VALUES (1 ,noon,[SimpleStrategy],...)">
This means there is a syntax error in the INSERT statement you're building. It might be easier to troubleshoot if you print the string query you've built.
Alternatively I would suggest parameterizing your query to let the driver do formatting:
http://datastax.github.io/python-driver/getting_started.html#passing-parameters-to-cql-queries

Resources