I have been working on a project that include a hive query.
INSERT INTO OVERWRITE .... TRANSFORM (....) USING 'python script.py' FROM .... LEFT OUTER JOIN . . . LEFT OUTER JOIN . . . LEFT OUTER JOIN
At the begining everything work fine until we loaded a big amount of dummy data. We just write the same records with small variations on some fields. After that we run this again and we are getting a Broken pipe error without much information. There is no log about the error, just the IOException: Broken pipe error. . . .
To simplify the script and isolate errors we modify the script to
for line in sys.stdin.readlines():
print line
to avoid any error at that level. We still have the same error.
The problem seems to be solved by spliting so many joins in different queries and using intermediate tables. Then you just add a final query with a last join summarizing all the previous results. As I understand this mean no error at the script level but too many data to handle by hive
Another work around on this is to remove the transform and generate a new query inserting the data in another table just running the transformation. I'm not 100% sure why, the scrtip is correct. I think the issue may be a really big amount of data streamed because of the so many joins.
Related
I have a code which tries to do left join with multiple data frames as the attributes getting built from each of these data frames are to be positioned at different places in the final json file that I’m trying to write. Also the code is to grow as and when the new elements are getting added. Using the current approach , the code is taking almost 3-4 hours and finally aborts due to performance issue.
What is the better way to address this performance issue?
Lkp_df1
lkd_df2 etc
Main_df = main_df.join(keys,’left’)
.select( ....)
is the pattern I have in code
Please paste your entire code . Try to use persist or checkpoint and also check for number of partitions and data distribution across cluster
My Hive query has been throwing an error:
syntax error near unexpected token `('
I am not sure where the error occurs in the query below.
Can you help me out?
select
A.dataA, B.dataB, count(A.nid), count(B.nid)
from
(select
nid, sum(dataA_count) as dataA
from
table_view
group by nid) A
LEFT JOIN
(select
nid, sum(dataB_count) as dataB
from
table_others
group by nid) B ON A.nid = B.nid
group by A.dataA , B.dataB;
I think you didnot close the ) at the end.
Thanks
Sometimes it have been seen that people forget to start the service metastore and thereafter as well as also to enter the hive bash shell, and start passing the commands in similar way of sqoop, when I were the newbie I were also facing these things,
so to overcome of this issue -
goto the hive directory & pass : bin/hive --service metastore & so it will start the hive metastore server for you and later
open another terminal or cli & pass : bin/hive so it will let you enter inside the hive bash shell.
sometimes when you forgot to do these steps, you got silly issues like on thread title we are discussing here,
hope it will help others, thanks.
I have gone through many posts but i didn't realize the my beeline terminal is logged off and i am trying in normal terminal
I faced an issue exactly like this:
First, you have opened 6 open parenthesis and 6 closed ones, so this is not your issue.
Last but not least, you are getting this error because your command is getting interpreted word by word. A statement, like a SQL query, is only known for the databases and if you use the language of a specific database, then only that particular database is able to understand it.
Your "before '(' ..." error means, you are using something before the ( that isnt known for the terminal or the place you are running in.
All you have to do to fix it is:
1- Wrap it with a single or double quotation
2- Use where-clause even if you don't need to(for example Apache Sqoop needs it no matter what). So check the documentation for exact way to do so, usually you can use something like where 1=1 when you dont need it(for Sqoop it was where $CONDITIONS .
3- Make sure your command runs in a designated database first, before asking any third party app to run it.
if this answer was helpful you can give it "chosen answer mark" for better reach in the community.
Using PDI (Kettle) I am filling the entry-stage of my database by utilizing a CSV Inputand Table Output step. This works great, however, I also want to make sure that the data that was just inserted fulfills certain criteria, e.g. fields not being NULL, etc.
Normally this would be a job for database constraints, however, we want to keep the data in the database even if its faulty (for debugging purposes. It is a pain trying to debug a .csv file...). As it is just a staging table anyway it doesn't cause any troubles for integrity, etc.
So to do just that, I wrote some SELECT Count(*) as test123 ... statements that instantly show if something is wrong or not and are easy to handle (if the value of test123 is 0 all is good, else the job needs to be aborted).
I am executing these statements using a Execute SQL Statements step within a PDI transformation. I expected the result to be automatically passed to my datastream, so I also used a Copy rows to result step to pass it up to the executing job.
This is the point where the problem is most likely located.
I think that the result of the SELECT statement was not automatically passed to my datastream, because when I do a Simple evaluation in the main job using the variable ${test123} (which I thought would be implicitly created by executing SELECT Count(*) as test123 ...) I never get the expected result.
I couldn't really find any clues to this problem in the PDI documentation so I hope that someone here has some experience with PDI and might be able to help. If something is still unclear, just hint at it and I will edit the post with more information.
best regards
Edit:
This is a simple model of my main job:
Start --> Load data (Transformation) --> Check data (Transformation) --> Simple Evaluation --> ...
You are mixing up a few concepts, if I read your post correctly.
You don't need a Execute SQL script, this is a job for the Table input step.
Just type your query in the Table input and you can preview your data and see it coming from the step into the data stream by using the preview on a subsequent step. The Execute SQL script is not an input step, which means it will not add external data to your data stream.
The output fields are not Variables. A Variable is set using the Set Variables step, which takes a single input row and maps a specific field to a variable, which can be persisted at parent job or root job levels. Fields are just that: fields. They are passed from one step to the next through hops and eventually to the parent job if you have a Copy rows to result step, but they are NOT variables.
I am currently processing a large input table (10^7 rows) in Pig Latin where the table is filtered on some field, processed and the processed rows are returned back into the original table. When the processed rows are returned back into the original table the fields the filters are based on are changed so that in subsequent filtering the processed fields are ignored.
Is it more efficient in Apache Pig to first split the processed and unprocessed tables on the filtering criteria, apply processing and union the two tables back together or to filter the first table, apply the process to the filtered table and perform a left join back into the original table using a primary key?
I can't say which one will actually run faster, I would simply run both versions and compare execution times :)
If you go for the first solution (split, then join) make sure to specify the smaller (if there is one) of the two tables first in the join operation (probably that's going to be the newly added data). The Pig documentation suggests that this will lead to a performance improvement because the last table is "not brought into memory but streamed through instead".
I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.
You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.
If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.
I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.
To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.