In Pig, can you do a replicated left join? - hadoop

The documentation is somewhat misleading when it comes to replicated joins in Pig. The script won't compile if I add 'left' to the join and also using 'replicated'. The documentation mentions only supporting left outer joins with replicated, but the behavior is obviously an inner join. Does anyone know how to do a left replicated join?
c = join a by (x,y,z) left, b by (x,y,z) using 'replicated';
(That statement won't parse)

replicated join can work with either inner or left outer join, and it works for me with 3 fields. what's the error message you're getting? are you sure the fields are of compatible data types? what's your pig version?

Related

Apache Spark 2.4: Why is there "No Broadcast"?

I have configured spark-submit with
"--conf",
"spark.sql.autoBroadcastJoinThreshold=536870912", 512MB
But the DAG is still not broadcasting the smaller side of the join.
The code is a simple join. So I'm wondering what is wrong.
The input are files of parquet, stored on S3.
If more information is needed for further analyse, please let me know.
According to this blog,
BHJ is not supported for a full outer join. For right outer join, the only left side table can be broadcasted and for other left joins only right table can be broadcasted.
That is the reason the broadcast is not happening.
My guess would be that the configuration spark.sql.autoBroadcastJoinThreshold is overwritten somewhere or is not correctly set. You should check in the Spark UI the Environment tab if you find it and check if it's correctly set.
If you just need a quick fix, you can also force the broadcast with the hint .broadcast on the Dataset that you already know is small.

What are the differences between KTable vs GlobalKTable and leftJoin() vs outerJoin()?

In Kafka Stream library, I want to know difference between KTable and GlobalKTable.
Also in KStream class, there are two methods leftJoin() and outerJoin(). What is the difference between these two methods also?
I read KStream.leftJoin, but did not manage to find an exact difference.
KTable VS GlobalKTable
A KTable shardes the data between all running Kafka Streams instances, while a GlobalKTable has a full copy of all data on each instance. The disadvantage of GlobalKTable is that it obviously needs more memory. The advantage is, that you can do a KStream-GlobalKTable join with a non-key attribute from the stream. For a KStream-KTable join and a non-key stream attribute for the join is only possible by extracting the join attribute and set it as the key before doing the join -- this will result in a repartitioning step of the stream before the join can be computed.
Note though, that there is also a semantical difference: For stream-table join, Kafka Stream align record processing ordered based on record timestamps. Thus, the update to the table are aligned with the records of you stream. For GlobalKTable, there is no time synchronization and thus update to GlobalKTable and completely decoupled from the processing of the stream records (thus, you get weaker semantics).
For further details, see KIP-99: Add Global Tables to Kafka Streams.
leftJoin() VS outerJoin()
About left and outer joins: it's like in a database a left-outer and full-outer join, respectively.
For a left outer join, you might "lose" data of your right input stream in case there is no match for the join in the left-hand side.
For a (full)outer join, no data will be dropped and each input record of both streams will be in the result stream.

What kind of JOINs does CockroachDB support?

I saw that CockroachDB offers JOIN support in this blog post, but it doesn't mention what level of JOINs are supported. Are all of the major types of joins supported, or are there limitations?
CockroachDB supports the major JOIN types:
INNER
FULL OUTER
LEFT
RIGHT
If you need it, you can find the CockroachDB JOIN documentation here.

Hive Broken pipe error

I have been working on a project that include a hive query.
INSERT INTO OVERWRITE .... TRANSFORM (....) USING 'python script.py' FROM .... LEFT OUTER JOIN . . . LEFT OUTER JOIN . . . LEFT OUTER JOIN
At the begining everything work fine until we loaded a big amount of dummy data. We just write the same records with small variations on some fields. After that we run this again and we are getting a Broken pipe error without much information. There is no log about the error, just the IOException: Broken pipe error. . . .
To simplify the script and isolate errors we modify the script to
for line in sys.stdin.readlines():
print line
to avoid any error at that level. We still have the same error.
The problem seems to be solved by spliting so many joins in different queries and using intermediate tables. Then you just add a final query with a last join summarizing all the previous results. As I understand this mean no error at the script level but too many data to handle by hive
Another work around on this is to remove the transform and generate a new query inserting the data in another table just running the transformation. I'm not 100% sure why, the scrtip is correct. I think the issue may be a really big amount of data streamed because of the so many joins.

Which is faster in Apache Pig: Split then Union or Filter and Left Join?

I am currently processing a large input table (10^7 rows) in Pig Latin where the table is filtered on some field, processed and the processed rows are returned back into the original table. When the processed rows are returned back into the original table the fields the filters are based on are changed so that in subsequent filtering the processed fields are ignored.
Is it more efficient in Apache Pig to first split the processed and unprocessed tables on the filtering criteria, apply processing and union the two tables back together or to filter the first table, apply the process to the filtered table and perform a left join back into the original table using a primary key?
I can't say which one will actually run faster, I would simply run both versions and compare execution times :)
If you go for the first solution (split, then join) make sure to specify the smaller (if there is one) of the two tables first in the join operation (probably that's going to be the newly added data). The Pig documentation suggests that this will lead to a performance improvement because the last table is "not brought into memory but streamed through instead".

Resources