Insert rows into Hive as they are computed in Spark - hadoop

Let's say I want to build a Spark application that I want to be able to kill part way through. I still want to persist the data from the partitions that finish successfully. I attempt to do so by inserting it into a Hive table. In (PySpark) pseudocode:
def myExpensiveProcess(x):
...
udfDoExpensiveThing = udf(myExpensiveProcess, StringType())
myDataFrame \
.repartition(100) \
.withColumn("HardEarnedContent", udfDoExpensiveThing("InputColumn")) \
.write.insertInto("SomeExistingHiveTable")
I run this until 30 partitions are done, then I kill the job. When I check SomeExistingHiveTable, I see that it has no new rows.
How do I persist the data that finishes, irrespective of which ones did not?

This is expected and desired behavior ensuring consistency of the output.
Write data directly to file system bypassing Spark's data source API.
myDataFrame \
.repartition(100) \
.withColumn("HardEarnedContent", udfDoExpensiveThing("InputColumn")) \
.rdd \
.foreachPartition(write_to_storage)
where write_to_storage implements required logic, for example using one of the HDFS interfaces.

Related

load 2T data from Hive to local server

I have 2TB of data in My hadoop cluster through Hive database and I would bring these data in my local server, So I use Hive to perform this task by using beeline CLI as below:
use db1;
for i in (T1 T2 T3 ...)do
export table $1 to '/tmp/$i';
done
(Notice: maybe you notice same errors in this query above, it's not what I'm looking for, this syntaxe isn't the same I've used, but it's close enough and it works for me, so don't care about this query).
this query is really slow to done this task, So what I'm looking for actually is to know if there is some other solution like using Scoop or (hadoop fs -get /user/hive/warehouse/database.db) or even hive to do this task as fast as possible.

Kafka Structured Streaming checkpoint

I am trying to do structured streaming from Kafka. I am planning to store checkpoints in HDFS. I read a Cloudera blog recommending not to store checkpoints in HDFS for Spark streaming. Is it same issue for structure streaming checkpoints.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/.
In structured streaming, If my spark program is down for certain time, how do I get latest offset from checkpoint directory and load data after that offset.
I am storing checkpoints in a directory as shown below.
df.writeStream\
.format("text")\
.option("path", '\files') \
.option("checkpointLocation", 'checkpoints\chkpt') \
.start()
Update:
This is my Structured streaming program reads a Kafka message, decompresses and writes to HDFS.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.option("failOnDataLoss", "false")\
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
Transaction_DF.printSchema()
decomp = Transaction_DF.select(zip_extract("value").alias("decompress"))
#zip_extract is a UDF to decompress the stream
query = decomp.writeStream\
.format("text")\
.option("path", \Data_directory_inHDFS) \
.option("checkpointLocation", \pathinDHFS\) \
.start()
query.awaitTermination()
Storing Checkpoint on longterm storage(HDFS, AWS S3,etc.) are most preferred. I would Like to add one point here that the property "failOnDataLoss" should not be set to false as it is not best practice. Data loss is something no one would like to afford. Rest you are on the right Path.
In structured streaming, If my spark program is down for certain time,
how do I get latest offset from checkpoint directory and load data
after that offset.
Under your checkpointdir folder you will find a folder name 'offsets'. Folder 'offsets' maintain the next offsets to be requested from kafka. Open the latest file(latest batch file) under 'offsets' folder, the next expected offsets will be in format below
{"kafkatopicname":{"2":16810618,"1":16810853,"0":91332989}}
To load data after that offset, set below property to your spark read stream
.option("startingOffsets", "{\""+topic+"\":{\"0\":91332989,\"1\":16810853,\"2\":16810618}}")
0,1,2 are the partitions in topic.
As I understood the artificial it recommend maintaining the offset management either in: Hbase, Kafka, HDFS or Zookeeper.
"It is worth mentioning that you can also store offsets in a storage
system like HDFS. Storing offsets in HDFS is a less popular approach
compared to the above options as HDFS has a higher latency compared to
other systems like ZooKeeper and HBase."
you can find in Spark Documentation how to restart a query from an existing checkpoint at: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
In your query, try applying a checkpoint while writing results to some persistent storage like HDFS in some format like parquet. It worked good for me.

HBase aggregation, Get And Put operation, Bulk Operation

I would like to know how can I map a value of a key.
I know that it can be done with Get and then Put operations. Is there any other way to do it efficiently? 'checkAndPut' is not ver helpful
can it be done with something like :
(key,value) => value+g()
I have read the book HBase the Definitive Guide and it seems like Map Reduce Job interpreted to Put/Get operations on top of HBase. Does it means that it is not a 'Bulk Operation' (since it's an operation per key) ?
How /Does Spark relevant here ?
HBase has scans (1) to retrieve multiple rows; and MapReduce jobs can and do use this command (2).
For HBase 'bulk' is mostly [or solely] is 'bulk load'/'bulk import' where one adds data via constructing HFiles and 'injecting' them to HBase cluster (as opposed to PUT-s) (3).
Your task can be implemented as a MapReduce Job as well as a Spark app (4 being one of examples, maybe not the best one), or a Pig script, or a Hive query if you use HBase table from Hive (5); pick your poison.
If you set up a Table with a counter then you can use an Increment to add a certain amount to the existing value in an atomic operation.
From a MapReduce job you would aggregate your input in micro batches (wherever you have your incremental counts), group them by key/value, sum them up, and then issue a Put from your job (1 Put per key).
What I mentioned above is not a 'bulk' operation but it would probably work just fine if the amount of rows that you modify in each batch is relatively small compared to the total number or rows in your table.
IFF you expect to modify your entire table at each batch then you should look at Bulk Loads. This will require you to write a job that reads your existing values in HBase, your new values from the incremental sources, add them together, and write them back to HBase (In a 'bulk load' fashion, not directly)
A Bulk Load writes HFiles directly to HDFS without going through the HBase 'write pipeline' (Memstore, minor compactions, major compactions, etc), and then issue a command to swap the existing files with the new ones. The swap is FAST! Note, you could also generate the new HFile outside the HBase cluster (not to overload it) and then copy them over and issue the swap command.

Move data from oracle to HDFS, process and move to Teradata from HDFS

My requirement is to
Move data from Oracle to HDFS
Process the data on HDFS
Move processed data to Teradata.
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
After searching a lot on the internet, i found that
ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
Do large scale processing either by Custom MapReduce or Hive or PIG.
SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
Other options that i found are the following
STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
Any open source ETL tools like Talend or Pentaho.
Please share your thoughts on these options as well and any other possibilities.
Looks like you have several questions so let's try to break it down.
Importing in HDFS
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
sqoop import --connect jdbc:oracle:thin#myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
Processing
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
In Hive you could run your query and tell it to write the results in HDFS:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Exporting into TeraData
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.
If you have concerns about the overhead or latency of moving the data from Oracle into HDFS, a possible commercial solution might be Dell Software’s SharePlex. They recently released a connector for Hadoop that would allow you to replicate table data from Oracle to Hadoop. More info here.
I’m not sure if you need to reprocess the entire data set each time or can possibly just use the deltas. SharePlex also supports replicating the change data to a JMS queue. It might be possible to create a Spout that reads from that queue. You could probably also build your own trigger based solution but it would be a bit of work.
As a disclosure, I work for Dell Software.

How to find optimal number of mappers when running Sqoop import and export?

I'm using Sqoop version 1.4.2 and Oracle database.
When running Sqoop command. For example like this:
./sqoop import \
--fs <name node> \
--jt <job tracker> \
--connect <JDBC string> \
--username <user> --password <password> \
--table <table> --split-by <cool column> \
--target-dir <where> \
--verbose --m 2
We can specify --m - how many parallel tasks do we want Sqoop to run (also they might be accessing Database at same time).
Same option is available for ./sqoop export <...>
Is there some heuristic (probably based on size of data) which will help to guess what is optimal number of task to use?
Thank you!
This is taken from Apache Sqoop Cookbook by O'Reilly Media, and seems to be the most logical answer.
The optimal number of mappers depends on many variables: you need to take into account your database type, the hardware that is used for your database server, and the impact to other requests that your database needs to serve. There is no optimal number of mappers that
works for all scenarios. Instead, you’re encouraged to experiment to find the optimal degree of parallelism for your environment and use case. It’s a good idea to start with a small number of mappers, slowly ramping up, rather than to start with a large number of mappers, working your way down.
In "Hadoop: The Definitive Guide," they explain that when setting up your maximum map/reduce task on each Tasktracker consider the processor and its cores to define the number of tasks for your cluster, so I would apply the same logic to this and take a look at how many processes you can run on your processor(s) (Counting HyperTreading, Cores) and set your --m to this value - 1 (leave one open for other tasks that may pop up during the export) BUT this is only if you have a large dataset and want to get the export done in a timely manner.
If you don't have a large dataset, then remember that your output will be the value of --m number of files, so if you are exporting a 100 row table, you may want to set --m to 1 to keep all the data localized in one file.

Resources