Save Spark RDD to Hive Table

Save Spark RDD to Hive Table - hadoop

In spark I want to save RDD objects to hive table. I am trying to use createDataFrame but that is throwing
Exception in thread "main" java.lang.NullPointerException
val products=sc.parallelize(evaluatedProducts.toList);
//here products are RDD[Product]
val productdf = hiveContext.createDataFrame(products, classOf[Product])
I am using Spark 1.5 version.

If your Product is a class (not a case class), I suggest you transform your rdd to RDD[Tuple] before creating the DataFrame:
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
val productDF = products
.map({p: Product => (p.getVal1, p.getVal2, ...)})
.toDF("col1", "col2", ...)
With this approach, you will have the Product attributes as columns in the DataFrame.
Then you can create a temp table with:
productDF.registerTempTable("table_name")
or a physical table with:
productDF.write.saveAsTable("table_name")

Related

Specify Oracle schema with Slick 3.2

I have an Oracle database that contains multiple users/schemas and I would like to generate Slick Schemas automatically for a specific user. This is what I've tried so far :
import scala.concurrent.ExecutionContext.Implicits.global
val profileInstance: JdbcProfile =
Class.forName("slick.jdbc.OracleProfile$")
.getField("MODULE$")
.get(null).asInstanceOf[JdbcProfile]
val db = profileInstance.api.Database
.forURL("jdbc:oracle:thin:#//myhost:myport/servicename","user","pass")
val modelAction = OracleProfile.createModel(Some(OracleProfile.defaultTables))
val model = Await.result(db.run(modelAction), Duration.Inf)
model.tables.foreach(println)
This doesn't print anything, I guess I have to provide the current schema to use, but I don't know how to do this.
On the other hand, I am able to list all the schemas of the database, using the following code :
val resultSet = db.createSession().metaData.getSchemas.getStatement.getResultSet
while(resultSet.next()) {
println(resultSet.getString(1))
}
How can I specify which schema I want to use with Slick ?

I've found out how to do it. Instead of using OracleProfile.defaultTable I manually defined the tables and views I needed like this :
val modelAction = OracleProfile.createModel(
Some(MTable.getTables(None, Some("MYSCHEMA"), None, Some(Seq("TABLE", "VIEW"))))
)

Stored procedure/functions in SparkSql

Any ways to achieve sql features like stored procedure or functions in sparksql?
I'm aware about hpl sql and coprocessor in hbase. But want to know if anything similar like is available in spark or not.

You may consider of use User Defined Function and inbuilt function
A quick example
val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")
val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
// Apply the UDF to change the source dataset
scala> dataset.withColumn("upper", upperUDF('text)).show
Result
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|

We cannot create SP/Functions in SparkSql. However best way is to create a temp table just like CTE and used those tables for further usage. Or you can create a UDF Function in Spark.

Transform a org.apache.spark.rdd.RDD[String] into Parallelized collections

I've a csv file in my HDFS with a collection of products like:
[56]
[85,66,73]
[57]
[8,16]
[25,96,22,17]
[83,61]
I'm trying to apply the Association Rules algorithm in my code. For that I need to run this:
scala> val data = sc.textFile("/user/cloudera/data")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/data MapPartitionsRDD[294] at textFile at <console>:38
scala> val distData = sc.parallelize(data)
But when I submit this I'm getting this error:
<console>:40: error: type mismatch;
found : org.apache.spark.rdd.RDD[String]
required: Seq[?]
Error occurred in an application involving default arguments.
val distData = sc.parallelize(data)
How can I transform a RDD[String] in a Sequence collection?
Many thanks!

What you are facing is simple. The error show to you.
To parallelize a object in spark you should add a Seq() object. And you are trying to add a RDD[String] object.
The RDD is already parallelized, the textFile method parallelize the file elements by lines in your cluster.
You can check the method description here:
https://spark.apache.org/docs/latest/programming-guide.html

How to load Postgress "Text" data type into HIVE

I have a postgress table which has text column (detail). I have declared detail as STRING in Hive. It is getting imported successfully When i try to import it from SQOOP or SPark . However i am missing lot of data which is available in detail column and lot of empty rows are getting created in hive table.
Can anyone help me on this?
Ex: detail column has below data
line1 sdhfdsf dsfdsdfdsf dsfs
line2 jbdfv df ffdkjbfd
jbdsjbfds
dsfsdfb dsfds
dfds dsfdsfds dsfdsdskjnfds
sdjfbdsfdsdsfds
Only "line1 sdhfdsf dsfdsdfdsf dsfs " is getting imported into hive table.
I can see empty rows for remaining lines.

Hive cannot support multiple lines in text file formats. You must load this data into a binary file, Avro or Parquet, to retain newline characters. If you don't need to retain them then you can strip them with hive-drop-import-delims

Here is the solution
SparkConf sparkConf = new SparkConf().setAppName("HiveSparkSQL");
SparkContext sc = new SparkContext(sparkConf);
HiveContext sqlContext= new HiveContext(sc);
sqlContext.setConf("spark.sql.parquet.binaryAsString","true");
String url="jdbc:postgresql://host:5432/dbname?user=**&password=***";
Map<String, String> options = new HashMap<String, String>();
options.put("url", url);
options.put("dbtable", "(select * from abc.table limit 50) as act1");
options.put("driver", "org.postgresql.Driver");
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
jdbcDF.write().format("parquet").mode(SaveMode.Append).saveAsTable("act_parquet");

Spark not be able to retrieve all of Hbase data in specific column

My Hbase table has 30 Million records, each record has the column raw:sample, raw is columnfamily sample is column. This column is very big, the size from a few KB to 50MB. When I run the following Spark code, it only can get 40 thousand records but I should get 30 million records:
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181")
conf.set(TableInputFormat.INPUT_TABLE, "sampleData")
conf.set(TableInputFormat.SCAN_COLUMNS, "raw:sample")
conf.set("hbase.client.keyvalue.maxsize","0")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
var arrRdd:RDD[Map[String,Object]] = hBaseRDD.map(tuple => tuple._2).map(...)
Right now I work around this by get the id list first then iterate the id list to get the column raw:sample by pure Hbase java client in Spark foreach.
Any ideas please why I can not get all of the column raw:sample by Spark, is it because the column too big?
A few days ago one of my zookeeper nodes and datanodes down, but I fixed it soon since the replica is 3, is this the reason? Would think if I run hbck -repair would help, thanks a lot!

Internally, TableInputFormat creates a Scan object in order to retrieve the data from HBase.
Try to create a Scan object (without using Spark), configured to retrieve the same column from HBase, see if the error repeats:
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable table = new HTable(config, "emp");
// Instantiating the Scan class
Scan scan = new Scan();
// Scanning the required columns
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
// Getting the scan result
ResultScanner scanner = table.getScanner(scan);
// Reading values from scan result
for (Result result = scanner.next(); result != null; result = scanner.next())
System.out.println("Found row : " + result);
//closing the scanner
scanner.close();
In addition, by default, TableInputFormat is configured to request a very small chunk of data from the HBase server (which is bad and causes a large overhead). Set the following to increase the chunk size:
scan.setBlockCache(false);
scan.setCaching(2000);

For a high throughput like yours, Apache Kafka is the best solution to integrate the data flow and keeping data pipeline alive. Please refer http://kafka.apache.org/08/uses.html for some use cases of kafka
One more
http://sites.computer.org/debull/A12june/pipeline.pdf

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Save Spark RDD to Hive Table - hadoop

Related

Specify Oracle schema with Slick 3.2

Stored procedure/functions in SparkSql

Transform a org.apache.spark.rdd.RDD[String] into Parallelized collections

How to load Postgress "Text" data type into HIVE

Spark not be able to retrieve all of Hbase data in specific column

Categories

Resources