We have a requirement to read data from the oracle table to spark jobs(dim and fct running on hdfs). They both read different columns from the same table.
Total number of records in the table = ~13M
To optimize the above process, we came up with two options:
sqoop import table from oracle and store it on hdfs - pyspark(dim & fct jobs) reads relative columns from hdfs.
Cons:
Resource intensive, additional sqoop-job, load on the edge node.
Using pyspark dataframe api make a connection to oracle using jdbc.
Read-only required columns to dims(4).
&
Read the required columns for fcts(12)
Cons: It makes connections twice for two jobs.
Pros: Easy maintenance, no dependency on sqoop, No load on edge node, resource-friendly.
Which is the most efficient option? or do you think there is any other better option?
I want to save and access a table like data structure in HDFS with MapReduce programming. Part of this DS is shown in the following picture. This DS have tens of thousands of columns and hundreds of rows and All nodes should have access to it.
My Question is: How can I save this DS in HDFS and access it with MapReduce programming. Should I use arrays? (Or Hive tables ? Or Hbase?)
Thank you.
HDFS is distributed file System which stores your big files in distributed servers.
You can copy your files from local system to HDFS using command
hadoop fs -copyFromLocal /source/local/path destincation/hdfs/path
Once copy completed an External hive table can be formed on destincation/hdfs/path.
This table can be queried using hive shell.
Do consider Hive for this scenario. If you want to do table type of processing like SAS dataset or R dataframe/dataTable or python pandas; almost always an equivalent thing is possible in SQL. Hive provides powerful SQL abstraction through MapReduce and Tez engines. If you want to graduate to Spark sometime then you can read Hive tables in dataframes. As #sumit pointed you just need to transfer your data from local to HDFS (using HDFS copyFromLocal or put command) and define an external Hive table on that.
If in case you want to write some custom map-reduce on this data then access the background hive table data (more likely at /user/hive/warehouse). After reading the data from stdin, parse it in mapper (separator could be find using describe extended <hive_table>) and emit in key-value pair format.
I am dealing with a problem: I want to make a datavizualisation & prediction infrastructure.
I thought about Kibana+Elasticsearch on Hdfs (with ES-Hadoop), & Spark (Python) on Hdfs for modelisation.
My question is: can I properly index data in Hdfs with ES, or should I use Hive or Spark between Elasticsearch & Hdfs ?
I don't know which architecture is the best way to go.
ES-Hadoop will allow you to index data in HDFS directly with Elasticsearch. If you need to manipulate the data on its way from HDFS to ES, for example, performing lookups or filtering out data based on some criteria, you could use a tool like StreamSets Data Collector - see the blog post for a bit more detail.
Full disclosure - I'm the community champion at StreamSets.
if your question is regarding the performance difference with indexing in hive and hadoop .... There will not be any difference . Even in the case of hive data is stored in HDFS and can be accessed thorough external tables in hive.... the way you want to use the indexes will determine your choice.... Hive will provide a structure on the data and you can apply many inbuilt functions to operate on data...
I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-7 Mins.
Where as when I try the same with Spark SQL, the performance is very low(1 Gb of records is taking 4 min to transfer from netezza to hdfs). I am trying to do some tuning and increase its performance but its unlikely to tune it to the level of sqoop(around 3 Gb of data in 1 Min).
I agree to the fact that spark is primarily a processing engine but my main question is that both spark and sqoop are using JDBC driver internally so why there is so much difference in the performance(or may be I am missing something). I am posting my code here.
object helloWorld {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Netezza_Connection").setMaster("local")
val sc= new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
val df2 =sqlContext.sql("select * from POC")
val partitioner= new org.apache.spark.HashPartitioner(14)
val rdd=df2.rdd.map(x=>(String.valueOf(x.get(1)),x)).partitionBy(partitioner).values
rdd.saveAsTextFile("hdfs://Hostname/test")
}
}
I have checked many other post but could not get a clear answer for the internal working and tuning of sqoop nor I got sqoop vs spark sql benchmarking .Kindly help in understanding this issue.
You are using the wrong tools for the job.
Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.
Get the dataset with Sqoop and then process it with Spark.
you can try the following:-
Read data from netezza without any partitions and with increased fetch_size to a million.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")
repartition the data before writing it to final file.
val df3 = df2.repartition(10) //to reduce the shuffle
ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.
df3.write.format("ORC").save("hdfs://Hostname/test")
#amitabh
Although marked as an answer, I disagree with it.
Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition. In your case no of tasks should be 14 (u can confirm this using spark UI).
I notice that you are using local as master, which would provide only 1 core for executors. Hence there will be no parallelism. Which is what is happening in your case.
Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel. Theoretically this can be done either by:
1. Using 14 executors with 1 core each
2. Using 1 executor with 14 cores (other end of the spectrum)
Typically, I would go with 4-5 cores per executor. So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode).
Use: executor.cores, executor.instances in sparkConf.set to play with the configs.
If this does not significantly increase performance, the next thing would be to look at the executor memory.
Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.
I had the same problem because the piece of code you are using it's not working for partition.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
You can check number of partitions created in you spark job by
df.rdd.partitions.length
you can use following code to connect db:
sqlContext.read.jdbc(url=db_url,
table=tableName,
columnName="ID",
lowerBound=1L,
upperBound=100000L,
numPartitions=numPartitions,
connectionProperties=connectionProperties)
To optimize your spark job following are the parameters:
1. # of partitions
2. --num-executors
3.--executor-cores
4. --executor-memory
5. --driver-memory
6. fetch-size
2,3,4 and 5 options are depends on you cluster configurations
you can monitor your spark job on spark ui.
Sqoop and Spark SQL both use JDBC connectivity to fetch the data from RDBMS engines but Sqoop has an edge here since it is specifically made to migrate the data between RDBMS and HDFS.
Every single option available in Sqoop has been fine-tuned to get the best performance while doing the data ingestions.
You can start with discussing the option -m which control the number of mappers.
This is what you need to do to fetch data in parallel from RDBMS. Can I do it in Spark SQL?
Of course yes but the developer would need to take care of "multithreading" that Sqoop has been taking care automatically.
The below solution helped me
var df=spark.read.format("jdbc").option("url","
"url").option("user","user").option("password","password").option("dbTable","dbTable").option("fetchSize","10000").load()
df.registerTempTable("tempTable")
var dfRepart=spark.sql("select * from tempTable distribute by primary_key") //this will repartition the data evenly
dfRepart.write.format("parquet").save("hdfs_location")
Apache Sqoop is retired now - https://attic.apache.org/projects/sqoop.html
Using Apache Spark is a good option. This link shows how Spark can be used instead of Sqoop - https://medium.com/zaloni-engineering/apache-spark-vs-sqoop-engineering-a-better-data-pipeline-ef2bcb32b745
Else one can choose any cloud services like Azure Data Factory or Amazon Redshift etc.
I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.