Extremely poor performance with Tableau + Spark + Cassandra - performance
Currently I am in the process of investigating the possibility of using Cassandra in combination with Spark and Tableau for data analysis. However, the performance that I am currently experiencing with this setup is so poor that I cannot imagine using it for production purposes. As I am reading about how great the performance of the combination of Cassandra + Spark must be, I am obviously doing something wrong, yet I cannot find out what.
My test data:
All data is stored on a single node
Queries are performed on a single table with 50MB (interval data)
Columns used in selection criteria have an index on it
My test setup:
MacBook 2015, 1.1 GHz, 8GB memory, SSD, OS X El Capitan
Virtual Box, 4GB memory, Ubuntu 14.04
Single node wit Datastax Enterprise 4.8.4:
Apache Cassandra 2.1.12.1046
Apache Spark 1.4.2.2
Spark Connector 1.4.1
Apache Thrift 0.9.3
Hive Connector 0.2.11
Tableau (Connected through ODBC)
Findings:
When a change in Tableau requires loading data from the database, it takes anywhere between 40s and 1.4 mins. to retrieve the data (which is basically unworkable)
When I use Tableau in combination with Oracle instead of Cassandra + Spark, but on the same virtual box, I get the results almost instantaneously
Here is the table definition used for the queries:
CREATE TABLE key.activity (
interval timestamp,
id bigint,
activity_name text,
begin_ts timestamp,
busy_ms bigint,
container_code text,
duration_ms bigint,
end_location_code text,
end_ts timestamp,
pallet_code text,
src_location_code text,
start_location_code text,
success boolean,
tgt_location_code text,
transporter_name text,
PRIMARY KEY (interval, id)
) WITH CLUSTERING ORDER BY (id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"ALL"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX activity_activity_name_idx ON key.activity (activity_name);
CREATE INDEX activity_success_idx ON key.activity (success);
CREATE INDEX activity_transporter_name_idx ON key.activity (transporter_name);
Here is an example of a query produced by Tableau:
INFO 2016-02-10 20:22:21 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Running query 'SELECT CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END AS `calculation_185421691185008640`,
AVG(CAST(`activity`.`busy_ms` AS DOUBLE)) AS `avg_busy_ms_ok`,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT) AS `qr_interval_ok`,
`activity`.`transporter_name` AS `transporter_name`,
YEAR(`activity`.`interval`) AS `yr_interval_ok`
FROM `key`.`activity` `activity`
GROUP BY CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT),
`activity`.`transporter_name`,
YEAR(`activity`.`interval`)'
Here is an example on statistics of a 52s query:
Spark statistics on query taken 52 secs. to complete
I've tried playing around with the partition keys as mentioned in other posts, but did not see a significant difference. I've also tried to enable row caching (Cassandra config + table property), but this also did not have any effect (although perhaps I have overlooked something there).
I would have expected to get at least a factor 10x-20x better performance out of the box, even without fiddling around with all these parameters and I've run out of ideas what to do.
What am I doing wrong? What performance should I expect?
Answering your questions will not be easy due to the variables that you do not define in your post. You mention data that is stored on one node, which is fine but you don't describe how you have structured your tables/column families. You also don't mention the cassandra cache hit ratios. You also have to consider Cassandra Compaction, if compaction is running during the heavy read/write operations it will slow things down.
You also appear to have a single SSD in which case you will have the Data directory and commitlogs and cache directories on the same physical drive. Even though it is not a spinning disc you will see degraded performance unless you split the data dir from the commitlogs/cache directories. I saw a 50% increase in performance by splitting the Data dir onto its own physical SSD.
Also, lastly you're running in a VM on a laptop host in Vbox none the less. Your largest bottleneck here is the 1.1 GHz CPU. In my cassandra environments on VMWare while running medium jobs I see almost 99% CPU use across 4 X 2 cores on 16GB RAM. My data dir(s) are on SSD's while my commitlogs and cache directories are on a magnetic HDD. I get good performance, but I tuned my environments to get to this point and I accept the latency my non production environments provide.
Take a look HERE and try to get a better understanding of how Cassandra should be used and how to achieve better performance out of the box. Distributed Systems are just that.. distributed and for a reason. Shared resources that you don't have available on a single machine.
Hope this explains a little more about where you're headed.
EDIT
Your table definition looks fine. Are you using the Tableau Spark connector? Your performance problem is likely on the cassandra/Spark side of things.
Take a look at this article which describes a compaction related problem while reading from cache. Basically on cassandra releases prior to 2.1.2 post compaction you now have lost your cache because Cassandra threw the file (and cache) away once the compaction finished. Once you start reading you imediately get a missed cache hit and cassandra then goes back to disc. This is fixed in releases from 2.1.2 onward. Everything else looks normal with respect towards running Spark/Cassandra.
While the query time does seem a little high, there's a few things I see that could cause issues.
I noticed you're using a MacBook. Beautiful computer but not ideal for Spark. I believe those are using the dual core Intel M processors. If you go to your Spark Master UI, it'll show you available cores. It might show 4 (to include vCPU's).
The nature in which you are running this query doesn't allow for a lot of parallelism (if any). You basically don't get the advantages of Spark in this case because you're running in an extremely small VM and you're running on a single node (with limited CPU's). Visualization tools haven't really caught up to Spark yet.
One other thing to keep in mind is that Spark is not designed as an 'adhoc query' tool. You can think of SparkSQL as an abstraction over proper Spark Batch. Comparing it to Oracle, at this scale, wont yield the results you expect. There's a 'minimum' performance threshold that you'll notice with Spark. Once you scale data and nodes far enough, you'll start to see that time to completion and size of data is not linear and as you add more data, the time to process remains relatively flat.
I suggest trying that query in the SparkSQL REPL dse spark-sql and see if you get similar times. If you do, then you know that's the best you'll get with your current setup. If Tableau is MUCH slower than the REPL, I'd guess it's something on their end at that point.
Related
jdbc write to greenplum/postgres issue
Spark jdbc write is giving nightmares for data having more columns (400 columns and 200 rows) and even with less columns and more rows it is taking quite long (200k records 30 to 60 minutes). We don't have primary key for partitioning so will use little relevant key(reading from jdbc/transformations has no problem only writing is problem). Spark Cluster conf -- 1 master, 2 workers >> 8 cores 32 Gb each. Spark sumit command params -- 'executor_cores': 2,'executor_memory': '2G','num_executors': 2,'driver_memory': '2G' Tried below ways as per other stackoverflow suggestions. df.write.format('jdbc').options(url=url,driver=driver,dbtable=table,user=user,password=password,batchsize=20000,rewriteBatchedStatements=true).mode(mode).save() df.repartition(15).write.format('jdbc').options(url=url,driver=driver,dbtable=table,user=user,password=password,batchsize=20000,rewriteBatchedStatements=true).mode(mode).save() Write to mysql is working fine. Writing to greenplum and postgres is issue(verified in both). I couldn't find much options.
After trial and error found some parameters which helped improve performance. For Postgres reWriteBatchedInserts=true should be used instead of rewriteBatchedStatements=true(this is for mysql only). This helps a lot in terms of performance. Reducing batchsize helped in writing to database. With trial and error we can identify suitable value for a environment.
hive data processing taking longer time than expected
I'm facing an issue with ORC type data in hive. Needed some suggestions if someone faced similar problem. I've huge data stored in hive table (partitioned & ORCed). The ORC data size is around 4 TB. I'm trying to copy this data to an uncompressed normal hive table (same table structure). The process is running forever & occupying huge amount of non DFS storage in the pursuit. At present the process is running for 12 hours & has occupied 130 TB of non-DFS. That's very much abnormal for a Hadoop cluster with 20 servers. Below are my parameters: Hadoop running: HDP 2.4 Hive: 0.13 No. of servers: 20 (2 NN included)** I wonder what a simple join or a normal analytics operation on this ORCed table would do. And theory tells that ORC format data increases performance for basic DML queries. Can someone please let me know if I'm doing something wrong or is this a normal behavior? With ORCed data, this is my first experience. Well, on a starters I saw that yarn log files are getting created in huge size. Mostly it shows the error logs only in heavy. Thanks
How to make a cached from a finished Spark Job still accessible for the other job?
My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it. The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow. The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache. But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works. We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB. What I have already tried: Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores. Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale. Try to use Cassandra but Cassandra does not support GROUP BY. So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.
If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable. So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance. Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.
Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs
I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-7 Mins. Where as when I try the same with Spark SQL, the performance is very low(1 Gb of records is taking 4 min to transfer from netezza to hdfs). I am trying to do some tuning and increase its performance but its unlikely to tune it to the level of sqoop(around 3 Gb of data in 1 Min). I agree to the fact that spark is primarily a processing engine but my main question is that both spark and sqoop are using JDBC driver internally so why there is so much difference in the performance(or may be I am missing something). I am posting my code here. object helloWorld { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("Netezza_Connection").setMaster("local") val sc= new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC") val df2 =sqlContext.sql("select * from POC") val partitioner= new org.apache.spark.HashPartitioner(14) val rdd=df2.rdd.map(x=>(String.valueOf(x.get(1)),x)).partitionBy(partitioner).values rdd.saveAsTextFile("hdfs://Hostname/test") } } I have checked many other post but could not get a clear answer for the internal working and tuning of sqoop nor I got sqoop vs spark sql benchmarking .Kindly help in understanding this issue.
You are using the wrong tools for the job. Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark. Get the dataset with Sqoop and then process it with Spark.
you can try the following:- Read data from netezza without any partitions and with increased fetch_size to a million. sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC") repartition the data before writing it to final file. val df3 = df2.repartition(10) //to reduce the shuffle ORC formats are more optimized than TEXT. Write the final output to parquet/ORC. df3.write.format("ORC").save("hdfs://Hostname/test")
#amitabh Although marked as an answer, I disagree with it. Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition. In your case no of tasks should be 14 (u can confirm this using spark UI). I notice that you are using local as master, which would provide only 1 core for executors. Hence there will be no parallelism. Which is what is happening in your case. Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel. Theoretically this can be done either by: 1. Using 14 executors with 1 core each 2. Using 1 executor with 14 cores (other end of the spectrum) Typically, I would go with 4-5 cores per executor. So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode). Use: executor.cores, executor.instances in sparkConf.set to play with the configs. If this does not significantly increase performance, the next thing would be to look at the executor memory. Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.
I had the same problem because the piece of code you are using it's not working for partition. sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC") You can check number of partitions created in you spark job by df.rdd.partitions.length you can use following code to connect db: sqlContext.read.jdbc(url=db_url, table=tableName, columnName="ID", lowerBound=1L, upperBound=100000L, numPartitions=numPartitions, connectionProperties=connectionProperties) To optimize your spark job following are the parameters: 1. # of partitions 2. --num-executors 3.--executor-cores 4. --executor-memory 5. --driver-memory 6. fetch-size 2,3,4 and 5 options are depends on you cluster configurations you can monitor your spark job on spark ui.
Sqoop and Spark SQL both use JDBC connectivity to fetch the data from RDBMS engines but Sqoop has an edge here since it is specifically made to migrate the data between RDBMS and HDFS. Every single option available in Sqoop has been fine-tuned to get the best performance while doing the data ingestions. You can start with discussing the option -m which control the number of mappers. This is what you need to do to fetch data in parallel from RDBMS. Can I do it in Spark SQL? Of course yes but the developer would need to take care of "multithreading" that Sqoop has been taking care automatically.
The below solution helped me var df=spark.read.format("jdbc").option("url"," "url").option("user","user").option("password","password").option("dbTable","dbTable").option("fetchSize","10000").load() df.registerTempTable("tempTable") var dfRepart=spark.sql("select * from tempTable distribute by primary_key") //this will repartition the data evenly dfRepart.write.format("parquet").save("hdfs_location")
Apache Sqoop is retired now - https://attic.apache.org/projects/sqoop.html Using Apache Spark is a good option. This link shows how Spark can be used instead of Sqoop - https://medium.com/zaloni-engineering/apache-spark-vs-sqoop-engineering-a-better-data-pipeline-ef2bcb32b745 Else one can choose any cloud services like Azure Data Factory or Amazon Redshift etc.
How to store hadoop data into oracle
My final table is in Hive(HDFS) 1) I have tired "Sqoop" 2) sql loader 3) oraoop Performance of all are very discouraging ,while we are putting data into sql database have to import 1 TB file and 1 GB is taking over all 8 Min (1297372920 Rows) in 5 node cluster with sqoop ,oraoop,sql loader
Your Sqoop export to Oracle speed will be determined by various factors including data size/characteristics, network performance and perhaps most importantly the target database server's configuration. Since the current release of Sqoop doesn't allow for use of "direct" in exporting data to Oracle the optimizations available in this use case are limited. I'd strongly encourage you to review the documentation (http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_export_literal) and try to get yourself into a position where you can work with incremental imports/exports if possible since you're displeased with latency using your 1tb dataset. Perhaps go with an initial full load of your entire desired dataset and find a way to only update incrementally from there.