Elasticsearch-hadoop & Elasticsearch-spark sql - Tracing of statements scan&scroll - elasticsearch

We are trying to integrate ES (1.7.2, 4 node cluster) with Spark (1.5.1, compiled with hive and hadoop with scala 2.11, 4 node cluster), there is hdfs coming into equation (hadoop 2.7,4 nodes) and thrift jdbc server and elasticsearch-hadoop-2.2.0-m1.jar
Thus, there are two ways of executing statement on ES.
Spark SQL with scala
val conf = new SparkConf().setAppName("QueryRemoteES").setMaster("spark://node1:37077").set("spark.executor.memory","2g")
conf.set("spark.logConf", "true")
conf.set("spark.cores.max","20")
conf.set("es.index.auto.create", "false")
conf.set("es.batch.size.bytes", "100mb")
conf.set("es.batch.size.entries", "10000")
conf.set("es.scroll.size", "10000")
conf.set("es.nodes", "node2:39200")
conf.set("es.nodes.discovery","true")
conf.set("pushdown", "true")
sc.addJar("executorLib/elasticsearch-hadoop-2.2.0-m1.jar")
sc.addJar("executorLib/scala-library-2.10.1.jar")
sqlContext.sql("CREATE TEMPORARY TABLE geoTab USING org.elasticsearch.spark.sql OPTIONS (resource 'geo_2/kafkain')" )
val all: DataFrame = sqlContext.sql("SELECT count(*) FROM geoTab WHERE transmittersID='262021306841042'")
.....
Thrift server (code executed on spark)
....
polledDataSource = new ComboPooledDataSource()
polledDataSource.setDriverClass("org.apache.hive.jdbc.HiveDriver")
polledDataSource.setJdbcUrl("jdbc:hive2://node1:30001")
polledDataSource.setMaxPoolSize(5)
dbConnection = polledDataSource.getConnection
dbStatement = dbConnection.createStatement
val dbResult = dbStatement.execute("CREATE TEMPORARY EXTERNAL TABLE IF NOT EXISTS geoDataHive6(transmittersID STRING,lat DOUBLE,lon DOUBLE) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'geo_2/kafkain','es.query'='{\"query\":{\"term\":{\"transmittersID\":\"262021306841042\"}}}','es.nodes'='node2','es.port'='39200','es.nodes.discovery' = 'false','es.mapping.include' = 'trans*,point.*','es.mapping.names' = 'transmittersID:transmittersID,lat:point.lat,lon:point.lon','pushdown' = 'true')")
dbStatement.setFetchSize(50000)
dbResultSet = dbStatement.executeQuery("SELECT count(*) FROM geoDataHive6")
.....
I have following issues and due to fact that they are connected, I have decided to pack them into one question on stack:
It seems that method using Spark SQL supports pushdown of what goes behind WHERE (whether es.query is specified or not), time of execution is the same and is acceptable. But solution number 1 definitely does not support pushdow of aggregating functions, i.e. presented count(*) is not executed on side of ES, but only after all data are retrieved - ES returns rows and Spark SQL counts them. Please confirm if this is correct behaviour
Solution number one behaves strange in that whether pushdown is passed true or false, time is equal
Solution number 2 seems to support no pushdown, it does not matter in what way I try to specify the sub-query, be it part of the table definition or in WHERE clause of the statement, it seems it is just fetching all the huge index and then to make the maths on it. Is it so that thrift-hive is not able to do pushdown on ES
I'd like to trace queries in elastic search, I do set following:
//logging.yml
index.search.slowlog: TRACE, index_search_slow_log_file
index.indexing.slowlog: TRACE, index_indexing_slow_log_file
additivity:
index.search.slowlog: true
index.indexing.slowlog: true
All index.search.slowlog.threshold.query,index.search.slowlog.threshold.fetch and even index.indexing.slowlog.threshold.index are set to 0ms.
And I do see in slowlog file common statements executed from sense (so it works). But I don't see Spark SQL or thrift statements executed against ES. I suppose these are scan&scroll statement because if i execute scan&scroll from sense, these are also not logged. Is it possible somehow to trace scan&scroll on side of ES?

As far as I know it is an expected behavior. All sources I know behave exactly the same way and intuitively it make sense. SparkSQL is designed for analytical queries and it make more sense to fetch data, cache and process locally. See also Does spark predicate pushdown work with JDBC?
I don't think that conf.set("pushdown", "true") has any effect at all. If you want to configure connection specific settings it should be passed as an OPTION map as in the second case. Using es prefix should work as well
This is strange indeed. Martin Senne reported a similar issue with PostgreSQL but I couldn't reproduce that.

After a discussion I had with Costin Leau on the elasticsearch discussion group, he pointed out the following and I ought sharing it with you :
There are a number of issues with your setup:
You mention using Scala 2.11 but are using Scala 2.10. Note that if you want to pick your Scala version, elasticsearch-spark should be used, elasticsearch-hadoop provides binaries for Scala 2.10 only.
The pushdown functionality is only available through Spark DataSource. If you are not using this type of declaration, the pushdown is not passed to ES (that's how Spark works). Hence declaring pushdown there is irrelevant.
Notice that how all params in ES-Hadoop start with es. - the only exceptions are pushdown and location which are Spark DataSource specific (following Spark conventions as these are Spark specific features in a dedicated DS)
Using a temporary table does count as a DataSource however you need to use pushdown there. If you don't, it gets activated by default hence why you see no difference between your runs; you haven't changed any relevant param.
Count and other aggregations are not pushed down by Spark. There might be something in the future, according to the Databricks team, but there isn't anything currently. For count, you can do a quick call by using dataFrame.rdd.esCount. But it's an exceptional case.
I'm not sure whether Thrift server actually counts as a DataSource since it loads data from Hive. You can double check this by enabling logging on the org.elasticsearch.hadoop.spark package to DEBUG. You should see whether the SQL does get translated to the DSL.
I hope this helps!

Related

Spark execution using jdbc

In the Spark dataframe, suppose I fetch data from oracle as below.
Will the query happen completely in oracle? Assume the query is huge. Is it an overhead to oracle then? Would a better approach be to read each filtered table data in a separate dataframe and join it using spark SQL or dataframe so that a complete join will happen in Spark? Can you please help with this?
df = sqlContext.read.format('jdbc').options(
url="jdbc:mysql://foo.com:1111",
dbtable="(SELECT * FROM abc,bcd.... where abc.id= bcd.id.....) AS table1", user="test",
password="******",
driver="com.mysql.jdbc.Driver").load()
In general, actual data movement is the most time consuming and should be avoided. So, as general rule, you want to filter as much as possible in the JDBC source (Oracle in your case) before data are moved into your Spark environment.
Once you're ready to do some analysis in Spark, you can persist (cache) the result so as to avoid re-retrieving from Oracle every time.
That being said, #shrey-jakhmola is right, you want to benchmark for your particular circumstance. Is the Oracle environment choked somehow, perhaps?

Spark SQL partition awareness querying hive table

Given partitioned by some_field (of int type) Hive table with data stored as Avro files, I want to query table using Spark SQL in a way that returned Data Frame have to be already partitioned by some_field (used for partitioning).
Query looks like just
SELECT * FROM some_table
By default Spark doesn't do that, returned data_frame.rdd.partitioner is None.
One way to get result is via explicit repartitioning after querying, but probably there is better solution.
HDP 2.6, Spark 2.
Thanks.
First of all you have to distinguish between partitioning of a Dataset and partitioning of the converted RDD[Row]. No matter what is the execution plan of the former one, the latter one won't have a Partitioner:
scala> val df = spark.range(100).repartition(10, $"id")
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.rdd.partitioner
res1: Option[org.apache.spark.Partitioner] = None
However internal RDD, might have a Partitioner:
scala> df.queryExecution.toRdd.partitioner
res2: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner#5a05e0f3)
This however is unlikely to help you here, because as of today (Spark 2.2), Data Source API is not aware of the physical storage information (with exception of simple partition pruning). This should change in the upcoming Data Source API. Please refer to the JIRA ticket (SPARK-15689) and design document for details.

Why does one action produce two jobs?

I use Spark 2.1.0.
Why does the following one action produce 2 identical jobs (same DAG in each one)? Shouldn't it produce just 1? Here you have the code:
val path = "/usr/lib/spark/examples/src/main/resources/people.txt"
val peopleDF = spark.
sparkContext.
textFile(path, 4).
map(_.split(",")).
map(attr => Person(attr(0), attr(1).trim.toInt)).
toDF
peopleDF.show()
I see that in the graphic interface when checking what is going on? I suppose it has something to do with all Data Frame transformation.
Although in general, a single SQL query may lead to more than one Spark job in this particular case Spark 2.3.0-SNAPSHOT gives only one (contrary to what you see).
The Job 12 is also pretty nice, i.e. just a single-stage no-shuffle Spark job.
The reason to see more than one Spark job per Spark SQL's structured query (using SQL or Dataset API) is that Spark SQL offers a high level API atop RDDs and uses RDDs and actions freely to make your life as a Spark developer and a Spark performance tuning expert easier. In most cases (esp. when you wanted to build abstractions), you'd have to fire up the Spark jobs yourself to achieve the comparable performance.

Cassandra aggregate to Map

I am new to cassandra, I've mainly been using Hive the past several months. Recently I started a project where I need to do some of the things I did in hive with cassandra instead.
Essentially, I am trying to find a way to do an aggregate of multiple rows into a single map on query.
In hive, I simply do a group by, with a "map" aggregate. Does a way exist in cassandra to do something similar?
Here is an example of a working hive query that does the task I am looking to do:
select
map(
"quantity", count(caseid)
, "title" ,casesubcat
, "id" , casesubcatid
, "category", named_struct("id",casecatid,'title',casecat)
) as casedata
from caselist
group by named_struct("id",casecatid,'title',casecat) , casesubcat, casesubcatid
Mapping query results to Map (or some other type/structure/class of your choice) is responsibility of client application and usually is a trivial task (but you didn't specify in what context this map is going to be used).
Actual question here is about GROUP BY in Cassandra. This is not supported out of the box. You can check Cassandra's standard aggregate functions or try creating user defined function, but Cassandra Way is knowing your query in advance, designing your schema accordingly, doing heavy lifting in write phase and simplistic querying afterwards. Thus, grouping/aggregation can often be achieved by using dedicated counter tables.
Another option is to do data processing in additional layer (Apache Spark, for example). Have you considered using Hive on top of Cassandra?

how to set spark RDD StorageLevel in hive on spark?

In my hive on spark job , I get this error :
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
thanks for this answer (Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?) , I know it may be my hiveonspark job has the same problem
since hive translates sql to a hiveonspark job, I don't how to set it in hive to make its hiveonspark job change from StorageLevel.MEMORY_ONLY to StorageLevel.MEMORY_AND_DISK ?
thanks for you help~~~~
You can use CACHE/UNCACHE [LAZY] Table <table_name> to manage caching. More details.
If you are using DataFrame's then you can use the persist(...) to specify the StorageLevel. Look at API here..
In addition to setting the storage level, you can optimize other things as well. SparkSQL uses a different caching mechanism called Columnar storage which is a more efficient way of caching data (as SparkSQL is schema aware). There are different set of config properties that can be tuned to manage caching as described in detail here (THis is latest version documentation. Refer to the documentation of version you are using).
spark.sql.inMemoryColumnarStorage.compressed
spark.sql.inMemoryColumnarStorage.batchSize

Resources