SearchPhaseExecutionException with Basic Embedded Elasticsearch - elasticsearch

I am trying out Elasticsearch with a basic Scala program, using the Java API:
object TestES {
var node:Node = NodeBuilder.nodeBuilder.node
var client:Client = node.client
def insertDoc(id:String, doc:String) = {
client.prepareIndex("myindex", "test", id).
setSource(doc).execute.actionGet
}
def countHits(qry:String) = {
client.prepareSearch("myindex").setTypes("test").
setQuery(queryString(qry)).execute.actionGet.
getHits.getTotalHits
}
def waitForGreen = client.admin.cluster.prepareHealth().
setWaitForGreenStatus.execute.actionGet
def main(args: Array[String]): Unit = {
insertDoc("1", """{"foo":"bar"}""")
//waitForGreen
println(countHits("bar"))
node.close
}
}
This works, and the insert + query run in under one second. If I comment out the insert, I get the following exception:
Exception in thread "main" org.elasticsearch.action.search.SearchPhaseExecutionException:
Failed to execute phase [query], total failure;
shardFailures {[_na_][myindex][0]: No active shards}
If I enable the waitForGreen line, it works again, but takes over half a minute to run both lines.
This seems quite odd. Is inserting a document a must before running a query, or is there a better way?

Inserting a document is not needed to run a query, but having the index created is required.
When you insert a document and the index doesn't exist, it will be automatically created.
If you want to avoid document creation, you can create only the index using the API :
client.admin.indices.prepareCreate("myindex").execute.actionGet

Related

Akka Streams with Alpakka indexing in ES: index name is only evaluated on starting execution

I've written some code with Akka Streams and Alpakka that reads from Amazon SQS and indexes the events in Elasticsearch. Everything works smoothly and the performance is awesome, but I have a problem with index names. I have this code:
class ElasticSearchIndexFlow(restClient: RestClient) {
private val elasticSettings = ElasticsearchSinkSettings(bufferSize = 10)
def flow: Flow[IncomingMessage[DomainEvent, NotUsed], Seq[IncomingMessageResult[DomainEvent, NotUsed]], NotUsed] =
ElasticsearchFlow.create[DomainEvent](index, "domain-event", elasticSettings)(
restClient,
DomainEventMarshaller.domainEventWrites
)
private def index = {
val now = DateTime.now()
s"de-${now.getYear}.${now.getMonthOfYear}.${now.getDayOfMonth}"
}
}
The problem is that after some days running the flow, the index name is not changing. I imagine that Akka Streams creates under the hood a fused actor, and that the function index for getting the index name is only evaluated at the beginning of execution.
Any idea of what can I do to index events in ES with an index name according to the current date?
The solution to the problem is setting the index name in previous step with IncomingMessage.withIndexName
So:
def flow: Flow[(DomainEvent, Message), IncomingMessage[DomainEvent, Message], NotUsed] =
Flow[(DomainEvent, Message)].map {
case (domainEvent, message) =>
IncomingMessage(Some(domainEvent.eventId), domainEvent, message)
.withIndexName(indexName(domainEvent.ocurredOn))
}
And:
def flow: Flow[IncomingMessage[DomainEvent, NotUsed], Seq[IncomingMessageResult[DomainEvent, NotUsed]], NotUsed] =
ElasticsearchFlow.create[DomainEvent]("this-index-name-is-not-used", "domain-event", elasticSettings)(
restClient,
DomainEventMarshaller.domainEventWrites
)

How to stabilize spark streaming application with a handful of super big sessions?

I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records.
A session is simply all of the records with the same ID . Then I perform some analytics on a session level to find an anomaly score.
I couldn't stabilize my application because a handful of sessions are getting bigger at each batch time for extended period ( more than 1h) . My understanding is a single session (key - value pair) is always processed by a single core in spark . I want to know if I am mistaken , and if there is a solution to mitigate this issue and make the streaming application stable.
I am using Hadoop 2.7.2 and Spark 1.6.1 on Yarn . Changing batch time, blocking interval , partitions number, executor number and executor resources didn't solve the issue as one single task makes the application always choke. However, filtering those super long sessions solved the issue.
Below is a code updateState function I am using :
val updateState = (batchTime: Time, key: String, value: Option[scala.collection.Map[String,Any]], state: State[Seq[scala.collection.Map[String,Any]]]) => {
val session = Seq(value.getOrElse(scala.collection.Map[String,Any]())) ++ state.getOption.getOrElse(Seq[scala.collection.Map[String,Any]]())
if (state.isTimingOut()) {
Option(null)
} else {
state.update(session)
Some((key,value,session))
}
}
and the mapWithStae call :
def updateStreamingState(inputDstream:DStream[scala.collection.Map[String,Any]]): DStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] ={//MapWithStateDStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] = {
val spec = StateSpec.function(updateState)
spec.timeout(Duration(sessionTimeout))
spec.numPartitions(192)
inputDstream.map(ds => (ds(sessionizationFieldName).toString, ds)).mapWithState(spec)
}
Finally I am applying a feature computing session foreach DStream , as defined below :
def computeSessionFeatures(sessionId:String,sessionRecords: Seq[scala.collection.Map[String,Any]]): Session = {
val features = Functions.getSessionFeatures(sessionizationFeatures,recordFeatures,sessionRecords)
val resultSession = new Session(sessionId,sessionizationFieldName,sessionRecords)
resultSession.features = features
return resultSession
}

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

Starting Activity Indicator while Running a database download in a background thread

I am running a database download in a background thread. The threads work fine and I execute group wait before continuing.
The problem I have is that I need to start an activity indicator and it seems that due to the group_wait it gets blocked.
Is there a way to run such heavy process, ensure that all threads get completed while allowing the activity indicator to run?
I start the activity indicator with (I also tried starting the indicator w/o the dispatch_async):
dispatch_async(dispatch_get_main_queue(), {
activityIndicator.startAnimating()
})
After which, I start the thread group:
let group: dispatch_group_t = dispatch_group_create()
let queue: dispatch_queue_t = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0) //also tried QOS_CLASS_BACKGROUND
while iter > 0 {
iter--
dispatch_group_enter(group)
dispatch_group_async(group, queue, {
do {
print("in queue \(iter)")
temp += try query.findObjects()
query.skip += query.limit
} catch let error as NSError {
print("Fetch failed: \(error.localizedDescription)")
}
dispatch_group_leave(group)
})
}
// Wait for all threads to finish and proceed
As I am using Parse, I have modified the code as follows (psuedo code for simplicity):
trigger the activity indicator with startAnimating()
call the function that hits Parse
set an observer in the Parse class on an int to trigger an action when the value reaches 0
get count of new objects in Parse
calculate how many loop iterations I need to pull all the data (using max objects per query = 1000 which is Parse max)
while iterations > 0 {
create a Parse query object
set the query skip value
use query.findObjectsInBackroundWithBlock ({
pull objects and add to a temp array
observer--
)}
iterations--
}
When the observer hits 0, trigger a delegate to return to the caller
Works like a charm.

How to speed up a simple count operation in SORM?

I'm trying to perform a simple count operation on a large table with SORM. This works inexplicably slow. Code:
case class Tests (datetime: DateTime, value:Double, second:Double)
object DB extends Instance (
entities = Set() +
Entity[Tests](),
url = "jdbc:h2:~/test",
user = "sa",
password = "",
poolSize = 10
)
object TestSORM {
def main(args: Array[String]) {
println("Start at " + DB.now())
println(" DB.query[Tests].count() = " + DB.query[Tests].count())
println("Finish at " + DB.now())
DB.close()
}
}
Results:
Start at 2013-01-28T22:36:05.281+04:00
DB.query[Tests].count() = 3390132
Finish at = 2013-01-28T22:38:39.655+04:00
It took two and a half minutes!
In H2 console 'select count(*) from tests' works in a tick. What is my mistake? Can I execute a sql query through it?
Thanks in advance!
It's a known optimization issue. Currently SORM does not utilize a COUNT operation. Since finally the issue has come up, I'll address it right away. Expect a release fixing this in the comming days.
The issue fix has been delayed. You can monitor its status on the issue tracker.

Resources