Elastisearch-Hadoop how to do a bulk search in spark program - hadoop

I am writing a spark program which is basically a RDD of Strings. What i need to to do is basically create a query per string and do the query based on Elastic search index. So essentially Query would differ on string. I wanted to use elasticsearch-hadoop to do the search so i can have optimizations. The RDD can be large and i m looking for any optimizations possible
For Example RDD is List[India, IBM Company , Netflix , Lebron James]. We will create More like this search on all these terms and do search on the Index Wikipedia and get back the results. For example we will create four more like this query for India and IBM and Netflix and Lebron James and get back the hits for them
I do have work around where i can use HTTP Rest Api call with Bulk search to get back the hits , but there i will be doing optimizations on my own . I wanted to see if we can use the spark elastic connector to create queries and do the search in optimized way

This use case is not possible. Elastic search basically assumes one or some more queries, but does not work with n=batch query mode

Related

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

how to bulk fetch data from Scylladb?

in our use case , we must fetch data from scylladb and put into Elasticsearch.
if we take record one by one, it must take too much time.
i found scylladb no binlog,right?
so , do you have better suggestion?
You might want to look at using Change Data Capture in Scylla, then using the CDC tables to feed a Kafka topic that will populate Elasticsearch.
ScyllaDB's CDC connector for Kafka is built on Debezium. You can read more about it here.
https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-connector-scylla-cdc-source-connector/
And if you want read everything on top of live additions using CDC, you can just write a sample scala spark application, that will just load everything needing a fulltext search from Scylla to Elastic (sample apps are on the internet or have a look at series of blogs around Scylla migrator, which explain how to properly leverage dataframes).
Fwiw, Scylla supports operator LIKE, in case simple search will cut it for you (and assuming your partitions are not huge) instead of lucene query language and inverted indexes Elastic uses.
LINKS:
https://docs.scylladb.com/getting-started/dml/#like-operator
https://www.scylladb.com/2018/07/31/spark-scylla/
https://www.scylladb.com/2019/03/12/deep-dive-into-the-scylla-spark-migrator/
https://github.com/scylladb/scylla-code-samples/tree/master/spark3-scylla4-demo
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
not sure how useful will be this:
https://www.youtube.com/watch?v=9pfEVQ9te5E

Is it possible to convert an elasticsearch query into something that can apply the same filtering logic over hadoop?

We have an architecture that uses both elasticsearch and hadoop for near realtime and batch processing concerns respectively. We ingest data and write to both systems so synchronization is already taken care of, provided some lag in storing the rows into HDFS.
Calls for UI content will query elasticsearch using the query DSL. These queries use many of the bells and whistles of the elasticsearch suite like custom analyzers, match phrases, and others that don't have an equivalent in hive or mapreduce.
One of our batch processes we're moving to HDFS to do a full export of all the rows matching an elasticsearch query needs to generate the same resulting data as the query sent to elasticsearch but avoid calling elasticsearch (as is done in es-hadoop) to avoid a performance hit on our elasticsearch cluster.
Is there any generic tool or process of converting a complex elasticsearch query to something that can apply the same filtering logic in hadoop? We don't need to consider aggregations or anything like that, just query filtering.
We were dealing with a similar situation where we had to apply filtering on client side with the data received as well as on the back end side with Elasticsearch. We came up with our own way of defining a filter as an expression.
For example: if((name == Jane && age > 18) || (name == John && age <18)), would be represented as OR(AND(EQ(name:Jane),GT(age:18)))/AND),(AND(EQ(name:John),LT(age:18)))/AND)/OR).
We then use this to get ES query or any other query format you want by parsing the expression.

What is elastic search

I'm just wanting to know what is exactly Elastic Search.
It is said it helps to search data but when I see some webinars it feels like I have to replicate my data in a kind of Elastic datastore... which not means very otpimized to me. In that way all modification done on left hand will have to be reported on right hand and data returned by Elastic Search may not be in the right format.
Can Elastic Search can directly search in my database?
It's to use with a Neo4J graph database. Does somebody already did something like that? Does that only replace the Cypher queries?
Thanks for advices, helping me on realize on what Elastic Search can really helps on our project.
Elasticsearch is a database, however it's not a relational database like you may be used to. It is a NoSQL database.
You insert JSON documents into an index. You query that index to find documents that match a particular criterion.
It is also sharded and node distributed, which gives it resilience and scalability, and also - if you set it up right - performance.
This means it's really good at 'search engine' style database queries, but because it's not relational, it cannot do the equivalent of a SQL JOIN operation very easily.
One example use case is logstash and kibana - known as the ELK stack - where system event logs (syslog, httpd logs, that kind of thing) are processed by logstash to parse metadata - like log source, referrer, URL, session ID, etc. - and then inserted into elasticsearch.
As each event is a self contained piece of information, this is what elasticsearch does particularly well.
You can then use Kibana as a visualisation engine to display your logs, but also perform analysis - most hit pages, geographic distribution of requests, incoming referrers, time based distribution of requests, etc.
But it also collates these logs, so if you run a really large, geographically distributed website with multiple webserver nodes - or maybe you just have a lot of servers in your computer room and want to summarise the system logs - you can feed the whole lot into elastic search.
It's design is such that it's good at handling near-real-time data insertion and analysis. It also works quite well for 'forum style' data models, as essentially all you're doing is querying a list of posts with a particular forum name, and finding replies to a particular parent node - but they're standalone 'documents'.
So yes, you probably could use it to search an existing database, but you'll have to think about your data model - you can't just translate a conventional relational model, you would have to flatten it. Denormalisation is something of a sin in RDBMS terms, but it's actually quite good for search engines, because you can execute queries in parallel more efficiently.
There exist some way to combine both approaches. Have a look at this blog post:
http://graphaware.com/neo4j/2015/09/30/recommendations-with-neo4j-and-graph-aided-search.html
Databases cannot be optimized for all use cases, but luckily there are many databases available so we can choose the best one for each task.
Elasticsearch is optimized for:
Filtering of documents (exact match)
Search ranking of documents (relevance of search terms)
Aggregation of results (sums, distinct counts, percentiles, ...)
Neo4j is optimized for:
Graph traversal (naturally)
High performance when operated on a "local" graph neighborhood (context)
Actually both databases use the same underlying library Lucene to "index" data to be searched later.
ES is an open source, distributed, RESTful, JSON-based search engine. It is easy to use, scalable and flexible. The indexing feature helps in fast retrieval of search queries.

Need clarification about usage of mahout with hadoop

I currently have an implementation of a recommender in mahout using the in memory recommendation apis. However, I would like to move to a distributed solution using hadoop in order to calculate offline recommendations. This is my first time using hadoop and I'm looking for clarification on a few concepts and api usages.
Currently, my understanding of hadoop is minimal and I think that the correct approach is the following:
use something like apache drill in order to populate the hdfs with the user and item data.
using the recommendation job in mahout train on the data from the hdfs.
transform the resultant data in the hdfs to index shards to be used by solr
use solr to provide the recommendations to the userbase
However, I am looking for clarifications on a couple aspects of this design:
How would I utilize a rescorer in the manner that it is used in the in memory live recommendations?
What is the best manner in which to invoke the recommendations job?
I have other questions besides these two but the answers to these would be a huge help.
You may be talking about the Mahout + Hadoop + Solr recommender. This method handles rescoring in a couple different ways.
The basic recommender can be put together in two ways:
After getting data into into HDFS in the form of (user id, item id, preference weight) run the ItemSimilarityJob on the data (use LLR similarity, which is usaully best). It will create what is called an indicator matrix. This will be an item id by item id sparse matrix of values indicating the similarity magnitude between any two items. You must then convert this into values that Solr can index. That means translating the internal Mahout integer IDs into some unique string representation, which is probably what they were at the very beginning. This will look like (item123,item223 item643 item293 item445...) as a CSV. so two Solr fields, the first is an item id, the second is a list of similar items. All ids must be text tokens. Then the query for recommendations is a Solr query made up of item ids that a particular user has shown a preference for. So query = "item223 item344 item445...". Make the query against the filed that olds the indicator matrix values. You will get back an ordered list of item IDs
A much easier way that may work for you is to use a tool in the /examples folder of Mahout 1.0-SNAPSHOT or here: https://github.com/pferrel/solr-recommender. It takes in raw log files with unique strings for user and item ids. it does all the work on Hadoop to output CSVs that can be indexed by Solr directly or loaded into a DB as described above.
The way I did the demo site (https://guide.finderbots.com) is to use my Solr web app integration, putting the indicator matrix into a DB attaching the similar item list to my collection of items. So item123 got item223 item643 item293 item445... in its indicator field. After you index the collection the query is then = "item223 item344 item445..." -- the user's prefered items.
Here are three ways to do rescoring:
Mix in metadata with the query. So you could do query = "item223 item344 item445..." against the indicator field AND "SciFi" against the "genre" field. This gives you blended collaborative filtering and metadata in your query and as you can imagine, the recs are based on the user's prefs but skewed towards "SciFi". There are lots of other interesting things you can do once you get item+indicators+metadata into an index.
Filter recs by metadata. You can get recs not skewed but filtered, if you want. Using the Solr query = "item223 item344 item445..." against the indicator field AND "SciFi" as a filter against the "genre" field. In this case you get nothing but "SciFi" where #1 you would get mostly "SciFi"
Get your ordered list of recs back and rescore them in any way you'd like based on other things you know about the user, context, or items. Often these can be encoded into a Solr query and done with one query but reordering and filtering can be done after the recs are returned too. You would have to write that code, it is not built in.
The fun thing is you can mix filters, metadata fields, and user preferences with what Solr calls "boost" values to get all sorts of rescoring. Solr can even use location to query, skew, or filter.
Note: You don't have to worry about Solr shards necessarily. Solr will index most DBs and HDFS directly but only the index is sharded. You shard an index if you have a very big one, you replicate it if you have lots of queries/second (or for failover). Solr queries are generally very fast so I'd worry about that after you have a functioning system since it's a config thing and shouldn't be affected by the rest of your workflow.

Resources