How can I replace like operator in Cassandra? - cassandra-2.0

I am a beginner in cassandra DB. in mysql you can use like operator.
For example if I want to to filter names of all those people who have "Michael" in their full name.
How can I perform it in Cassandra?
The output should be like
"Michael Jackson"
"Kamraan Micheal"
"Michael kern"

You can't. Cassandra is a key based data store -- this kind of problem is resolved by putting a full text search engine beside Cassandra (like solr/elastic search) -- Datastax enterprise offers the solution described using solr. HTH, Carlo

Related

how to bulk fetch data from Scylladb?

in our use case , we must fetch data from scylladb and put into Elasticsearch.
if we take record one by one, it must take too much time.
i found scylladb no binlog,right?
so , do you have better suggestion?
You might want to look at using Change Data Capture in Scylla, then using the CDC tables to feed a Kafka topic that will populate Elasticsearch.
ScyllaDB's CDC connector for Kafka is built on Debezium. You can read more about it here.
https://debezium.io/blog/2021/09/22/deep-dive-into-a-debezium-community-connector-scylla-cdc-source-connector/
And if you want read everything on top of live additions using CDC, you can just write a sample scala spark application, that will just load everything needing a fulltext search from Scylla to Elastic (sample apps are on the internet or have a look at series of blogs around Scylla migrator, which explain how to properly leverage dataframes).
Fwiw, Scylla supports operator LIKE, in case simple search will cut it for you (and assuming your partitions are not huge) instead of lucene query language and inverted indexes Elastic uses.
LINKS:
https://docs.scylladb.com/getting-started/dml/#like-operator
https://www.scylladb.com/2018/07/31/spark-scylla/
https://www.scylladb.com/2019/03/12/deep-dive-into-the-scylla-spark-migrator/
https://github.com/scylladb/scylla-code-samples/tree/master/spark3-scylla4-demo
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
not sure how useful will be this:
https://www.youtube.com/watch?v=9pfEVQ9te5E

Cassandra table structure for query to get all usernames

In my cassandra database I have a table with users and I want a function to search for users by their unique usernames. For that I need to query all usernames from the user table so that I can filter them serverside, because for input of "nark" I should also find username "Mark", "Narkis" and so on, so I can't just use the username as a partition key and search for the exact value.
If I give them all in the same partition, it results in a hot partition. If I distribute them over multiple partitions, I have to search in all of them.
How can query that efficiently for millions of users? Is there a way to search like that without querying all usernames?
Thank you for your help!
Cassandra natively is not a good fit for such a use case. Even extensive use of secondary indexes will be of minimal help here.
Nevertheless if you already have all your data on C* to achieve such a functionality you essentially need a indexing framework on top of it, most widely used is Apache SOLR (built on Lucene).I have seen SOLR work like magic for fuzzy searching on C* though nothing beats having something like Elasticsearch for the use case.

How to search for multiple strings in very large database

I want to search for multiple strings in a very large database. These strings are part of different attributes of database table. I have tried string search using LIKE in sql query. But it is taking a lot of time to get results. I have used Oracle database.
Should I use indexing of database? I found that Lucene can be used for it.
I also got some suggestions of using big data concepts. Which approach should I use?
The easiest way is:
1.) adding an index to the columns you like to search trough
2.) using oracle text as #lalitKumarB wrote
The most powerful way is:
3.) use an separate search engine (solr, elaticsearch).
But, probably you have to change you application in order to explicit use the search index for searching trough the data,...
I had the same situation some years before. Trying to search text in an big database. After a wile I found out, that database based search will never reach the performance of an dedicate search engine. And: you will have much more search features working out of the box, if you use solr (for example), like spelling correction, "More like this", ...
One option is to hold the data on orcale, searching in solr and return the ID of the document in order to only load the one row form oracle, the is referenced by the ID.
2nd option is to keep oracle as base datapool for your search engine and search in solr (or elasticsearch) in order to return the whole document/row from solr, not only the ID. So you don't need to load the data from the database any more.
The best option depends on your needs.
You have the choice between elasticsearch, solr or lucene

elasticsearch-hadoop for Spark. Send documents from a RDD in different index (by day)

I work on a complex workflow using Spark (parsing, cleaning, Machine Learning ...).
At the end of the workflow I want to send aggregated results to elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to send document to elasticsearch with the saveJsonToEs(myindex/mytype) method.
The target is to have an index by day using the proper template that we built.
AFAIK you could not add consideration of a feature in a document to send it to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step using spark and bulk so that each executor send documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
I tried to send JSON to _bulk using saveJsonToEs("_bulk") but the pattern has to follow index/type
Thanks to Costin Leau, I have the solution.
Simply use dynamic indexing with something like saveJsonToEs("my-index-{date}/my-type"). "date" have to be a feature in the document that has to be send.
Discussion on elasticsearch google group: https://groups.google.com/forum/#!topic/elasticsearch/5-LwjQxVlhk
Documentation: http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-write-dyn
You can use : ("custom-index-{date}/customtype") to create dynamic index. This could be any field in given rdd.
If you want format the date : ("custom-index-{date:{YYYY.mm.dd}}/customtype")
[Answered to question ask by Amit_Hora in the comment, as I don't have enough privilege to comment, I am adding this here]

Internal data storage mechanism of elasticsearch

I have been working with elasticsearch for the past 2 months. I have used both REST approach and API support in different languages to index, get and search data. I also read a lot about elasticsearch and found out it is not a good option to use it as a data store. Why is this? And I'm also curious about how elasticsearch internally stores the indexed data. Any good link or explanation??
Elastic Search is built on top of Apache Lucene - here's a reference doc on the Lucene index file structure:
http://lucene.apache.org/core/4_7_2/core/org/apache/lucene/codecs/lucene46/package-summary.html#package_description
Regarding whether or not it's a good option as a data store I think that's more individual opinion and specific use cases than a fact that can be proved. It does not have the transaction support that something like MySQL does if that's what you are looking for. In that case it's somewhat on a par with other NoSQL solutions. This is a pretty decent writeup on the trade-offs and issues: https://www.found.no/foundation/elasticsearch-as-nosql/
In the end it depends on what you are doing with your data and what level of robustness you require.

Resources