Cassandra and advanced queries: Spark, ElasticSearch, Sorl - elasticsearch

Ok, so, I'm developing an app and I'm using Cassandra as the database.
Everything going good so far, but now I need to do a query using the LIKE clause.
I know Cassandra doesn't support that, and that's why after looking for a workaround I was thinking in maintaining this single table that I need to query using the LIKE clause in another database, other than Cassandra - was even considering a relational database, even though there wouldn't exist any relations.
Then I started looking to see if this is really the right approach, and came into stuff like Spark, Sorl and ElasticSearch.
Just to make it clear: I have little to no knowledge about those frameworks. Really. I only have heard about them and that's all.
So, I'm not here to ask you guys 'hey, how to do that using this framework?'. I just want to know, before I dig into any of those: Would any of those satisfy my needs? - Since I have no idea exactly how they work, and what exactly they are for.
If it is the case, them I'll study the framework properly - I just don't want to spend the time to figure out it has nothing to do with my problem.
Thanks!

Both elasticsearch and solr fits your needs. They use lucene library to perform reverse indexing and much more -- Datastax enterprise (commercial distribution of Cassandra) offer this solution integrating solr natively. One more solution (little different but working) is to integrate infinispan which offers both integration with Cassandra repository and reverse indexing ...
HTH,
Carlo

Related

Project with Laravel and NOSQL

I want to start a project on Laravel and want to go for NOSQL. I need extensive search with this project and was considering Mongodb but I am not sure about search.
Few related questions:
Is there enough support for using NOSQL, incase I get stuck somewhere?
NOSQL is flexible enough for searching parameters?
If I need to import data from previous project to NoSQL will it be a challenge?
What about realtime time, does NOSQL supports realtime?
Thanks in advance.

Manage reports, when our database is Cassandra ...Spark or Solr...or BOTH?

My db is Cassandra (datastax enterprise => linux). Since it doesn't support group-by, aggregate and etc. for reporting, according to its fundamentals, it's not a good decision to use Cassandra, downright. I googled about this deficit and found some results as this, and this and also this one.
But I really became confused! Hive uses additional tables, individually. Solr is better for full-text searching and like that. And Spark...it's useful for analysis, but, I didn't understand if it uses Hadoop eventually, or not.
I will have many reports, which needs indexing and grouping, at least. But I don't want to use additional tables which will impose overhead. And also, I'm .Net (and not Java) developer, and my application is besed on .Net Framework, too.
I am not exactly sure what your question is here and your confusion is understandable as with Cassandra and DSE there is a lot going on.
You are correct in stating that Cassandra does not support any aggregations or group by functionality that you would want to use for reporting.
Solr (DSE Search) is used for ad-hoc and full text searching of the data stored in Cassandra. This only works on a single table at a time.
Spark (DSE Analytics) provides analytics capabilities such as Map-Reduce as well as the ability to filter and join tables. This is not done in real-time though as the processing and shuffling of data can be expensive depending on the data load.
Spark does not use Hadoop. It performs many of the same jobs but is more efficient in many scenarios as it allows for in-memory distributed processing on the data.
Since you are using DataStax Enterprise the advantage is that you have built in connectors to both Solr (DSE Search) to provide ad-hoc queries and Spark (DSE Analytics) to provide analytics on your data.
Since I don't know your exact reporting requirements it is difficult to give you a specific recommendation. If you can provide some additional details about what sort of reporting (scheduled versus ad-hoc etc.) you will be running I may be able to help you more.

Hbase vs Cassandra: Which is better for a timeseries data storage?

I use my API logs to extract information like:
In this period of time how many are the users of my API ?
Or in this period of time, what type of services are called the most ?
Almost all the information I extract depend on the timestamp. Actually I use MongoDB and I added the time-stamp as an index(for 80GB, indexes size is 12GB).
A migration to cassandra or Hbase was recommended for me. And I want to know which is better for my use case:
Analysis for timeseries data.
Both good write and read performance are required.
Possibility of using hadoop to do my data analysis.
Thanks for sharing your point of view or your experience.
Advantages of Cassandra:
Cassandra generally shows better performance (though both are excellent).
Cassandra is substantially easier to setup and manage from an operational stand point (though there are tools that will help either way).
Advantages of HBase:
Native to the hadoop ecosystem
HBase will require you installing hadoop anyway, and you get a nice two-for-one. To use Cassandra you will probably need to go to use DataStax Enterprise, a commercial, non-open source product, OR investigate using Spark for your analytics work which has an open-source connector with Cassandra.
Chocolate or Vanilla ice cream - which is better?
I would suggest that you would be the best decision maker. Set up development environments for each option, and this will tell you much more about operational and tuning issues than, I think, anyone else might be able to give you. :)

Using elasticsearch as central data repository

We are currently using elasticsearch to index and perform searches on about 10M documents. It works fine and we are happy with its performance. My colleague who initiated the use of elasticsearch is convinced that it can be used as the central data repository and other data systems (e.g. SQL Server, Hadoop/Hive) can have data pushed to them. I didn't have any arguments against it because my knowledge of both is too limited. However, I am concerned.
I do know that data in elasticsearch is stored in a manner that is efficient for text searching. Hadoop stores data just as a file system would but in a manner that is efficient to scale/replicate blocks over over multiple data nodes. Therefore, in my mind it seems more beneficial to use Hadoop (as it is more agnostic w.r.t its view on data) as a central data repository. Then push data from Hadoop to SQL, elasticsearch, etc...
I've read a few articles on Hadoop and elasticsearch use cases and it seems conventional to use Hadoop as the central data repository. However, I can't find anything that would suggest that elasticsearch wouldn't be a decent alternative.
Please Help!
As is the case with all database deployments, it really depends on your specific application.
Elasticsearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.
Nevertheless all the advantages Elasticsearch that brings, there are still some main disadvantages:
Security - Elasticsearch does not provide any authentication or access control functionality. It's supported since they have introduced shield.
Transactions - There is no support for transactions or processing on data manipulation. Well now data manipulation is handled with logstash.
Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder. We can consider that it's quite mature now
with a variety of connectors and tools around it like kibana. But it's still not suited for large computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
If you can deal with these issues then there's certainly no reason why you can't use Elasticsearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.
As always, weigh the benefits, do some experimentation and see what works best for you.
DISCLAIMER: This answer was written a while ago for the Elasticsearch 1.x series. These critics still somehow stand with the 2.x series. But Elastic is working on them, as the 2.x series comes with more mature tools, APIs and plugins per example, security wise, like Shield or even transport clients like Logstash or Beats, etc.
I'd highly discourage most users from using elasticsearch as your primary datastore. It will work great until your cluster melts down due to a network partition. Even settings such as minimum_master_nodes that the ES pros always set won't save you. See this excellent analysis by Aphyr with his Call Me Maybe series:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch
eliasah, is right, it depends on your use case, but if your data (and job) is important to you, stay away.
Keep your golden record of your data stored in something really focused on persisting and sync your data out to search from there. It adds extra complexity and resources, but will result in a better nights rest :)
There are plenty of ways to go about this and if elasticsearch does everything you need, you can look into Kafka for persisting all the events going into a cluster which would allow replaying if things go wrong. I like this approach as it provides an async ingestion pipeline into elasticsearch that also does the persistence.

Are there any tools or best practices for documenting an ElasticSearch db

I'm looking for documentation about documenting an ElasticSearch deployment (you can see why this is challenging to Google!).
My question is really two questions. Are there best practices for documenting an ElasticSearch installation? -and- Are there any tools that aid in the visualization of an ElasticSearch installation. I guess it would be akin to a sql db schema (ER diagram or whatever). I've scoured Google without much luck and I didn't find anything on SO.
Documenting ES may be a bit more challenging than documenting a sql db since you probably want to be able to show the relationship between end user queries and indexing to explain the context of each mapping. Furthermore, it would be useful to visualize the cluster (perhaps a separate problem that could be solved with existing tools).
Thanks for any help.

Resources