I have a transactional application where the reps want to enter the tickets and I got to store them immediately. The reason I picked ES is because the techs may enter some unstructured data and they want to search on it later.
Is it ok to store the data directly in ES instead of RDBMS?
I think probably 5-10 users will be using this application concurrently.
I have already built using DJango/ES but just want to make sure I don't have any issues later.
It is certainly 'ok' to store data in Elasticsearch instead of a traditional relational model, but that doesn't mean it's the right choice. Your use case sounds fairly simple, and more 'document' based that tabular. For this a NoSQL document store can be a good fit. Elasticsearch also offers shards as well that can replicate your data for both higher availability and resilience - for instance, if one of your concerns is backing up your data.
On the other hand, simply having some longer text fields is not a strong argument for choosing ES over a database system (RDBMS or otherwise) that you more familiar with or that has more built-in support for administrative functions.
If you have truly unstructured data - ie different tickets can have different fields - or you have a high volume of tickets, such that the full-text indexing and searching in ES provides a real performance gain, then it could be worth the learning curve.
The basic concepts page for ES is a good place to start. See the sections on Shards & Replicas.
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
This might also be useful: https://www.elastic.co/blog/found-uses-of-elasticsearch
Related
I develop for a relatively large online store with a PHP backend, and it uses elasticsearch for some things (like text search, logging... etc).
Now, I'd like to start storing all kinds of information about user activity in ES. For instance, every page view (for instance: user enter product page/category page ,etc).
Is ES optimized for such a heavy load of continuous inserts, or should I consider some alternatives, like for instance having some sort of a buffer layer where I store all of my immediate inserts in memory, and then every minute or so, insert them into ES in bulk?
What is the industry standard? Or am I worrying in vain and ES is optimized for that?
Thanks.
Elasticsearch, when properly sized to handle your load, is definitely a valid alternative for such a use case.
You might decide, however, to store that streaming data into another cluster which is different from your production cluster, so as to not impact the health of the production cluster too much.
There are a lot variables to arrive at the correct decision, and we don't have enough information here, but it's definitely a valid way.
I am developing a web app in Meteor, with Mongo, that will be running on cloud. Each user must belong to a Company.
Each Company can only access it's own data.
Each user can access it's own data and some data shared with other users of the same company.
Imagine 1.000 companies and 100 users per company, it could get very bad in performance and secutiry, if I use 1 Mongodb database for whole app.
So, because Mongo is "Schema-less and Database-less" I think I can define 1.000 dbs, lets say db_0001, db_0002, ... with same name collections, lets say tasks, messages, ..., so the app can be efficient and more secure (same code for every Company and isolation of data).
Also, on hosting side (let's say for example with Digital Ocean), I think its easier to distribute the dbs if the are already atomized.
Is this a good approach? Or should I not worry about it and let the hosting do this job?
Any thoughts are wellcome.
You are currently only looking at one side of the coin. That's fine to start with.
Think about how you are going to be displaying that data and what query does it translate to. Do a thorough due diligence on all the potential query. For example, how often would user/getbyid be called and how often would you have to show a user their info and their relationship with other users. What other meta data would be required beside user info, would you have to perform a join to get that data? or is it stored as an embedded document? What fields are you going to be searching and sorting by most? Which types of data are write heavy and what are read heavy?
Now lets get back to your database shading approach. It's great that you are thinking ahead of time on this front rather than having to rewrite your component later. Data volume/storage does not worry me here. How many concurrent users would be using at application and what are primary use cases should be the first place to look at to think about scale.
Additionally, you need to understand the nature of the business and project growth. Is it like Instragram type of hyper growth? or is it more predictable. A big Mongo cluster can handle thousands of concurrent read/write requests (assuming your design and query are optimized) so that does not bother me. If you want to keep it flexible MongoDB has a sharding mechanism and you can shard on a key and it takes care all the fancy stuff for ya.
MongoDB has eventual consistency (look up MongoDB CAP theorem) if you enable read from secondaries and you have a high volume business critical app you need to be careful because you can be reading out of date result.
As far as hosting is concerned, DO is fine but always have a backup in another region to maintain geographic redundancy so in case if a region goes down (Hello AWS!) you have something to fall back on.
Good luck on your project!
The idea is to redesign data structure and/or change DB.
I just started to review this project and plan to start optimization from this one.
Currently i have CouchDb with about 80GB of document data, around 30M records.
From that subset for the most of documents properties like id, group_id, location, type can be considered as generic, but unfortunately for now such are even stored with different property naming around the set. Also a lot of deeply nested can be found.
Structure isn't hardly defined, that's why NoSQL db was selected way before some picture was seen.
Data is calculated and populated in DB in a separate Job on powerful cluster. This isn't done too often. From that perspective i can conclude that general write/update performance isn't very important. Also size decrease would be great, but isn't most important. There are only like 1-10 active customers at a time.
Actually read performance with various filtering/grouping etc is most important.
But no heavy summary calculations should be done, this one is already done while population.
This one is a data analytical tool for displaying compare and other reports to quality engineers and data analyst, so they can browse the results, group them or filter from the Web UI.
Now such tasks like searching a subset of document properties for a text isn't possible due to performance.
For sure i've done some initial investigations(like http://www.datastax.com/wp-content/themes/datastax-2014-08/files/NoSQL_Benchmarks_EndPoint.pdf) and it looks Cassandra seems to be good choice among NoSql.
Also it's quite interesting trying to port this data into the new PostgreSQl.
Any ideas would be highly appreciated :-)
Hello please check the following articles:
http://www.enterprisedb.com/nosql-for-enterprise
For me, PostgreSQL json(and jsonb!) capabilities allow to start schema-less, have transactions, indexes, grouping, aggregate functions with very good performance, just from the start. And when ready(and if needed), you can go for the schema, with internal data migration.
Also check:
https://www.compose.io/articles/is-postgresql-your-next-json-database/
Good luck
I'm currently create a program that imports all groups and feeds from Facebook which the user wants.
I used to use the Graph API with OAuth and this works very well.
But I came to the point that I realized that one request can't handle the import of 1000 groups plus the feeds.
So I'm looking for a solution that imports this data in the background (like a cron job) into a database.
Requirements
Runs in background
Runs under Linux
Restful
Questions
What's you experience about that?
Would hadoop the right solution?
You can use neo4j.
Neo4j is a graph database, reliable and fast for managing and querying highly connected data
http://www.neo4j.org/
1) Decide structure of nodes, relationships, and there properties and accordingly
You need to create API that will get data from facebook and store it in Neo4j.
I have used neo4j in 3 big projects, and it is best for graph data.
2) Create a cron jon that will get data from facebook and store into the neo4j.
I think implementing mysql for graph database is not a good idea. for large data neo4j is the good option.
Interestingly you designed the appropriate solution yourself already. So in fact you need following components:
a relational database, since you want to request data in a structured, quick way
-> from experiences I would pressure the fact to have a fully normalized data model (in your case with tables users, groups, users2groups), also have 4-Byte surrogate keys over larger keys from facebook (for back referencing you can store their keys as attributes, but internal relations are more efficient on surrogate keys)
-> establish indexes based on hashes rather than strings (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING))
-> never,ever establish unique columns based on strings with length > 8 Byte; unique bulk inserts can be done based on hashes+string checking via insert...select
-> once you got that settled you could also look into sparse matrices (see wikipedia) and bitmaps to get your users2groups optimized (however I have learned that this is an extra that should not hinder you to come up with a first version soon)
a cron job that is run periodically
-> ideally along the caps, facebook is giving you (so if they rule you to not request more often than once per second, stick to that - not more, but also try to come as close as possible to the cap) -> invest some time in getting the management of this settled, if different types of requests need to be fired (request for user records <> requests for group records, but maybe hit by the same cap)
-> most of the optimization can only be done with development - so if I were you I would stick to any high level programming language that does not bother to much with var type juggling and that also comes along with a broad support for associative arrays such as PHP and I would programm that thing myself
-> I made good experiences with setting up the cron job as web page with deactivated output buffering (for php look at ob_end_flush(void)) - easy to test and the cron job can be triggered via curl; if you channel status outputs via an own function (eg with time stamps) this could then also become flexible to either run viw browser or via command line -> which means efficient testing + efficient production running
your user ui, which only requests your database and never, ever, never the external system api
lots of memory, to keep your performance high (optimal: all your data+index data fits into database memory/cache dedicated to the database)
-> if you use mysql as database you should look into innodb_flush_log_at_trx_commit=0, and innodb_buffer_pool_size (just google, if interested)
Hadoop is a file system layer - it could help you with availability. However I would put this into the category of "sparse matrix", which is nothing that stops you from coming up with a solution. From my experience availability is not a primary constraint in data exposure projects.
-------------------------- UPDATE -------------------
I like neo4j from the other answer. So I wondered what I can learn for my future projects. My experiences with mysql is that RAM is usually the biggest constraint. So increasing your RAM to be able to load the full database can gain you performance improvements by a factor of 2-1000 - depending on from where you are coming from. Everything else such as index improvements and structure somehow follows. So if I would need to make up a performance prioritization list, it would be something like this:
MYSQL + enough RAM dedicated to the database to load all data
NEO4J + enough RAM dedicated to the database to load all data
I would still prefer MYSQL. It stores records efficiently, but needs to run joins for deriving relations (which neo4j does not require to that extend). Join-costs are usually low with the right indexes and according to http://docs.neo4j.org/chunked/milestone/configuration-caches.html neo4j does need to add extra management data to the property separation. For big data projects those management data sums up and in full load to memory set ups requires you buy more memory. Performance wise these both options are ultimate. Further, much further down the line you would find this:
NEO4J + not enough RAM dedicated to the database to load all data
MYSQL + not enough RAM dedicated to the database to load all data
In worst case MYSQL will even put indexes to disk (at least partly), which can result in massive read delay. In comparison with NEO4J you could perform a ' direct jump from node to node' exercise, which should - at least in theory - be faster.
I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :
store them securely
use them to analysis (mostly time-oriented)
retrieve some raw data occasionally
It would be nice if it could be used with JasperReports or BIRT
My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL
On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .
I heard about Hypertable, Pentaho, MongoDB ....
Do you have any recommendations ?
(Yes, I found some topics here, but it was year or two ago)
Edit:
Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?
Am having the same problem here and made researches; two types of storages for BI :
column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
Distributed : hTable, Cassandra (also column oriented theoretically)
Document oriented / MongoDb, CouchDB
The answer depends on what you really need :
If your millions of row are loaded at once (nighly batch or so), InfiniDB or other column oriented DB are the best; They have great performance and are "BI oriented". http://www.d1solutions.ch/papers/d1_2010_hauenstein_real_life_performance_database.pdf
And they won't require a setup of "nodes", "sharding" and other stuff that comes with distributed/"NoSQL" DBs.
http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.
Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row
As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.
You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.
GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.
If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.
As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.
My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.