ELK indexes design according to development goal - elasticsearch

I used elastic in the past to analyze logs but I don't have any experience in elastic "architecture". I have an application that I deployed to multiple machines (200+). I want to connect to each machine and gather metadata like logs, metrics, db stats and so on..
With that data I want to be able to :
Find problems in each machine and notify about them (finding problems requires joining data between different sources, for example, finding exception in log1 requires me to go check the db)
Analyze common issues for all machines and implement ML model that will be able to predict issues.
I need to create indexes, and I thought about 2 options:
Create one index per each machine and then all the data related to each machine will be available in its index.
Create index per data source. For example, all db logs from all machines will be available in one dedicated index. Another index will contain only data that related to machine metrics (cpu/ram usage..)
What would be the best to create those indexes?

Ok, now that I got a better understanding of your needs, here's my suggestion:
I strongly recommend to not create an index per machine. I don't know much about your use case(s) but I assume you want to search the data either in kibana or by implementing search requests in your application.
Let's say you are interested in the ram usage of every machine. You would need to execute 200 search requests against elasticsearch since the data (ram usage) is spread over 200 indices (of course one could create aliases but these had to be updated for every new machine). Furthermore you wouldn't be able to do basic aggregations like which machine has the highest ram usage? in a convenient way. In my opinion there are plenty more disadvantages like index-management, shard-allocation etc.
So what's a better solution?
As you have already suggested, you should create an index per datasource. With that, your indices have a dedicated "purpose", e.g. one index that stores database data, the other system metrics and so on. Referring to my examples above, you only would need to execute one search request to determine a) the ram usage of every machine and b) which machine has the highest ram usage. However, this would require, that every document contains a field that references the particular host like so:
PUT metrics/_doc/1
{
"system":{
"ram": {
"usage": "45%",
"free": "55%"
}
},
"host":{
"name": "YOUR HOSTNAME",
"ip": "192.168.17.100"
}
}
In addition to that I recommend using daily indices. So instead of creating one huge index for the system metrics you would create an index for every day like metrics-2020.01.01, metrics-2020.01.02 and so on. This approach has the following advantages:
your indices will be much smaller in size, making them better to manage and (re-)allocate.
after some time period, you can roughly estimate the data size and be able to define the number of shards much better. With only one huge index, you would constantly need to update the number of shards in order to handle your requests in a fast way.
furthermore, you can search your data on a day-by-day basis in a convenient way.
you are able to setup ilm-policies to automate the maintenance of your indices, e.g. delete metrics-indices that are older than X days.
...
I hope I could help you!

Related

Is Elasticsearch optimized for inserts?

I develop for a relatively large online store with a PHP backend, and it uses elasticsearch for some things (like text search, logging... etc).
Now, I'd like to start storing all kinds of information about user activity in ES. For instance, every page view (for instance: user enter product page/category page ,etc).
Is ES optimized for such a heavy load of continuous inserts, or should I consider some alternatives, like for instance having some sort of a buffer layer where I store all of my immediate inserts in memory, and then every minute or so, insert them into ES in bulk?
What is the industry standard? Or am I worrying in vain and ES is optimized for that?
Thanks.
Elasticsearch, when properly sized to handle your load, is definitely a valid alternative for such a use case.
You might decide, however, to store that streaming data into another cluster which is different from your production cluster, so as to not impact the health of the production cluster too much.
There are a lot variables to arrive at the correct decision, and we don't have enough information here, but it's definitely a valid way.

Mongo - quick exports of all documents without performace hit on other queries?

I need Mongo cluster doing 2 operations:
get/update a single document - Mongo is great for realtime changes, excelent speed.
export all documents into JSON file (one file for a category, there are cca 15 categories) - this is very slow, when I use regular query. May be I do not know, what command or options to use ... or I would need to fit it whole into RAM, which is expesive. Even replication to a new mongo instance is much faster (takes hours) then a query and writing data to disk (takes days).
I have about 10m documents. Mongo data on disk has 250Gb. There are cca 15 categories for which I need separate files (at the moment all documents are in 1 collection regardless of category).
Which command should I use to export all data into files in a couple of hours?
How large aws instances should I use to speed it up, but not to pay too much for RAM. Would it help? Operation 2) must not cause a performace hit for operation 1) -- I cannot stop Mongo and use mongoexport.
I am not sure what kind of servers you are using but this may provide some further insights regarding the export/file creation performance and not shutting off mongo. One presumes you are working with a sharded and replicated cluster.
In my case I am on Azure VMs running Windows server in a replicated and sharded cluster. So I would take a copy of the Azure blobs associated with the data disks on a secondary in each RS. You should stop your balancer and lock the db on the secondary to do this. This should take a couple of minutes at most to copy only 250gb. Then I would restore the blobs to disks on a new VM.
Then you could query data out of this VM without affecting your cluster's performance. You may additionally add indexing fir this export process since you are on a separate instance now.
Personally I use PowerShell to do this in Azure. Golang may be a better choice to write your queries in due to its parallel capabilities if JavaScript via the mongo shell fails you. I've had JS work faster than python code but it also depends on what you know.
This is just one way but it does address some of the criteria you posted.

Handle huge data imported from facebook

I'm currently create a program that imports all groups and feeds from Facebook which the user wants.
I used to use the Graph API with OAuth and this works very well.
But I came to the point that I realized that one request can't handle the import of 1000 groups plus the feeds.
So I'm looking for a solution that imports this data in the background (like a cron job) into a database.
Requirements
Runs in background
Runs under Linux
Restful
Questions
What's you experience about that?
Would hadoop the right solution?
You can use neo4j.
Neo4j is a graph database, reliable and fast for managing and querying highly connected data
http://www.neo4j.org/
1) Decide structure of nodes, relationships, and there properties and accordingly
You need to create API that will get data from facebook and store it in Neo4j.
I have used neo4j in 3 big projects, and it is best for graph data.
2) Create a cron jon that will get data from facebook and store into the neo4j.
I think implementing mysql for graph database is not a good idea. for large data neo4j is the good option.
Interestingly you designed the appropriate solution yourself already. So in fact you need following components:
a relational database, since you want to request data in a structured, quick way
-> from experiences I would pressure the fact to have a fully normalized data model (in your case with tables users, groups, users2groups), also have 4-Byte surrogate keys over larger keys from facebook (for back referencing you can store their keys as attributes, but internal relations are more efficient on surrogate keys)
-> establish indexes based on hashes rather than strings (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING))
-> never,ever establish unique columns based on strings with length > 8 Byte; unique bulk inserts can be done based on hashes+string checking via insert...select
-> once you got that settled you could also look into sparse matrices (see wikipedia) and bitmaps to get your users2groups optimized (however I have learned that this is an extra that should not hinder you to come up with a first version soon)
a cron job that is run periodically
-> ideally along the caps, facebook is giving you (so if they rule you to not request more often than once per second, stick to that - not more, but also try to come as close as possible to the cap) -> invest some time in getting the management of this settled, if different types of requests need to be fired (request for user records <> requests for group records, but maybe hit by the same cap)
-> most of the optimization can only be done with development - so if I were you I would stick to any high level programming language that does not bother to much with var type juggling and that also comes along with a broad support for associative arrays such as PHP and I would programm that thing myself
-> I made good experiences with setting up the cron job as web page with deactivated output buffering (for php look at ob_end_flush(void)) - easy to test and the cron job can be triggered via curl; if you channel status outputs via an own function (eg with time stamps) this could then also become flexible to either run viw browser or via command line -> which means efficient testing + efficient production running
your user ui, which only requests your database and never, ever, never the external system api
lots of memory, to keep your performance high (optimal: all your data+index data fits into database memory/cache dedicated to the database)
-> if you use mysql as database you should look into innodb_flush_log_at_trx_commit=0, and innodb_buffer_pool_size (just google, if interested)
Hadoop is a file system layer - it could help you with availability. However I would put this into the category of "sparse matrix", which is nothing that stops you from coming up with a solution. From my experience availability is not a primary constraint in data exposure projects.
-------------------------- UPDATE -------------------
I like neo4j from the other answer. So I wondered what I can learn for my future projects. My experiences with mysql is that RAM is usually the biggest constraint. So increasing your RAM to be able to load the full database can gain you performance improvements by a factor of 2-1000 - depending on from where you are coming from. Everything else such as index improvements and structure somehow follows. So if I would need to make up a performance prioritization list, it would be something like this:
MYSQL + enough RAM dedicated to the database to load all data
NEO4J + enough RAM dedicated to the database to load all data
I would still prefer MYSQL. It stores records efficiently, but needs to run joins for deriving relations (which neo4j does not require to that extend). Join-costs are usually low with the right indexes and according to http://docs.neo4j.org/chunked/milestone/configuration-caches.html neo4j does need to add extra management data to the property separation. For big data projects those management data sums up and in full load to memory set ups requires you buy more memory. Performance wise these both options are ultimate. Further, much further down the line you would find this:
NEO4J + not enough RAM dedicated to the database to load all data
MYSQL + not enough RAM dedicated to the database to load all data
In worst case MYSQL will even put indexes to disk (at least partly), which can result in massive read delay. In comparison with NEO4J you could perform a ' direct jump from node to node' exercise, which should - at least in theory - be faster.

Neo4j - Using Java plugins to REST api to improve performance?

I am building an application that requires a lot of data constantly being extracted from a local MongoDB to be put into Neo4j. Seeing as I am also having many users access the Neo4j database, from both a Django webserver and other places, I decided on using the REST interface for Neo4j.
The problem I am having is that, even with batch insertion, the Neo4j server is active over 50% of the time with just trying the insert all the data from the mongoDB. As far as I can see there might be some waiting time because of the HTTP requests but I have been trying to tweak but have only gotten so far.
The question is, if I write a Java plugin (http://docs.neo4j.org/chunked/stable/server-plugins.html) that can handle inserting the mongoDB extractions directly, will I then go around the REST API? Or will the java plugin commands just convert to regular REST API requests? Furthermore, will there be a performance boost by using the plugin?
The last question is how do I optimize the speed of the REST API (So far I am performing around 1500 read/write operations which includes many "get_or_create_in_index" operations)? Is there a sweet spot where the number of queries appended to one HTTP requests will keep Neo4j busy until the next HTTP request arrives?
Update:
I am using Neo4j version 2.0
The data that I am extracting consists of bluetooth observations, where the phone that is running the app i created scans all nearby phones. This single observation is then saved as a document in MongoDB and consists of the users id, the time of the scan and a list of the phones/users that he has seen in that scan.
In Neo4j I model all the users as nodes and I also model an observation between two users as a node so that it will look like this:
(user1)-[observed]->(observation_node)-[observed]->(user2)
Furthermore I index all user nodes.
When moving the observation from mongoDB to Neo4j, I do the following for each document:
Check in the index if the user doing the scan already has a node assigned, else create one
Then for each observed user in the scan: A) Check in index if the observed user has a node else create one B) Create an observation node and relationships between the users and the observation node, if this doesn't already exist C) Make a relationship between the observation node and a timeline node (the timeline just consists of a tree of nodes so that I can quickly find observations at a certain time)
As it can be seen I am doing quite a few lookups in the user index (3), some normal read (2-3) and potentially many writes for each observation.
Each bluetooth scan average around 5-30 observations and I batch 100 scans in a single HTTP request. This means that each request usually contains 5000-10000 updates.
What version are you using?
The unmanaged extension would use the underlying Java-API so it much faster, also you can decide on the format & protocol of the data that you push to it.
It is sensible to batch writes, so that you don't incurr tx overhead per each tiny write. E.g. aggregating 10-50k updates in one operation helps a lot.
What is the concrete shape of the updates you do? Can you edit your question to reflect that?
Some resources for this:
http://maxdemarzi.com/2013/09/05/scaling-writes/
http://maxdemarzi.com/2013/12/31/the-power-of-open-source-software/

What is the most efficient way to filter a search?

I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.

Resources