MongoDB embedded vs reference schema for large data documents - performance

I am designing my first MongoDB database schema for log managment system and I would like to store information from log file to mongoDB and I can't decide what schema should I
use for large documents (embedded vs reference).
Note: project has many sources and source has many logs (in some cases over 1 000 000 logs)
{
"_id" : ObjectId("5141e051e2f56cbb680b77f9"),
"name" : "projectName",
"source" : [{
"name" : "sourceName",
"log" : [{
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"
}, {
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "ERROR",
"message" : "test"
}]
}]
}
My focuse is on performance during the reading data from database (NOT WRITING) e.g. Filtering, Searching, Pagination etc. User can filter source log by date, status etc (so I want to focus on reading performance when user search or filtering data)
I know that MongoDB has a 16Mbyte document size limit so I am worried if I gonna have 1 000 000 logs for one source how this gonna work (as I can have many sources for one project and sources can have many logs). What is better solutions when I gonna work with large documents and I want to have good reading performance, should I use embedded or reference schema? Thanks

The answer to your question is neither. Instead of embedding or using references, you should flatten the schema to one doc per log entry so that it scales beyond whatever can fit in the 16MB doc limit and so that you have access to the full power and performance of MongoDB's querying capabilities.
So get rid of the array fields and move everything up to top-level fields using an approach like:
{
"_id" : ObjectId("5141e051e2f56cbb680b77f9"),
"name" : "projectName",
"sourcename" : "sourceName",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"
}, {
"_id" : ObjectId("5141e051e2f56cbb680b77fa"),
"name" : "projectName",
"sourcename" : "sourceName",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "ERROR",
"message" : "test"
}

I think having logs in an array might get messy..If project and source entities don't have any attributes(keys) other than a name and logs are not to be stored for long, you may use a capped collection having one log per document:
{_id: ObjectId("5141e051e2f56cbb680b77f9"),
p: "project_name",
s: "source_name",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"}
Refer this as well: http://docs.mongodb.org/manual/use-cases/storing-log-data/
Capped collections maintain natural order. So you don't need an index on timestamp to return the logs in natural order. In your case, you may want to retrieve all logs from a particular source/project. You can create index{p:1,s:1}to speed up this query.
But I'd recommend you do some 'benchmarking' to check performance. Try the capped collection approach above. And also try bucketing of documents with the fully embedded schema that you have suggested. This technique is used in the classic blog-comments problem. Hence you only store so many logs of each source inside a single document and overflow to a new document whenever the custom-defined size exceeds.

Related

How to store nested document as String in elastic search

Context:
1) We are building a CDC pipeline (using kafka & connect framework)
2) We are using debezium for capturing mysql Tx logs
3) We are using Elastic Search connector to add documents to ES index
Sample change event generated by Debezium:
{
"source" : {
"before" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 0
},
"after" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 1
},
"source" : {
"version" : "0.7.5",
"name" : "__",
"server_id" : 252639387,
"ts_sec" : 1547805940,
"gtid" : null,
"file" : "mysql-bin-changelog.000570",
"pos" : 236,
"row" : 0,
"snapshot" : false,
"thread" : 614,
"db" : "bazaarify",
"table" : "state"
},
"op" : "u",
"ts_ms" : 1547805939683
}
What we want :
We want to visualize only 3 columns in kibana :
1) before - containing the nested JSON as string
2) after - containing the nested JSON as string
3) source - containing the nested JSON as string
I can think below possibilities here :
a) Either converting nested JSON as string
b) Combining column data in elastic search
I am a newbie to elastic search . Can someone please guide me how to do that.
I tried defining custom mapping as well but it is giving me exception.
You can always view your document as a Raw JSON in Kibana.
You don't need to manipulate it before indexing in elastic.
As this is related to visualization, handle this in Kibana only.
Check this link for a screenshot.
Refer this to add the columns which you want to see onto the results
I don't fully understand your use case, but if you would like to turn some json's to their representing strings, then you can use logstash for that, or even Elasticsearch ingest capabilities to convert an object (json) to a string.
From the link above, an example:
PUT _ingest/pipeline/my-pipeline-id { "description": "converts the
content of the id field to an integer", "processors" : [
{
"convert" : {
"field" : "source",
"type": "string"
}
} ] }

Elasticsearch querying alias with routing giving partial results

In an effort to create multi-tenant architecture for my project.
I've created an elasticsearch cluster with an index 'tenant'
"tenant" : {
"some_type" : {
"_routing" : {
"required" : true,
"path" : "tenantId"
},
Now,
I've also created some aliases -
"tenant" : {
"aliases" : {
"tenant_1" : {
"index_routing" : "1",
"search_routing" : "1"
},
"tenant_2" : {
"index_routing" : "2",
"search_routing" : "2"
},
"tenant_3" : {
"index_routing" : "3",
"search_routing" : "3"
},
"tenant_4" : {
"index_routing" : "4",
"search_routing" : "4"
}
I've added some data with tenantId = 2
After all that, I tried to query 'tenant_2' but I only got partial results, while querying 'tenant' index directly returns with the full results.
Why's that?
I was sure that routing is supposed to query all the shards that documents with tenantId = 2 resides on.
When you have created aliases in elasticsearch, you have to do all operations using aliases only. Be it indexing, update or search.
Try reindexing the data again and check if possible (If it is a test index, I hope so).
Remove all the indices.
curl -XDELETE 'localhost:9200/' # Warning:!! Dont use this in production.
Use this command only if it is test index.
Create the index again. Create alias again. Do all the indexing, search and delete operations on alias name. Even the import of data should also be done via alias name.

nested field query for mongodb (using ruby)

Sup, good folks of the internet.
Does anyone know how to make nested queries for mongodb? This is probably best explained by an example. To retrieve specific fields, I can use the :fields option to retrieve that field (e.g. suppose it is called "useful_field"):
collection.find({},{:fields => {"useful_field" => 1}})
But suppose that useful_field itself contains an array of many further fields, i.e
useful_field = [{"value_I_want"=>"useful","value_I_dont_want"=>"not_useful"}]
My aim is to select "value_I_want". Any thoughts?
Here is a specific entry that I am trying to deal with (a reply to a tweet):
{ "_id" : ObjectId("51b6f71b0364718d71e4bca5"),
"annotations" : { },
"resultType" : "Tweet",
"score" : 1,
"groupName" : "TweetsWithConversation",
"results" : [
{
"kind" : "Tweet",
"score" : 1,
"annotations" : { "ConversationRole" : "Ancestor" },
"value" : { "created_at" : "Fri Jun 07 19:47:51 +0000 2013",
"id" : NumberLong("343091955196104704"),
"id_str" : "343091955196104704",
"text" : "THIS_IS_WHAT_I_WANT",
etc. etc. (Apologies for the odd formatting)
I'm trying to use a method of the form that will let me do something like this:
db.collection.find({},{:fields { some_way_of_selecting(THIS_IS_WHAT_I_WANT)})
(I'm querying as part of a ruby script)
Otherwise, I'll have to go back into the dark world of regex. No-one wants that.
Try the following
db.collection.find({},{"useful_field.value_I_want": 1})
Maybe try this:
db.collection.find({"resultType" : "Tweet"}, {"results" : {$elemMatch : {"value.text" : "THIS_IS_WHAT_I_WANT"}}})
What you are trying to do is called "projection" - it's specifying what fields you want returned in the second argument to find.
In your case you simply want:
db.collection.find({}, {"results.value.text":1} )

Elasticsearch - how to return only data, not meta information?

When doing a search, Elasticsearch returns a data structure that contains various meta information.
The actual result set is contained within a "hits" field within the JSON result returned from the database.
Is it possible for Elasticsearch to return only the needed data (the contents of then "hits" field) without being embedded within all the other meta data?
I know I could parse the result into JSON and extract it, but I don't want the complexity, hassle, performance hit.
thanks!
Here is an example of the data structure that Elasticsearch returns.
{
"_shards":{
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits":{
"total" : 1,
"hits" : [
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}
}
]
}
}
You can at least filter the results, even if you cannot extract them. The "common options" page of the REST API explains the "filter_path" option. This lets you filter only the portions of the tree you are interested in. The tree structure is still the same, but without the extra metadata.
I generally add the query option:
&filter_path=hits.hits.*,aggregations.*
The documentation doesn't say anything about this making your query any faster (I doubt that it does), but at least you could return only the interesting parts.
Corrected to show only hits.hits.*, since the top level "hits" has metadata as well.
No, it's not possible at this moment. If performance and complexity of parsing are the main concerns, you might want to consider using different clients: java client or Thrift plugin, for example.

Index huge data into Elasticsearch

I am new to elasticsearch and have huge data(more than 16k huge rows in the mysql table). I need to push this data into elasticsearch and am facing problems indexing it into it.
Is there a way to make indexing data faster? How to deal with huge data?
Expanding on the Bulk API
You will make a POST request to the /_bulk
Your payload will follow the following format where \n is the newline character.
action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
...
Make sure your json is not pretty printed
The available actions are index, create, update and delete.
Bulk Load Example
To answer your question, if you just want to bulk load data into your index.
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
The first line contains the action and metadata. In this case, we are calling create. We will be inserting a document of type type1 into the index named test with a manually assigned id of 3 (instead of elasticsearch auto-generating one).
The second line contains all the fields in your mapping, which in this example is just field1 with a value of value3.
You will just concatenate as many as these as you'd like to insert into your index.
This may be an old thread but I though I would comment anyway for anyone who is looking for a solution to this problem. The JDBC river plugin for Elastic Search is very useful for importing data from a wide array of supported DB's.
Link to JDBC' River source here..
Using Git Bash' curl command I PUT the following configuration document to allow for communication between ES instance and MySQL instance -
curl -XPUT 'localhost:9200/_river/uber/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://localhost:3306/elastic",
"user" : "root",
"password" : "root",
"sql" : "select * from tbl_indexed",
"poll" : "24h",
"max_retries": 3,
"max_retries_wait" : "10s"
},
"index": {
"index": "uber",
"type" : "uber",
"bulk_size" : 100
}
}'
Ensure you have the mysql-connector-java-VERSION-bin in the river-jdbc plugin directory which contains jdbc-river' necessary JAR files.
Try bulk api
http://www.elasticsearch.org/guide/reference/api/bulk.html

Resources