How to store search data in redis db for caching to gain maximum performance improvement - caching

We are implementing a online book store where the user can search books by filter,price,
author, sorting etc..Since we have million's of records in database(mysql) the search query
becomes slow. SO improve performance we are planning to implement caching mechanism using redis . BUt i am struggling to determine how to store the search data in redis db.
For example:-
If the search criteria is below:
{
searchFilter:
[
{
"field":"title",
"operator" : "LIKE",
"value" : "JAVA"
}
]
}
Then we will fetch data from from mysql db using above json. Then we are storing the data received in redis server.
In redis server we are using above json as "key" and all the books as value i.e
:
{searchFilter:[{"field":"title","operator" : "LIKE","value" : "JAVA"}]} : booksList
Now say some other user again fired a search request as below
{
searchFilter:
[
{
"field":"title",
"operator" : "LIKE",
"value" : "JAVA"
},
{
"field":"price",
"operator" : "GREATER_THAN",
"value" : "500"
},
{
"field":"price",
"operator" : "GREATER_THAN",
"value" : "500"
}
]
}
Then again we will fetch data from from mysql db using above json. Then we are storing the data received in redis server so that for next request with same search critria
we will get data from redis cache.
In redis server we are using above json as "key" and all the books as value i.e
:
{searchFilter:[{"field":"title","operator" : "LIKE","value" : "JAVA"},{"field":"price","operator" : "GREATER_THAN","value" : "500",},{"field":"price","operator" : "GREATER_THAN","value" : "500"}]} : booksList
So my question is using the json search criteria as key in redis is good idea?
If yes, then for every search request criteia having any small difference if we store the data in redis then caching server will consume lot of memeory.
So what is the ideal aproach to design our redis cache server to gain maximum performance ?

Related

Protecting data in elastic search

I have a elastic search engine running locally with an index which contains data from Multiple customers. When a customer makes a query, is there a way to dynamically add Customer Id in the filtering criteria so a customer cannot access the records from other customers.
Yes, you can achieve that using filtered aliases. So you'd create one alias per customer like this:
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "customer_index",
"alias" : "customer_1234",
"filter" : { "term" : { "customer_id" : "1234" } }
}
}
]
}
Then your customer can simply query the alias customer_1234 and only his data is going to come back.

Is it possible to organize data between elasticsearch shards based on stored data?

I want to build a data store with three nodes. The first one should keep all data, the second one data of the last month, the third data of the last week. Is it possible to automatically configure elasticsearch shards to relocate themselves between nodes so that this functionality is given?
if you want to move existing documents from some node to another then you can use _cluster/reroute.
But using this solution with automatic allocation can be dangerous as just after moving an index to target node it will try to even balance the cluster.
Or you can disable automatic allocations, in that case, only custom allocations will work and can be really risky to handle for large data set.
POST /_cluster/reroute
{
"commands" : [
{
"move" : {
"index" : "test", "shard" : 0,
"from_node" : "node1", "to_node" : "node2"
}
},
{
"allocate_replica" : {
"index" : "test", "shard" : 1,
"node" : "node3"
}
}
]
}
source: Elasticsearch rerouting
Also, you should read this : > Customize document routing

How to read all data from druid datasource

I am using below json to read all data from a druid datasource.
But in the request threshold field/value is mandatory.
It returns number of rows specified in threshold.
{
"queryType" : "select",
"dataSource" : "wikiticker",
"granularity" : "day",
"intervals" : [ "1000/3000" ],
"filter" :null,
"dimensions" : [ ],
"metrics" : [ ],
"descending" : "false",
"pagingSpec" : {
"threshold" : 10000,
"pagingIdentifiers" : null
},
"aggregations" : [ ]
}
Is there any way to retrieve all the data by setting the threshold to some value that returns all the data from datasource.
For eg:if intervals column is set to [ "1000/3000" ] gets data from all intervals.
The distributed nature of the system makes it hard to have an exact count of rows per interval of time, therefor the answer is no. Also keep in mind that select query will materialize all the rows in-memory so you might want to avoid pulling all the data at once and use pagination spec.

MongoDB embedded vs reference schema for large data documents

I am designing my first MongoDB database schema for log managment system and I would like to store information from log file to mongoDB and I can't decide what schema should I
use for large documents (embedded vs reference).
Note: project has many sources and source has many logs (in some cases over 1 000 000 logs)
{
"_id" : ObjectId("5141e051e2f56cbb680b77f9"),
"name" : "projectName",
"source" : [{
"name" : "sourceName",
"log" : [{
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"
}, {
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "ERROR",
"message" : "test"
}]
}]
}
My focuse is on performance during the reading data from database (NOT WRITING) e.g. Filtering, Searching, Pagination etc. User can filter source log by date, status etc (so I want to focus on reading performance when user search or filtering data)
I know that MongoDB has a 16Mbyte document size limit so I am worried if I gonna have 1 000 000 logs for one source how this gonna work (as I can have many sources for one project and sources can have many logs). What is better solutions when I gonna work with large documents and I want to have good reading performance, should I use embedded or reference schema? Thanks
The answer to your question is neither. Instead of embedding or using references, you should flatten the schema to one doc per log entry so that it scales beyond whatever can fit in the 16MB doc limit and so that you have access to the full power and performance of MongoDB's querying capabilities.
So get rid of the array fields and move everything up to top-level fields using an approach like:
{
"_id" : ObjectId("5141e051e2f56cbb680b77f9"),
"name" : "projectName",
"sourcename" : "sourceName",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"
}, {
"_id" : ObjectId("5141e051e2f56cbb680b77fa"),
"name" : "projectName",
"sourcename" : "sourceName",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "ERROR",
"message" : "test"
}
I think having logs in an array might get messy..If project and source entities don't have any attributes(keys) other than a name and logs are not to be stored for long, you may use a capped collection having one log per document:
{_id: ObjectId("5141e051e2f56cbb680b77f9"),
p: "project_name",
s: "source_name",
"time" : ISODate("2012-07-20T13:15:37Z"),
"host" : "127.0.0.1",
"status" : 200.0,
"level" : "INFO",
"message" : "test"}
Refer this as well: http://docs.mongodb.org/manual/use-cases/storing-log-data/
Capped collections maintain natural order. So you don't need an index on timestamp to return the logs in natural order. In your case, you may want to retrieve all logs from a particular source/project. You can create index{p:1,s:1}to speed up this query.
But I'd recommend you do some 'benchmarking' to check performance. Try the capped collection approach above. And also try bucketing of documents with the fully embedded schema that you have suggested. This technique is used in the classic blog-comments problem. Hence you only store so many logs of each source inside a single document and overflow to a new document whenever the custom-defined size exceeds.

Index huge data into Elasticsearch

I am new to elasticsearch and have huge data(more than 16k huge rows in the mysql table). I need to push this data into elasticsearch and am facing problems indexing it into it.
Is there a way to make indexing data faster? How to deal with huge data?
Expanding on the Bulk API
You will make a POST request to the /_bulk
Your payload will follow the following format where \n is the newline character.
action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
...
Make sure your json is not pretty printed
The available actions are index, create, update and delete.
Bulk Load Example
To answer your question, if you just want to bulk load data into your index.
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
The first line contains the action and metadata. In this case, we are calling create. We will be inserting a document of type type1 into the index named test with a manually assigned id of 3 (instead of elasticsearch auto-generating one).
The second line contains all the fields in your mapping, which in this example is just field1 with a value of value3.
You will just concatenate as many as these as you'd like to insert into your index.
This may be an old thread but I though I would comment anyway for anyone who is looking for a solution to this problem. The JDBC river plugin for Elastic Search is very useful for importing data from a wide array of supported DB's.
Link to JDBC' River source here..
Using Git Bash' curl command I PUT the following configuration document to allow for communication between ES instance and MySQL instance -
curl -XPUT 'localhost:9200/_river/uber/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://localhost:3306/elastic",
"user" : "root",
"password" : "root",
"sql" : "select * from tbl_indexed",
"poll" : "24h",
"max_retries": 3,
"max_retries_wait" : "10s"
},
"index": {
"index": "uber",
"type" : "uber",
"bulk_size" : 100
}
}'
Ensure you have the mysql-connector-java-VERSION-bin in the river-jdbc plugin directory which contains jdbc-river' necessary JAR files.
Try bulk api
http://www.elasticsearch.org/guide/reference/api/bulk.html

Resources