Do unused data in Elasticsearch reduce performance? - elasticsearch

I have an Elasticsearch server with logs data, right now I have 3 year data (50 GB). I have checked that, data older than 1 year is rarely required.
If I change all the queries to fetch the data only for last 1 year, how it impact the performance? Or should I store data older than 1 year on another server?
I Did some digging but could not find the exact answer.
https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

Related

Will Druid continue to work fast even if my SELECT COUNT(*) ... has no time boundaries?

I have a statement which looks like the following to compute a total of counts:
SELECT id, SUM("count") AS total FROM my_table GROUP BY id
Say Druid ingest about 1 million rows of data per day. The rows are relatively small (like 20 columns, the longest string is around 100 characters). Each row includes a date and an identifier. The data gets aggregated by id with 5-minute windows.
Will that SELECT statement continue to be fast after a few years of data ingestion?
Druid is surprisingly fast in groupby's because of the way data is stored and the way query engine is optimized for reading the data.
Will that SELECT statement continue to be fast after a few years of data ingestion?
I think the question is how you are rolling up the data. If you have enabled compaction and doing monthly rollups, then the above query should not pose an issue.
To read more about automatic compaction here: https://druid.apache.org/docs/latest/data-management/automatic-compaction.html
If you have more doubts please feel free to open an issue on github: https://github.com/apache/druid/issues or find us on the druid slack channel https://druid.apache.org/community/

How to increase the speed of bulk inserts with GORM?

I am using v2 GORM's CreateInBatches and changed SkipDefaultTransaction: true as mentioned in the "Performance" page, but I find that +350k records inserted in batches of 1000 take almost 3 minutes.
I tried removing the gorm.Model{} fields but didn't see much improvement.
What can I do to increase bulk-insert speed?
EDIT for anyone reading this with the same problem: I ended up saving my data in CSVs and importing it with pg_bulkload - got 1 million rows imported in 1 second (on a server with lots of cores & RAM)

Initial ElasticSearch Bulk Index/Insert /Upload is really slow, How do I increase the speed?

I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.

high volume data storage and processing

I am building a new application where I am expecting a high volume of geo location data something like a moving object sending geo coordinates every 5 seconds. This data needs to be stored in some database so that it can be used for tracking the moving object on a map anytime. So, I am expecting about 250 coordinates per moving object per route. And each object can run about 50 routes a day. and I have 900 such objects to track. SO, that brings to about 11.5 million geo coordinates to store per day. I have to store about one week of data at least in my database.
This data will be basically used for simple queries like find all the geocoordates for a particular object and a particular route. so, the query is not very complicated and this data will not be used for any analysis purpose.
SO, my question is should I just go with normal Oracle database like 12C distributed over two VMs or should I think about some big data technologies like NO SQL or hadoop?
One of the key requirement is to have high performance. Each query has to respond withing 1 second.
Since you know the volume of data (11.5 million) you can easily simulate the all your scenario in Oracle DB and test it well before.
My suggestions are you need to go for day level partitions & 2 sub partitions like objects & routs. All your business SQL has to hit right partitions always.
and also you might required to clear older days data. or Some sort of aggregation you can created with past days and delete your raw data would help.
its well doable 12C.

Is it good design to store 500 mb of data in jvm cache which can be searched like a sql query?

I have a requirement to get results of search query within 1 second for a database table.The database table is returning results slowly at this point.A table has to be moved to a cache and searched from there so that search results come fast.I want to do google type page refresh on my existing search page -- which means the page should refresh as the user types.
In order to acheive this the search results should return within one second.My database is teradata.Its queries are taking 2 to 3 seconds at least.Hence i want to look for other options like caching.I want to use cache so that the resuls come fast.
Columns are
company , Id , Industry, parent ...4 more
Its a search page.So if user types "ja" all items starting from ja like
company ------------- Id ------------- Industry --------------parent
jaico ------------- 222 -------------paints ------------- Jaico asia
Jammy fruits------------- 232-------------food------------- jammy International
The table contains 3.2 million rows and there are 8 columns that are present.The search data need to return all 8 columns.Considering byte wise there are 150 chars per row.So total bytes are 3.2 million * 150 chars = 480 Megabytes .I need to store this much data in cache and then fire search queries like sql (grouping ,like ,order by) across them.What would be the best option to use in this case
ehcache
jboss cache
Inifinispan
Apache Lucene
Please suggest which option is good .Is it better to do caching in memory or to use lucene?
What need to be cached?--> It is a table of 3.2 million rows with 8 columns.
Why it is to be cached?--> It is to be cached so that search results come faster than sql query.If i use sql query it takes very long time.Hence i want to move towards caching data.
Take a look at Apache Solr - you can get that kind of performance with the right deployment. You can shard to distribute queries, for one thing.

Resources