How do we send data from elastic search to google big query, Is there any specific connector?
I have been looking into various options and will need data to be available in google big query real time
I found google_bigquery output pligin that might be useful, but I have never use it personally.
Experiment with the settings depending on how much log data you generate, your needs to see "fresh" data, and how much data you could lose in the event of crash. For instance, if you want to see recent data in BQ quickly, you could configure the plugin to upload data every minute or so (provided you have enough log events to justify that)
Related
I have a web app that displays the analysis data in browser with elasticsearch as backend data store.
Everything was cool as elasticsearch was handling about 1TB data and search queries were blazing fast.
Then came the decision to add data from all services into the app, close to a peta byte, and we switched to bigquery.[yes, we abandoned the elasticsearch and started querying bigquery directly ].
Now users of my app are complaining that their queries are slow, they are taking seconds (4~10~15), which used to display under a second before.
Naturally the huge amount of data here is to be blamed but I am wondering if there is a way to bring back elasticsearch into the game and make elasticsearch and bigquery play together nicely so that I can get the petaytes of storage from bigquery but still retain the lightspeed search of elasticsearch.
I am sure I am not the first one to face this issue rather I believe I am bit late to the bigquery party so I should be able to reap the benefits of delayed entry by getting all the problems already solved.
Thanks in advance if you can point me to the right direction.
This is a common pattern I see deployed by customers:
Use Elasticsearch to display results from the latest day/week - whatever fits within Elasticsearch's RAM.
Use BigQuery for everything else.
In this way your users will get sub-second results for 90% of their queries, and they will also be able to go wherever they want to go if Elasticsearch can't find an answer within its resources.
I'm not sure what are your users interfaces for getting data - but that's where this logic would need to be deployed.
(of course, expect improvements in the connections and speed as tech progresses)
Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for
I am working on an application which is basically event management kind of application. Where user can post different types of events. Events can be short noticed events like a user throwing a birthday party in next 2 hours or pre planned events like marathon etc (This is just a dummy example).
Now the requirement is this that other user when search through free text search they will able to get both kind of results . The search will be more like linkedIn search , other than event there are other categories also.
I was thinking of using ElasticSearch or Solr. but both of them are lucene based search engine. When I will be writing a lot of data in real time to these search engines , the data will be written to cache and then periodically flush out to disk in different segments.
To optimize the search performance lucene based engine try to co-locate the data using segmentation process. In frequent write case these process become frequent and lot of resources will be consumed by this and the search performance will be degraded finally.
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
Is there any way by which I could achive real time free text search without costing much of performance in segmentation. Please suggest any architectural design which will best suit in this scenario
I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.
I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.
Why batch layer
To save Time and Money!
It basically has two functionalities,
To manage the master dataset (assumed to be immutable)
To pre-compute the batch views for ad-hoc querying
Everything can be stored in realtime view and generate the results out of it - NOT TRUE
The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!
Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.
Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!
Further to the answer provided by #karthik manchala, data Processing can be handled in three ways - Batch, Interactive and Real-time / Streaming.
I believe, your reference to real-time is more with interactive response than to streaming as not all use cases are streaming related.
Interactive responses are where the response can be expected anywhere from sub-second to few seconds to minutes, depending on the use case. Key here is to understand that processing is done on data at rest i.e. already stored on a storage medium. User interacts with the system while processing and hence waits for the response. All the efforts of Hive on Tez, Impala, Spark core etc are to address this issue and make the responses as fast as possible.
Streaming on the other side is where data streams into the system in real-time - for example twitter feeds, click streams etc and processing need to be done as soon as the data is generated. Frameworks like Storm, Spark Streaming address this space.
The case for batch processing is to address scenarios where some heavy-lifting need to be done on a huge dataset before hand such that user would be made believe that the responses he sees are real-time. For example, indexing a huge collection of documents into Apache Solr is a batch job, where indexing would run for minutes or possibly hours depending on the dataset. However, user who queries the Solr index would get the response in sub-second latency. As you can see, indexing cannot be achieved in real-time as there may be hue amounts of data. Same is the case with Google search, where indexing would be done in a batch mode and the results are presented in interactive mode.
All the three modes of data processing are likely involved in any organisation grappling with data challenges. Lambda Architecture addresses this challenge effectively to use the same data sources for multiple data processing requirements
You can check out the Kappa-Architecture where there is no seperate Batch-Layer.
Everything is analyzed in the Stream-Layer. You can use Kafka in the right configuration as as master-datasetstorage and save computed data in a database as your view.
If you want to recompute, you can start a new Stream-Processing job and recompute your view from Kafka into your database and replace your old view.
It is possible to use only the Realtime view as the main storage for adhoc query but as it is already mentioned in other answers, it is faster if you have much data to do batch-processing and stream-processing seperate instead of doing batch-jobs as a stream-job. It depends on the size of your data.
Also it is cheaper to have a storage like hdfs instead of a database for batch-computing.
And the last point in many cases you have different algorithms for batch and stream processing, so you need to do it seperate. But basically it is possible to only use the "realtime view" as your batch-and stream-layer also without using Kafka as masterset. It depends on your usecase.
I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.