Possible to store images in Elasticsearch? - elasticsearch

Is it possible to store images in Elasticsearch clusters? If yes, then is there a resource about the work flow? I checked the following link: https://github.com/kzwang/elasticsearch-image
Since we have to handle large image files (over 500GB), we are planning to use HDFS.

Storing whole images in Elasticsearch will not be very beneficial, because if the image is scaled/cropped and then used as a query, it will give incorrect results. What you need depends on why you want to index these images.
In my case, I need to find if an image after some scaling or cropping, has a close match in my database. I am extracting local descriptors (SIFT/SURF) of images and using them to build an Elasticsearch index. This will reduce the image index size as instead of storing the whole image, only a few features are stored. I will be storing all these images on S3 for now and Elasticsearch will store ids for these images along with the features extracted from them.
Regarding elasticsearch-image: This plugin has not been updated in a while and the most recent responses to issues were from last year. This plugin integrates LIRE with Elasticsearch, where LIRE provides the functionality of a multiple image fingerprints extractor.
Possible solutions:
Integrate the library OpenCv (to compute feature vectors for an image) and Elasticsearch and build your own index using these image features instead of storing a whole image. For the product architecture, you can get some hints here.
Use an older version of Elasticsearch with a compatible version of elasticsearch-image.
Upgrade elasticsearch-image to work with the latest version of Elasticsearch.
You can also use SOLR along with LireSolr plugin to integrate with the LireSolr library.
UPDATE:- This is update on task of Image retrieval where you need to search for close image matches. I would recommend you to go through this link https://paperswithcode.com/task/image-retrieval. The best solution - Deep Local Features is already integrated in tensorflow.

Related

Is there a way to import data (csv data) to the winlogbeat kibana dashboard?

I had just started learning about ElasticSearch and Kibana. I created a Winlogbeat dashboard where the logs are working fine. I want to import additional data (CSV data) which I created using Python. I tried uploading the CSV file but I am only allowed to create a separate index and not merge it with the Winlogbeat data. Does anyone know how to do this?
Thanks in advance
In many use cases, you don't need to actually combine into a single index. Here's a few ways you can show combined data, in approximate order of complexity:
Straightforward methods, using separate indices:
Use multiple charts on a dashboard
Use multiple indices in a single chart
More complex methods that combine data into a single index:
Pivot indices using Data Transforms
Combine at ingest-time
Roll your own
Use multiple charts on a dashboard
This is the simplest way: ingest your data into separate indices, make separate visualizations for them, then add those visualisations to one dashboard. If the events are time-based, this simple approach could be all you need.
Use multiple indices in a single chart
Lens, TSVB and Timelion can all use multiple data sources. (Vega can too, but that's playing on hard mode)
Here's an official Elastic video about how to do it in Lens: youtube
Create pivot indices using Data Transforms
You can use Elasticsearch's Data Transforms functionality to fetch, combine and aggregate your disparate data sources into a combined data structure which is then available for querying with Kibana. The official tutorial on Transforming the eCommerce sample data is a good place to learn more.
Combine at ingest-time
If you have (or can add) Logstash in the mix, you have several options for combining datasets during the filter phase of your pipelines:
Using a file-based lookup table and the translate filter plugin
By waiting for related documents to come in then outputting a combined document to Elasticsearch with the aggregate filter plugin
Using external lookups with filter plugins like elasticsearch or http
Executing arbitrary ruby code using the ruby filter plugin
Roll your own
If you're generating the CSV file with a Python program, you might want to think about incorporating the python Elasticsearch DSL lib to run queries on the winlogbeat data, then ingest it in its combined state (whether via a CSV or other means).
Basically, Winlogbeat is a data shipper to Elasticsearch. Which ships windows specific data to an index named winlogbeat with a specific schema and document structure.
You can't merge another document with a different schema into winlogbeat index.
If your goal is to correlate different data points. Please use Time-series visual builder to overlay two different datasets to visualize.

Working with NLP tags in Elasticsearch

Working on a large data-oriented search product powered by elasticsearch. We've built a lot of machine learning functionality on top of this app, but currently we're having some difficulty deciding how to integrate fairly standard NLP-based word tags into our ES index.
Currently we have a tagging service that can annotate a word with a respective type (or types, but one may be useful enough for now). This function could be abstracted to: type = getWordType(word) I imagine there must be a way to integrate this tagging service into the analysis chain that is applied at index time, where, maybe, we tell the index what type a particular word belongs to. However, doing this kind of advanced analysis is a bit beyond my elasticsearch capacity. Does anyone have pointers on this kind of advanced analysis in elasticsearch?
Thanks!
you might want to take a look at the ingest node functionality introduced in Elasticsearch 5.0. This allows you to preprocess your documents and add fields into the JSON before the document is being indexed in Elasticsearch.
I wrote an ingest processor that is using OpenNLP to enrich documents. You could take a look at that one and adapt it to your needs (also, pull requests are very welcome).
Check it out at https://github.com/spinscale/elasticsearch-ingest-opennlp
This is achieved in Elasticsearch 6.5 with the type annotated_text: https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/mapper-annotated-text-usage.html
Essentially, kind of like synonyms, the tags (or named entity IDs, etc) can exist at the same position as the word you’re tagging.
Needs a plugin installed, the Mapper Annotated Text Plugin.

Summarization in Elasticsearch

I am a newbie to Elasticsearch. We are currently using Splunk platform for our analytics application and looking to migrate to ELK. Splunk provides options to schedule searches to run in background periodically and to store the search results in a separate summary index. Is similar functionality available in Elasticsearch? If so, please point me to the documentation containing the process.
Thanks,
Keerthana
This is a great use case. Of course Elasticsearch can perform such tasks, but it is more manual. You have to write your own script. So for example, if you want to summarize data, you can use ElasticSearch aggregations, and take the result (which comes in JSON format) and store it back into an index where you keep summary data. This way, even if you delete your raw data, your summary data lives on.
Elasticsearch comes with different clients. I like to use the Python Elasticsearch DSL library.

How to build lucene index in a mapReduce way?

I am building a small image similarity search application with hadoop. I decide to use LIRE which in this demo code, it uses lucene indexWriter to write index to a local disk. What I have done now is making my reducers generate the LIRE records. but how to make reducers write these records to a Lucene index file in HDFS? I googled and find some tools like solrCloud, Blur, but there is no good document and code example to show how to do it.
Does anyone know some good reference?
PS. I notice there is a question with similarity title, but it was from 3 years ago, and the answers are not clear.
If you are using Solr 4.7 there is a option do index using HDFS using kite morpholines code. This is part of Solr distribution now (>4.7). Look at this JIRA for more information. https://issues.apache.org/jira/browse/SOLR-5729
Also look at the earlier git repository https://github.com/markrmiller/solr-map-reduce-example

How can I copy hadoop data to SOLR

I've a SOLR search which uses lucene index as a backend.
I also have some data in Hadoop I would like to use.
How do I copy this data into SOLR ??
Upon googling the only likns I can find tell me how to use use an HDFS index instead of a local index, in SOLR.
I don't want to read the index directly from hadoop, I want to copy them to SOLR and read it from there.
How do I copy? And it would be great if there is some incremental copy mechanism.
If you have a standalone Solr instance, then you could face some scaling issues, depending on the volume of data.
I am assuming high volume given you are using Hadoop/HDFS. In which case, you might need to look at SolrCloud.
As for reading from hdfs, here is a tutorial from LucidImagination, that addresses this issue, and recommends the use of Behemoth
You might also want to look at Katta project, that claims to integrate with hadoop and provide near real-time read access of large datasets . The architecture is illustrated here
EDIT 1
Solr has an open ticket for this. Support for HDFS is scheduled for Solr 4.9. You can apply the patch if you feel like it.
You cannot just copy custom data to Solr, you need to index* it. You data may have any type and format (free text, XML, JSON or even binary data). To use it with Solr, you need to create documents (flat maps with key/value pairs as fields) and add them to Solr. Take a look at this simple curl-based example.
Note, that reading data from HDFS is a different question. For Solr, it doesn't matter where you are reading data from as long as you provide it with documents.
Storing index on local disk or in HDFS is also a different question. If you expect your index to be really large, you can configure Solr to use HDFS. Otherwise you can use default properties and use local disk.
* - "Indexing" is a common term for adding documents to Solr, but in fact adding documents to Solr internal storage and indexing (making fields searchable) are 2 distinct things and can be configured separately.

Resources