Indexing .png/JPG/PDF files in elastic search from fileserver - elasticsearch

Am having search based requirement. Am able to do indexing oracle database tables into elasticsearch by using logstash. In the same way, i have to index png/JPG/PDF files which are all presented in fileserver now.
Am using elasticsearch version 6.2.3. Can anyone have any idea about indexing files from fileserver to elasticsearch ?
purpose - why am seeing for indexing png/JPG/PDF :
i have to search and display some products with product information, along with that i have to display product picture also which is stored in fileserver.
I have a feature to search for documents (pdf). so,if is search with any keywords, it should also search in the contents of the documents and bring those document as search results. Here documents filepath is available in DB only files are available in fileserver.
For these two purpose, am looking for indexing png/JPG/PDF files.

You just have to get bytes from your image(you can do it in any program language) and then save them in field with binary type. But it is not a good idea, try to save the link to the image.

Related

ElasticSearch as primary DB for document library

My task is a full-text search system for a really large amount of documents. Now I have documents as RTF file and their metadata, so all this will be indexed in elastic search. These documents are unchangeable (they can be only deleted) and I don't really expect many new documents per day. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
This question was solved here. So it's a good case for elasticsearch as the primary DB
Elastic is more known as distributed full text search engine , not as database...
If you preserve the document _source it can be used as database since almost any time you decide to apply document changes or mapping changes you need to re-index the documents in the index(known as table in relation world) , there is no possibility to update parts of the elastic lucene inverse index , you need to re-index the whole document ...
Elastic index survival mechanism is one of the best , meaning that if you loose node the index lost replicas are automatically replicated to some of the other available nodes so you dont need to do any manual operations ...
If you do regular backups and having no requirement the data to be 24/7 available it is completely acceptable to hold the data and full text index in elasticsearch as like in database ...
But if you need highly available combination I would recommend keeping the documents in mongoDB (known as best for distributed document store) for example and use elasticsearch only in its original purpose as full text search engine ...

How about including JSON doc version? Is it possible for elastic search, to include different versions of JSON docs, to save and to search?

We are using ElasticSearch to save and manage information on complex transactions. We might need to add more information for every transaction, on the near future.
How about including JSON doc version?
Is it possible for elastic search, to include different versions of JSON docs, to save and to search?
How does this affects performance on ElasticSearch?
It's completely possible, By default elastic uses the dynamic mappings for every new documents such as your JSON documents to index them. For each field in your documents elastic creates a table called inverted_index and the search queries executed against them so regardless of your field variation as long as you know which field you want to execute query the data throughput and performance will not be affected.

How to use Elasticsearch to make files in a directory searchable?

I am very new to search engines and Elasticsearch, so please bear with me and apologies if this question sounds vague. I have a large directory with lots of .csv and .hdr files, and I want to be able to search text within these files. I've done the tutorials and read some of the documentation but I'm still struggling to understand the concept of indexing. It seems like all the tutorials show you how to index one document at a time, but this will take a long time as I have lots of files. Is there an easier way to make elasticsearch index all the documents in this directory and be able to search for what I want?
Elasticsearch can only search on documents it has indexed. Indexed means Elasticsearch has consumed a document one by one and stored it internally.
Normaly internal structure matters and you shold understand what you're doing to get best performance.
So you need a way to get your files into elastic search, I'm affraid there is no "one click way" to achieve this...
You need
Running cluster
Designed index on for the documents
Get document from filesystem to Elasticsearch
Your question is focused on 3).
For this, search for script examples or tools that can crawl your directory and provide Elasticsearch with documents.
5 seconds of using Google brought me to
https://github.com/dadoonet/fscrawler
https://gist.github.com/stevehanson/7462063
Theoretically it could be done with Logstash (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html), but I would give fscrawler a try.

where is the original json document stored in lucene through elasticsearch on disk(linux-ubuntu)

after reading blogs I came to know that data is stored on lucene and only metadata stored in elasticsearch
when we index a doc through elasticsearch , we store inverted index in segment but somewhere we need to store the json doc which will be retrieved . I'm unable to figure out the location on disk after going through several blogs also.
Note - /var/lib/elasticsearh/data(mentioned in official doc and in stackoverflow question) not exists in ubuntu
Thanks in advance
You're not supposed to go store/retrieve the document on the disk, you have to go through one of the elasticsearch API to store/retrieve your document.
I think the documents are converted in a binary format. And I think the decision of hiding the files behind the API is to protect the integrity of the content, especially when working with replicated content (so that all nodes have the same content) and in order to be sure that the inverted index always reflect the real content on disk.
[citation needed]

Can ElasticSearch create/store just the indexes while leaving the source document where it is?

Assuming I already have a set of documents living in some document store can I have ElasticSearch create its indexes and store them in its various replicated nodes while leaving the documents themselves where they are? In other words can I use ES just for search and not for storage? (I understand this might not be ideal but assume there are good reasons I need to keep the documents themselves where they are).
If I take this approach does it remove any functionality from search, for example showing where in a document the search term was found?
Thanks.
The link Konstantin referenced should show you how to disable _source.
There is another way to store fields (store=true). You are better off using _source and excluding any specific fields you don't want stored as part of _source, though.
Functionality removed:
Viewing fields that are returned from search
Highlighting
Easily rebuilding an index from _source. Probably not an issue, since data is stored elsewhere
There are probably other features I am missing.
The only case I've come across where I really don't need _source is when building an analytics engine where I am only returning aggregates (term and histogram).

Resources