Is there some way to get elasticsearch index data from RAM? - elasticsearch

I have 60gb text file and I want to search by text field in it. My plan is to put the file into Elasticsearch and setup search there.
But it might be that searching in text file would by quicker if reading file from RAM.
So the question is: Is there some way to read Elasticsearch index from RAM and search in RAM. It helps me to compare speed of searching into Elasticsearch and searching into text file(json,.pickle of other format).
I tried to read from the .pickle file using python.
The version of Elasticsearch is 7.1.

No, it is not. In the first versions of ES (see https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-store.html) it was possible, but not anymore. You should rely on ES to cache the contents that are more frequently used, but there is nothing you can do to tell it to store contents in memory.

Related

Indexing .png/JPG/PDF files in elastic search from fileserver

Am having search based requirement. Am able to do indexing oracle database tables into elasticsearch by using logstash. In the same way, i have to index png/JPG/PDF files which are all presented in fileserver now.
Am using elasticsearch version 6.2.3. Can anyone have any idea about indexing files from fileserver to elasticsearch ?
purpose - why am seeing for indexing png/JPG/PDF :
i have to search and display some products with product information, along with that i have to display product picture also which is stored in fileserver.
I have a feature to search for documents (pdf). so,if is search with any keywords, it should also search in the contents of the documents and bring those document as search results. Here documents filepath is available in DB only files are available in fileserver.
For these two purpose, am looking for indexing png/JPG/PDF files.
You just have to get bytes from your image(you can do it in any program language) and then save them in field with binary type. But it is not a good idea, try to save the link to the image.

Elasticsearch lucene, understand code path for search

I want to understand how each of the lucene index files (nvd,dvd,tim,doc.. mainly these four) are used in ES query.
E.g. say my index has ten docs and i am doing a aggregation query. I would like to understand how ES/Lucene performs access to these four files for a single query.
I am trying to see if I can make some optimization in my system which is mostly a disk heavy system to speed up query performance.
I looked at ES code and understand that the QueryPhase is the most expensive and it seems to be doing a lot of randomn access to disk for the log oriented data I have.
I want to now dive deeper on lucene level as well and possibly debug code and see in action. Lucene code has zero log messages for IndexReader related classes. Also debugging lucene code directly seems unhelpful since the unittest don't create indexes with tim, doc, nvd, dvd files
Any pointers ?
As I know, ES don't do much on search details, if your want optimize search, my experience is optimize your data layout, here is some important lucene files description:
(see http://lucene.apache.org/core/7_2_1/core/org/apache/lucene/codecs/lucene70/package-summary.html#package.description):
Term Index(.tip) # ON MEMORY.
Term Dictionary(.tim) # ON DISK.
Frequencies(.doc) # ON DISK.
Per-Document Values(.dvd, .dvm), very useful on aggregation. # ON DISK.
Field Index(.fdx) # ON MEMORY.
Field Data(.fdt), finally data fetch from disk in here. # ON DISK.
And there are some point can optmize performance:
trying use small date type, for example: INTEGER or LONG values instead of STRING.
CLOSE DocValues on unnecessary field, at the same time open DocValues on that filed which your want to sort/aggregation.
just incluse necessasy field on source like "_source": { "includes": ["some_necessasy_field"]}.
only index field that your need using ES defined mappings.
split your data on multi index.
add SSD.

How to use Elasticsearch to make files in a directory searchable?

I am very new to search engines and Elasticsearch, so please bear with me and apologies if this question sounds vague. I have a large directory with lots of .csv and .hdr files, and I want to be able to search text within these files. I've done the tutorials and read some of the documentation but I'm still struggling to understand the concept of indexing. It seems like all the tutorials show you how to index one document at a time, but this will take a long time as I have lots of files. Is there an easier way to make elasticsearch index all the documents in this directory and be able to search for what I want?
Elasticsearch can only search on documents it has indexed. Indexed means Elasticsearch has consumed a document one by one and stored it internally.
Normaly internal structure matters and you shold understand what you're doing to get best performance.
So you need a way to get your files into elastic search, I'm affraid there is no "one click way" to achieve this...
You need
Running cluster
Designed index on for the documents
Get document from filesystem to Elasticsearch
Your question is focused on 3).
For this, search for script examples or tools that can crawl your directory and provide Elasticsearch with documents.
5 seconds of using Google brought me to
https://github.com/dadoonet/fscrawler
https://gist.github.com/stevehanson/7462063
Theoretically it could be done with Logstash (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html), but I would give fscrawler a try.

where is the original json document stored in lucene through elasticsearch on disk(linux-ubuntu)

after reading blogs I came to know that data is stored on lucene and only metadata stored in elasticsearch
when we index a doc through elasticsearch , we store inverted index in segment but somewhere we need to store the json doc which will be retrieved . I'm unable to figure out the location on disk after going through several blogs also.
Note - /var/lib/elasticsearh/data(mentioned in official doc and in stackoverflow question) not exists in ubuntu
Thanks in advance
You're not supposed to go store/retrieve the document on the disk, you have to go through one of the elasticsearch API to store/retrieve your document.
I think the documents are converted in a binary format. And I think the decision of hiding the files behind the API is to protect the integrity of the content, especially when working with replicated content (so that all nodes have the same content) and in order to be sure that the inverted index always reflect the real content on disk.
[citation needed]

How can I copy hadoop data to SOLR

I've a SOLR search which uses lucene index as a backend.
I also have some data in Hadoop I would like to use.
How do I copy this data into SOLR ??
Upon googling the only likns I can find tell me how to use use an HDFS index instead of a local index, in SOLR.
I don't want to read the index directly from hadoop, I want to copy them to SOLR and read it from there.
How do I copy? And it would be great if there is some incremental copy mechanism.
If you have a standalone Solr instance, then you could face some scaling issues, depending on the volume of data.
I am assuming high volume given you are using Hadoop/HDFS. In which case, you might need to look at SolrCloud.
As for reading from hdfs, here is a tutorial from LucidImagination, that addresses this issue, and recommends the use of Behemoth
You might also want to look at Katta project, that claims to integrate with hadoop and provide near real-time read access of large datasets . The architecture is illustrated here
EDIT 1
Solr has an open ticket for this. Support for HDFS is scheduled for Solr 4.9. You can apply the patch if you feel like it.
You cannot just copy custom data to Solr, you need to index* it. You data may have any type and format (free text, XML, JSON or even binary data). To use it with Solr, you need to create documents (flat maps with key/value pairs as fields) and add them to Solr. Take a look at this simple curl-based example.
Note, that reading data from HDFS is a different question. For Solr, it doesn't matter where you are reading data from as long as you provide it with documents.
Storing index on local disk or in HDFS is also a different question. If you expect your index to be really large, you can configure Solr to use HDFS. Otherwise you can use default properties and use local disk.
* - "Indexing" is a common term for adding documents to Solr, but in fact adding documents to Solr internal storage and indexing (making fields searchable) are 2 distinct things and can be configured separately.

Resources