Having the following dataset input format : TextA TextB
Is it possible to use a single hadoop MapFile to provide indexing (binary search support) on the first column (TextA) and also on the second one (TextB) ?
The idea would be to have the same data folder, but with different index files.
You can't, the data file MUST be sorted by key.
If you try to visualize how MapFile is implemented you will figure out that it cannot work:
The large data file is sorted by key
The index file contains N keys are is loading in memory
When you do a get, the two neighboring keys in the index file are found. Then a binary search is done in the large data file (that's why it must be sorted by key)
How would you meet the sorting requirement with a single file ?
Related
suppose there is an index in elastic search, i want that as soon as I enter the data in that index, in 1 file 10 data is stored only, after that new file created and next 10 data stored in newly created file by automation, like this goes on. How can I achieve that ? i am storing data in 1 file but not getting an idea how to divide it after certain data.
I have an issue with my index and on ES startup i get an
org.elasticsearch.index.mapper.MapperParsingException -- tried to parse field [null] as object, but found a concrete value
thus ES is not starting at all...
The data i have is of no importance, is there a way to manually delete the index all together (mapping and data) ? Or if not just update the index mapping?
I am not sure if that's a good idea but you can try deleting 'indices' folder - this will delete all the indices so be careful.
I have 2 elasticsearch clusters one with 3 indexes and other is empty so the folder structure is looking like this,
the one with 3 indexes,
ls "the data directory path from elasticsearch.yml"/nodes/0/indices
11RicU32QMK1r5Hu89ktKg FViegU6eTWOti8_bMQSMww YVw4MImcSlCeM5lqWlXW3w
As you can see the index names are obfuscated.
the one with no indexes,
ls "the data directory path from elasticsearch.yml"/nodes/0/
node.lock _state
second one has no 'indices' folder.
HTH.
I want to use this method in a script, which sets up a sync between ElasticSearch and Firebase. I am using this to avoid children that are already with me and index only new ones, is it efficient method when I have millions of data on my firebase!
It is efficient if you have an index defined on the field you're using to sort.
I am using SOLR 4.5(standalone instance) and I am trying to use external field to improve the ranking of documents. I have two external file fields for two different parameters which change daily which I use in "bf" and "boost" params of the edismax parser. Previously, these fields were part of the SOLR index.
I am facing serious performance issue for moving these fields out from index to external file. The CPU usage of SOLR machine reaches 100% in peak load and average response time has risen from 13 milliseconds to almost 150 milliseconds.
Is there anything I can do to improve the performance of SOLR when using external file fields. Are there any things to take care of while using external file field values within boost/bf functions ?
As described in the SO Relevancy boosting very slow in Solr the key=value pairs the external file consists of, should be sorted by that key. This is also stated in the java doc of the ExternalFileField
The external file may be sorted or unsorted by the key field, but it will be substantially slower (untested) if it isn't sorted.
So if the content of your file would look like this (just an example)
300=3.8294805903e-07
5=3.8294805903e-07
20=3.8294805903e-07
You will need a script that alters the contents to
5=3.8294805903e-07
20=3.8294805903e-07
300=3.8294805903e-07
When I save the same document for, for example, 10 times, does it need ten times as much disk space? Or are the single fields of the document saved in an index or something and the document only references to this index if more than one document have the same value for one field?
Well answer is yes and no :).
By default the data is stored in a aggregated data structure called lucene reverse index.
In addition to this , the data that you gave for indexing is also stored in a field called _source. So we can safely assume that the data is stored in two different formats where we can only use reverse index for searching but for retrieving the actual data , we need to fetch it from _source.
So if _source is explicitly disabled , you wont be seeing a linear growth of disk size. ( Given that segment merge is done to a single segment )
If this is not disabled , then the data has to be stored both in _source ( As raw JSON ) and reverse index ( Data is tokenized and then stored )