ElasticSearch with Hadoop data duplication issue - hadoop

I have a requirement as follows :
Whatever data is there in hadoop, i need to make it searchable (and vice-versa).
So, for this , I use ElasticSearch where we can use elasticsearch-hadoop plug-in to send a data from hadoop to Elastic.And a real-time search is now possible.
But, my question is, isn't there a duplication of data. Whatever the data is in hadoop , same is duplicated in Elastic search with
indexing. Is there any way of get rid of this duplication OR my concept is wrong. I search a lot but don't find any clue about this duplication issue.

If you specify an immutable ID for each rows in elasticsearch (eg : a customerID for example), all inserts of existing datas will be only updates.
Extract from official documentation about insertion method (cf http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/configuration.html#_operation):
index (default) :new data is added while existing data (based on its
id) is replaced (reindexed).
If you have "customer" dataset in pig, just store datas like that :
A = LOAD '/user/hadoop/customers.csv' USING PigStorage()
....;
B = FOREACH A GENERATE customerid, ...;
STORE B INTO 'foo/customer' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes = localhost','es.http.timeout = 5m','es.index.auto.create = true','es.input.json = true','es.mapping.id =customerid','es.batch.write.retry.wait = 30', 'es.batch.size.entries = 500');
--,'es.mapping.parent = customer');
To perform a new search on Hadoop just use the custom loader :
A = LOAD 'foo/customer' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*');

Related

How to modify data table column values in Kibana

In Elastic search we store events, I have built data table which aggregates based on event types. I have filter which checks for event.keyword : "job-completed". I am getting the count as 1 or 0 but i want to display as completed / in-progress.
How to achieve this in Kibana ?
The best and more effective way to do this is to add another field with and to complete it at the ingest time.
It the best solution regarding on performance. But it can lead to an heavy work.
You can also use a scripted field to do this without touching your data.
Do to stack management > kibana > index pattern and select your index.
Select scripted field tab and fill in form.
Name : your_field
language: painless
type: string
format: string
script:
if(doc['event.keyword'].value=='job-completed'){
return "completed";
}else {
return "in progress";
}
I got to few information on your real data to be able to give you a working code, so you'll have to modify it to fit your needs.
Then refresh you visualization and you can use your new field

Elasticsearch + Logstash: How to add a fields based on existing data at importing time

Currently, I'm importing data into Elastic through logstash, at this time by reading csv files.
Now let's say I have two numeric fields in the csv, age, and weight.
I would need to add a 3rd field on the fly, by making a math on the age, the weight and another external data ( or function result ); and I need that 3rd field to be created when importing the data.
There is any way to do this?
What could be the best practice?
In all Logstash filter sections, you can add fields via add_field, but that's typically static data.
Math calculations need a separate plugin
As mentioned there, the ruby filter plugin would probably be your best option. Here is an example snippet for your pipeline
filter {
# add calculated field, for example BMI, from height and weight
ruby {
code => "event['data']['bmi'] = event['data']['weight'].to_i / (event['data']['height'].to_i)"
}
}
Alternatively, in Kibana, there are Scripted fields meant to be visualized, but cannot be queried

How to query on hbase json string value

I am saving list of data as below in hbase with unique id along with column family name :
I can query on address column family with specific id but I want to query on the json value like
where homenumber = 4
Can we do that? Any example will be helpful
Thanks
You can use HBase filter for this. Find the possible duplicate questions
Scan with filter using HBase shell
Scan HTable rows for specific column value using HBase shell
To start working with HBase filter, refer
http://hbase.apache.org/0.94/book/client.filter.html
http://www.hadooptpoint.org/filters-in-hbase-shell/

Neo4j Elastic Search Integration

I have data set(questions) which are mapped to multiple tags and these tags are hierarchical in nature.
So there is A question which is mapped to t1 and t2 tag.
t1 has parent p1 and p1 has parent p2.(p2 -> p1 - >t1 --mapped to--->A)
So i was storing my data in neo4j and I want to get A as result for p2 tag. I am getting result easily using cypher. but now i have sort and limit by in the same query and since neo4j cant use index in such queries, i am thinking of integrating neo4j with elasticsearch, but I am not able to get how to query?
$query = "MATCH p=(n:messages)-[r:TAGGED_TO]->(k:tags{tag_id:{tag_id}}) RETURN p,n ORDER by n.msgId desc limit 5";
$params['tag_id'] = (int)$tag_id;
$result = $this->dbHandle->run($query,$params);
Now sort and limit are not using index. I want to run this query in optimized way.
You can use Graphaware plugin for connecting neo4j to elastic or the apoc plugin,specifically apoc.es.* functions... see documentation for more.

Passing parameters to a couchbase view

I'm looking to search for a particular JSON document in a bucket and I don't know its document ID, all I know is the value of one of the sub-keys. I've looked through the API documentation but still confused when it comes to my particular use case:
In mongo I can do a dynamic query like:
bucket.get({ "name" : "some-arbritrary-name-here" })
With couchbase I'm under the impression that you need to create an index (for example on the name property) and use startKey / endKey but this feels wrong - could you still end up with multiple documents being returned? Would be nice to be able to pass a parameter to the view that an exact match could be performed on. Also how would we handle multi-dimensional searches? i.e. name and category.
I'd like to do as much of the filtering as possible on the couchbase instance and ideally narrow it down to one record rather than having to filter when it comes back to the App Tier. Something like passing a dynamic value to the mapping function and only emitting documents that match.
I know you can use LINQ with couchbase to filter but if I've read the docs correctly this filtering is still done client-side but at least if we could narrow down the returned dataset to a sensible subset, client-side filtering wouldn't be such a big deal.
Cheers
So you are correct on one point, you need to create a view (an index indeed) to be able to query on on the content of the JSON document.
So in you case you have to create a view with this kind of code:
function (doc, meta) {
if (doc.type == "youtype") { // just a good practice to type the doc
emit(doc.name);
}
}
So this will create a index - distributed on all the nodes of your cluster - that you can now use in your application. You can point to a specific value using the "key" parameter

Resources