Elasticsearch Connector as Source in Flink - elasticsearch

I used Elasticsearch Connector as a Sink to insert data into Elasticsearch (see : https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/elasticsearch.html).
But, I did not found any connector to get data from Elasticsearch as source.
Is there any connector or example to use Elasticsearch documents as source in a Flink pipline?
Regards,
Ali

I don't know of an explicit ES source for Flink. I did see one user talking about using elasticsearch-hadoop as a HadoopInputFormat with Flink, but I don't know if that worked for them (see their code).

I finaly defined a simple read from ElasticSearch function
public static class ElasticsearchFunction
extends ProcessFunction<MetricMeasurement, MetricPrediction> {
public ElasticsearchFunction() throws UnknownHostException {
client = new PreBuiltTransportClient(settings)
.addTransportAddress(new TransportAddress(InetAddress.getByName("YOUR_IP"), PORT_NUMBER));
}
#Override
public void processElement(MetricMeasurement in, Context context, Collector<MetricPrediction> out) throws Exception {
MetricPrediction metricPrediction = new MetricPrediction();
metricPrediction.setMetricId(in.getMetricId());
metricPrediction.setGroupId(in.getGroupId());
metricPrediction.setBucket(in.getBucket());
// Get the metric measurement from Elasticsearch
SearchResponse response = client.prepareSearch("YOUR_INDEX_NAME")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.termQuery("YOUR_TERM", in.getMetricId())) // Query
.setPostFilter(QueryBuilders.rangeQuery("value").from(0L).to(50L)) // Filter
.setFrom(0).setSize(1).setExplain(true)
.get();
SearchHit[] results = response.getHits().getHits();
for(SearchHit hit : results){
String sourceAsString = hit.getSourceAsString();
if (sourceAsString != null) {
ObjectMapper mapper = new ObjectMapper();
MetricMeasurement obj = mapper.readValue(sourceAsString, MetricMeasurement.class);
obj.getMetricId();
metricPrediction.setPredictionValue(obj.getValue());
}
}
out.collect(metricPrediction);
}
}

Hadoop Compatibility + Elasticsearch Hadoop
https://github.com/cclient/flink-connector-elasticsearch-source

Related

Simple aggregation is getting failed in javaelasticsearch 8.0+ client

I have got a simple method that performs simple terms aggregation using elastic search8.0
I am able to do it using RestHighLevelClient but with ElasticsearchClient I am getting empty buckets.
can someone please help me to resolve
public void aggregate(ElasticsearchClient client) throws ElasticsearchException, IOException {
String field = "loglevel";
Map<String, Long> buckets = new HashMap<String, Long>();
SearchResponse<SspDevLog> response = client.search(fn -> fn
.aggregations("loglevel", a -> a.terms(v-> v.field(field))), SspDevLog.class);
Map<String, Aggregate> aggrs = response.aggregations();
for(Map.Entry<String, Aggregate> entry : aggrs.entrySet()) {
Aggregate aggregate = entry.getValue();
StringTermsAggregate sterms = aggregate.sterms();
Buckets<StringTermsBucket> sbuckets = sterms.buckets();
List<StringTermsBucket> bucArr = sbuckets.array();
for(StringTermsBucket bucObj : bucArr) {
buckets.put(bucObj.key(), bucObj.docCount());
}
}
System.out.println(buckets);
}

Get Aggregate Information from Elasticsearch using Spring-data-elasticsearch, ElasticsearchRepository

I would like to get aggregate results from ES like avgSize (avg of a field with name 'size'), totalhits for documents that match a term, and some other aggregates in future, for which I don't think ElasticsearchRepository has any methods to call. I built Query and Aggregate Builders as below. I want to use my Repository interface but I am not sure of what should the return ObjectType be ? Should it be a document type in my DTOs ? Also I have seen examples where the searchQueryis passed directly to ElasticsearchTemplate but then what is the point of having Repository interface that extends ElasticsearchRepository
Repository Interface
public interface CCFilesSummaryRepository extends ElasticsearchRepository<DataReferenceSummary, UUID> {
}
Elastic configuration
#Configuration
#EnableElasticsearchRepositories(basePackages = "com.xxx.repository.es")
public class ElasticConfiguration {
#Bean
public ElasticsearchOperations elasticsearchTemplate() throws UnknownHostException {
return new ElasticsearchTemplate(elasticsearchClient());
}
#Bean
public Client elasticsearchClient() throws UnknownHostException {
Settings settings = Settings.builder().put("cluster.name", "elasticsearch").build();
TransportClient client = new PreBuiltTransportClient(settings);
client.addTransportAddress(new TransportAddress(InetAddress.getLocalHost(), 9200));
return client;
}
}
Service Method
public DataReferenceSummary createSummary(final DataSet dataSet) {
try {
QueryBuilder queryBuilder = QueryBuilders.matchQuery("type" , dataSet.getDataSetCreateRequest().getContentType());
AvgAggregationBuilder avgAggregationBuilder = AggregationBuilders.avg("avg_size").field("size");
ValueCountAggregationBuilder valueCountAggregationBuilder = AggregationBuilders.count("total_references")
.field("asset_id");
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(queryBuilder)
.addAggregation(avgAggregationBuilder)
.addAggregation(valueCountAggregationBuilder)
.build();
return ccFilesSummaryRepository.search(searchQuery).iterator().next();
} catch (Exception e){
e.printStackTrace();
}
return null;
}
DataReferernceSummary is just a POJO for now and for which I am getting an error during my build that says Unable to build Bean CCFilesSummaryRepository, illegalArgumentException DataReferernceSummary. is not a amanged Object
First DataReferenceSummary must be a class annotated with #Document.
In Spring Data Elasticsearch 3.2.0 (the current version) you need to define the repository return type as AggregatedPage<DataReferenceSummary>, the returned object will contain the aggregations.
From the upcoming version 4.0 on, you will have to define the return type as SearchHits<DataReferenceSummary> and find the aggregations in this returned object.

How to use BasicAuth with ElasticSearch Connector on Flink

I want to use the elastic producer on flink but I have some trouble for authentification:
I have Nginx in front of my elastic search cluster, and I use basic auth in nginx.
But with the elastic search connector I can't add the basic auth in my url (because of InetSocketAddress)
did you have an Idea to use elasticsearch connector with basic auth ?
Thanks for your time.
there is my code :
val configur = new java.util.HashMap[String, String]
configur.put("cluster.name", "cluster")
configur.put("bulk.flush.max.actions", "1000")
val transportAddresses = new java.util.ArrayList[InetSocketAddress]
transportAddresses.add(new InetSocketAddress(InetAddress.getByName("cluster.com"), 9300))
jsonOutput.filter(_.nonEmpty).addSink(new ElasticsearchSink(configur,
transportAddresses,
new ElasticsearchSinkFunction[String] {
def createIndexRequest(element: String): IndexRequest = {
val jsonMap = parse(element).values.asInstanceOf[java.util.HashMap[String, String]]
return Requests.indexRequest()
.index("flinkTest")
.source(jsonMap);
}
override def process(element: String, ctx: RuntimeContext, indexer: RequestIndexer) {
indexer.add(createIndexRequest(element))
}
}))
Flink uses the Elasticsearch Transport Client which connects using a binary protocol on port 9300.
Your nginx proxy is sitting in front of the HTTP interface on port 9200.
Flink isn't going to use your proxy, so there's no need to provide authentication.
If you need to use a HTTP Client to connect Flink with Elasticsearch, one solution is to use Jest Library.
You have to create a custom SinkFunction, like this basic java class :
package fr.gfi.keenai.streaming.io.sinks.elasticsearch5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import io.searchbox.client.JestClient;
import io.searchbox.client.JestClientFactory;
import io.searchbox.client.config.HttpClientConfig;
import io.searchbox.core.Index;
public class ElasticsearchJestSinkFunction<T> extends RichSinkFunction<T> {
private static final long serialVersionUID = -7831614642918134232L;
private JestClient client;
#Override
public void invoke(T value) throws Exception {
String document = convertToJsonDocument(value);
Index index = new Index.Builder(document).index("YOUR_INDEX_NAME").type("YOUR_DOCUMENT_TYPE").build();
client.execute(index);
}
#Override
public void open(Configuration parameters) throws Exception {
// Construct a new Jest client according to configuration via factory
JestClientFactory factory = new JestClientFactory();
factory.setHttpClientConfig(new HttpClientConfig.Builder("http://localhost:9200")
.multiThreaded(true)
// Per default this implementation will create no more than 2 concurrent
// connections per given route
.defaultMaxTotalConnectionPerRoute(2)
// and no more 20 connections in total
.maxTotalConnection(20)
// Basic username and password authentication
.defaultCredentials("YOUR_USER", "YOUR_PASSWORD")
.build());
client = factory.getObject();
}
private String convertToJsonDocument(T value) {
//TODO
return "{}";
}
}
Note that you can also use bulk operations for more speed.
An exemple of Jest implementation for Flink is described at the part "Connecting Flink to Amazon RS" of this post

Client is not connected to any Elasticsearch nodes in Flink

I am using Flink 1.1.2 and have added ElesticSearch dependency in Maven as follows
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch2_2.10</artifactId>
<version>1.2.0</version>
</dependency>
My program contains the following code that is reading data from Kafka and inserting to Elastic search
public class ReadFromKafka {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<JoinedStreamEvent> message = env.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new JoinSchema(), properties));
System.out.println("reading form kafka ");
message.print();
Map<String, String> config = new HashMap<>();
config.put("bulk.flush.max.actions", "1"); // flush inserts after every event
config.put("cluster.name", "elasticsearch_amar"); // default cluster name
List<InetSocketAddress> transports = new ArrayList<>();
// set default connection details
transports.add(new InetSocketAddress(InetAddress.getByName("127.0.0.1"), 9300));
message.addSink(new ElasticsearchSink<>(config,transports,new ElasticInserter()));
env.execute();
} //main
public static class ElasticInserter implements ElasticsearchSinkFunction<JoinedStreamEvent>{
#Override
public void process(JoinedStreamEvent record, RuntimeContext runtimeContext, RequestIndexer requestIndexer) {
Map<String, Integer> json = new HashMap<>();
json.put("Time", record.getPatient_id());
json.put("heart Rate ", record.getHeartRate());
json.put("resp rete", record.getRespirationRate());
IndexRequest rqst = Requests.indexRequest()
.index("nyc-places") // index name
.type("popular-locations") // mapping name
.source(json);
requestIndexer.add(rqst);
} //process
} //ElasticInserter
} //ReadFromKafka
I have installed ElesticSearch using homebrew and then started it using elesticsearch command as shown below
however, when I start my program I got following error
my reputation below 50, can not comment.
I have a bit of suggestion:
first check whether ES is up,
see Can't Connect to Elasticsearch (through Curl).
recommended to use the docker container to start ES, eg. docker run -d --name es -p 9200:9200 elasticsearch:2 -Des.network.host=0.0.0.0
BTW, You can try: modify es.network.host value to 0.0.0.0 in ES config elasticsearch.yml:

How to balance the Elastic search nodes using TransportClient java code

looking for expert's help(i am newbie on elastic search)... have multiple nodes of elastic search.
i am using ElasticSearch java lib for indexing the json docs. would like to know how to handle the node balancing,is it possible to handle that from client side?
---elasticSearch transport client code------
public static Client getTransportClient(String host, int port) {
Settings settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", "ccw_cat_es")
.put("node.name", "catsrch-pdv1-01")
.build();
return new TransportClient(settings).addTransportAddress(new InetSocketTransportAddress(host, port));
}
public static IndexResponse doIndex(Client client, String index, String type, String id, Map<String, Object> data) {
return client
.prepareIndex(index, type, id)
.setSource(data)
.execute()
.actionGet();
}
public static void main(String[] args) {
Client client = getTransportClient("catsrch-pdv1-01", 9200);
String index = "orderstatussearch";
String type = "osapi";
String id = null;
Map<String, Object> data = new HashMap<String, Object>();
data.put("OrderNumber", "444");
data.put("PO", "123");
data.put("WID", "ab234");
id= "444";
IndexResponse result = doIndex(client, index, type, id, data);
}
The TransportClient will automatically use a round robin strategy to load balance against nodes that it is connected too. In your case, you are only connecting to one node, so there is nothing to balance. You can add other nodes to the list and it will balance them appropriately.
Alternatively, you can "sniff" out the data nodes automatically by just connecting to one of them with an extra setting applied:
Settings settings = ImmutableSettings.settingsBuilder()
// ...
.put("client.transport.sniff", true)
// ...
.build()
This will then round robin against all data nodes that it finds in the cluster state.
This probably leads to the question: why isn't this the default? The reason is that, if you have standalone client nodes, then they are better proxies to the cluster rather than directly communicating with data nodes. For smaller clusters, this is a perfectly acceptable strategy though.

Resources