Stormcrawler not fetching/indexing pages for elasticsearch - elasticsearch

I am using the Stormcrawler with the Elasticsearch example and no pages are shown with the FETCHED status in Kibana while crawling the webpage http://books.toscrape.com/
Still on the console the webpages appear to be fetched and parsed
48239 [Thread-26-fetcher-executor[3 3]] INFO c.d.s.b.FetcherBolt - [Fetcher #3] Threads : 0 queues : 1 in_queues : 1
48341 [FetcherThread #7] INFO c.d.s.b.FetcherBolt - [Fetcher #3] Fetched http://books.toscrape.com/catalogue/category/books_1/index.html with status 200 in msec 86
48346 [Thread-46-parse-executor[5 5]] INFO c.d.s.b.JSoupParserBolt - Parsing : starting http://books.toscrape.com/catalogue/category/books_1/index.html
48362 [Thread-46-parse-executor[5 5]] INFO c.d.s.b.JSoupParserBolt - Parsed http://books.toscrape.com/catalogue/category/books_1/index.html in 13 msec
Also it seems that the index of Elasticsearch gets some items even though these have no title
Screenshot of Kibana
I expanded the com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt to also store the metadata of a web page in a local file and it seems like it does not get any tuples at all. Since the IndexerBolt also marks the status of a url as FETCHED that would explain the mentioned observation in Kibana.
Is there any explanation for this behaviour? I already reverted the crawler configuration to the standard, except the index bolt in crawler.flux to run my class.
The Topology Configuration:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "index"
className: "de.hpi.bpStormcrawler.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 1
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["url"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
The reconfigured IndexerBolt
package de.hpi.bpStormcrawler;
/**
* Licensed to DigitalPebble Ltd under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* DigitalPebble licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import static com.digitalpebble.stormcrawler.Constants.StatusStreamName;
import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
import java.io.*;
import java.util.Iterator;
import java.util.Map;
import org.apache.storm.metric.api.MultiCountMetric;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.elasticsearch.action.DocWriteRequest;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.digitalpebble.stormcrawler.Metadata;
import com.digitalpebble.stormcrawler.elasticsearch.ElasticSearchConnection;
import com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt;
import com.digitalpebble.stormcrawler.persistence.Status;
import com.digitalpebble.stormcrawler.util.ConfUtils;
/**
* Sends documents to ElasticSearch. Indexes all the fields from the tuples or a
* Map <String,Object> from a named field.
*/
#SuppressWarnings("serial")
public class IndexerBolt extends AbstractIndexerBolt {
private static final Logger LOG = LoggerFactory
.getLogger(IndexerBolt.class);
private static final String ESBoltType = "indexer";
static final String ESIndexNameParamName = "es.indexer.index.name";
static final String ESDocTypeParamName = "es.indexer.doc.type";
private static final String ESCreateParamName = "es.indexer.create";
private OutputCollector _collector;
private String indexName;
private String docType;
// whether the document will be created only if it does not exist or
// overwritten
private boolean create = false;
File indexFile;
private MultiCountMetric eventCounter;
private ElasticSearchConnection connection;
#SuppressWarnings({ "unchecked", "rawtypes" })
#Override
public void prepare(Map conf, TopologyContext context,
OutputCollector collector) {
super.prepare(conf, context, collector);
_collector = collector;
indexName = ConfUtils.getString(conf, IndexerBolt.ESIndexNameParamName,
"fetcher");
docType = ConfUtils.getString(conf, IndexerBolt.ESDocTypeParamName,
"doc");
create = ConfUtils.getBoolean(conf, IndexerBolt.ESCreateParamName,
false);
try {
connection = ElasticSearchConnection
.getConnection(conf, ESBoltType);
} catch (Exception e1) {
LOG.error("Can't connect to ElasticSearch", e1);
throw new RuntimeException(e1);
}
this.eventCounter = context.registerMetric("ElasticSearchIndexer",
new MultiCountMetric(), 10);
indexFile = new File("/Users/jonaspohlmann/code/HPI/BP/stormCrawlerSpike/spikeStormCrawler2/index.log");
}
#Override
public void cleanup() {
if (connection != null)
connection.close();
}
#Override
public void execute(Tuple tuple) {
String url = tuple.getStringByField("url");
// Distinguish the value used for indexing
// from the one used for the status
String normalisedurl = valueForURL(tuple);
Metadata metadata = (Metadata) tuple.getValueByField("metadata");
String text = tuple.getStringByField("text");
//BP: added Content Field
String content = new String(tuple.getBinaryByField("content"));
boolean keep = filterDocument(metadata);
if (!keep) {
eventCounter.scope("Filtered").incrBy(1);
// treat it as successfully processed even if
// we do not index it
_collector.emit(StatusStreamName, tuple, new Values(url, metadata,
Status.FETCHED));
_collector.ack(tuple);
return;
}
try {
XContentBuilder builder = jsonBuilder().startObject();
// display text of the document?
if (fieldNameForText() != null) {
builder.field(fieldNameForText(), trimText(text));
}
// send URL as field?
if (fieldNameForURL() != null) {
builder.field(fieldNameForURL(), normalisedurl);
}
// which metadata to display?
Map<String, String[]> keyVals = filterMetadata(metadata);
Iterator<String> iterator = keyVals.keySet().iterator();
while (iterator.hasNext()) {
String fieldName = iterator.next();
String[] values = keyVals.get(fieldName);
if (values.length == 1) {
builder.field(fieldName, values[0]);
try {
saveStringToFile(indexFile, fieldName + "\t" + values[0]);
} catch (IOException e) {
e.printStackTrace();
}
} else if (values.length > 1) {
builder.array(fieldName, values);
}
}
builder.endObject();
String sha256hex = org.apache.commons.codec.digest.DigestUtils
.sha256Hex(normalisedurl);
IndexRequest indexRequest = new IndexRequest(indexName, docType,
sha256hex).source(builder);
DocWriteRequest.OpType optype = DocWriteRequest.OpType.INDEX;
if (create) {
optype = DocWriteRequest.OpType.CREATE;
}
indexRequest.opType(optype);
connection.getProcessor().add(indexRequest);
eventCounter.scope("Indexed").incrBy(1);
_collector.emit(StatusStreamName, tuple, new Values(url, metadata,
Status.FETCHED));
_collector.ack(tuple);
} catch (IOException e) {
LOG.error("Error sending log tuple to ES", e);
// do not send to status stream so that it gets replayed
_collector.fail(tuple);
}
}
private void saveStringToFile(File file, String stringToWrite) throws IOException {
String pathName = file.getPath();
File folder = file.getParentFile();
if (!folder.exists() && !folder.mkdirs()) {
throw new IOException("Couldn't create the storage folder: " + folder.getAbsolutePath() + " does it already exist ?");
}
try (PrintWriter out = new PrintWriter(new FileOutputStream(file, true))) {
out.append(stringToWrite + '\n');
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}

Have you merged all your configs i.e generic SC + specific ES one into a single es-conf.yaml? If not then your Flux file is probably missing
- resource: false
file: "crawler-conf.yaml"
override: true
where the indexer config typically looks like:
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
Not having any md mappings defined would explain why your modified indexer does not write to the files and why the index contains urls but no additional fields.
Please note that the 'index' index (excuse the terminology) does not contain the status of the URL. See https://stackoverflow.com/a/49316316/432844 for an explanation of status vs index.

Related

How can I troubleshoot an error: lib/graphql has no exported mutation - for a mutation I have defined and which appears in graphql.tsx

I'm trying to figure out what I need to do in order to have lib/graphql recognise the mutations I have made.
I have an issue.tsx (which is a form). It imports:
import {
IssueInput,
useUpdateIssueMutation,
useAllIssuesQuery,
useCreateIssueMutation,
useDeleteIssueMutation,
Issue as IssueGQLType,
} from "lib/graphql"
Other than IssueInput and Issue, I'm getting errors in my terminal that say these queries and mutations are not exported members.
However when I try to load the issue page in local host, I get an error that says:
error - GraphQLError [Object]: Syntax Error: Expected Name, found
. It points to the line where Issue is imported.
I made all of these queries and mutations in my resolver as follows:
import { Arg, Mutation, Query, Resolver } from "type-graphql"
import { Issue } from "./issue.model"
import { IssueService } from "./issue.service"
import { IssueInput } from "./inputs/create.input"
import { Inject, Service } from "typedi"
import { UseAuth } from "../shared/middleware/UseAuth"
import { Role } from "#generated"
#Service()
#Resolver(() => Issue)
export default class IssueResolver {
#Inject(() => IssueService)
issueService: IssueService
#Query(() => [Issue])
async allIssues() {
return await this.issueService.getAllIssues()
}
#Query(() => [Issue])
async futureRiskIssues() {
return await this.issueService.getFutureRiskIssues()
}
#Query(() => Issue)
async issue(#Arg("id") id: string) {
return await this.issueService.getIssue(id)
}
#UseAuth([Role.ADMIN])
#Mutation(() => Issue)
async createIssue(#Arg("data") data: IssueInput) {
return await this.issueService.createIssue(data)
}
#UseAuth([Role.ADMIN])
#Mutation(() => Issue)
async deleteIssue(#Arg("id") id: string) {
return await this.issueService.deleteIssue(id)
}
#UseAuth([Role.ADMIN])
#Mutation(() => Issue)
async updateIssue(#Arg("id") id: string, #Arg("data") data: IssueInput) {
return await this.issueService.updateIssue(id, data)
}
}
I can also see from my graphql.tsx file, that these functions are recognised as follows:
export type Mutation = {
__typename?: 'Mutation';
createIssue: Issue;
createUser: User;
deleteIssue: Issue;
destroyAccount: Scalars['Boolean'];
forgotPassword: Scalars['Boolean'];
getBulkSignedS3UrlForPut?: Maybe<Array<SignedResponse>>;
getSignedS3UrlForPut?: Maybe<SignedResponse>;
login: AuthResponse;
register: AuthResponse;
resetPassword: Scalars['Boolean'];
updateIssue: Issue;
updateMe: User;
};
export type MutationCreateUserArgs = {
data: UserCreateInput;
};
export type MutationDeleteIssueArgs = {
id: Scalars['String'];
};
export type MutationUpdateIssueArgs = {
data: IssueInput;
id: Scalars['String'];
};
I have run the codegen several times and can't think of anything else to try to force these mutations and queries to be recognised. Can anyone see a way to trouble shoot this?
My codegen.yml has:
schema: http://localhost:5555/graphql
documents:
- "src/components/**/*.{ts,tsx}"
- "src/lib/**/*.{ts,tsx}"
- "src/pages/**/*.{ts,tsx}"
overwrite: true
generates:
src/lib/graphql.tsx:
config:
withMutationFn: false
addDocBlocks: false
scalars:
DateTime: string
plugins:
- add:
content: "/* eslint-disable */"
- typescript
- typescript-operations
- typescript-react-apollo
When I look at the mutations available on the authentication objects (that are provided with the [boilerplate app][1] that I am trying to use), I can see that there are mutations and queries that are differently represented in the lib/graphql file. I just can't figure out how to force the ones I write to be included in this way:
export function useLoginMutation(baseOptions?: Apollo.MutationHookOptions<LoginMutation, LoginMutationVariables>) {
const options = {...defaultOptions, ...baseOptions}
return Apollo.useMutation<LoginMutation, LoginMutationVariables>(LoginDocument, options);
}
Instead, I get all of these things, but none of them look like the above and I can't figure out which one to import into my front end form so that I can make an entry in the database. None of them look like the queries or mutations I defined in my resolver
export type IssueInput = {
description: Scalars['String'];
issueGroup: Scalars['String'];
title: Scalars['String'];
};
export type IssueListRelationFilter = {
every?: InputMaybe<IssueWhereInput>;
none?: InputMaybe<IssueWhereInput>;
some?: InputMaybe<IssueWhereInput>;
};
export type IssueRelationFilter = {
is?: InputMaybe<IssueWhereInput>;
isNot?: InputMaybe<IssueWhereInput>;
};
export type IssueWhereInput = {
AND?: InputMaybe<Array<IssueWhereInput>>;
NOT?: InputMaybe<Array<IssueWhereInput>>;
OR?: InputMaybe<Array<IssueWhereInput>>;
createdAt?: InputMaybe<DateTimeFilter>;
description?: InputMaybe<StringFilter>;
id?: InputMaybe<UuidFilter>;
issueGroup?: InputMaybe<IssueGroupRelationFilter>;
issueGroupId?: InputMaybe<UuidFilter>;
subscribers?: InputMaybe<UserIssueListRelationFilter>;
title?: InputMaybe<StringFilter>;
updatedAt?: InputMaybe<DateTimeFilter>;
};
export type IssueWhereUniqueInput = {
id?: InputMaybe<Scalars['String']>;
};
I do have this record in my graphql.tsx file:
export type Mutation = {
__typename?: 'Mutation';
createIssue: Issue;
createIssueGroup: IssueGroup;
createUser: User;
deleteIssue: Issue;
deleteIssueGroup: IssueGroup;
destroyAccount: Scalars['Boolean'];
forgotPassword: Scalars['Boolean'];
getBulkSignedS3UrlForPut?: Maybe<Array<SignedResponse>>;
getSignedS3UrlForPut?: Maybe<SignedResponse>;
login: AuthResponse;
register: AuthResponse;
resetPassword: Scalars['Boolean'];
updateIssue: Issue;
updateIssueGroup: IssueGroup;
updateMe: User;
};
but I can't say: createIssueMutation as an import in my issue.tsx where I'm trying to make a form to use to post to the database.
[1]: https://github.com/NoQuarterTeam/boilerplate
In the issue form, I get an error that says:
"resource": "/.../src/pages/issue.tsx", "owner": "typescript",
"code": "2305", "severity": 8, "message": "Module '"lib/graphql"'
has no exported member 'useCreateIssueMutation'.", "source": "ts",
"startLineNumber": 7, "startColumn": 27, "endLineNumber": 7,
"endColumn": 54 }]
and the same thing for the query
check your codegen.yml
overwrite: true
schema: "http://localhost:4000/graphql"
documents: "src/graphql/**/*.graphql"
generates:
src/generated/graphql.tsx:
plugins:
- "typescript"
- "typescript-operations"
- "typescript-react-apollo"
./graphql.schema.json:
plugins:
- "introspection"
or try something like #Resolver(Issue)
It seems like you are not generating the hooks that you are trying to import.
You can update your codegen.yml file to add the generated hooks:
schema: http://localhost:5555/graphql
documents:
- "src/components/**/*.{ts,tsx}"
- "src/lib/**/*.{ts,tsx}"
- "src/pages/**/*.{ts,tsx}"
overwrite: true
generates:
src/lib/graphql.tsx:
config:
withMutationFn: false
addDocBlocks: false
scalars:
DateTime: string
withHooks: true # <--------------------- this line
plugins:
- add:
content: "/* eslint-disable */"
- typescript
- typescript-operations
- typescript-react-apollo

OpenApi - Is there a way to have a ComposedSchema with a discriminator part in a contract generated with springdoc-openapi-maven-plugin?

I have a sample SpringBoot API with the following features:
1 controller that exposes a single endpoint invokable with a GET request and that returns a custom class (ContainerClass in my example)
ContainerClass contains a property List
ParentClass is an abstract class that has 2 sub-classes: ChildA and ChildB
I try to generate an OpenApi contract from this API with springdoc-openapi-maven-plugin.
In my pom.xml, I have the following elements:
SpringBoot version: 2.2.6
org.springdoc:springdoc-openapi-ui:1.4.1
org.springdoc:springdoc-openapi-maven-plugin:1.0
Here are my classes I generate schema from.
import io.swagger.v3.oas.annotations.media.ArraySchema;
import io.swagger.v3.oas.annotations.media.Schema;
public class ContainerClass {
#ArraySchema(
arraySchema = #Schema(discriminatorProperty = "classType"),
schema = #Schema(implementation = ParentClass.class)
)
public List<ParentClass> elements;
// + Getter/Setter
}
#JsonTypeInfo(
use = JsonTypeInfo.Id.NAME,
include = JsonTypeInfo.As.EXISTING_PROPERTY,
property = "classType",
defaultImpl = ParentClass.class,
visible = true)
#JsonSubTypes({
#JsonSubTypes.Type(value = ChildA.class, name = "CHILD_A"),
#JsonSubTypes.Type(value = ChildB.class, name = "CHILD_B")})
#Schema(
description = "Parent description",
discriminatorProperty = "classType",
discriminatorMapping = {
#DiscriminatorMapping(value = "CHILD_A", schema = ChildA.class),
#DiscriminatorMapping(value = "CHILD_B", schema = ChildB.class)
}
)
public abstract class ParentClass {
public String classType;
// + Getter/Setter
}
#io.swagger.v3.oas.annotations.media.Schema(description = " Child A", allOf = ParentClass.class)
public class ChildA extends ParentClass{
}
#io.swagger.v3.oas.annotations.media.Schema(description = " Child B", allOf = ParentClass.class)
public class ChildB extends ParentClass{
}
When I run springdoc-openapi-maven-plugin, I get the following contract file.
openapi: 3.0.1
info:
title: OpenAPI definition
version: v0
servers:
- url: http://localhost:8080
description: Generated server url
paths:
/container:
get:
tags:
- hello-controller
operationId: listElements
responses:
"200":
description: OK
content:
'*/*':
schema:
$ref: '#/components/schemas/ContainerClass'
components:
schemas:
ChildA:
type: object
description: ' Child A'
allOf:
- $ref: '#/components/schemas/ParentClass'
ChildB:
type: object
description: ' Child B'
allOf:
- $ref: '#/components/schemas/ParentClass'
ContainerClass:
type: object
properties:
elements:
type: array
description: array schema description
items:
oneOf:
- $ref: '#/components/schemas/ChildA'
- $ref: '#/components/schemas/ChildB'
ParentClass:
type: object
properties:
classType:
type: string
description: Parent description
discriminator:
propertyName: classType
mapping:
CHILD_A: '#/components/schemas/ChildA'
CHILD_B: '#/components/schemas/ChildB'
Actually, in my context, in order to have not any breaking change with existing consumers, I need items property in ContainerClass schema to contain the discriminator part that is contained in ParentClass schema, like this:
ContainerClass:
type: object
properties:
elements:
type: array
description: array schema description
items:
discriminator:
propertyName: classType
mapping:
CHILD_A: '#/components/schemas/ChildA'
CHILD_B: '#/components/schemas/ChildB'
oneOf:
- $ref: '#/components/schemas/ChildA'
- $ref: '#/components/schemas/ChildB'
When I try to set properties in annotation, I don't manage to do that. And when I debug code of io.swagger.v3.core.jackson.ModelResolver, I don't manage to find a way to do that.
And so far I have not found an example of code that help me.
Is there a way so that a ComposedSchema (array contained in ContainerClass in my case) has a disciminator part generated by springdoc-openapi-maven-plugin execution?
This the default generation structure. Handled directly by swagger-api (and not springdoc-openapi.
The generated OpenAPI description looks coorect.
With springdoc-openapi, you can define an OpenApiCustomiser Bean, where you can change the elements of the components element defined on the OpenAPI level:
https://springdoc.org/faq.html#how-can-i-customise-the-openapi-object-
Here is my solution by defining an OpenApiCustomiser Bean:
#Bean
public OpenApiCustomiser myCustomiser() {
Map<String, String> classTypeMapping = Map.ofEntries(
new AbstractMap.SimpleEntry<String, String>("CHILD_A", "#/components/schemas/ChildA"),
new AbstractMap.SimpleEntry<String, String>("CHILD_B", "#/components/schemas/ChildB")
);
Discriminator classTypeDiscriminator = new Discriminator().propertyName("classType")
.mapping(classTypeMapping);
return openApi -> openApi.getComponents().getSchemas().values()
.stream()
.filter(schema -> "ContainerClass".equals(schema.getName()))
.map(schema -> schema.getProperties().get("elements"))
.forEach(arraySchema -> ((ArraySchema)arraySchema).getItems().discriminator(classTypeDiscriminator));
}
I get the expected result in my contract file.

Flink is not adding any data to Elasticsearch but no errors

Folks, I'm new to all this data streaming process but I was able to build and submit a Flink job that will read some CSV data from Kafka and aggregate it then put it in Elasticsearch.
I was able to do the first two parts, and print out my aggregation to STDOUT. But when I added the code to put it to Elasticsearch, it seems nothing is happening there (no data being added). I looked at the Flink job manager log and it looks fine (no errors) and says:
2020-03-03 16:18:03,877 INFO
org.apache.flink.streaming.connectors.elasticsearch7.Elasticsearch7ApiCallBridge
- Created Elasticsearch RestHighLevelClient connected to [http://elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local:9200]
Here is my code at this point:
/*
* This Scala source file was generated by the Gradle 'init' task.
*/
package flinkNamePull
import java.time.LocalDateTime
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer010, FlinkKafkaProducer010}
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.{DataTypes, Table}
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.table.descriptors.{Elasticsearch, Json, Schema}
object Demo {
/**
* MapFunction to generate Transfers POJOs from parsed CSV data.
*/
class TransfersMapper extends RichMapFunction[String, Transfers] {
private var formatter = null
#throws[Exception]
override def open(parameters: Configuration): Unit = {
super.open(parameters)
//formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")
}
#throws[Exception]
override def map(csvLine: String): Transfers = {
//var splitCsv = csvLine.stripLineEnd.split("\n")(1).split(",")
var splitCsv = csvLine.stripLineEnd.split(",")
val arrLength = splitCsv.length
val i = 0
if (arrLength != 13) {
for (i <- arrLength + 1 to 13) {
if (i == 13) {
splitCsv = splitCsv :+ "0.0"
} else {
splitCsv = splitCsv :+ ""
}
}
}
var trans = new Transfers()
trans.rowId = splitCsv(0)
trans.subjectId = splitCsv(1)
trans.hadmId = splitCsv(2)
trans.icuStayId = splitCsv(3)
trans.dbSource = splitCsv(4)
trans.eventType = splitCsv(5)
trans.prev_careUnit = splitCsv(6)
trans.curr_careUnit = splitCsv(7)
trans.prev_wardId = splitCsv(8)
trans.curr_wardId = splitCsv(9)
trans.inTime = splitCsv(10)
trans.outTime = splitCsv(11)
trans.los = splitCsv(12).toDouble
return trans
}
}
def main(args: Array[String]) {
// Create streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// Set properties per KafkaConsumer API
val properties = new Properties()
properties.setProperty("bootstrap.servers", "kafka.kafka:9092")
properties.setProperty("group.id", "test")
// Add Kafka source to environment
val myKConsumer = new FlinkKafkaConsumer010[String]("raw.data3", new SimpleStringSchema(), properties)
// Read from beginning of topic
myKConsumer.setStartFromEarliest()
val streamSource = env
.addSource(myKConsumer)
// Transform CSV (with a header row per Kafka event into a Transfers object
val streamTransfers = streamSource.map(new TransfersMapper())
// create a TableEnvironment
val tEnv = StreamTableEnvironment.create(env)
println("***** NEW EXECUTION STARTED AT " + LocalDateTime.now() + " *****")
// register a Table
val tblTransfers: Table = tEnv.fromDataStream(streamTransfers)
tEnv.createTemporaryView("transfers", tblTransfers)
tEnv.connect(
new Elasticsearch()
.version("7")
.host("elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local", 9200, "http") // required: one or more Elasticsearch hosts to connect to
.index("transfers-sum")
.documentType("_doc")
.keyNullLiteral("n/a")
)
.withFormat(new Json().jsonSchema("{type: 'object', properties: {curr_careUnit: {type: 'string'}, sum: {type: 'number'}}}"))
.withSchema(new Schema()
.field("curr_careUnit", DataTypes.STRING())
.field("sum", DataTypes.DOUBLE())
)
.inUpsertMode()
.createTemporaryTable("transfersSum")
val result = tEnv.sqlQuery(
"""
|SELECT curr_careUnit, sum(los)
|FROM transfers
|GROUP BY curr_careUnit
|""".stripMargin)
result.insertInto("transfersSum")
// Elasticsearch elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local:9200
env.execute("Flink Streaming Demo Dump to Elasticsearch")
}
}
I'm not sure how I can debug this beast... Wondering if somebody can help me figure out why the Flink job is not adding data to Elasticsearch :(
From my Flink cluster, I'm able to query Elasticsearch just fine (manually) and add records to my index:
curl -XPOST "http://elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local:9200/transfers-sum/_doc" -H 'Content-Type: application/json' -d'{"curr_careUnit":"TEST123","sum":"123"}'
A kind soul in the Flink mailist pointed out the fact that it could be Elasticsearch buffering my records... Well, it was. ;)
I have added the following options to the Elasticsearch connector:
.bulkFlushMaxActions(2)
.bulkFlushInterval(1000L)
Flink Elasticsearch Connector 7 using Scala
Please find a working and detailed answer which I have provided here.

ValidationException on Update: Validation error whilst flushing entity on AbstractPersistenceEventListener

In my environment, i have grails.gorm.failOnError = true on Config.groovy.
package org.example
class Book {
String title
String author
String email
static constraints = {
title nullable: false, blank: false
email nullable: false, blank: false, unique: true //apparently this is the problem..
}
}
And, on controller, i have:
package org.example
class BookController {
def update() {
def bookInstance = Book.get(params.id)
if (!bookInstance) {
flash.message = message(code: 'default.not.found.message', args: [message(code: 'book.label', default: 'Book'), params.id])
redirect(action: "list")
return
}
if (params.version) {
def version = params.version.toLong()
if (bookInstance.version > version) {
bookInstance.errors.rejectValue("version", "default.optimistic.locking.failure",
[message(code: 'book.label', default: 'Book')] as Object[],
"Another user has updated this Book while you were editing")
render(view: "edit", model: [bookInstance: bookInstance])
return
}
}
bookInstance.properties = params
bookInstance.validate()
if(bookInstance.hasErrors()) {
render(view: "edit", model: [bookInstance: bookInstance])
} else {
bookInstance.save(flush: true)
flash.message = message(code: 'default.updated.message', args: [message(code: 'book.label', default: 'Book'), bookInstance.id])
redirect(action: "show", id: bookInstance.id)
}
}
}
To save, it's ok. But, when updating without set the title field, i get:
Message: Validation error whilst flushing entity [org.example.Book]:
- Field error in object 'org.example.Book' on field 'title': rejected value []; codes [org.example.Book.title.blank.error.org.example.Book.title,org.example.Book.title.blank.error.title,org.example.Book.title.blank.error.java.lang.String,org.example.Book.title.blank.error,book.title.blank.error.org.example.Book.title,book.title.blank.error.title,book.title.blank.error.java.lang.String,book.title.blank.error,org.example.Book.title.blank.org.example.Book.title,org.example.Book.title.blank.title,org.example.Book.title.blank.java.lang.String,org.example.Book.title.blank,book.title.blank.org.example.Book.title,book.title.blank.title,book.title.blank.java.lang.String,book.title.blank,blank.org.example.Book.title,blank.title,blank.java.lang.String,blank]; arguments [title,class org.example.Book]; default message [Property [{0}] of class [{1}] cannot be blank]
Line | Method
->> 46 | onApplicationEvent in org.grails.datastore.mapping.engine.event.AbstractPersistenceEventListener
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| 895 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
| 918 | run . . . . . . . in ''
^ 680 | run in java.lang.Thread
At q I understand it, the problem occurs when the flush hibernate session, hibernate tries to save the object again then the exception is thrown...
When trying to save the object again, is called the book.validate () again, which makes a new query in the database to ensure the uniqueness of the email field. Right now, the Validation Exception is thrown.
But, when i removed the unique validation of email property, the update is performed normally..
My question is: This behavior is correct? Hibernate calls book.save automatically?
This is the sample project, and the steps to simulate the error are:
source: https://github.com/roalcantara/grails_app_validation_exception
grails run-app
navigate to http:// localhost: 8080/ book/book/create
create an new instance filling all fields..
then edit this instance, in: http:// localhost: 8080/ book/book/edit/1
finally, drop the 'Title' field and click on Update, then the exception is thrown..
In my environment, this behavior has occurred on grails version 2.0.3 and 2.2.1
Thanks for any help! And sorry by my poor (and shame) english.. rs..
You are essentially validating twice, first with:
bookInstance.validate()
and second with:
bookInstance.save(flush: true)
When you call bookInstance.save(flush: true) a boolean is returned. Grails takes advantage of this by default when a controller is generated, but it appears you have changed the controller Grails generated by default for some reason.
Just replace this:
bookInstance.validate()
if(bookInstance.hasErrors()) {
render(view: "edit", model: [bookInstance: bookInstance])
} else {
bookInstance.save(flush: true)
flash.message = message(code: 'default.updated.message', args: [message(code: 'book.label', default: 'Book'), bookInstance.id])
redirect(action: "show", id: bookInstance.id)
}
With this:
if( !bookInstance.save( flush: true ) ) {
render(view: "edit", model: [bookInstance: bookInstance])
return
}

Authentication token always null in kernel.request event in Symfony 2?

I'm tring to write a basic listener for kernel.request event in Symfony 2. Service definition is pretty simple and annotations come from JMSDiExtraBundle.
The problems is that $context->getToken() is always null even the user is fully authenticated:
/**
* #Service("request.set_messages_count_listener")
*
*/
class RequestListener
{
/**
* #var \Symfony\Component\DependencyInjection\ContainerInterface
*/
private $container;
/**
* #InjectParams({"container" = #Inject("service_container")})
*
*/
public function __construct(ContainerInterface $container)
{
$this->container = $container;
}
/**
* #Observe("kernel.request", priority = 255)
*/
public function onKernelRequest(GetResponseEvent $event)
{
$context = $this->container->get('security.context');
var_dump($context->getToken()); die();
}
}
I think my security setup is working fine. What could be the problem then?
secured_area:
pattern: ^/app/
switch_user: true
form_login:
check_path: /app/login_check
login_path: /app/login
default_target_path: /app/dashboard
always_use_default_target_path: true
logout:
path: /demo/secured/logout # TODO
target: /demo/ # TODO
access_control:
- { path: ^/app/login, roles: IS_AUTHENTICATED_ANONYMOUSLY }
- { path: ^/app/users, roles: ROLE_MNG_USERS }
- { path: ^/app/messages, roles: ROLE_MNG_USERS }
- { path: ^/app/roles, roles: ROLE_MNG_PACKAGES_FEATURES }
- { path: ^/app/packages, roles: ROLE_MNG_PACKAGES_FEATURES }
- { path: ^/app/, roles: ROLE_USER }
With priority = 255, your listener is called BEFORE the security firewall (priority = 8, look here).
Try to change your priority.

Resources