Stormcrawler Topology class is not working as we run with es-injector.flux - apache-storm

I have few urls need to scrape using stormcrawler.As per [link]https://medium.com/analytics-vidhya/web-scraping-and-indexing-with-stormcrawler-and-elasticsearch-a105cb9c02ca[/link]I followed all steps and got scraped and loaded content in my elastic.
As per above blog, he used Flux command to inject topology to ES.
spouts:
-
className: com.digitalpebble.stormcrawler.spout.FileSpout
constructorArgs:
- "stormcrawlertest-master/"
- seeds.txt
- true
id: spout
parallelism: 1
streams:
-
from: spout
grouping:
customClass:
className: com.digitalpebble.stormcrawler.util.URLStreamGrouping
constructorArgs:
- byHost
streamId: status
type: CUSTOM
to: status
this will inject urls to ES.
I followed the same class in Flux and create a main class
String[] argsa = new String[] { "-conf","/crawler-conf.yaml", "-conf","/es-conf.yaml", "-local" };
ConfigurableTopology.start(new InjectorTopology(), argsa);
public class InjectorTopology extends ConfigurableTopology {
#Override
protected int run(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new FileSpout("stormcrawlertest-master/","seeds.txt", true), 1);
builder.setBolt("status", new StatusUpdaterBolt(), 1).customGrouping("spout",new URLStreamGrouping(Constants.PARTITION_MODE_HOST));
return submit("ESInjectorInstance", conf, builder);
}}
and clean and package by maven run
python storm.py jar target/stormcrawlertest-1.0-SNAPSHOT.jar com.my.sitescraper.main.SiteScraper
this is not injecting any urls to ES.
What I am missing.

Related

Spring Cloud Stream - Testing Functional Producer

I wrote a Spring Cloud Stream Producer according new functional model introduced with 3.1 version.
#EnableAutoConfiguration
#Component
public class Producer {
private final BlockingQueue<Message<Object>> messageQueue = new LinkedBlockingQueue<>();
public void produce(int messageId, Object message) {
Message<Object> toProduce = MessageBuilder
.withPayload(message)
.setHeader(PARTITION_KEY, messageId)
.build();
messageQueue.offer(toProduce);
}
#Bean
public Supplier<Message<Object>> produceMessage() {
return () -> messageQueue.poll();
}
}
I'm able to call from a REST controller the produce(int, Object) method that put data into the BlockingQueue.
The Supplier, annotated with #Bean, annotation is polled by default every second.
This is a snippet of the application.yml:
spring:
cloud:
function:
definition: produceMessage
stream:
bindings:
produceMessage-out-0:
destination: test-topic
contentType: application/json
producer:
partitionKeyExpression: headers['partitionKey']
partitionCount: 1
errorChannelEnabled: true
...
kafka:
bindings:
produceMessage-out-0:
producer:
configuration:
retries: 10
max.in.flight.requests.per.connection: 1
request.timeout.ms: 20000
Finally I wrote this class in order to test my code:
#SpringBootTest
class ProducerTest {
#Test
void producerTest() {
try (ConfigurableApplicationContext context = new SpringApplicationBuilder
(TestChannelBinderConfiguration.getCompleteConfiguration(Producer.class))
.web(WebApplicationType.NONE)
.run("--spring.jmx.enabled=false")) {
OutputDestination output = context.getBean(OutputDestination.class);
Producer producer= context.getBean(Producer.class);
producer.produce(1, new MyMessage(1, "Hello Message"));
Message<byte[]> received = output.receive();
Assertions.assertNotNull(received);
}
}
}
When I run the test, it fails because received is null.
I read a lot of examples that show this is the way to test this type of Producer.
What am I doing wrong? Can you help me please?
Thanks

Speed up YAML file processing in Spring boot

I'm trying to read YAML properties file using the #PropertySource annotation in Spring Boot.
configuration.yaml file has about 7.5K lines of data in the below format:
# Configuration for privilege management
---
role-configuration:
roles:
- name: Agent
groups:
-name: privilege-group
group-configuration:
groups:
- name: privilege-group
privilege-configuration:
privileges:
- name: admin-dashboard-view
description: View Admin dashboard
groups:
- name: privilege-group
- name: admin-dashboard-edit
description: Edit Admin dashboard
groups:
- name: privilege-group
....
...
..
Configuration bean and YAML specific PropertySourceFactory has been implemented by following this link
PrivilegeProvider.java
#Configuration
#ConfigurationProperties(prefix = "privilege-configuration")
#PropertySource(value = "classpath:security/configuration.yaml", factory = YamlPropertySourceFactory.class)
#Data # lombok annotation for generating getter/setter and other helper functions
public class PrivilegeProvider {
private List<Privilege> privileges;
}
YamlPropertySourceFactory.java
public class YamlPropertySourceFactory implements PropertySourceFactory {
#Override
public PropertySource<?> createPropertySource(String name, EncodedResource encodedResource) throws IOException {
YamlPropertiesFactoryBean factory = new YamlPropertiesFactoryBean();
factory.setResources(encodedResource.getResource());
Properties properties = factory.getObject();
return new PropertiesPropertySource(encodedResource.getResource().getFilename(), properties);
}
}
All the data is successfully loaded from YAML file but it is taking ~5-7 minutes to load, which is quite a lot of time given the file size.
Can this be optimized? or is there any other way in which I can implement the same?

StateStore is never added on Spring cloud

Any Help how can I add state store on Spring cloud
I always receive this error "nested exception is org.springframework.kafka.KafkaException: Could not start stream: ; nested exception is org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore myStore is not added yet."
Here is the bean definition however it never works
#Bean
public StoreBuilder storeBuilder() {
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore("mystore");
StoreBuilder<KeyValueStore<String, MyData>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier, Serdes.String(), StreamsSerde.MyDataSerde());
return storeBuilder;
}
Here is the Serde
public static final class MyDataSerde extends Serdes.WrapperSerde<MyData> {
public MyDataSerde() {
super(new JsonSerializer<>(), new JsonDeserializer<>(MyData.class));
}
}
Here is the data class
public class MyData {
private String name;
private String course;
}
Here is the spring cloud dependencies
springBootVersion = "2.2.5.RELEASE"
set('springCloudVersion', "Hoxton.SR3")
implementation group:"org.springframework.cloud", name: "spring-cloud-stream"
implementation group: "org.springframework.cloud", name: "spring-cloud-stream-binder-kafka-streams"
implementation group: "org.springframework.cloud", name: "spring-cloud-starter-stream-kafka"
You need to add state stores like this when you have to use the lower level processor or transformer API. Did you try to add the state store to your process or transform method call? Here is a test that works. Take a look at the process call and the way the state stores are passed along.
I found a solution to add the store programmatically on this article
public void initializeStateStores() throws Exception {
StreamsBuilderFactoryBean streamsBuilderFactoryBean =
applicationContext.getBean("&stream-builder-requestListener", StreamsBuilderFactoryBean.class);
StreamsBuilder streamsBuilder = streamsBuilderFactoryBean.getObject();
StoreBuilder<KeyValueStore<String, Long>> keyValueStoreBuilder = Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(stateStoreName), Serdes.String(), Serdes.Long());
streamsBuilder.addStateStore(keyValueStoreBuilder);
}
https://medium.com/#daniyaryeralin/utilizing-kafka-streams-processor-api-and-implementing-custom-aggregator-6cb23d00eaa7

Spring Cloudstream 3 + RabbitMQ configuration to existing queue

I'm learning Cloudstream and cannot map the cloudstream Function<String, String> into existing queue.
I'm just creating the hello world app from spring cloud documentation, but don't really understand this part regarding binding names.
I have q.test (existing) on my rabbitmq app, but when I use this code and configuration, my app always create new queue q.test.anonymous.someRandomString.
Anybody has configuration example for this?
#SpringBootApplication
public class CloudstreamApplication {
public static void main(String[] args) {
SpringApplication.run(CloudstreamApplication.class, args);
}
#Bean
public Function<String, String> uppercase() {
return value -> {
System.out.println("Received: " + value);
return value.toUpperCase();
};
}
}
application.yml
spring.cloud.stream:
function.bindings:
uppercase-in-0: q.test
bindings:
uppercase-in-0.destination: q.test
Thanks
See the binder documentation - Using Existing Queues/Exchanges.
If you have an existing exchange/queue that you wish to use, you can completely disable automatic provisioning as follows, assuming the exchange is named myExchange and the queue is named myQueue:
spring.cloud.stream.bindings.<binding name>.destination=myExhange
spring.cloud.stream.bindings.<binding name>.group=myQueue
spring.cloud.stream.rabbit.bindings.<binding name>.consumer.bindQueue=false
spring.cloud.stream.rabbit.bindings.<binding name>.consumer.declareExchange=false
spring.cloud.stream.rabbit.bindings.<binding name>.consumer.queueNameGroupOnly=true

Spring Cloud Stream Kafka Channel Not Working in Spring Boot Application

I have been attempting to get an inbound SubscribableChannel and outbound MessageChannel working in my spring boot application.
I have successfully setup the kafka channel and tested it successfully.
Furthermore I have create a basic spring boot application that tests adding and receiving things from the channel.
The issue I am having is when I put the equivalent code in the application it belongs in, it appears that the messages never get sent or received. By debugging it's hard to ascertain what's going on but the only thing that looks different to me is the channel-name. In the working impl the channel name is like application.channel in the non working app its localhost:8080/channel.
I was wondering if there is some spring boot configuration blocking or altering the creation of the channels into a different channel source?
Anyone had any similar issues?
application.yml
spring:
datasource:
url: jdbc:h2:mem:dpemail;DB_CLOSE_DELAY=-1;DB_CLOSE_ON_EXIT=FALSE
platform: h2
username: hello
password:
driverClassName: org.h2.Driver
jpa:
properties:
hibernate:
show_sql: true
use_sql_comments: true
format_sql: true
cloud:
stream:
kafka:
binder:
brokers: localhost:9092
bindings:
email-in:
destination: email
contentType: application/json
email-out:
destination: email
contentType: application/json
Email
public class Email {
private long timestamp;
private String message;
public long getTimestamp() {
return timestamp;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
public String getMessage() {
return message;
}
public void setMessage(String message) {
this.message = message;
}
}
Binding Config
#EnableBinding(EmailQueues.class)
public class EmailQueueConfiguration {
}
Interface
public interface EmailQueues {
String INPUT = "email-in";
String OUTPUT = "email-out";
#Input(INPUT)
SubscribableChannel inboundEmails();
#Output(OUTPUT)
MessageChannel outboundEmails();
}
Controller
#RestController
#RequestMapping("/queue")
public class EmailQueueController {
private EmailQueues emailQueues;
#Autowired
public EmailQueueController(EmailQueues emailQueues) {
this.emailQueues = emailQueues;
}
#RequestMapping(value = "sendEmail", method = POST)
#ResponseStatus(ACCEPTED)
public void sendToQueue() {
MessageChannel messageChannel = emailQueues.outboundEmails();
Email email = new Email();
email.setMessage("hello world: " + System.currentTimeMillis());
email.setTimestamp(System.currentTimeMillis());
messageChannel.send(MessageBuilder.withPayload(email).setHeader(MessageHeaders.CONTENT_TYPE, MimeTypeUtils.APPLICATION_JSON).build());
}
#StreamListener(EmailQueues.INPUT)
public void handleEmail(#Payload Email email) {
System.out.println("received: " + email.getMessage());
}
}
I'm not sure if one of the inherited configuration projects using Spring-Cloud, Spring-Cloud-Sleuth might be preventing it from working, but even when I remove it still doesnt. But unlike my application that does work with the above code I never see the ConsumeConfig being configured, eg:
o.a.k.clients.consumer.ConsumerConfig : ConsumerConfig values:
auto.commit.interval.ms = 100
auto.offset.reset = latest
bootstrap.servers = [localhost:9092]
check.crcs = true
client.id = consumer-2
connections.max.idle.ms = 540000
enable.auto.commit = false
exclude.internal.topics = true
(This configuration is what I see in my basic Spring Boot application when running the above code and the code works writing and reading from the kafka channel)....
I assume there is some over spring boot configuration from one of the libraries I'm using creating a different type of channel I just cannot find what that configuration is.
What you posted contains a lot of unrelated configuration, so hard to determine if anything gets in the way. Also, when you say "..it appears that the messages never get sent or received.." are there any exceptions in the logs? Also, please state the version of Kafka you're using as well as Spring Cloud Stream.
Now, I did try to reproduce it based on your code (after cleaning up a bit to only leave relevant parts) and was able to successfully send/receive.
My Kafka version is 0.11 and Spring Cloud Stream 2.0.0.
Here is the relevant code:
spring:
cloud:
stream:
kafka:
binder:
brokers: localhost:9092
bindings:
email-in:
destination: email
email-out:
destination: email
#SpringBootApplication
#EnableBinding(KafkaQuestionSoApplication.EmailQueues.class)
public class KafkaQuestionSoApplication {
public static void main(String[] args) {
SpringApplication.run(KafkaQuestionSoApplication.class, args);
}
#Bean
public ApplicationRunner runner(EmailQueues emailQueues) {
return new ApplicationRunner() {
#Override
public void run(ApplicationArguments args) throws Exception {
emailQueues.outboundEmails().send(new GenericMessage<String>("Hello"));
}
};
}
#StreamListener(EmailQueues.INPUT)
public void handleEmail(String payload) {
System.out.println("received: " + payload);
}
public interface EmailQueues {
String INPUT = "email-in";
String OUTPUT = "email-out";
#Input(INPUT)
SubscribableChannel inboundEmails();
#Output(OUTPUT)
MessageChannel outboundEmails();
}
}
Okay so after a lot of debugging... I discovered that something is creating a Test Support Binder (how don't know yet) so obviously this is used to not impact add messages to a real channel.
After adding
#SpringBootApplication(exclude = TestSupportBinderAutoConfiguration.class)
The kafka channel configurations have worked and messages are adding.. would be interesting to know what on earth is setting up this test support binder.. I'll find that sucker eventually.

Resources