how can i bring data from static websites to HDFS? - hadoop

What are the other available framework like spring XD, Flume for that? Which one is the best among; please advise steps to bring data.

Using NUTCH
Using Kafka flume
Using spring xd
scraper import.io
java program by producer consumer

Related

Spring Batch and Kafka

I am a junior programmer in banking. I want to make a microservice system that get data from kafka and processes it. after that, save to database and send final data to client app. What technology can i use? I plan to use spring bacth and kafka. Can the technology be implemented in my project or is there a better alternative?
To process data from a Kafka topic I recommend you to use Kafka Streams API, especially Spring Kafka Streams.
Kafka Streams and Spring
And to store the data in a database, you should use a Kafka Sink Connector.
Kafka Connect
This approach is very common and easy if your company has a Kafka ecosystem.
In terms of alternatives, here you will find an interesting comparison:
https://scramjet.org/blog/welcome-to-the-family
3 in 1 serverless
Scramjet takes a slightly different approach - 3 platforms in one.
Both the free product https://hub.scramjet.org/ for installation on your server and the cloud platform are available - currently also free in the beta version https://scramjet.org/#join-beta

Spring Boot and Apache Kafka complex examples

I am looking to get the Spring Boot and Apache Kafka complex topics/scenarios examples. Whatever I found on web was very basic and similar demo.
Does anyone has Spring Boot and Apache Kafka example?
I was also looking for solid examples. And I could not find anything concrete but "hello world"... Then I just realized that I'm overlooking the Confluent's documents.
How to Work with Apache Kafka in Your Spring Boot Application
This is literally "hello world" but might be a good warm-up for beginners.
Spring for Apache Kafka Deep Dive – Part 1: Error Handling, Message Conversion and Transaction Support
Spring for Apache Kafka Deep Dive – Part 2: Apache Kafka and Spring Cloud Stream
Spring for Apache Kafka Deep Dive – Part 3: Apache Kafka and Spring Cloud Data Flow
Spring for Apache Kafka Deep Dive – Part 4: Continuous Delivery of Event Streaming Pipelines
I think the Confluent Platform's blog is a real hidden gem for the developers. There are lots of cool topics besides of example/tutorials. To name few of them:
Spring for Apache Kafka – Beyond the Basics: Can Your Kafka Consumers Handle a Poison Pill?
Advanced Testing Techniques for Spring for Apache Kafka
How to Use Schema Registry and Avro in Spring Boot Applications
please refer this tutorial. there are some steps.
https://howtodoinjava.com/kafka/spring-boot-with-kafka/

data stream between Kerberized kafka cluster to hadoop cluster using Spring boot

I have a streaming use case to develop an Spring boot application where it should read data from kafka topic and put into hdfs path, I got two distinct cluster for kafka and hadoop.
Application worked fine without having kerberos authentication in kafka cluster and hadoop being kerberized.
Issues started when both cluster being kerberized, At the same time i could only authenticate into only one cluster.
I did few analysis/googling , i could not find much of help,
My theory is we could not login/authenticate into two kerberized cluster at same jvm instance because we need to set REALM and KDC details in code which are not client specific but jvm specific,
It might happen that i did not used proper APIs, I am very new to Spring boot.
I know we can do this by setting cross realm trust between clusters but i am looking for application level solutions if possible.
I got few questions
is it possible to login/authenticate two separate kerberized cluster at same jvm instance, if possible? please help me, use of Spring boot is preferred.
What would be the best solution to stream data from kafka cluster to hadoop cluster.
What would be the best solution to stream data from kafka cluster to hadoop cluster.
Kafka's Connect API is for streaming integration of sources and targets with Kafka, using just configuration files - no coding! The HDFS connector is what you want, and supports Kerberos authentication. It is open source and available standalone or as part of Confluent Platform.

How to get Kafka bootstrap configuration setting from Kafka connector

Question for Kafka experts:
Anyone knows how to get worker's config setting 'bootstrap.servers' either from SinkConnector or SinkTask? Or how to get Kafka cluster information from connector?
Thank you
AFAIK, the Connect API doesn't provide this information right now.
If you need this functionality, perhaps your best bet is to open a JIRA on the Apache Kafka project and explain the use-case (i.e. what are you planning on doing this information).

Is Spring XD the right tool choice?

We're building an M2M IoT platform and part of the ecosystem is a Big Data storage and analytics component.
The platform connects devices at one end and provides a streaming data output using ActiveMQ to interface with the Big Data application layer.
I'm right now designing this middle layer which accepts machine data, running real time processes and stores this data in to a Hadoop storage module.
From what I see, Spring XD seems to be able to orchestrate this process from ingestion, to filtering, processing, analytics and export to Hadoop.
However, I do not know anyone who has done something like this. Anyone here who has executed something similar? Need your feedback into the choice of tool for the middleware.
Spring XD is great with RabbitMQ, for ActiveMQ you can use the JMS connector.
For more information take a look at Spring Integration, which is the main underpinnings and has been around for ever.
Spring XD runs on YARN or Zookeeper which are very solid.
I have seen it used for orchestration of big data in a few places.

Resources