Change single to cluster hadoop installation keeping persisted data - hadoop

I'm going to do a Hadoop POC in a production environment. The POC consists of:
1. Receive lots of (real life) events
2. Accumulate them to have a set of events with enough size
3. Persist the set of events in a single file HDFS
In case the POC is successful, I want to install a cluster environment but I need to keep the data persisted in the single cluster installation (POC).
Then, the question: How difficult is to migrate the data already persisted in HDFS single cluster to a real cluster HDFS environment?
Thanks in advance (and sorry for my bad english)
Regards

You don't need to migrate anything.
If you're running Hadoop in Pseudo distributed mode, all you need to do is add datanodes that are pointing at your existing namenode and that's it!
I would like to point out
Persist the set of events in a single file HDFS
I'm not sure about making "a single file", but I suggest you do periodic checkpointing. What if the stream fails? How do you catch dropped events? Spark, Flume, Kafka Connect, NiFi, etc can allow you to do this.
And if all you're doing is streaming events, and want to store them for a variable time period, then Kafka is more built for that use case. You don't necessarily need Hadoop. Push events to Kafka, consume them where it makes sense, for example, a search engine or a database (Hadoop is not a database)

Related

Ingest log files from edge nodes to Hadoop

I am looking for a way to stream entire log files from edge nodes to Hadoop. To sum up the use case:
We have applications that produce log files ranging from a few MB to hundreds of MB per file.
We do not want to stream all the log events as they occur.
Pushing the log files in their entirety after they have written completely is what we are looking for (written completely = got moved into another folder for example... this is not a problem for us).
This should be handled by some kind of lightweight agents on the edge nodes to the HDFS directly or - if necessary - an intermediate "sink" that will push the data to HDFS afterwards.
Centralized Pipeline Management (= configuring all edge nodes in a centralized manner) would be great
I came up with the following evaluation:
Elastic's Logstash and FileBeats
Centralized pipeline management for edge nodes is available, e.g. one centralized configuration for all edge nodes (requires a license)
Configuration is easy, WebHDFS output sink exists for Logstash (using FileBeats would require an intermediate solution with FileBeats + Logstash that outputs to WebHDFS)
Both tools are proven to be stable in production-level environments
Both tools are made for tailing logs and streaming these single events as they occur rather than ingesting a complete file
Apache NiFi w/ MiNiFi
The use case of collecting logs and sending the entire file to another location with a broad number of edge nodes that all run the same "jobs" looks predestined for NiFi and MiNiFi
MiNiFi running on the edge node is lightweight (Logstash on the other hand is not so lightweight)
Logs can be streamed from MiNiFi agents to a NiFi cluster and then ingested into HDFS
Centralized pipeline management within the NiFi UI
writing to a HDFS sink is available out-of-the-box
Community looks active, development is lead by Hortonworks (?)
We have made good experiences with NiFi in the past
Apache Flume
writing to a HDFS sink is available out-of-the-box
Looks like Flume is more of a event-based solution rather than a solution for streaming entire log files
No centralized pipeline management?
Apache Gobblin
writing to a HDFS sink is available out-of-the-box
No centralized pipeline management?
No lightweight edge node "agents"?
Fluentd
Maybe another tool to look at? Looking for your comments on this one...
I'd love to get some comments about which of the options to choose. The NiFi/MiNiFi option looks the most promising to me - and is free to use as well.
Have I forgotten any broadly used tool that is able to solve this use case?
I experience similar pain when choosing open source big data solutions, simply that there are so many paths to Rome. Though "asking for technology recommendations is off topic for Stackoverflow", I still want to share my opinions.
I assume you already have a hadoop cluster to land the log files. If you are using an enterprise ready distribution e.g. HDP distribution, stay with their selection of data ingestion solution. This approach always save you lots of efforts in installation, setup centrol managment and monitoring, implement security and system integration when there is a new release.
You didn't mention how you would like to use the log files once they lands in HDFS. I assume you just want to make an exact copy, i.e. data cleansing or data trasformation to a normalized format is NOT required in data ingestion. Now I wonder why you didn't mention the simplest approach, use a scheduled hdfs commands to put log files into hdfs from edge node?
Now I can share one production setup I was involved. In this production setup, log files are pushed to or pulled by a commercial mediation system that makes data cleansing, normalization, enrich etc. Data volume is above 100 billion log records every day. There is an 6 edge nodes setup behind a load balancer. Logs are firstly land on one of the edge nodes, then hdfs command put to HDFS. Flume was used initially but replaced by this approach due to performance issue.(it can very likely be that engineer was lack of experience in optimizing Flume). Worth to mention though, the mediation system has a managment UI for scheduling ingestion script. In your case, I would start with cron job for PoC then use e.g. Airflow.
Hope it helps! And would be glad to know your final choice and your implementation.

Spark Architecture for processing small binary files saved in HDFS

I don't know how to build architecture for following use case:
I have an Web application where users can upload files(pdf&pptx) and directories to be processed. After upload is complete web application put this files and directories in HDFS, then send a messages on kafka with path to this files.
Spark Application read messages from kafka streaming, collect them on master(driver), and after that process them. I collect messages first because i need to move the code to data, and not move data where the message is received. I understood that spark assign job to executor which already have file locally.
I have issues with kafka because i was forced to collect them first for the above reason, and when want to create checkpoint app crash "because you are attempting to reference SparkContext from a broadcast variable" even if the code run before adding checkpointing( I use sparkContext there because i need to save data to ElasticSearch and PostgreSQL. I don't know how exactly i can do code upgrading in this conditions.
I read about hadoop small files problems, and I understand what problems are in this case. I read that HBase is a better solution to save small files than just save in hdfs. Other problem in hadoop small files problems is big number of mappers and reducers created for computation, but i don't understand if this problem there in spark.
What is the best architecture for this use case?
How to do Job Scheduling? It's kafka good for that? or I need to use other service like rabbitMQ or something else?
Exist some method to add jobs to an running Spark application through some REST API?
How is the best way to save files? Is better to use Hbase because i have small files(<100MB)? Or I need to use SequenceFile? I think SequenceFile isn't for my use case because i need to reprocess some files randomly.
What is the best architecture do you think for this use case?
Thanks!
There is no one single "the best" way to build architecture. You need to make decisions and stick to them. Make the architecture flexible and decoupled so that you can easily replace components if needed.
Consider following stages/layers in your architecture:
Retrieval/Acquisition/Transport of source data (files)
Data processing/transformation
Data archival
As a retrieval component, I would use Flume. It is flexible, supports a lot of sources, channels (including Kafka) and sinks. In your case you can configure source that monitors the directory and extracts the newly received files.
For data processing/transformation - it depends what task you are solving. You probably decided on Spark Streaming. Spark streaming can be integrated with Flume sink (http://spark.apache.org/docs/latest/streaming-flume-integration.html) There are other options available, e.g. Apache Storm. Flume combines very well with Storm. Some transformations can also be applied in Flume.
For data archival - do not store/archive the files directly in Hadoop, unless they are bigger than few hundredths of megabytes. One solution would be to put them in HBase.
Make your architecture more flexible. I would place processed files in a temporary HDFS location and have some job regualarly archive them into zip, HBase, Hadoop Archive (there is such an animal) or any other solution.
Consider using Apache NiFi (aka HDF - Hortonworks Data Flow). It uses internally queues, provides a lot of processors. It can make your life easier and get the workflow developed in minutes. Give it a try. There is nice Hortonworks tutorial which , combined with HDP Sandbox running on a virtual machine/Docker, can bring you up to speed in very short time (1-2 hours?).

Does Apache Kafka Store the messages internally in HDFS or Some other File system

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.

data backup and recovery in hadoop 2.2.0

I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.

Hadoop Disaster Recovery

Anyone could help me to know about Hadoop Disaster Recovery ?
should i replicate data from cluster to another cluster as backup use distcp?
Or i can use copyToLocal to copy my data to my localmachine ?
Anyone idea about it ?
DRP plan goes beyond just the technology and the requirements can greatly affect the solution.
for instance if you can't afford to lose any data you'd want an active/active setup and send data to two hadoop clusters simultaneously. on the other side of the spectrum hadoop's replication (default is 3 copies but you can change that) and rack awareness can give you a copy on a secondary rack. In between you can use things like distcp that you mention to copy data from cluster to cluster.
Additionally you might want to follow project falcon which is a new initiative for hadoop data life-cycle management

Resources