getting data in and out of hadoop - hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?

Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications

A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.

HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line

I'd recommend using Flume to collect the log files from your servers into HDFS.

Related

Storing small files in hdfs and archiving them in Nifi Flow

I have an issue with small files and HDFS.
Scenario: I am using NiFi to read messages from the Kafka topic, these are all really small.
Requirement: to store these raw messages of data in HDFS(for replay capability)...before doing further processing on them.
I was thinking using Hadoop Archive (HAR) on them periodically. Is that something i can do through NiFi? the har command seems like a command line thing rather than something that i could execute through Nifi? Would love to know a solution that can achieve my requirement, without bringing down HDFS due to the small files.
Ginil
You can execute command line inside Nifi with ExecuteProcess processor :
http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.ExecuteProcess/
You can also take a look at Kafka-connect HDFS for putting kafka records into HDFS.

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
BW
Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

Spark Architecture for processing small binary files saved in HDFS

I don't know how to build architecture for following use case:
I have an Web application where users can upload files(pdf&pptx) and directories to be processed. After upload is complete web application put this files and directories in HDFS, then send a messages on kafka with path to this files.
Spark Application read messages from kafka streaming, collect them on master(driver), and after that process them. I collect messages first because i need to move the code to data, and not move data where the message is received. I understood that spark assign job to executor which already have file locally.
I have issues with kafka because i was forced to collect them first for the above reason, and when want to create checkpoint app crash "because you are attempting to reference SparkContext from a broadcast variable" even if the code run before adding checkpointing( I use sparkContext there because i need to save data to ElasticSearch and PostgreSQL. I don't know how exactly i can do code upgrading in this conditions.
I read about hadoop small files problems, and I understand what problems are in this case. I read that HBase is a better solution to save small files than just save in hdfs. Other problem in hadoop small files problems is big number of mappers and reducers created for computation, but i don't understand if this problem there in spark.
What is the best architecture for this use case?
How to do Job Scheduling? It's kafka good for that? or I need to use other service like rabbitMQ or something else?
Exist some method to add jobs to an running Spark application through some REST API?
How is the best way to save files? Is better to use Hbase because i have small files(<100MB)? Or I need to use SequenceFile? I think SequenceFile isn't for my use case because i need to reprocess some files randomly.
What is the best architecture do you think for this use case?
Thanks!
There is no one single "the best" way to build architecture. You need to make decisions and stick to them. Make the architecture flexible and decoupled so that you can easily replace components if needed.
Consider following stages/layers in your architecture:
Retrieval/Acquisition/Transport of source data (files)
Data processing/transformation
Data archival
As a retrieval component, I would use Flume. It is flexible, supports a lot of sources, channels (including Kafka) and sinks. In your case you can configure source that monitors the directory and extracts the newly received files.
For data processing/transformation - it depends what task you are solving. You probably decided on Spark Streaming. Spark streaming can be integrated with Flume sink (http://spark.apache.org/docs/latest/streaming-flume-integration.html) There are other options available, e.g. Apache Storm. Flume combines very well with Storm. Some transformations can also be applied in Flume.
For data archival - do not store/archive the files directly in Hadoop, unless they are bigger than few hundredths of megabytes. One solution would be to put them in HBase.
Make your architecture more flexible. I would place processed files in a temporary HDFS location and have some job regualarly archive them into zip, HBase, Hadoop Archive (there is such an animal) or any other solution.
Consider using Apache NiFi (aka HDF - Hortonworks Data Flow). It uses internally queues, provides a lot of processors. It can make your life easier and get the workflow developed in minutes. Give it a try. There is nice Hortonworks tutorial which , combined with HDP Sandbox running on a virtual machine/Docker, can bring you up to speed in very short time (1-2 hours?).

Spark stream unable to read files created from flume in hdfs

I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.
You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.
Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.
Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.
HTH
In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.
with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.
import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])

Hadoop with Hive

We want to develop one simple Java EE web application with log file analysis using Hadoop. The following are Approach following to develop the application. But we are unable to through the approach.
Log file would be uploaded into Hadoop server from client machine using sftp/ftp.
Call the Hadoop Job to fetch the log file and process the log file into HDFS file system.
While processing the log file the content will stored into HIVE database.
Search the log content by using HIVE JDBC connection from client web application
We browsed so many sample to full fill some of the steps. But we are not having any concrete sample are not available.
Please suggest the above approach is correct or not and get the links for sample application developed in Java.
I would point out a few thing:
a) You need to merge log files or in some other ways take care that you do not have too much of them. Consider Flume (http://flume.apache.org/) which is built to accept logs from various sources and put them into HDFS.
b) If you go with ftp - you will need some scripting to take data from FTP and put into HDFS.
c) Main problem I see is- to run Hive job as result of the client's web request. Hive request is not interactive - it will take at least dozens of seconds, and probably much more.
I also would be vary of concurrent requests - you proabbly can not run more then a few in parallel
According to me, you can do one thing that:
1)Instead of accepting logs from various sources and put them into HDFS, You can put into one database say SQL Server and from that you can import your data into Hive (or HDFS) using Sqoop.
2) This will reduce your effort for writing the various job to bring the data into HDFS.
3) Once the data come in Hive, you can do whatever you want.

Resources