How to handle small file issue in NiFi - hadoop

I'm referring this below article;
https://community.cloudera.com/t5/Community-Articles/Create-Dynamic-Partitions-based-on-FlowFile-Con...
I'm trying to create pipeline in nifi while data coming realtime streaming based say some example kafka, while data put in hdfs in partitioned location, it may ended be with many small files at the same while querying im facing performance lag issue; can you please give some apporaches to resolve small files issue in nifi itself with orc file format;

Related

Storing small files in hdfs and archiving them in Nifi Flow

I have an issue with small files and HDFS.
Scenario: I am using NiFi to read messages from the Kafka topic, these are all really small.
Requirement: to store these raw messages of data in HDFS(for replay capability)...before doing further processing on them.
I was thinking using Hadoop Archive (HAR) on them periodically. Is that something i can do through NiFi? the har command seems like a command line thing rather than something that i could execute through Nifi? Would love to know a solution that can achieve my requirement, without bringing down HDFS due to the small files.
Ginil
You can execute command line inside Nifi with ExecuteProcess processor :
http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.ExecuteProcess/
You can also take a look at Kafka-connect HDFS for putting kafka records into HDFS.

Streaming live data from HDFS to Hive

I am new to Hadoop ecosystem and self learning it through online articles.
I am working on very basic project so that I can get hands-on on what I have learnt.
My use-case is extremely: Idea is I want to present location of user who login to portal to app admin.So, I have a server which is continuously generating logs, logs have user id, IP address, time-stamp. All fields are comma separated.
My idea to do this is to have a flume agent to streaming live logs data and write to HDFS. Have HIVE process in place which will read incremental data from HDFS and write to HIVE table. Use scoop to continuously copy data from HIVE to RDMBS SQL table and use that SQL table to play with.
So far I have successfully configured flume agent which read logs from a given location and write to hdfs location. But after this I am confused as how should I move data from HDFS to HIVE table. One idea that's coming to my mind is to have a MapRed program that will read files in HDFS and write to HIVE tables programatically in Java. But I also want to delete files which are already processed and make sure that no duplicate records are read by MapRed. I searched online and found command that can be used to copy file data to HIVE but that's sort of a manual once activity. In my usecase I want to push data as soon as it's available in HDFS.
Please guide me how to achieve this task. Links will be helpful.
I am working on Version: Cloudera Express 5.13.0
Update 1:
I just created an external HIVE table pointing to HDFS location where flume is dumping logs. I noticed that as soon as table is created, I can query HIVE table and fetch data. This is awesome. But what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed? Similarly, will hive read new logs which are not processed and ignore the ones which it has already processed?
how should I move data from HDFS to HIVE table
This isn't how Hive works. Hive is a metadata layer over existing HDFS storage. In Hive, you would define an EXTERNAL TABLE, over wherever Flume writes your data to.
As data arrives, Hive "automatically knows" that there is new data to be queried (since it reads all files under the given path)
what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed
Depends how you've setup Flume. AFAIK, it will checkpoint all processed files, and only pick up new ones.
will hive read new logs which are not processed and ignore the ones which it has already processed?
Hive has no concept of unprocessed records. All files in the table location will always be read, limited by your query conditions, upon each new query.
Bonus: Remove Flume and Scoop. Make your app produce records into Kafka. Have Kafka Connect (or NiFi) write to both HDFS and your RDBMS from a single location (Kafka topic). If you actually need to read log files, Filebeat or Fluentd take less resources than Flume (or Logstash)
Bonus 2: Remove HDFS & RDBMS and instead use a more real-time ingestion pipeline like Druid or Elasticsearch for analytics.
Bonus 3: Presto / SparkSQL / Flink-SQL are faster than Hive (note: the Hive metastore is actually useful, so keep the RDBMS around for that)

oracle to oracle data pipeline using apache nifi

in our project we load data from one database(oracle) to another database(oracle) and run some batch level analytics to it.
as of now it is done via pl/sql jobs where we are pulling 3 years of data into destination db..
i have got a task to automate the flow using APache nifi..
cluster info:
1. APache hadoop cluster of 5 nodes
2. all the softwares are open source being used.
i have tried creating a flow where i am using a processor queryDatabaseTable -> putDatabaseRecord. but as far as i know that queryDatabaseTable outputs avro format..
i request to suggest me how to convert and what should be the processors sequence also i need to handle incremental loads/Change data capture. kindly suggest.
thanks in advance :)
PutDatabaseRecord configured with an Avro reader will be able to read the Avro produced by QueryDatabaseTable.

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
BW
Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.

Resources