Nifi failed to write to FileSystemRepository Stream - apache-nifi

I have a flow where I am using the getFile processor. The input directory is a network mount point. When I test the flow on small files (les than 1GB), it works well. When I test it on bigger files (more than 1GB), I get the following error :
GetFile[id=f1a533fd-1959-16d3-9579-64e64fab1ac6] Failed to retrieve
files due to
org.apache.nifi.processor.exception.FlowFileAccessException: Failed to
import data from /path/to/directory for
StandardFlowFileRecord[uuid=f8389032-c6f5-43b9-a0e3-7daab3fa115a,claim=,offset=0,name=490908299598990,size=0]
due to java.io.IOException: Failed to write to FileSystemRepository
Stream [StandardContentClaim
[resourceClaim=StandardResourceClaim[id=1486976827205-28,
container=default, section=28], offset=0, length=45408256]]
Do you have any idea about the origin of this error ?
Thank you for your answers

Based on the comments the answer in this specific case was found by Andy and confirmed by the asker:
The content repositories were too small in proportion to the file size.
Another thing to look at for future readers is whether the memory of the Nifi node is large enough to hold individual messages.

Although the answer provided by Dennis is correct at analysing the root cause, there is not a solution for it, so let me provide one.
Answer/Solution for Containers
Since you can't specify a size for a docker volume, we can't use them for this task if you are lacking the required space for your flowfiles content.
Instead, I recommend using bind mounts. This way, you could use up to (theoretically) all your machine disk.
# Create a folder where to locate the NiFi Flowfiles content in the host
mkdir -p /tmp/data/nifi/nifi-data-content_repository
Then, modify your docker-compose.yml file to modify the type of storage to be used. Specifically, you have to look for the contents repository volume:
- type: source
source: nifi-data-content_repository
target: /opt/nifi/nifi-current/content_repository
And replace it with the bind mount targeting the folder that we've just created above:
- type: bind
source: /tmp/data/nifi/nifi-data-content_repository
target: /opt/nifi/nifi-current/content_repository
Ready, now you can re-deploy your NiFi with this mount capable of using your host disk space.
Read more on bind mounts in the official docs.

Related

How can I access MinIO files on the file system?

On the underlying server filesystem, MinIO seems to store the content of an uploaded file (e.g. X) in a file called xl.meta in a directory bearing the original file name (e.g. X/xl.meta).
However, the file xl.meta is encoded. How can I access the original file content on the server file system itself (i.e. see the text inside a plain text file or being able to play a sound file with a respective application)?
It would not be possible since the object you are seeing on the backend fs is not the actual object, it is only the erasure coded part(s) that is split across all the disks in a given erasure set. So, you could do it if you were just using fs mode (single node, single disk) but in erasure coded environment you will need to have quorum to be able to download the object, and that via an S3 supported method and not directly from the backend. Technically not quorum, rather n/2 if you just want to read the object, but as a rule you should avoid doing anything in the backend fs.
If you happen to want to just see the contents of xl.meta, and not recover the file itself, you can use this something like mc support inspect myminio/test/syslog/xl.meta --export=json (or you can build a binary from https://github.com/minio/minio/tree/master/docs/debugging/xl-meta but using mc is probably easier).

Apache NIFI: Recovering from Flowfile repository issue

I am currently trying to recover my flows from the below exception.
failed to process session due to Cannot update journal file
/data/disk1/nifi/flowfile_repository/journals/90620570.journal because
no header has been written yet.; Processor Administratively Yielded
for 1 sec: java.lang.IllegalStateException: Cannot update journal file
/data/disk1/nifi/flowfile_repository/journals/90620570.journal because
no header has been written yet.
I have seen some answers on best practices wrt to handling large files in Nifi, but my question is more about how to recover from this exception. My observation is that, once the exception is seen, it begins to appear in several processors in all the flows in our nifi instance, how do we recover without a restart?
It seems like your disk is full which is not allowing the processors to update or modify the data.
You can either increase your disk or you can delete the contents from your nifi repository.
first, check the logs folder. If its the logs folder thats taking up the space, you can directly do a
rm -rf logs/*
else just delete all the content
rm -rf logs/* content_repository/* provenance_repository/* flowfile_repository/* database_repository/*
PS : The deletion of the content will cause all your data on the canvas also to be deleted, so make sure you're not deleting the data which can't be reproduced.
Most likely, it must be the logs which must be eating up the space. Also, check your log rotation interval!
Let me know if you need further assistance!

fastest way to retrieve CIFS file metadata

Situation:
I am scanning a directory using NtQueryDirectoryFile(..., FileBothDirectoryInformation, ...). In addition to data returned by this call I need security data (typically returned by GetKernelObjectSecurity) and list of alternate streams (NtQueryInformationFile(..., FileStreamInformation)).
Problem:
To retrieve security and alternate stream info I need to open (and close) each file. In my tests it slows down the operation by factor of 3. Adding GetKernelObjectSecurity and NtQueryInformationFile slows it down by factor of 4 (making it 12x).
Question:
Is there a better/faster way to get this information (by either opening files faster or avoiding file open altogether)?
Ideas:
If target file system is local I could access it directly and (knowing NTFS/FAT/etc details extract info from raw data). But it isn't going to work for remote file systems.
Custom SMB client is the answer, it seems. Skipping Windows/NT API layer opens all doors.

Write time series data into hdfs partitioned by month and day?

I'm writing a program which save the time series data from kafka into hadoop. and I designed the directory struct like this:
event_data
|-2016
|-01
|-data01
|-data02
|-data03
|-2017
|-01
|-data01
Because the is a daemon task, I write a LRU-based manager to manage the opened file and close inactive file in time to avoid resource leaking, but the income data stream is not sorted by time, it's very common to open the existed file again to append new data.
I tried use FileSystem#append() method to open a OutputStream when file existed, but it run error on my hdfs cluster(Sorry, I can't offer the specific error here because it's several month ago and now I tried another solution).
Then I use another ways to achieve my goals:
Adding a sequence suffix to the file name when the same name file exists. now I have a lot of file in my hdfs. It looks very dirty.
My question is: what's the best practice for the circumstances?
Sorry that this is not a direct answer to your programming problem, but if you're open for all options rather than implement it by yourself, I'd like to share you our experiences with fluentd and it's HDFS (WebHDFS) Output Plugin.
Fluentd is a open source, pluggable data collector and by which you can build your data pipeline easily, it'll read data from inputs, process it and then write it to the specified outputs, in your scenario, the input is kafka and the output is HDFS. What you need to do is:
Config fluentd input following fluentd kafka plugin, you'll config the source part with your kafka/topic info
Enable webhdfs and append operation for your HDFS cluster, you can find how to do it following HDFS (WebHDFS) Output Plugin
Config your match part to write your data to HDFS, there's example on the plugin docs page. For partition your data by month and day, you can configure path parameter with time slice placeholders, something like:
path "/event_data/%Y/%m/data%d"
With this option to collect your data, you can then write your mapreduce job to do ETL or whatever you like.
I don't know if this is suitable for your problem, just provide one more option here.

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Resources