Kafka Streams - RocksDB - max open files - apache-kafka-streams

If we define the max open files as 300, and if the number of .sst files exceed, I assume that the files in cache will be evicted, but if the data in those files being evicted were to be accessed, will it reload it OR that file is lost for ever?
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

From the link you posted:
max_open_files -- RocksDB keeps all file descriptors in a table cache. If number of file descriptors exceeds max_open_files, some files are evicted from table cache and their file descriptors closed. This means that every read must go through the table cache to lookup the file needed. Set max_open_files to -1 to always keep all files open, which avoids expensive table cache calls.
This only means that if the number of open files is exceeded, some files will be closed. If you want to access a close file, the corresponding file will be re-opened (and maybe before, another file would be closed).
Hence, the config is not about creating/deleting files, but just about how many files to keep open in parallel.

Related

How to merge small blocks of a file in HDFS?

We plan to append data to our files in NEW_BLOCK mode. This gives us more flexibility as to DN status.
Now after running the process for days, we find our 2mb file has too many blocks.
Is there a way to merge the blocks of a file - say bring down the 100 blocks for a file to 4.

Writing a file larger than block size in hdfs

If I am trying to write a file of 200MB into HDFS where HDFS block size is 128MB. What happens if the write fails after writing 150MB out of 200MB. Will I be able to read data from the portion of data written? What if I try to write the same file again? Will that be a duplicate? What happens to the 150MB of data written earlier to failure?
HDFS default Block Size is 128MB, if it fails while writing (it will show the status in Hadoop Administration UI, with file extension copying.)
Only 150MB data will be copied.
yeah you can read only portion of data(150MB).
Once you reinstate the copying it will continue from previous point(if both the paths are same and file name is same).
For every piece of data you can find the replication based on your replication factor.
Previous written data will be available in HDFS.

How backup works when flow.xml size more than max storage?

i have check below properties used for the backup operations in Nifi-1.0.0 with respect to JIRA.
https://issues.apache.org/jira/browse/NIFI-2145
nifi.flow.configuration.archive.max.time=1 hours
nifi.flow.configuration.archive.max.storage=1 MB
Since we have two backup operations first one is "conf/flow.xml.gz" and "conf/archive/flow.xml.gz"
I have saved archive workflows(conf/archive/flow.xml.gz) as per hours in "max.time" property.
At particular time i have reached "1 MB"[set as size of default storage].
So it will delete existing conf/archive/flow.xml.gz completely and doesn't write new flow files in conf/archive/flow.xml.gz due to size exceeds.
No logs has shows that new flow.xml.gz has higher size than specified storage.
Why it could delete existing flows and doesn't write new flows due to storage?
In this case in one backup operation has failed or not?

Hadoop Avro file size concern

I have a cronjob that that downloads zip files (200 bytes to 1MB) from a server on the internet every 5 minutes. If I import the zip files into HDFS as is, I encounter the infamous Hadoop small file size issue. In order to avoid the build up of small files in HDFS, process of the the text data in the zip files and convert them into avro files and wait every 6 hours to add my avro file into HDFS. Using this method, I have managed to get avro files imported into HDFS with a file size larger than 64MB. The files sizes range from 50MB to 400MB. What I'm concerned about is that what happens if I start building file sizes that start getting into the 500KB avro file size range or larger. Will this cause issues with Hadoop? How does everyone else handle this situation?
Assuming that you have some Hadoop post-aggregation step and that you're using some splittable compression type (sequence, snappy, none at all), you shouldn't face any issues from Hadoop's end.
If you would like your avro file sizes to be smaller, the easiest way to do this would be to make your aggregation window configurable and lower it when needed (6 hours => 3 hours?). Another way you might be able to ensure more uniformity in file sizes would be to keep a running count of lines seen from downloaded files and then combine upload after a certain line threshold has been reached.

Transfer of oracle dump file through mail which allows only 5mb max upload

i want to transfer my oracle database dump file from one place to another, and size of database is 80mb even if i 7 zip it coverts to 9mb. but mail allows me to upload maximum of 5mb data, so can i break my dump file? and at the same time i dont want to loose the key structure in database.
P.S. all the other mails are blocked and cloud spaces are also bloacked.
To meet the constraints of your network, you can create dump files of smaller size, which will enable you to create dump files of 5 MB (or smaller than that).
exp user/pass FILE=D:P1.dmp,E:P2.dmp FILESIZE=5m LOG=splitdump.log
I have not tried the above syntax, but have tried this one, where a substitution variable is used, ensuring that you need not worry about how many dump files you have to specify beforehand. This will automatically generate as many dump files, as needed of requisite size
expdp user/pass tables=test directory=dp_dir dumpfile=dump%u.dmp filesize=5m

Resources