Ordered mode behavior in journaling file system - linux-kernel

In the following article, it says "ordered mode is metadata journaling only but writes the data before journaling the metadata." Does this mean the data is physically written to disk before the metadata is written? From what I understand, data written to a disk is first placed in a cache and then flushed to disk. Or is whether the data was actually written to disk irrevelent to the journaling service?
Is the metadata that is placed in the journal written directly to disk without first being written to a cache?

Related

How do I decide where I should locate my TimesTen database files?

I am setting up a TimesTen In-Memory database and I am looking for guidance on the storage and location that I should use for the database's persistence files.
A TimesTen database consists of two types of file; checkpoint files (two) and transaction log files (always at least one, usually many).
There are 3 criteria to consider:
a) Data safety and availability (regular storage versus RAID). The database files are critical to the operation of the database and if they become inaccessible or are lost/damaged then your database will become inoperable and you will likely lose data. One way to protect against this is to use TimesTen's built in replication to implement high availability but even if you do that you may also want to protect your database files using some form of RAID storage. For performance reasons RAID-1 is preferred over RAID-5 or RAID-6. Use of NFS storage is not recommended for database files.
b) Capacity. Both checkpoint files are located in the same directory (Datastore attribute) and hence in the same filesystem. Each file can grow to a maximum size of PermSize + ~64 MB. Normally the space for these files is pre-allocated when the files are created, so it is less likely you will run out of space for them. By default, the transaction log files are also located in the same directory as the checkpoint files, though you can (and almost always should) place them in a different location by use of the LogDir attribute. The filesystem where the transaction logs are located should have enough space such that you never run out. If the database is unable to write data to the transaction logs it will stop processing transactions and operations will start to receive errors.
c) Performance. If you are using traditional spinning magnetic media, then I/O contention is a significant concern. The checkpoint files and the transaction log files should be stored on separate devices and separate from any other files that are subject to high levels of I/O activity. I/O loading and contention is less of a consideration for SSD storage and usually irrelevant for PCIe/NVMe flash storage.

Offloading unstructured data saved in RDBMS to Hadoop

My organization is thinking about offloading the unstructured data like Text , images etc saved as part of Tables in Oracle Database , into Hadoop. The size of the DB is around 10 TB and growing. The size of the CLOB/BLOB columns is around 3 TB.Right now these columns are queried for certain kind of reports through a web application. They are also written into but not very frequently.
What kind of approach we can take to achieve proper offloading of data and ensuring that the offloaded data is available for read through existing web application.
You can get part of the answer in oracle blog (link).
If data needs to be pulled in HDFS environment via sqoop, then you must first read the following from sqoop documentation.
Sqoop handles large objects (BLOB and CLOB columns) in particular ways. If this data is truly large, then these columns should not be fully materialized in memory for manipulation, as most columns are. Instead, their data is handled in a streaming fashion. Large objects can be stored inline with the rest of the data, in which case they are fully materialized in memory on every access, or they can be stored in a secondary storage file linked to the primary data storage. By default, large objects less than 16 MB in size are stored inline with the rest of the data. At a larger size, they are stored in files in the _lobs subdirectory of the import target directory. These files are stored in a separate format optimized for large record storage, which can accomodate records of up to 2^63 bytes each. The size at which lobs spill into separate files is controlled by the --inline-lob-limit argument, which takes a parameter specifying the largest lob size to keep inline, in bytes. If you set the inline LOB limit to 0, all large objects will be placed in external storage.
Reading via web application is possible if you are using MPP query engine like Impala and it works pretty well and it is production ready technology. We heavily use complex Impala queries to render content for SpringBoot application. Since Impala runs everything in memory, there is a chance of slowness or failure if it is multi-tenant Cloudera cluster. For smaller user groups (1000-2000 user base) it works perfectly fine.
Do let me know if you need more input.
Recommendation will be
Use Cloudera distribution (read here)
Give enough memory for Impala Deamons
Make sure you YARN is configured correctly for schedule (fair share or priority share) based ETL load vs Web Application Load
If required keep the Impala Daemons away from YARN
Define memory quota for Impala Memory so it allows concurrent queries
Flatten your queries so Impala runs faster without joins and shuffles.
If you are reading just a few columns, store in Parquet, it works very fast.

Writing a file larger than block size in hdfs

If I am trying to write a file of 200MB into HDFS where HDFS block size is 128MB. What happens if the write fails after writing 150MB out of 200MB. Will I be able to read data from the portion of data written? What if I try to write the same file again? Will that be a duplicate? What happens to the 150MB of data written earlier to failure?
HDFS default Block Size is 128MB, if it fails while writing (it will show the status in Hadoop Administration UI, with file extension copying.)
Only 150MB data will be copied.
yeah you can read only portion of data(150MB).
Once you reinstate the copying it will continue from previous point(if both the paths are same and file name is same).
For every piece of data you can find the replication based on your replication factor.
Previous written data will be available in HDFS.

Caching vs Tempview

I have a parquet file which I reading atleast 4-5 times within my application. I was wondering what is most efficient thing to do.
Option 1. While writing parquet file read it back on dataset and call cache. I am assuming by doing an immediate read I might use some existing hdfs/spark cache as part from write process.
Option 2. In my application when I need the dataset first time, after reading it cache it.
Option 3. While writing parquet file, after completion create a temp view out of it. In all subsequent usage, use the view.
I am also not very clear about efficiency of reading from tempview vs parquet dataset.
The datasets doesn't fit all into memory.
You should cache dataset (Option 2).
writing to disk will provide no improvements over Spark in-memory format
temporary views don't cache.

Name Node stores what?

In case of "Name Node",
what gets stored in main memory and what gets stored in secondary memory ( hard disk ).
What we mean by "file to block mapping" ?
What exactly is fsimage and edit logs ?
In case of "Name Node", what gets stored in main memory and what gets
stored in secondary memory ( hard disk ).
The file to block mapping, locations of blocks on data nodes, active data nodes, a bunch of other metadata is all stored in memory on the NameNode. When you check the NameNode status website, pretty much all of that information is stored in memory somewhere.
The only thing stored on disk is the fsimage, edit log, and status logs. It's interesting to note that the NameNode never really uses these files on disk, except for when it starts. The fsimage and edits file pretty much only exist to be able to bring the NameNode back up if it needs to be stopped or it crashes.
What we mean by "file to block mapping" ?
When a file is put into HDFS, it is split into blocks (of configurable size). Let's say you have a file called "file.txt" that is 201MB and your block size is 64MB. You will end up with three 64MB blocks and a 9MB block (64+64+64+9 = 201). The NameNode keeps track of the fact that "file.txt" in HDFS maps to these four blocks. DataNodes store blocks, not files, so the mapping is important to understanding where your data is and what your data is.
What exactly is fsimage and edit logs ?
A recent checkpoint of the memory of the NameNode is stored in the fsimage. The NameNode's state (i.e. file->block mapping, file properties, etc.) from that checkpoint can be restored from this file.
The edits file are all the new updates from the fsimage since the last checkpoint. These are things like a file being deleted or added. This is important for if your NameNode goes down, as it has the most recent changes since the last checkpoint stored in fsimage. The way the NameNode comes up is it materializes the fsimage into memory, and then applies the edits in the order it sees them in the edits file.
fsimage and edits exist the way they do because editing the potentially massive fsimage file every time a HDFS operation is done can be hard on the system. Instead, the edits file is simply appended to. However, for the NameNode starting up and for data storage reasons, rolling the edits into the fsimage every now and then is a good thing.
The SecondaryNameNode is the process that periodically takes the fsimage and edits file and merges them together, into a new checkpointed fsimage file. This is an important process to prevent edits from getting huge.

Resources