ubifs volumes vs. mtd partitions - linux-kernel

I'm migrating a product from jffs2 file system to ubifs.
Previous jffs2 design contains 3 mtd partitions ( 2 ro and 1 rw ) .
moving to ubifs - should I create :
One mtd partition and 3 volumes
3 mtd partitions, 1 volume each
Basically I'm asking if I should replace partitions with volumes when moving to ubifs ?
( my understanding that ubi layer will manage entire flash if doing so )
Thanks,
Ran

The options exist and here are the benefits...
One mtd partition and 3 volumes
The UBI layer will manage the volume. This is a flash virtualization layer that transforms unreliable flash into reliable memory. The UBI layer does wear leveling. Even for read-only data it is beneficial to re-write the data occasionally. This will recharge floating gates, etc so that the data remains readable for longer. For the read-write data it is highly beneficial for longevity. The UBI wear leveling will take place across all volumes. This substaneously increases the erase-write cycles that the file system can handle.
3 mtd partitions, 1 volume each
This is usually less desirable, but there are benefits and it maybe suitable to some users. Mainly having a separate partition increases reliability of mounting a single volume. If something happens to the single MTD partition, then your entire flash might become unuseable. By having seperate MTD partitions, a read-only MTD/UBI/UbiFS system may be useable when the read write filesystem failed.
This really is more beneficial for a 3rd option,
multiple MTD with mixed file systems.
It is possible to put CramFS, RomFS in some flash devices where a device block is gaurenteed reliable by a manufacturer. This maybe a boot file system that is all that is required for a system to minimally function. The tools for manipulating these partitions are quite simple (compared to UBI/UbiFS) and can be implemented in minimal code space. Some systems have large DDR with smaller on-chip SRAM. Loaders/flashes may have restricted code space.
That said, recently (last two years) the mtd-utils contains UBI parsing code. This might be need to be ported to a flasher, recovery code, etc. Recovery code might be in an attached initrd partition, which does mounting/fail-safe recover of the UBI/UbiFS partitions.
u-boot contains code to manage and manipulate UBI/UbiFS code and it use a two phase boot (running from internal SRAM, configuring DDR and then migrating) on many platforms to have rich functionality in a boot loader. u-boot itself will need to be on another device OR in a separate MTD as per above.
The 2nd option 3 mtd partitions, 1 volume each is probably the least likely/desirable. The first will favour system/flash life time. The last will offer simplicity of higher reliability/recovery. The best will depend on what the data is on the partitions and the non-linux resources you have available to recover data. The happy medium is to give as much NAND flash space as possible to UBI and use volumes when you want logical partitioning.
Usually, I would question why to use volumes at all and just put all the data together in such a case, but again it depends on the nature of the data.

Related

Is it possible to "gracefully" use virtual memory in a program whose regular use would consume all physical RAM?

I am intending to write a program to create huge relational networks out of unstructured data - the exact implementation is irrelevant but imagine a GPT-3-style large language model. Training such a model would require potentially 100+ gigabytes of available random access memory as links get reinforced between new and existing nodes in the graph. Only a small portion of the entire model would likely be loaded at any given time, but potentially any region of memory may be accessed randomly.
I do not have a machine with 512 Gb of physical RAM. However, I do have one with a 512 Gb NVMe SSD that I can dedicate for the purpose. I see two potential options for making this program work without specialized hardware:
I can write my own memory manager that would swap pages between "hot" resident memory and "cold" on the hard disk, probably using memory-mapped files or some similar construct. This would require me coding all memory accesses in the modeling program to use this custom memory manager, and coding the page cache and concurrent access handlers and all of the other low-level stuff that comes along with it, which would take days and very likely introduce bugs. Also performance would likely be poor. Or,
I can configure the operating system to use the entire SSD as a page file / SWAP filesystem, and then just have the program reserve as much virtual memory as it needs - the same as any other normal program, relying on the kernel's memory manager which is already doing the page mapping + swapping + caching for me.
The problem I foresee with #2 is making the operating system understand what I am trying to do in a "cooperative" way. Ideally I would like to hint to the OS that I would only like a specific fraction of resident memory and swap the rest, to keep overall system RAM usage below 90% or so. Otherwise the OS will allocate 99% of physical RAM and then start aggressively compacting and cutting down memory from other background programs, which ends up making the whole system unresponsive. Linux apparently just starts sacrificing entire processes if it gets too bad.
Does there exist a kernel command in any language or operating system that would let me tell the OS to chill out and proactively swap user memory to disk? I have looked through VMM functions in kernel32.dll and the Linux paging and swap daemon (kswapd) documentation, but nothing looks like what I need. Perhaps some way to reserve, say, 1Gb of pages and then "donate" them back to the kernel to make sure they get used for processes that aren't my own? Some way to configure memory pressure or limits or make kswapd work more aggressively for just my process?

in ESP32 / ESP-IDF - when to use EEPROM vs NVS vs SPIFFS?

I'm fairly new to doing production work on ESP32 microcontrollers, and I'm wanting a little context and nuance from people who've been around the block a few times. So this question is a bit more on that kind of thing rather than a "how do I code X" kind of question.
I have lots of data storage needs on my current project.
larger blobs of data that need to be stored less often
smaller blobs of data that need to be updated more often
factory settings (like serial number, board revision, etc) that are particular to a given device, but aren't going to be encoded in C.
etc
I'm familiar with storing data in "blobs", and I'm familiar with encoding / decoding data with protocol buffers.
So given all that, I'm trying to gain context on the differences between my various storage options on the ESP32, and when to use each.
EEprom
NVS
SPIFFS / LittleFS
other options...
What use cases make you pick one of these options over another?
There's no EEPROM on the ESP32, just the flash.
NVS is a simple non-volatile key-value store with different data types (integers 8-64 bits, strings, blobs). It's reasonably convenient to use, does wear levelling and supports flash encryption (although that a bit of a hassle). I'd use it for storing factory settings and anything else which is reasonably small (there's a 4000 byte limit on strings, 508,000 byte limit on blobs). If the device needs to write often, you might want to create a separate, dedicated, read-only NVS partition for storing device attributes (serial, hw info) so it's guaranteed to not get clobbered by power failures during write.
ESP IDF supports SPIFFS and FAT file systems.
SPIFFS is light-weight and much better in terms of wear levelling and reliability. I'd use this for storing any larger files. It doesn't support flash encryption, unfortunately.
FAT file system is probably the worst choice because it's not really natively Flash-friendly, nor reliable. Espressif has built some kind of a layer between FAT and flash to accommodate wear levelling. The only critical advantage of FAT is that it supports flash encryption.
Then there are third party options which I haven't used, unfortunately.
As always, consider the number of page erases your writes are going to cause in the flash - this gives you an estimate of how many times you can write before the chip's lifetime is reached.

Read files by device/inode order?

I'm interested in an efficient way to read a large number of files on the disk. I want to know if I sort files by device and then by inode I'll got some speed improvement against natural file reading.
There are vast speed improvements to be had from reading files in physical order from rotating storage. Operating system I/O scheduling mechanisms only do any real work if there are several processes or threads contending for I/O, because they have no information about what files you plan to read in the future. Hence, other than simple read-ahead, they usually don't help you at all.
Furthermore, Linux worsens your access patterns during directory scans by returning directory entries to user space in hash table order rather than physical order. Luckily, Linux also provides system calls to determine the physical location of a file, and whether or not a file is stored on a rotational device, so you can recover some of the losses. See for example this patch I submitted to dpkg a few years ago:
http://lists.debian.org/debian-dpkg/2009/11/msg00002.html
This patch does not incorporate a test for rotational devices, because this feature was not added to Linux until 2012:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ef00f59c95fe6e002e7c6e3663cdea65e253f4cc
I also used to run a patched version of mutt that would scan Maildirs in physical order, usually giving a 5x-10x speed improvement.
Note that inodes are small, heavily prefetched and cached, so opening files to get their physical location before reading is well worth the cost. It's true that common tools like tar, rsync, cp and PostgreSQL do not use these techniques, and the simple truth is that this makes them unnecessarily slow.
Back in the 1970s I proposed to our computer center that reading/writing from/to disk would be faster overall if they organized the queue of disk reads and/or writes in such a way as to minimize the seek time and I was told by the computer center that their experiments and information from IBM that many studies had been made of several techniques and that the overall throughput of JOBS (not just a single job) was most optimal if disk reads/writes were done in first come first serve order. This was an IBM batch system.
In general, optimisation techniques for file access are too tied to the architecture of your storage subsystem for them to be something as simple as a sorting algorithm.
1) You can effectively multiply the read data rate if your files are spread into multiple physical drives (not just partitions) and you read two or more files in parallel from different drives. This one is probably the only method that is easy to implement.
2) Sorting the files by name or inode number does not really change anything in the general case. What you'd want is to sort the files by the physical location of their blocks on the disk, so that they can be read with minimal seeking. There are quite a few obstacles however:
Most filesystems do not provide such information to userspace applications, unless it's for debugging reasons.
The blocks themselves of each file can be spread all over the disk, especially on a mostly full filesystem. There is no way to read multiple files sequentially without seeking back and forth.
You are assuming that your process is the only one accessing the storage subsystem. Once there is at least someone else doing the same, every optimisation you come up with goes out of the window.
You are trying to be smarter than the operating system and its own caching and I/O scheduling mechanisms. It's very likely that by trying to second-guess the kernel, i.e. the only one that really knows your system and your usage patterns, you will make things worse.
Don't you think e.g. PostreSQL pr Oracle would have used a similar technique if they could? When the DB is installed on a proper filesystem they let the kernel do its thing and don't try to second-guess its decisions. Only when the DB is on a raw device do the specialised optimisation algorithms that take physical blocks into account come into play.
You should also take the specific properties of your storage devices into account. Modern SSDs, for example, make traditional seek-time optimisations obsolete.

Estimation of commodity hardware for an application

Suppose, I wanted to develop stack overflow website. How do I estimate the amount of commodity hardware required to support this website assuming 1 million requests per day. Are there any case studies that explains the performance improvements possible in this situation?
I know I/O bottleneck is the major bottleneck in most systems. What are the possible options to improve I/O performance? Few of them I know are
caching
replication
You can improve I/O performance in several ways depending upon what you use for your storage setup:
Increase filesystem block size if your app displays good spatial locality in its I/Os or uses large files.
Use RAID 10 (striping + mirroring) for performance + redundancy (disk failure protection).
Use fast disks (Performance Wise: SSD > FC > SATA).
Segregate workloads at different times of day. e.g. Backup during night, normal app I/O during day.
Turn off atime updates in your filesystem.
Cache NFS file handles a.k.a. Haystack (Facebook), if storing data on NFS server.
Combine small files into larger chunks, a.k.a BigTable, HBase.
Avoid very large directories i.e. lots of files in the same directory (instead divide files between different directories in a hierarchy).
Use a clustered storage system (yeah not exactly commodity hardware).
Optimize/design your application for sequential disk accesses whenever possible.
Use memcached. :)
You may want to look at "Lessons Learned" section of StackOverflow Architecture.
check out this handy tool:
http://www.sizinglounge.com/
and another guide from dell:
http://www.dell.com/content/topics/global.aspx/power/en/ps3q01_graham?c=us&l=en&cs=555
if you want your own stackoverflow-like community, you can sign up with StackExchange.
you can read some case studies here:
High Scalability - How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
http://www.gear6.com/gear6-downloads?fid=56&dlt=case-study&ls=Veoh-Case-Study
1 million requests per day is 12/second. Stack overflow is small enough that you could (with interesting normalization and compression tricks) fit it entirely in RAM of a 64 GByte Dell PowerEdge 2970. I'm not sure where caching and replication should play a role.
If you have a problem thinking enough about normalization, a PowerEdge R900 with 256GB is available.
If you don't like a single point of failure, you can connect a few of those and just push updates over a socket (preferably on a separate network card). Even a peak load of 12K/second should not be a problem for a main-memory system.
The best way to avoid the I/O bottleneck is to not do I/O (as much as possible). That means a prevayler-like architecture with batched writes (no problem to lose a few seconds of data), basically a log file, and for replication also write them out to a socket.

Question about hard drive , 'seek' and 'read' in windows OS

Does anyone know when calling 'seek' and 'read' , how is the hard-drive physicly affected?
If i'll be more specific, I know that the harddrive has some kind of a magnetic needle that is used to read the data from the magnetic plates. So my question is , when is the needle actualy moved to the reading location?
Is it moved when we are calling the "seek" windowsApi method (no matter if an actual read performed) , or does "seek" just remember a virtual pointer , and the physical movement of the needle is performed only when the "read" method is called?
Edit: Assume that the data requested from the Hard-Drive doesn't exist in any of the caches (hard-drive cache , Os Cache , Ram and whatever else it could be)
Wanted to break out this question from your post
When is the needle actualy moved to the reading location?
I think the simple answer is "whenever data is requested that is not already present in any number of caches". The problem with predicting hard drive movement is you have to consider all of the different places that cache data read from the hard drive. If the data is present in those caches and accessible in the context requesting the data, the cache will be used instead of actually reading the hard drive. Here are just some of the places that can and do cache hard drive data
Hard Drive's internal cache
OS level caches
Program level caches
API level cache
In the case where none of the data is present then it will likely be read from the hard drive during a read call. A seek call is unlikely to cause the hard drive to move because you're not changing the physical hard drive pointer but a virtual pointer to the file within your program.
The hard drive head (needle) starts moving and the disk starts spinning up (unless already spinning) at the read operation. There is no head move or spinup at the seek operation.
Please note that the head may move nonsequentially above the disk even if you are reading a file sequentially, i.e. the the read of the 2nd, 3rd etc. 512-byte block may cause the head to move far away as well even if there aren't intervening seeks. This is partially because the file is fragmented on the filesystem, or because the firmware has a sector number remapping (i.e. logical sector 5 is not between logical sectors 4 and 6) to compensate bad-block errors.
The assumption in the question "Assume that the data requested from the Hard-Drive doesn't exist in any of the caches (hard-drive cache , Os Cache , Ram and whatever else it could be)" is difficult to assume and relatively rare. Even in this case, there is only a loose association between user mode file I/O operations and physical storage device operations.
There are many user mode File I/O functions in various windows libraries. Some of the oldest are the C library low level I/O functions. There are also the C library stream I/O functions, the C++ iostreams classes, and the manged I/O classes. There are other I/O interfaces as well that are part of other packages.
In general, all the user mode I/O Libraries are built on top of the Win32 file I/O functions including CreateFile(), SetFilePointer(), ReadFile(), and WriteFile().
Unless a file is opened in unbuffered mode the operating system can cache the files contents. This is done system wide, and not on a per-file basis. So, even if your program had not read or written a file, I/O to a file may be cached and not result in any physical storage device I/Os.
There are many factors that determine how file I/Os map to actual I/O operations on a physical device. This includes, library level bufering, OS cashing, device driver caching, hardware level cashing, device block size, file size, hardware block/sector remapping, and other factors.
The short story here is that you cannot assume that individual file level read or seek operations correspond to physical device operations, such as disk head seeking.
This gets even trickier when writes are considered. Often writes are accompanied by a flush - which the application developer assumes will push the data all the way to the physical media. Developers often assume that when a flush call returns success, that the data is guaranteed to be persistent on the storage device. This is far from true as devices and drivers often ignore flush calls.
There is more complexity with solid state drives which are not mechanical and therefore do not have 'seek' operations. Here, other physical characteristics manifest themselves such as the necessity to erase blocks before they are written to.

Resources