How to calculate Storage when ftping to MainFrame - ftp

How can I calculate storage when FTPing to MainFrame? I was told LRECL will always remain '80'. Not sure how I can calculate PRI and SEC dynamically based on the file size...
QUOTE SITE LRECL=80 RECFM=FB CY PRI=100 SEC=100

If the site has SMS, you shouldn't need to, but if you need to calculate the number of tracks is the size of the file in bytes divided by 56,664, or the number of cylinders is the size of the file in bytes divided by 849,960. In either case, you would round up.

Unfortunately IBM's FTP server does not support the newer space allocation specifications in number of records (the JCL parameter AVGREC=U/M/K plus the record length as the first specification in the SPACE parameter).
However, there is an alternative, and that is to fall back on one of the lesser-used SPACE parameters - the blocksize specification. I will assume 3390 disk types for simplicity, and standard data sets.
For fixed-length records, you want to calculate the largest number that will fit in half a track (27994 bytes), because z/OS only supports block sizes up to 32760. Since you are dealing with 80-byte records, that number is 27290. Divide your file size by that number and that will give you the number of blocks. Then in a SITE server command, specify
SITE BLKSIZE=27920 LRECL=80 RECFM=FB BLOCKS=27920 PRI=calculated# SEC=a_little_extra
This is equivalent to SPACE=(27920,(calculated#,a_little_extra)).
z/OS space allocation calculates the number of tracks required and rounds up to the nearest track boundary.
For variable-length records, if your reading application can handle it, always use BLKSIZE=27994. The reason I have the warning about the reading application is that even today there are applications from ISVs that still have strange hard-coded maximum variable length blocks such as 12K.
If you are dealing with PDSEs, always use BLKSIZE=32760 for variable-length and the closest-to-32760 for fixed-length in your specification (32720 for FB/80), but calculate requirements based on BLKSIZE=4096. PDSEs are strange in their underlying layout; the physical records are 4096 bytes, which is because there is some linear data set VSAM code that handles the physical I/O.

Related

Procedural generation and permutations question

I probably won't pursue this but I had this idea of generating a procedural universe in the most memory efficient way possible.
Like in the game Elite, you could use a random number generator based on a seed, and so each star system can be represented by a single seed number instead of lists of stats and other info. But if each star system is a 64 bit number, to make the milky way, 100 billion stars, that is 6.4 terabytes of memory. But if you use only 8-bits per star system, you'll only have 256 unique star systems in your game. So my other idea was to have each star system represented by 8 bits, but simply grab the next 7 star system's bytes in memory and use that combination to form a 64 bit number for the planet's seed. Obviously there would be 7 extra bytes at the end to account for the last star system in memory.
So is there any way to organize the values in these bytes such that every set of 8 bytes over the entire file covers all 64 bit values (hypothetically) with no repeats? Or is it impossible and I should just accept repeats? Or could I possibly use the address of the byte itself as part of the seed? So how would that work in C? Like if I have a file of 100 billion bytes, does that actually take up exactly 100 billion bytes in memory or is it more and how are the addresses for those bytes stored? And is accessing large files like that (like 100gb+) in a server client relationship practical? Thank you.

Why is two lines that differ in their address by precisely 65,536 bytes cannot be stored in the cache at the same?

I read a book Andrew Tanenbaum - structured computer organization (6th edition) - 2012, and I dont understand it.
"This mapping scheme puts consecutive memory lines in consecutive cache entries.In fact, up to 64 KB of contiguous data can be stored in the cache.However,two lines that differ in their address by precisely 65,536 bytes or any integral multiple of that number cannot be stored in the cache at the same time (because they have the same Line value).For example, if a program accesses data at location X and next executes an instruction that needs data at location X + 65,536 (or anyother location within the same line), the second instruction will force the cache entry to be reloaded, overwriting what was there.If this happens often enough, itcan result in poor behavior.In fact, the worst-case behavior of a cache is worsethan if there were no cache at all, since each memory operation involves reading in an entire cache line instead of just one word."
Why are they have the same Line value?
This is because of two concepts in cache design. First, a concept called associativity in cache design. For every possible input cache-line address (64 byte aligned on a modern x86-64 system) there are only N possible slots in the cache it may access.
The second is the a problem much like what is encountered with the hash function used within a hashmap. Simply put, some scheme has to be used in converting input addresses to slots in the cache. Notice that the book says the cache can hold 64 (presumably imperial) kilobytes. 64 kB is 65,536 bytes, and the magical cache-ruining distance in question is ALSO 65,536! So, in this case the address -> cache slot function is a simple and operation, and it appears the author is talking about a 1-way associativity cache (that is, each line may only be stored in ONE location inside the cache.) Leading to the mentioned conflict.
Why would microprocessor designers choose a simple AND function? Well... Because it's simple, mainly. Instead of wasting transistors on more complex logic, a basic operation like AND will suffice.

How does the disk seek is faster in column oriented database

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.
The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).
The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Why does Unix block size increase with bigger memory size?

I am profiling binary data which has
increasing Unix block size (one got from stat > Blocks) when the number of events are increased as in the following figure
but the byte distance between events stay constant
I have noticed some changes in other fields of the file which may explain the increasing Unix block size
The unix block size is a dynamic measure.
I am interested in why it is increasing with bigger memory units in some systems.
I have had an idea that it should be constant.
I used different environments to provide the stat output:
Debian Linux 8.1 with its default stat
OSX 10.8.5 with Xcode 6 and its default stat
Greybeard's comment may have the answer to the blocks behaviour:
The stat (1) command used to be a thin CLI to the stat (2) system
call, which used to transfer relevant parts of a file's inode. Pretty
early on, the meaning of the st_blksize member of the C struct
returned by stat (2) was changed to "preferred" blocksize for
efficient file system I/O, which carries well to file systems with
mixed block sizes or non-block oriented allocation.
How can you measure the block size in case (1) and (2) separately?
Why can the Unix block size increase with bigger memory size?
"Stat blocks" is not a block size. It is number of blocks the file consists of. It is obvious that number of blocks is proportional to size. Size of block is constant for most file systems (if not all).

Sort serial data with buffer

Is there any algorithms to sort data from serial input using buffer which is smaller than data length?
For example, I have 100 bytes of serial data, which can be read only once, and 40 bytes buffer. And I need to print out sorted bytes.
I need it in Javascript, but any general ideas are appreciated.
This kind of sorting is not possible in a single pass.
Using your example: suppose you have filled your 40 byte buffer, so you need to start printing out bytes in order to make room for the next one. In order to print out sorted data, you must print the smallest byte first. However, if the smallest byte has not been read, you can't possibly print it out yet!
The closest relevant fit to your question may be external sorting algorithms, which take multiple passes in order to sort data that can't fit into memory. That is, if you have peripherals that can store the output of a processing pass, you can sort data larger than your memory in O(log(N/M)) passes, where N is the size of the problem, and M is the size of your memory.
The classic storage peripheral for external sorting is the tape drive; however, the same algorithms work for disk drives (of whatever kind). Also, as cache hierarchies grow in depth, the principles of external sorting become more relevant even for in-memory sorts -- try taking a look at cache-oblivious algorithms.

Resources