How does fdisk utility use CHS addressing on SSDs? - disk

In my free time I am working on a personal project for reading and writing all kinds of disk structures like MBR, GPT, EXT2/3/4, NTFS, etc. While working on the MBR part I noticed that the fdisk utility populates the CHS address fields of the MBR partition entries when formatting a device... even if the target device is an SSD or a normal file like a disk image. The lack of any actual cylinders and heads/platters in an SSD or file made me curious: What does fdisk write to those CHS address fields?
Well, after some testing (i.e. formatting a normal file multiple times with different start sectors for the first partition), I am fairly certain that fdisk actually writes CHS addresses into those fields. But how does fdisk determine the cylinder/head sizes for an SSD or for a normal file? I can't imagine the HDIO_REQ and HDIO_GETGEO ioctls working on normal files.

After some more testing (see below) I am confident that the CHS addresses produced by fdisk always satisfy sector % (255 * 63) == 63 * chs.head + (chs.sector - 1) and chs.cylinder == 0 with chs.head only taking values in [0,254] and chs.sector only taking values in [1,63], regardless of the underlying (device) file's size.
I used the following Python script to quickly convert from LBAs to fdisk-CHSs to generate test cases. Tested on normal files as well as device files with different sector sizes.
import ctypes, subprocess
class chs_struct (ctypes.LittleEndianStructure):
# Unused: No system modern enough to run Python uses CHS addressing
_pack_ = 1
_fields_ = [
( "head", ctypes.c_uint8 ),
( "sector", ctypes.c_uint8, 6 ),
( "cylinder_hi", ctypes.c_uint8, 2 ),
( "cylinder_lo", ctypes.c_uint8 )
]
def chs_of_sector (first_sector, filename):
"""
Formats the file pointed to by filename with a new MBR whose first and
only partition is a primary partition starting at the specified
first_sector. This function returns the corresponding CHS address the
fdisk utility calculated.
"""
with subprocess.Popen(
["fdisk", filename],
universal_newlines = True,
stdin = subprocess.PIPE,
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL
) as proc:
# +------------------------------------------- Create new MBR
# | +-------------------------------------- Create new partition
# | | +--------------------------------- Make it a primary partition...
# | | | +---------------------------- and make it the first partition...
# | | | | +--------------- starting at this sector
# | | | | | +----- Max out partition size
# | | | | | | + Write MBR to file
# v v v v vvvvvvvvvvvv v v
proc.communicate("\n".join(["o", "n", "p", "1", str(first_sector), "", "w"]))
with open(filename, "rb") as file:
file.seek(447) # Offset of the first partition entry's first sector CHS address
chs = chs_struct.from_buffer_copy(file.read(3))
return (chs.head, chs.cylinder_hi << 8 + chs.cylinder_lo, chs.sector)

Related

Reading freely available UVR data using gfortran on mac OSX

I would like to use fortran to read ultraviolet radiation data that has been produced by the Japan Aerospace Exploration Agency. This data is at a daily and monthly temporal resolution from 2000-2010 at a ~5 km spatial resolution. This question is worth answering as the data could be useful for a number of environment/health projects and is freely available, with proper acknowledgement of source and sharing of preprint of any subsequent publications, from:
ftp://suzaku.eorc.jaxa.jp/pub/GLI/glical/Global_05km/monthly/uvb/
There is a readme file available, which provides instructions on how to read data using fortran as follows:
Instructions for _le files
Header
Read header (size= pixel size *2byte):
character head*14400
read(10,rec=1) head
read(head,'(2i6,2f8.2,f8.4,2e12.5,a1,a8,a1,a40)')
& npixel,nline,lon_min,lat_max,reso,slope,offset,',',
& para,',',outfile
Read data (e.g., fortran77)
parameter(nl=7200, ml=3601)
... open file by "unformatted", "recl=nl*2(byte)" (,"bytereclen")
integer*2 i2buf(nl,ml)
do m=1,ml
read(10,rec=1+m) (i2buf(n,m), n=1,nl)
do n=1,nl
par=i2buf(n,m)*slope+offset
write(6,*) 'PAR[Ein/m^2/day]=',par
enddo
enddo
slope values
par__le : daily PAR [Ein/m^2/day] = DN * 0.01
dpar_le : direct PAR = DN * 0.01
swr__le : daily mean shortwave radiation [W/m^2] = DN * 0.01
tip__le : transmittance of instantaneous PAR at noon = DN * 0.0001
uva__le : daily mean UVA [W/m^2] = DN * 0.001
uvb__le : daily mean UVB [W/m^2] = DN * 0.0001
rpar_le : PAR-range surface reflectance (TOP of canopy/solid surfaces) = DN * 0.0001 (monthly data only)
error values
-1 as signed short integer (int16)
65535 as unsigned short integer (uint16)
Progress so far
I have downloaded and installed gfortran successfully on mac OSX. I have downloaded a test file (MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le.gz) and decompressed it. I have created a program file:
PROGRAM readuvr
IMPLICIT NONE
!some code
END PROGRAM
I will then type the following into the command line to create an executable and run it to extract the data.
gfortran -o executable
./executable
As a complete beginner to fortran, my question is: how can I use the instructions provided to build a program that can read the data and output it into a text file?
Well, that file expands to 51,868,800 bytes. The comments imply the header is 14,400 bytes, which leaves 51,854,400 bytes of actual data payload.
There seem to be 7200 lines of data, so that means there are 7202 bytes per line. There seem to be 2 bytes (16-bit samples) so if we assume 2 bytes/sample, that means there are 3601 samples per line, which matches the ml=3601.
So basically, you need to read 14,400 bytes of header, then 7200 lines of data, each line consisting of 3601 values, each of those being 2 bytes wide...
Actually, if you are that unfamiliar with FORTRAN, you may like to extract the data with Perl which is already installed and available on OS X anyway. I have started a VERY SIMPLISTIC Perl program that reads the dat and prints the first 2 values on each line:
#!/usr/bin/perl
use strict;
use warnings;
# Read 14,400 bytes of header
my $buffer;
my $nBytes = 14400;
my $bytesRead = read (STDIN, $buffer, $nBytes) ;
my ($npixel,$nline,$lon_min,$lat_max,$reso,$slope,$offset,$junk)=split(' ',$buffer);
print "npixel:$npixel\n";
print "nline:$nline\n";
print "lon_min:$lon_min\n";
print "lat_max:$lat_max\n";
print "reso:$reso\n";
print "slope:$slope\n";
$offset =~ s/,.*//; # strip trailing comma and junk
print "offset:$offset\n";
# Read actual lines of data
my $line;
for(my $m=1;$m<=$nline;$m++){
read(STDIN,$line,$npixel*2);
my $x=$npixel*2;
my #values=unpack("S$x",$line);
printf "Line: %d",$m;
for(my $j=0;$j<2;$j++){
printf ",%f",$values[$j]*$slope+$offset;
}
printf "\n"; # newline
}
Save it as go.pl and then in the Terminal, type the following once to make it executable
chmod +x go.pl
and then run it like this
./go.pl < MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le
Sample output extract:
npixel:7200
nline:3601
lon_min:0.00
lat_max:90.00
reso:0.0500
slope:0.10000E-03
offset:0.00000E+00
...
...
Line: 3306,0.099800,0.099800
Line: 3307,0.099900,0.099900
Line: 3308,0.099400,0.074200
Line: 3309,0.098900,0.098900
Line: 3310,0.098400,0.098400
Line: 3311,0.074300,0.074200
Line: 3312,0.071300,0.071200
fortran (f2003 or so) solution. (The linked instructions are awful by the way )
implicit none
character*80 para,outfile
character(len=:),allocatable::header,infile
integer npixel,nline,blen,i
c note kind=2 is not standard. This needs to be a 2-byte integer.
integer(kind=2),allocatable :: data(:,:)
real lon_min,lat_max,reso,slope,off
c header is plain text, so first open formatted and
c directly read header data
infile='MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le'
open(10,file=infile)
read(10,*)npixel,nline,lon_min,lat_max,reso,slope,off,
$ para,outfile
close(10)
write(*,*)npixel,nline,lon_min,lat_max,reso,slope,off,
$ trim(para),' ',trim(outfile)
blen=2*npixel
allocate(character(len=blen)::header)
allocate(data(npixel,nline))
if( sizeof(data(1,1)).ne.2 )then
write(*,*)'error kind=2 did not give a 2 byte integer'
stop
endif
c now close and reopen for binary read.
c direct access approach:
open(20,file=infile,access='direct',recl=blen/4)
c note the granularity of the recl= specifier is not standard.
c ifort uses 4 bytes. (note this will break if npixel is not even )
read(20,rec=1)header
write(*,*)trim(header)
do i=1,nline
read(20,rec=i+1)data(:,i)
enddo
c note streams if available is simpler: (we don't need to know rec len )
c open(20,file=infile,access='stream')
c read(20)header,data
end
This is not actually validated because I don't have known file content to compare against.

Extracting plain text output from binary file

I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.

How to print the contents of a memory address using LLDB?

I am using LLDB and wondering how to print the contents of a specific memory address, for example 0xb0987654.
To complement Michael's answer.
I tend to use:
memory read -s1 -fu -c10000 0xb0987654 --force
That will print in the debugger.
-s for bytes grouping so use 1 for uint8 for example and 4 for int
-f for format. I inherently forget the right symbol. Just put the statement with -f and it will snap back at you and give you the list of all the options
-c is for count of bytes
if you are printing more than 1024 bytes, append with --force
Hope this helps.
Xcode has a very nice Memory Browser window, which will very nicely display the contents of memory addresses. It also lets you control byte grouping and number of bytes displayed, and move back or forward a memory page:
You can access it by pressing ⌘^⌥⇧M. After entering it, press enter to open the memory browser in the main editor.
or
Debug --> Debug Workflow --> View Memory
Notice the field on its bottom left corner where you can paste the memory address you want to inspect!
Documentation here: https://developer.apple.com/library/ios/recipes/xcode_help-debugger/articles/viewing_memory.html
Related answer here: How do I open the memory browser in Xcode 4?
"me" is the command you're looking for.
For example, this lldb command:
me -r -o /tmp/mem.txt -c512 0xb0987654
will copy 512 bytes from your memory address into a file at /tmp/mem.txt.
for example, print memory of length 16x4 bytes.
x/16 0xb0987654
Here's a simple trick for displaying typed arrays of fixed-length in lldb. If your program contains a long* variable that points to 9 elements, you can declare a struct type that contains a fixed array of 9 long values and cast the pointer to that type:
long *values = new long[9]{...};
(lldb) expr typedef struct { long values[9]; } l9; *(l9 *)values
(l9) $1 = {
values = {
[0] = 0
[1] = 1
[2] = 4
[3] = 9
[4] = 16
[5] = 25
[6] = 36
[7] = 49
[8] = 64
}
}
I use the typedef when I'm coding in C, it's not needed in C++.

reading file with UPC

I'm starting to learn UPC, and I have the following piece of code to read a file:
upc_file_t *fileIn;
int n;
fileIn = upc_all_fopen("input_small", UPC_RDONLY | UPC_INDIVIDUAL_FP , 0, NULL);
upc_all_fread_local(fileIn, &n, sizeof(int), 1, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC);
upc_barrier;
printf("%d\n", n);
upc_all_fclose(fileIn);
However, the output (value of n) is always 808651319, which means something is wrong, and I can't find what is it. The first line of the file I'm giving as input is '7', so the result of the printf should be 7...
Any idea why this happens?
Thanks in advance!
UPC Parallel I/O library performs unformatted (binary) input/output, not formatted one like what you get with (f)printf(3)/(f)scanf(3) from the standard C library. Parallel I/O cannot handle text files because of their intrinsic properties like variable-length records.
upc_all_fread_local(fileIn, &n, sizeof(int), 1, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC)
behaves like the following call to the standard C library function for unformatted read from a file:
fread(&n, sizeof(int), 1, fh)
You are just reading 1 element of sizeof(int) bytes from the file (4 bytes on most platforms) into the address of n. The number you got 808651319 in hexadecimal is 0x30330A37. On little endian systems like x86/x64 this is stored in memory and on disk as 0x37 0x0A 0x33 0x30 (reversed byte order). These are the ASCII codes of the first 4 bytes of the string 7\n30 (\n or LF is the line feed/new line symbol) so I'd guess your input_small file looked like:
7
30...
...
You should prepare your input data in binary format using fwrite(3) instead of using (f)printf(3) or your text editor of choice.

Parallel Reading/Writing File in c

Problem is to read a file of size about 20GB simultaneously by n processes. File contains one string at each line and Length of the strings may or may not be same. String length can be at-most 10 bytes long.
I have a cluster of having 16 nodes. Each node are the uni-processor and having 6GB RAM.I am using MPI to write Parallel codes.
What are the efficient way to partition this big file so that all resources can be utilized ?
Note: The constraints to the partitions is to read file as a chunk of fixed number of lines.
Assume file contains 1600 lines(e.g. 1600 strings). then first process should read from 1st line to 100th line, second process should do from 101th line to 200th line and so on....
As i think that one can't read a file by more than one processes at a time because we have only one file handler that point to somewhere only one string. then how other processes can read parallely from different chunks?
So as you're discovering, text file formats are poor for dealing with large amounts of data; not only are they larger than binary formats, but you run into formatting problems like here (seaching for newlines), and everything is much slower (data must be converted into strings). There can easily be 10x difference in IO speeds between text-based formats and binary formats for numerical data. But we'll assume for now you're stuck with the text file format.
Presumably, you're doing this partitioning for speed. But unless you have a parallel filesystem -- that is, multiple servers serving from multiple disks, and a FS that can keep those coordinated -- it's unlikely you're going to get a significant speedup from having multiple MPI tasks reading from the same file, as ultimately these requests are all going to get serialized anyway at the server/controller/disk level.
Further, reading in large blocks of data is going to be much faster than fseek()ing around and doing small reads looking for newlines.
So my suggestion would be to have one process (perhaps the last) read all the data in as few chunks as it can and send the relevant lines to each task (including, finally, itself). If you know how many lines the file has at the start, this is fairly simple; read in say 2 GB of data, search through memory for the end of the N/Pth line, and send that to task 0, send task 0 a "completed your data" message, and continue.
You don't specify if there are any constraints on the partitions, so I'll assume there are none. I'll also assume that you want the partitions to be as close to equal in size as possible.
The naïve approach would be to split the file into chunks of size 20GB/n. The starting position of chunk i wouild be i*20GB/n for i=0..n-1.
The problem with that is, of course, that there's no guarantee that chunk boundaries would fall between the lines of the input file. In general, they won't.
Fortunately, there's an easy way to correct for this. Having established the boundaries as above, shift them slightly so that each of them (except i=0) is placed after the following newline.
That'll involve reading 15 small fragments of the file, but will result in a very even partition.
In fact, the correction can be done by each node individually, but it's probably not worth complicating the explanation with that.
I think it would be better to write a piece of code that would get line lengths and distribute lines to processes. That distributing function would work not with strings themselves, but only their lengths.
To find an algorythm for even distribution of sources of fixed size is not a problem.
And after that the distributing func will tell other processes what pieces they have to get for work. Process 0 (distributor) will read a line. It already knows, that the line num. 1 should be worked by the process 1. ... P.0 reads line num. N and knows what process has to work with it.
Oh! We needn't optimize the distribution from the start. Simply the distributor process reads a new line from input and gives it to a free process. That's all.
So, you have even two solutions: heavily optimized and easy one.
We could reach even more optimalization if the distributor process will reoptimalize the unread yet strings from time to time.
Here is a function in python using mpi and the pypar extension to read the number of lines in a big file using mpi to split up the duties amongst a number of hosts.
def getFileLineCount( file1 ):
import pypar, mmap, os
"""
uses pypar and mpi to speed up counting lines
parameters:
file1 - the file name to count lines
returns:
(line count)
"""
p1 = open( file1, "r" )
f1 = mmap.mmap( p1.fileno(), 0, None, mmap.ACCESS_READ )
#work out file size
fSize = os.stat( file1 ).st_size
#divide up to farm out line counting
chunk = ( fSize / pypar.size() ) + 1
lines = 0
#set start and end locations
seekStart = chunk * ( pypar.rank() )
seekEnd = chunk * ( pypar.rank() + 1 )
if seekEnd > fSize:
seekEnd = fSize
#find start of next line after chunk
if pypar.rank() > 0:
f1.seek( seekStart )
l1 = f1.readline()
seekStart = f1.tell()
#tell previous rank my seek start to make their seek end
if pypar.rank() > 0:
# logging.info( 'Sending to %d, seek start %d' % ( pypar.rank() - 1, seekStart ) )
pypar.send( seekStart, pypar.rank() - 1 )
if pypar.rank() < pypar.size() - 1:
seekEnd = pypar.receive( pypar.rank() + 1 )
# logging.info( 'Receiving from %d, seek end %d' % ( pypar.rank() + 1, seekEnd ) )
f1.seek( seekStart )
logging.info( 'Calculating line lengths and positions from file byte %d to %d' % ( seekStart, seekEnd ) )
l1 = f1.readline()
prevLine = l1
while len( l1 ) > 0:
lines += 1
l1 = f1.readline()
if f1.tell() > seekEnd or len( l1 ) == 0:
break
prevLine = l1
#while
f1.close()
p1.close()
if pypar.rank() == 0:
logging.info( 'Receiving line info' )
for p in range( 1, pypar.size() ):
lines += pypar.receive( p )
else:
logging.info( 'Sending my line info' )
pypar.send( lines, 0 )
lines = pypar.broadcast( lines )
return ( lines )

Resources