HDFS fsck command output - hadoop

I got this in output so I just want to know what is BP, Blk? Can you explain me what each thing means in this output? I know the
BP-929597290- len=2 repl=3 [DatanodeInfoWithStorage[,DS-730a75d3-046c-4254-990a-4eee9520424f,DISK], DatanodeInfoWithStorage[,DS-fc6ee5c7-e76b-4faa-b663-58a60240de4c,DISK], DatanodeInfoWithStorage[,DS-8ab81b26-309e-42d6-ae14-26eb88387cad,DISK]]
I guess this is the Ip of first replication of data

This is Block Pool ID. Block pool is a set of blocks that belong to single name space. For simplicity, you can say that all the blocks managed by a Name Node are under the same Block Pool.
The Block Pool is formed as:
String bpid = "BP-" + rand + "-"+ ip + "-" + Time.now();
rand = Some random number
ip = IP address of the Name Node
Time.now() - Current system time
Read about Block Pools here: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html
Block number of the block. Each block in HDFS is given a unique identifier.
The block ID is formed as:
blockid = ID of the block
genstamp = an incrementing integer that records the version of a particular block
Read about generation stamp here: http://blog.cloudera.com/blog/2009/07/file-appends-in-hdfs/
Length of the block: Number of bytes in the block
There are 3 replicas of this block
Where: => IP address of the Data Node holding this block
1000 => Data streaming port
DS-730a75d3-046c-4254-990a-4eee9520424f => Storage ID. It is an internal ID of the Data Node. It is assigned, when the Data Node registers with Name Node
DISK => storageType. It is DISK here. Storage type can be: RAM_DISK, SSD, DISK and ARCHIVE
The description of point 5 applies to remaining 2 blocks:


Pyspark performance tunning - cache or not to cache?

I am trying to speed up the calculations from multiple operations that I am adding as columns in a pyspark data frame, when I found the sparkbyexamples article on performance tunning. I am considering how to use the cache and the spark.sql.shuffle.partitions, solutions.
Would cache be appropriate for a code that first joins multiple data
frames and then adds calculations over different windows?
What happens when reassigning the cached data frame (see bellow)?
df = dfA.join(dfB, on = ['key'], how ='left') # should I add .cache here?
w_u = Window.partitionBy('user')
w_m = Window.partitionBy(['user','month']).orderBy('month')\
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
MLAB = ['val1','val2'] # example to indicate that I run similar operations multiple times
for mlab in MLAB:
percent_50 = F.expr('percentile_approx('+mlab+',0.5)')
df = df.withColumn(mlab+'_md', percent_50.over(w_u) # what happens with the cache when I reassing it
Afterwards I am adding additional operations that include aggregations, such as:
radius_df = (df
# number of visits per stop
.groupby('userId', 'locationId').agg(F.count(F.lit(1)).alias('n_i'),
#compute center of mass (lat/lon) per user
.withColumn('center_lon', F.avg(F.col('locationLongitude')).over(w))
.withColumn('center_lat', F.avg(F.col('locationLatitude')).over(w))
# compute total visits
.withColumn('N', F.sum(F.col('n_i')).over(w))
# compute (r_i - r_cm)
.withColumn('distance', distance(F.col('locationLatitude'), F.col('locationLongitude'), F.col('center_lat'), F.col('center_lon')))
# compute n_i(r_i - r_cm)^2 / N
.withColumn('distance2', F.col('n_i') * (F.col('distance') * F.col('distance')) / F.col('N'))
# compute sum(n_i(r_i - r_cm)^2)
# square root
.withColumn('radius_gyr', F.sqrt(F.col('sum_dist2')))
df_f = df.join(radius_df.dropDuplicates(), on='userId', how='left')
I am open to any suggestions on how to speed up the code. Many thanks.

Obtaining the sum of correct message byte lengths on the network layer by the Omnet++ result collection

Suppose in a wireless network with 25 nodes, we have a scenario where some of each node sends messages to some other nodes according to a routing protocol such as AODV. We simulate this network. After finishing the simulation, how to obtain the sum of messages byte length on the network layer by the Omnet++ result collection?
For each node, we must have two metrics, a metric for sent message byte lengths (e.g. totalSentMessageByteLengths) and a metric for received message byte lengths (e.g. totalReceivedMessageByteLengths).
By correct messages, I mean messages received by a node whose destination address field is the address of the same node. If retransmission occurs, it should be summed once for the receiver side, summed the incorrect message byte lengths, and the correct message byte lengths for the sender side. If a node has more than an application, all the message byte lengths generated by all applications of the same node must be calculated. Message byte lengths mean the total byte of header and data on the network layer per byte.
An instace code for a node in omnetpp.ini:
*.hostA.numApps = 2
*.hostA.app[0].typename = "UdpBasicApp"
*.hostA.app[0].destAddresses = "hostB"
*.hostA.app[0].destPort = 5000
*.hostA.app[0].messageLength = 1000B
*.hostA.app[0].sendInterval = exponential(12ms)
*.hostA.app[0].packetName = "UDPData"
*.hostA.app[0].typename = "TcpBasicApp"
*.hostA.app[0].destAddresses = "hostC"
*.hostA.app[0].destPort = 5001
*.hostA.app[0].messageLength = 1024B
*.hostA.app[0].sendInterval = exponential(45ms)
*.hostA.app[0].packetName = "TCPData"
The Ipv4 module has several signals that can be used to create statistics either on a node or network level like packetSentToLower or packetReceivedFromLower. Just use these signals on your #statistics declaration.

Omnetpp.ini - How to create loop for the host parametres

I have 1000 hosts. I need to simulate the situation when host[0] connects with other 999 hosts by PingApp in accordance with a timetable.
For example
**.host[0]*.numPingApps = 999 #number of hosts
**.host[0]*.pingApp[*].typename = "PingApp"
**.host[0]*.pingApp[*].packetSize = 42 B
**.host[0]*.pingApp[*].sendInterval = 1 s
**.host[0]*.pingApp[*].srcAddr = "host[0]"
**.host[0]*.pingApp[0].destAddr = "host[1]"
**.host[0]*.pingApp[0].startTime = 0 s
**.host[0]*.pingApp[0].stopTime = 5s
**.host[0]*.pingApp[1].destAddr = "host[2]"
**.host[0]*.pingApp[1].startTime = 0.1 s
**.host[0]*.pingApp[1].stopTime = 5.1 s
**.host[0]*.pingApp[2].destAddr = "host[3]"
**.host[0]*.pingApp[2].startTime = 0.2 s
**.host[0]*.pingApp[2].stopTime = 5.2 s
**.host[0]*.pingApp[3].destAddr = "host[4]"
**.host[0]*.pingApp[3].startTime = 0.3 s
**.host[0]*.pingApp[3].stopTime = 5.3 s
and so on...
How can I create the loop for automatic changes of parameters: startTime, stopTime, destAddr, number of pingApp?
I need to increase startTime and stopTime by 0.1s at every step of one point increase of pingApp number and destAddr.
Help me please!
Thank you!
Actually, every host should have only one Ping Application. To achieve your goal you can use the following settings:
**.host[*].numApps = 1
**.host[*].app[0].typename = "PingApp"
**.host[999].app[0].destAddr = "host[0]"
**.host[*].app[0].destAddr = "host[" + string(parentIndex()+1) + "]"
**.host[*].app[0].startTime = replaceUnit (0.1*(parentIndex()), "s")
**.host[*].app[0].stopTime = replaceUnit (5 + 0.1*(parentIndex()), "s")
The paretnIndex() returns the index of the host in vector of hosts, reference OMNeT++ Manual. For the last node (i.e. host[999]) destAddr is set by hand because parentIndex()+1 will return 1000, and there is no host[1000].
The second NED function - replaceUnit() - is used to add the unit to the result of calculation.
Here is an other quasi solution:
From the PingApp's documentation:
string destAddr = default(""); // destination address(es), separated by spaces, "*" means all IPv4/IPv6 interfaces in entire simulation
Specifying '*' allows pinging ALL configured network interfaces in the
whole simulation. This is useful to check if a host can reach ALL other
hosts in the network (i.e. routing tables were set up properly).
To specify the number of ping requests sent to a single destination address,
use the 'count' parameter. After the specified number of ping requests was
sent to a destination address, the application goes to sleep for 'sleepDuration'.
Once the sleep timer has expired, the application switches to the next destination
and starts pinging again. The application stops pinging once all destination
addresses were tested or the simulation time reaches 'stopTime'.
So if you have only these hosts in the network and you don't mind that in the beginning the host pings itself too, destAddr="*" and count=1
I combined answers of #Rudi and #JerzyD. and got the workable solution:
**.host[0]*.numPingApps = 999
**.host[0]*.pingApp[*].typename = "PingApp"
**.host[0]*.pingApp[*].sendInterval = 1 s
**.host[0]*.pingApp[*].packetSize = 42 B
**.host[0]*.pingApp[0..998].destAddr = "host[" + string(index()+1) + "]"
**.host[0]*.pingApp[0..998].startTime = replaceUnit (0.1 * (index()), "s")
**.host[0]*.pingApp[0..998].stopTime = replaceUnit (5 + 0.1 * (index()), "s")

Parallel Reading/Writing File in c

Problem is to read a file of size about 20GB simultaneously by n processes. File contains one string at each line and Length of the strings may or may not be same. String length can be at-most 10 bytes long.
I have a cluster of having 16 nodes. Each node are the uni-processor and having 6GB RAM.I am using MPI to write Parallel codes.
What are the efficient way to partition this big file so that all resources can be utilized ?
Note: The constraints to the partitions is to read file as a chunk of fixed number of lines.
Assume file contains 1600 lines(e.g. 1600 strings). then first process should read from 1st line to 100th line, second process should do from 101th line to 200th line and so on....
As i think that one can't read a file by more than one processes at a time because we have only one file handler that point to somewhere only one string. then how other processes can read parallely from different chunks?
So as you're discovering, text file formats are poor for dealing with large amounts of data; not only are they larger than binary formats, but you run into formatting problems like here (seaching for newlines), and everything is much slower (data must be converted into strings). There can easily be 10x difference in IO speeds between text-based formats and binary formats for numerical data. But we'll assume for now you're stuck with the text file format.
Presumably, you're doing this partitioning for speed. But unless you have a parallel filesystem -- that is, multiple servers serving from multiple disks, and a FS that can keep those coordinated -- it's unlikely you're going to get a significant speedup from having multiple MPI tasks reading from the same file, as ultimately these requests are all going to get serialized anyway at the server/controller/disk level.
Further, reading in large blocks of data is going to be much faster than fseek()ing around and doing small reads looking for newlines.
So my suggestion would be to have one process (perhaps the last) read all the data in as few chunks as it can and send the relevant lines to each task (including, finally, itself). If you know how many lines the file has at the start, this is fairly simple; read in say 2 GB of data, search through memory for the end of the N/Pth line, and send that to task 0, send task 0 a "completed your data" message, and continue.
You don't specify if there are any constraints on the partitions, so I'll assume there are none. I'll also assume that you want the partitions to be as close to equal in size as possible.
The naïve approach would be to split the file into chunks of size 20GB/n. The starting position of chunk i wouild be i*20GB/n for i=0..n-1.
The problem with that is, of course, that there's no guarantee that chunk boundaries would fall between the lines of the input file. In general, they won't.
Fortunately, there's an easy way to correct for this. Having established the boundaries as above, shift them slightly so that each of them (except i=0) is placed after the following newline.
That'll involve reading 15 small fragments of the file, but will result in a very even partition.
In fact, the correction can be done by each node individually, but it's probably not worth complicating the explanation with that.
I think it would be better to write a piece of code that would get line lengths and distribute lines to processes. That distributing function would work not with strings themselves, but only their lengths.
To find an algorythm for even distribution of sources of fixed size is not a problem.
And after that the distributing func will tell other processes what pieces they have to get for work. Process 0 (distributor) will read a line. It already knows, that the line num. 1 should be worked by the process 1. ... P.0 reads line num. N and knows what process has to work with it.
Oh! We needn't optimize the distribution from the start. Simply the distributor process reads a new line from input and gives it to a free process. That's all.
So, you have even two solutions: heavily optimized and easy one.
We could reach even more optimalization if the distributor process will reoptimalize the unread yet strings from time to time.
Here is a function in python using mpi and the pypar extension to read the number of lines in a big file using mpi to split up the duties amongst a number of hosts.
def getFileLineCount( file1 ):
import pypar, mmap, os
uses pypar and mpi to speed up counting lines
file1 - the file name to count lines
(line count)
p1 = open( file1, "r" )
f1 = mmap.mmap( p1.fileno(), 0, None, mmap.ACCESS_READ )
#work out file size
fSize = os.stat( file1 ).st_size
#divide up to farm out line counting
chunk = ( fSize / pypar.size() ) + 1
lines = 0
#set start and end locations
seekStart = chunk * ( pypar.rank() )
seekEnd = chunk * ( pypar.rank() + 1 )
if seekEnd > fSize:
seekEnd = fSize
#find start of next line after chunk
if pypar.rank() > 0:
f1.seek( seekStart )
l1 = f1.readline()
seekStart = f1.tell()
#tell previous rank my seek start to make their seek end
if pypar.rank() > 0:
# logging.info( 'Sending to %d, seek start %d' % ( pypar.rank() - 1, seekStart ) )
pypar.send( seekStart, pypar.rank() - 1 )
if pypar.rank() < pypar.size() - 1:
seekEnd = pypar.receive( pypar.rank() + 1 )
# logging.info( 'Receiving from %d, seek end %d' % ( pypar.rank() + 1, seekEnd ) )
f1.seek( seekStart )
logging.info( 'Calculating line lengths and positions from file byte %d to %d' % ( seekStart, seekEnd ) )
l1 = f1.readline()
prevLine = l1
while len( l1 ) > 0:
lines += 1
l1 = f1.readline()
if f1.tell() > seekEnd or len( l1 ) == 0:
prevLine = l1
if pypar.rank() == 0:
logging.info( 'Receiving line info' )
for p in range( 1, pypar.size() ):
lines += pypar.receive( p )
logging.info( 'Sending my line info' )
pypar.send( lines, 0 )
lines = pypar.broadcast( lines )
return ( lines )

Sorting and Balancing Across Multiple Columns

I have a Hash of data that looks something like this.
{ "GROUP_A" => [22, 440],
"GROUP_B" => [14, 70],
"GROUP_C" => [60, 620],
"GROUP_D" => [174, 40],
"GROUP_E" => [4, 12]
# ...few hundred more
GROUP_A has 22 accounts and they are using 440GB of data...and so on. There are a couple hundred of these groups. Some have a lot of accounts but use very little storage and some have only a few users and use A LOT of storage, some are just average.
I have X number of buckets (servers) that I want to put these groups of accounts into, and I want there to be approximately the same number of accounts per bucket and have each bucket also contain approximately the same amount of data. Number of groups is not important, so if a bucket had 1 group of 1000 accounts using 500GB of data and the next bucket had 10 groups of 97 accounts (970 total) using 450GB of data...I'd call it good.
So far I've not come up with an algorithm that will do this. In my mind I'm thinking of something like this perhaps?
Bucket 1: Group with largest data, 60 users.
Bucket 2: Next largest data group, 37 users.
Bucket 3: Next largest data group, 72 users.
Bucket 4: etc....
Bucket 1: Add a group with small amount of data, but more users than average.
# There's probably a ratio I can calculate to figure this out...divide users/datavmaybe?
Bucket 2: Find a "small data" group where sum of users in Bucket 1 ~= sum of users in Bucket 2
# But then there's no guarantee that the data usages will be close enough
Bucket 3: etc...
Bucket 1: Now what? Back to next largest data group?
I still think there's a better way to figure this out but it's not coming to me. If anyone has any thoughts I'm open to suggestions.
Solution 1.1 - Brute Force Update
Well....here's an update to the first attempt. This is still not a "knapsack-problem" solution. Just brute forcing the data so the accounts balance across buckets. This time I added some logic so that if a bucket has a higher full percentage of accounts vs. data...it will find the largest group (by data) that fits best based on number of accounts. I get a lot better distribution of data now vs. my first attempt (see the edit history if you want to look at the first attempt).
Right now I load each bucket in sequence, filling bucket one, then bucket two, etc... I think if I was to modify the code so that I filled them simultaneously (or nearly so) I'd get a better data balance.
e.g. 1st department into bucket 1, 2nd department into bucket 2, etc...until all buckets have one department... Then start back with bucket 1 again.
dept_arr_sorted_by_acct = dept_hsh.sort_by {|key, value| value[0]}
ap "MAX ACCTS: #{max_accts} AVG ACCTS: #{avg_accts}"
ap "MAX SIZE: #{max_size} AVG SIZE: #{avg_data}"
# puts dept_arr_sorted_by_acct
# exit
bucket_arr = Array.new
used_hsh = Hash.new
server_names.each do |s|
bucket_hsh = Hash.new
accts_space_pct_used = 0
data_space_pct_used = 0
while this_accts < avg_accts
if accts_space_pct_used <= data_space_pct_used
# This loop runs if the % used of accts is less than % used of data
dept_arr_sorted_by_acct.each do |val|
# Sorted by num accts - ascending. Loop until we find the last entry in the array that has <= accts than what we need
next if used_hsh.has_key?(val[0])
#do nothing
if val[1][0] <= avg_accts-this_accts
my_key = val[0]
my_val = val[1]
accts = val[1][0]
data = val[1][1]
# This loop runs if the % used of data is less than % used of accts
dept_arr_sorted_by_data = dept_arr_sorted_by_acct.sort { |a,b| b[1][1] <=> a[1][1] }
dept_arr_sorted_by_data.each do |val|
# Sorted by size - descending. Find the first (largest data) entry where accts <= what we need
next if used_hsh.has_key?(val[0])
# do nothing
if val[1][0] <= avg_accts-this_accts
my_key = val[0]
my_val = val[1]
accts = val[1][0]
data = val[1][1]
used_hsh[my_key] = my_val
bucket_hsh[my_key] = my_val
this_accts = this_accts + accts
this_data = this_data + data
accts_space_pct_used = this_accts.to_f / avg_accts * 100
data_space_pct_used = this_data.to_f / avg_data * 100
bucket_arr << [this_accts, this_data, bucket_hsh]
while x < bucket_arr.size do
th = bucket_arr[x][2]
list_of_depts = []
th.each_key do |key|
list_of_depts << key
ap "Bucket #{x}: #{bucket_arr[x][0]} accounts :: #{bucket_arr[x][1]} data :: #{list_of_depts.size} departments"
#ap list_of_depts
x = x+1
...and the results...
"MAX ACCTS: 2279 AVG ACCTS: 379"
"MAX SIZE: 1693315 AVG SIZE: 282219"
"Bucket 0: 379 accounts :: 251670 data :: 7 departments"
"Bucket 1: 379 accounts :: 286747 data :: 10 departments"
"Bucket 2: 379 accounts :: 278226 data :: 14 departments"
"Bucket 3: 379 accounts :: 281292 data :: 19 departments"
"Bucket 4: 379 accounts :: 293777 data :: 28 departments"
"Bucket 5: 379 accounts :: 298675 data :: 78 departments"
(379 * 6 <> 2279) I still need to figure out how to account for when the MAX_ACCTS are not evenly divisible by the number of buckets. I tried adding a 1% pad to the AVG_ACCTS value, which in this case means the average would be 383 I think, but then all the buckets say they have 383 accounts in them...which can't be true because then there are more accounts in the buckets than MAX_ACCTS. I've got a mistake in the code somewhere that I haven't found yet.
This is an example of the knapsack problem. There are a few solutions, but it's a really tricky problem and it's better to research a good solution than to try and make your own.
