How to add multiple columns in Apache Spark - hadoop

Here is my input data with four columns with space as the delimiter. I want to add the second and third column and print the result
sachin 200 10 2
sachin 900 20 2
sachin 500 30 3
Raju 400 40 4
Mike 100 50 5
Raju 50 60 6
My code is in the mid way
from pyspark import SparkContext
sc = SparkContext()
def getLineInfo(lines):
spLine = lines.split(' ')
name = str(spLine[0])
cash = int(spLine[1])
cash2 = int(spLine[2])
cash3 = int(spLine[3])
return (name,cash,cash2)
myFile = sc.textFile("D:\PYSK\cash.txt")
rdd = myFile.map(getLineInfo)
print rdd.collect()
From here I got the result as
[('sachin', 200, 10), ('sachin', 900, 20), ('sachin', 500, 30), ('Raju', 400, 40
), ('Mike', 100, 50), ('Raju', 50, 60)]
Now the final result I need is as below, adding the 2nd and 3rd column and display the remaining fields
sachin 210 2
sachin 920 2
sachin 530 3
Raju 440 4
Mike 150 5
Raju 110 6

Use this:
def getLineInfo(lines):
spLine = lines.split(' ')
name = str(spLine[0])
cash = int(spLine[1])
cash2 = int(spLine[2])
cash3 = int(spLine[3])
return (name, cash + cash2, cash3)

Related

How can I create a specific time interval in Ruby?

What I have tried so far ...
start_hour = 7
start_minute = 0 * 0.01
end_hour = 17
end_minute = 45 * 0.01
step_time = 25
start_time = start_hour + start_minute
end_time = end_hour + end_minute
if step_time > 59
step_time = 1 if step_time == 60
step_time = 1.3 if step_time == 90
step_time = 2 if step_time == 120
else
step_time *= 0.01
end
hours = []
(start_time..end_time).step(step_time).map do |x|
next if (x-x.to_i) > 0.55
hours << '%0.2f' % x.round(2).to_s
end
puts hours
If I enter the step interval 0, 5, 10, 20, I can get the time interval I want. But if I enter 15, 25, 90, I can't get the right range.
You currently have:
end_hour = 17
end_minute = 45 * 0.01
end_time = end_hour + end_minute
#=> 17.45
Although 17.45 looks like the correct value, it isn't. 45 minutes is 3 quarters (or 75%) of an hour, so the correct decimal value is 17.75.
You could change your code accordingly, but working with decimal hours is a bit strange. It's much easier to just work with minutes. Instead of turning the minutes into hours, you turn the hours into minutes:
start_hour = 7
start_minute = 0
start_time = start_hour * 60 + start_minute
#=> 420
end_hour = 17
end_minute = 45
end_time = end_hour * 60 + end_minute
#=> 1065
The total amount of minutes can easily be converted back to hour-minute pairs via divmod:
420.divmod(60) #=> [7, 0]
1065.divmod(60) #=> [17, 45]
Using the above, we can traverse the range without having to convert the step interval:
def hours(start_time, end_time, step_time)
(start_time..end_time).step(step_time).map do |x|
'%02d:%02d' % x.divmod(60)
end
end
hours(start_time, end_time, 25)
#=> ["07:00", "07:25", "07:50", "08:15", "08:40", "09:05", "09:30", "09:55",
# "10:20", "10:45", "11:10", "11:35", "12:00", "12:25", "12:50", "13:15",
# "13:40", "14:05", "14:30", "14:55", "15:20", "15:45", "16:10", "16:35",
# "17:00", "17:25"]
hours(start_time, end_time, 90)
#=> ["07:00", "08:30", "10:00", "11:30", "13:00", "14:30", "16:00", "17:30"]

How to write a program in gwbasic for adding the natural numbers for 1 to 100?

I am trying to write a program for adding the natural numbers from 1 to n (1 + 2 + 3 + ... + n). However, the sum appears 1 when I use if statement. And when I use for-next statement there is a syntax error that I don't understand.
if:
30 let s = 0
40 let i = 1
50 s = s + i
60 i = i + 1
70 if i<=n, then goto 50
80 print s
for-next:
30 let i, s
40 s = 0
50 for i = 1 to n
60 s = s + i
70 next i
80 print n
When I take n = 10, the if statement code gives a result of 1, but it should be 55.
When I try to use the for-next statement, it gives no result saying that there is a syntax error in 30.
Why is this happening?
The following code works in this online Basic interpreter.
10 let n = 100
30 let s = 0
40 let i = 1
50 s = s + i
60 i = i + 1
70 if i <= n then goto 50 endif
80 print s
I initialised n on the line labelled 10, removed the comma on the line labelled 70 and added an endif on the same line.
This is the for-next version:
30 let n = 100
40 let s = 0
50 for i = 1 to n
60 s = s + i
70 next i
80 print s
(btw, the sum of the first n natural numbers is n(n+1)/2:
10 let n = 100
20 let s = n * (n + 1) / 2
30 print s
)
Why is this happening? Where am I mistaking?
30 let s = 0
40 let i = 1
50 s = s + i
60 i = i + 1
70 if i<=n, then goto 50
80 print s
Fix #1: Initialize variable 'n':
20 let n = 10
Fix #2: Remove comma from line 70:
70 if i<=n then goto 50
30 let i, s
40 s = 0
50 for i = 1 to n
60 s = s + i
70 next i
80 print n
Fix #1: Initialize variable 'n':
30 let n = 10
Fix #2: Print 's' instead of 'n':
80 print s
10 cls
20 let x=1
30 for x=1 to 100
40 print x
50 next x
60 end

D3 Filter Issue

I am trying to filter my data list using D3. What I am trying to do is filter my data based on date I specify and threshold value for precipitation.
Here is my code as
$(function() {
$("#datepicker").datepicker();
$("#datepicker").on("change",function(){
//var currentDate = $( "#datepicker" ).datepicker( "getDate" )/1000;
//console.log(currentDate)
});
});
function GenerateReport() {
d3.csv("/DataTest.csv", function(data) {
var startdate = $( "#datepicker" ).datepicker( "getDate" )/1000;
var enddate = startdate + 24*60*60
var data_Date = d3.values(data.filter(function(d) { return d["Date"] >=
startdate && d["Date"] <= enddate} ))
var x = document.getElementById("threshold").value
console.log(data_Date)
var data_Date_Threshold = data_Date.filter(function(d) {return
d.Precipitation > x});
My data set looks like
ID Date Prcip Flow Stage
1010 1522281000 0 0 0
1010 1522281600 0 0 0
1010 1522285200 10 0 0
1010 1522303200 12 200 1.2
1010 1522364400 6 300 2
1010 1522371600 4 400 2.5
1010 1522364400 6 500 2.8
1010 1522371600 4 600 3.5
2120 1522281000 0 0 0
2120 1522281600 0 0 0
2120 1522285200 10 100 1
2120 1522303200 12 1000 2
2120 1522364400 6 2000 3
2120 1522371600 4 2500 3.2
2290 1522281000 0 0 0
2290 1522281600 4 0 0
2290 1522285200 5 200 1
2290 1522303200 10 800 1.5
2290 1522364400 6 1500 3
2290 1522371600 0 1000 2
6440 1522281000 0 0 0
6440 1522281600 4 0 0
6440 1522285200 5 200 0.5
6440 1522303200 10 800 1
6440 1522364400 6 1500 2
6440 1522371600 0 100 1.4
When I use filter function, I have some problems.
What I have found is that when I use x = 2 to filter precipitation value, it does not catch precipitation = 10 or 12. However, when I use x=1, it works fine. I am guessing that it catches only the first number (e.g., if x=2, it regards precipitation = 10 or 12 is less than 2 since it looks only 1 in 10 and 12) Is there anyone who had the same issue what I have? Can anyone help me to solve this problem?
Thanks.
You are comparing strings. This comparison is therefore done lexicographically.
In order to accomplish what you want, you need to first convert these strings to numbers:
var x = Number(document.getElementById("threshold").value)
var data_Date_Threshold = data_Date.filter(function(d) {return Number(d.Precipitation) > x});
Alternatively, floats:
var x = parseFloat(document.getElementById("threshold").value)
var data_Date_Threshold = data_Date.filter(function(d) {return parseFloat(d.Precipitation) > x});

Converting SFrames into input dataset Sframes

I have a pretty bad way to convert my input logs to the input dataset.
I have an SFrame sf with the following format:
user_id int
timestamp datetime.datetime
action int
reasoncode str
action column takes up 9 values ranging from 1 to 9.
So, every user_id can perform more than 1 action, more than once.
I am trying to obtain all unique user_id from sf and create an op_sf in the following manner:
y = 225
def calc_class(a,x):
diffd = a['timestamp'].apply(lambda x: (dte - x).days)
g = 0
b = 0
for i in diffd:
if i > y:
g += 1
else:
b += 1
if b>= x:
return 4
elif b!= 0:
return 3
elif g>= 0:
return 2
else:
return 1
l1 = []
ids = z['user_id'].unique()
for idd in ids:
temp = sf[sf['user_id']== idd]
zero1 = temp[temp['action'] == 1]
zero2 = temp[temp['action'] == 2]
zero3 = temp[temp['action'] == 3]
zero4 = temp[temp['action'] == 4]
zero5 = temp[temp['action'] == 5]
zero6 = temp[temp['action'] == 6]
zero7 = temp[temp['action'] == 7]
zeroh8 = temp[temp['reasoncode'] == 'xyz']
zero9 = temp[temp['reasoncode'] == 'abc']
/* I'm getting clas1 to clas9 from function calc_class for each action
clas1 to clas9 are 4 integers ranging from 1 to 4
*/
clas1 = calc_class(zero1,2)
clas2 = calc_class(zero2,2)
clas3 = calc_class(zero3,2)
clas4 = calc_class(zero4,2)
clas5 = calc_class(zero5,2)
clas6 = calc_class(zero6,2)
clas7 = calc_class(zero7,2)
clas8 = calc_class(zero8,2)
clas9 = calc_class(zero9,2)
l1.append([idd,clas1,clas2,clas3,clas4,clas5*(-1),clas6*(-1),clas7*(-1),clas8*(-1),clas9])
I wanted to know if this is the fastest way of doing this. Specifically if it is possible to do the same thing without generating the zero1 to zero9 SFrames.
An example sf:
user_id timestamp action reasoncode
574 23/09/15 12:43 1 None
574 23/09/15 11:15 2 None
574 06/10/15 11:20 2 None
574 06/10/15 11:21 3 None
588 04/11/15 10:00 1 None
588 05/11/15 10:00 1 None
555 15/12/15 13:00 1 None
585 22/12/15 17:30 1 None
585 15/01/16 07:44 7 xyz
588 06/01/16 08:10 7 abc
l1 corresponding to the above sf:
574 1 2 2 0 0 0 0 0 0
588 3 0 0 0 0 0 0 0 3
555 3 0 0 0 0 0 0 0 0
585 3 0 0 0 0 0 0 3 0
I think your logic is relatively complex, but it's still more efficient to use column-wise operations on the whole dataset, rather than extracting the subset of rows for each user. The key tools are SFrame.groupby, SFrame.apply, SFrame.unstack, and SFrame.unpack. API docs here:
https://dato.com/products/create/docs/generated/graphlab.SFrame.html
Here's a solution that uses slightly simpler data than your example and slightly simpler logic to code the old vs. new actions.
# Set up and make the data
import graphlab as gl
import datetime as dt
sf = gl.SFrame({'user': [574, 574, 574, 588, 588, 588],
'timestamp': [dt.datetime(2015, 9, 23), dt.datetime(2015, 9, 23),
dt.datetime(2015, 10, 6), dt.datetime(2015, 11, 4),
dt.datetime(2015, 11, 5), dt.datetime(2016, 1, 6)],
'action': [1, 2, 3, 1, 1, 7]})
# Count old vs. new actions.
sf['days_elapsed'] = (dt.datetime.today() - sf['timestamp']) / (3600 * 24)
sf['old_threshold'] = sf['days_elapsed'] > 225
aggregator = {'total_count': gl.aggregate.COUNT('user'),
'old_count': gl.aggregate.SUM('old_threshold')}
grp = sf.groupby(['user', 'action'], aggregator)
# Code the actions according to old vs. new. Use your own logic here.
grp['action_code'] = grp.apply(
lambda x: 2 if x['total_count'] > x['old_count'] else 1)
grp = grp[['user', 'action', 'action_code']]
# Reshape the results into columns.
sf_new = (grp.unstack(['action', 'action_code'], new_column_name='action_code')
.unpack('action_code'))
# Fill in zeros for entries with no actions.
for c in sf_new.column_names():
sf_new[c] = sf_new[c].fillna(0)
print sf_new
+------+---------------+---------------+---------------+---------------+
| user | action_code.1 | action_code.2 | action_code.3 | action_code.7 |
+------+---------------+---------------+---------------+---------------+
| 588 | 2 | 0 | 0 | 2 |
| 574 | 1 | 1 | 1 | 0 |
+------+---------------+---------------+---------------+---------------+
[2 rows x 5 columns]

Processing Chromosomal Data in Ruby

Say I have a file of chromosomal data I'm processing with Ruby,
#Base_ID Segment_ID Read_Depth
1 100
2 800
3 seg1 1900
4 seg1 2700
5 1600
6 2400
7 200
8 15000
9 seg2 300
10 seg2 400
11 seg2 900
12 1000
13 600
...
I'm sticking each row into a hash of arrays, with my keys taken from column 2, Segment_ID, and my values from column 3, Read_Depth, giving me
mr_hashy = {
"seg1" => [1900, 2700],
"" => [100, 800, 1600, 2400, 200, 15000, 1000, 600],
"seg2" => [300, 400, 900],
}
A primer, which is a small segment that consists of two consecutive rows in the above data, prepends and follows each regular segment. Regular segments have a non-empty-string value for Segment_ID, and vary in length, while rows with an empty string in the second column are parts of primers. Primer segments always have the same length, 2. Seen above, Base_ID's 1, 2, 5, 6, 7, 8, 12, 13 are parts of primers. In total, there are four primer segments present in the above data.
What I'd like to do is, upon encountering a line with an empty string in column 2, Segment_ID, add the READ_DEPTH to the appropriate element in my hash. For instance, my desired result from above would look like
mr_hashy = {
"seg1" => [100, 800, 1900, 2700, 1600, 2400],
"seg2" => [200, 15000, 300, 400, 900, 1000, 600],
}
hash = Hash.new{|h,k| h[k]=[] }
# Throw away the first (header) row
rows = DATA.read.scan(/.+/)[1..-1].map do |row|
# Throw away the first (entire row) match
row.match(/(\d+)\s+(\w+)?\s+(\d+)/).to_a[1..-1]
end
last_segment = nil
last_valid_segment = nil
rows.each do |base,segment,depth|
if segment && !last_segment
# Put the last two values onto the front of this segment
hash[segment].unshift( *hash[nil][-2..-1] )
# Put the first two values onto the end of the last segment
hash[last_valid_segment].concat(hash[nil][0,2]) if last_valid_segment
hash[nil] = []
end
hash[segment] << depth
last_segment = segment
last_valid_segment = segment if segment
end
# Put the first two values onto the end of the last segment
hash[last_valid_segment].concat(hash[nil][0,2]) if last_valid_segment
hash.delete(nil)
require 'pp'
pp hash
#=> {"seg1"=>["100", "800", "1900", "2700", "1600", "2400"],
#=> "seg2"=>["200", "15000", "300", "400", "900", "1000", "600"]}
__END__
#Base_ID Segment_ID Read_Depth
1 100
2 800
3 seg1 1900
4 seg1 2700
5 1600
6 2400
7 200
8 15000
9 seg2 300
10 seg2 400
11 seg2 900
12 1000
13 600
Second-ish refactor. I think this is clean, elegant, and most of all complete. It's easy to read with no hardcoded field lengths or ugly RegEx. I vote mine as the best! Yay! I'm the best, yay! ;)
def parse_chromo(file_name)
last_segment = ""
segments = Hash.new {|segments, key| segments[key] = []}
IO.foreach(file_name) do |line|
next if !line || line[0] == "#"
values = line.split
if values.length == 3 && last_segment != (segment_id = values[1])
segments[segment_id] += segments[last_segment].pop(2)
last_segment = segment_id
end
segments[last_segment] << values.last
end
segments.delete("")
segments
end
puts parse_chromo("./chromo.data")
I used this as my data file:
#Base_ID Segment_ID Read_Depth
1 101
2 102
3 seg1 103
4 seg1 104
5 105
6 106
7 201
8 202
9 seg2 203
10 seg2 204
11 205
12 206
13 207
14 208
15 209
16 210
17 211
18 212
19 301
20 302
21 seg3 303
21 seg3 304
21 305
21 306
21 307
Which outputs:
{
"seg1"=>["101", "102", "103", "104", "105", "106"],
"seg2"=>["201", "202", "203", "204", "205", "206", "207", "208", "209", "210", "211", "212"],
"seg3"=>["301", "302", "303", "304", "305", "306", "307"]
}
Here's some Ruby code (nice practice example :P). I'm assuming fixed-width columns, which appears to be the case with your input data. The code keeps track of which depth values are primer values until it finds 4 of them, after which it will know the segment id.
require 'pp'
mr_hashy = {}
primer_segment = nil
primer_values = []
while true
line = gets
if not line
break
end
base, segment, depth = line[0..11].rstrip, line[12..27].rstrip, line[28..-1].rstrip
primer_values.push(depth)
if segment.chomp == ''
if primer_values.length == 6
for value in primer_values
(mr_hashy[primer_segment] ||= []).push(value)
end
primer_values = []
primer_segment = nil
end
else
primer_segment = segment
end
end
PP::pp(mr_hashy)
Output on input provided:
{"seg1"=>["100", "800", "1900", "2700", "1600", "2400"],
"seg2"=>["200", "15000", "300", "400", "900", "1000"]}

Resources