merge rows csv by id ruby - ruby

I have a .csv file that, for simplicity, is two fields: ID and comments. The rows of id's are duplicated where each comment field had met max char from whatever table it was generated from and another row was necessary. I now need to merge associative comments together thus creating one row for each unique ID, using Ruby.
To illustrate, I'm trying in Ruby, to make this:
ID | COMMENT
1 | fragment 1
1 | fragment 2
2 | fragment 1
3 | fragment 1
3 | fragment 2
3 | fragment 3
into this:
ID | COMMENT
1 | fragment 1 fragment 2
2 | fragment 1
3 | fragment 1 fragment 2 fragment 3
I've come close to finding a way to do this using inject({}) and hashmap, but still working on getting all data merged correctly. Meanwhile seems my code is getting too complicated with multiple hashes and arrays just to do a merge on selective rows.
What's the best/simplest way to achieve this type of row merge? Could it be done with just arrays?
Would appreciate advice on how one would normally do this in Ruby.

Keep the headers and use group by ID:
rows = CSV.read 'comment.csv', :headers => true
rows.group_by{|row| row['ID']}.values.each do |group|
puts [group.first['ID'], group.map{|r| r['COMMENT']} * ' '] * ' | '
end
You can use 0 and 1 but I think it's clearer to use the header field names.

With the following csv file, tmp.csv
1,fragment 11
1,fragment 21
2,fragment 21
2,fragment 22
3,fragment 31
3,fragment 32
3,fragment 33
Try this (demonstrated using irb)
irb> require 'csv'
=> true
irb> h = Hash.new
=> {}
irb> CSV.foreach("tmp.csv") {|r| h[r[0]] = h.key?(r[0]) ? h[r[0]] + r[1] : r[1]}
=> nil
irb> h
=> {"1"=>"fragment 11fragment 21", "2"=>"fragment 21fragment 22", "3"=>"fragment 31fragment 32fragment 33"}

Related

Reshape data in pig - change row values to column names

Is there a way to reshape the data in pig?
The data looks like this -
id | p1 | count
1 | "Accessory" | 3
1 | "clothing" | 2
2 | "Books" | 1
I want to reshape the data so that the output would look like this--
id | Accessory | clothing | Books
1 | 3 | 2 | 0
2 | 0 | 0 | 1
Can anyone please suggest some way around?
If its a fixed set of product line the below code might help, otherwise you can go for a custom UDF which helps in achieving the objective.
Input : a.csv
1|Accessory|3
1|Clothing|2
2|Books|1
Pig Snippet :
test = LOAD 'a.csv' USING PigStorage('|') AS (product_id:long,product_name:chararray,rec_cnt:long);
req_stats = FOREACH (GROUP test BY product_id) {
accessory = FILTER test BY product_name=='Accessory';
clothing = FILTER test BY product_name=='Clothing';
books = FILTER test BY product_name=='Books';
GENERATE group AS product_id, (IsEmpty(accessory) ? '0' : BagToString(accessory.rec_cnt)) AS a_cnt, (IsEmpty(clothing) ? '0' : BagToString(clothing.rec_cnt)) AS c_cnt, (IsEmpty(books) ? '0' : BagToString(books.rec_cnt)) AS b_cnt;
};
DUMP req_stats;
Output :DUMP req_stats;
(1,3,2,0)
(2,0,0,1)

Validating against a variable number of columns in Spark

​I have a bunch of codes indicating the stages a person has been in my data displayed horizontally as shown below.
Name code1 code2 code3 code4
A 2 3. 4 Null
B 2 5 4 7
C 1 3 4 5
D 0 9 Null Null
I have another file which has all the valid codes.
ID Value
1 3
2 4
3 5
4 6
5 7
What I would like to do is validate all the columns cell by cell against this lookup and indicate 0 if they are valid and null if they are not valid.
I'm using Apache Spark 1.5.2 and I would like to do this the efficient way. I've tried bunch of combinations and only thing close to what I want I've come is using concat on the cells and then explode it as normalized table and then perform lookups.
You can do this very simply with a single pass through the data, without any joins or explode by code-generating a validation expression:
// Simulate the data
case class Record(Name: String, code1: Option[Int], code2: Option[Int])
val dfData = sc.parallelize(Seq(
Record("A", Some(3), Some(4)),
Record("B", Some(3), None)
)).toDF.registerTempTable("my_data")
// Simulate the lookup table
val dfLookup = sc.parallelize(Seq((1,3), (2,4))).toDF("ID", "Value")
// Build a validation expression
val validationExpression = dfLookup.collect.map{ row =>
s"code${row.getInt(0)} = ${row.getInt(1)}"
}.mkString(" and ")
// Add an is_valid column to the data
sql(s"select *, nvl($validationExpression, false) as is_valid from my_data").show
This produces:
defined class Record
dfData: Unit = ()
dfLookup: org.apache.spark.sql.DataFrame = [ID: int, Value: int]
validationExpression: String = code1 = 3 and code2 = 4
+----+-----+-----+--------+
|Name|code1|code2|is_valid|
+----+-----+-----+--------+
| A| 3| 4| true|
| B| 3| null| false|
+----+-----+-----+--------+

Number of string value occurrences for distinct another column value

I have a model Counter which returns the following records:
name.....flowers.....counter
vino.....rose.........1
vino.....lily.........1
gaya.....rose.........1
rosi.....lily.........1
vino.....lily.........1
rosi.....rose.........1
rosi.....rose.........1
I want to display in the table like:
name | Rose | Lily |
---------------------
Vino | 1 | 2 |
---------------------
Gaya | 1 | 0 |
---------------------
Rosi | 2 | 1 |
I want to display the count of flowers for each distinct name. I have tried the following and wondering how can I do it elegantly?
def counter_results
#counter_results= {}
Counter.each do |name|
rose = Counter.where(flower: 'rose').count
lily= Counter.where(flower: 'lily').count
#counter_results['name'] = name
#counter_results['rose_count'] = rose
#counter_results['lily_count'] = lily
end
return #counter_results
end
which I don't get the hash values.
This will give you slightly different output, but I think it is probably closer to what you want than what you showed.
You can use the query:
Counter.group([:name, :flowers]).sum(:counter)
To get a result set that looks like:
{ ["vino", "rose"] => 1, ["vino", "lily"] => 2, ["gaya", "rose"] => 1, ["gaya", "lily"] => 0, ... }
And you can do something like this to generate your hash:
def counter_results
#counter_results = {}
Counter.group([:name, :flowers]).sum(:counter).each do |k, v|
#counter_results[k.join("_")] = v
end
#counter_results
end
The resulting hash would look like this:
{
"vino_rose" => 1,
"vino_lily" => 2,
"gaya_rose" => 1,
"gaya_lily" => 0,
...
}
Somebody else may have a better way to do it, but seems like that should get you pretty close.

Ruby CSV re-arranging Array

I'm not sure what the appropriate title for this question so if someone could help me with that also, it would be nice.
-
I have a CSV file that looks something like
ID | Num
a | 1
a | 2
a | 3
b | 4
b | 5
c | 6
c | 7
I need the result to be:
ID | Num
a | 1,2,3,4
b | 4,5
c | 6,7
Currently, my solution is:
ary = CSV.open('some_file')
final = Array.new
id = ary[1][0] # ary[0] is "id"
numJoin = ary[1][1]
(1..ary.length).each do |i|
if id == ary[i+1][0]
numJoin = numJoin + "," + ary[i+1][1]
else
final << [id,numJoin]
id = ary[i+1][0]
numJoin = ary[i+1]]1]
end
end
It works, but I would like to have the opportunity to learn other ways to solve this, as I think there should be simpler ways to do this..
Thanks in advance.
You can use group_by, which groups by the return value of the block passed to it, in this case, it's the ID.
ary = ary.group_by { |v| v[0] }
P.S That file ain't looking like a CSV.

ruby multiple loop sets but with limited rows per set

Alrightie, so I'm building an CSV file this time with ruby. The outer loop will run up to length of num_of_loops, but it runs for an entire set rather than up to the specified row. I want to change the first column of a CSV file to a new name for each row.
If I do this:
class_days = %w[Wednesday Thursday Friday]
num_of_loops = (num_of_loops / class_days.size).ceil
num_of_loops.times {
["Wednesday","Thursday","Friday"].each do |x|
data[0] = x
data[4] = classname()
# Write all to file
#
csv << data
end
}
Then the loop will run only 3 times for a 5 row request.
I'd like it to run the full 5 rows such that instead of stopping at Wed/Thurs/Fri it goes to Wed/Thurs/Fri/Wed/Thurs instead.
class_days = %w[Wednesday Thursday Friday]
num_of_loops.times do |i|
data[0] = class_days[i % class_days.size]
data[4] = classname
csv << data
end
The interesting part is here:
class_days[i % class_days.size]
We need an index into class_days that is between 0 and class_days.size - 1. We can get that with the % (modulo) operator. That operator yields the remainder after dividing i by class_days.size. This table shows how it works:
i i % 3
0 0
1 1
2 2
3 0
4 1
5 2
...
The other key part is that the times method yields indices starting with 0.

Resources