Validating against a variable number of columns in Spark - validation

​I have a bunch of codes indicating the stages a person has been in my data displayed horizontally as shown below.
Name code1 code2 code3 code4
A 2 3. 4 Null
B 2 5 4 7
C 1 3 4 5
D 0 9 Null Null
I have another file which has all the valid codes.
ID Value
1 3
2 4
3 5
4 6
5 7
What I would like to do is validate all the columns cell by cell against this lookup and indicate 0 if they are valid and null if they are not valid.
I'm using Apache Spark 1.5.2 and I would like to do this the efficient way. I've tried bunch of combinations and only thing close to what I want I've come is using concat on the cells and then explode it as normalized table and then perform lookups.

You can do this very simply with a single pass through the data, without any joins or explode by code-generating a validation expression:
// Simulate the data
case class Record(Name: String, code1: Option[Int], code2: Option[Int])
val dfData = sc.parallelize(Seq(
Record("A", Some(3), Some(4)),
Record("B", Some(3), None)
)).toDF.registerTempTable("my_data")
// Simulate the lookup table
val dfLookup = sc.parallelize(Seq((1,3), (2,4))).toDF("ID", "Value")
// Build a validation expression
val validationExpression = dfLookup.collect.map{ row =>
s"code${row.getInt(0)} = ${row.getInt(1)}"
}.mkString(" and ")
// Add an is_valid column to the data
sql(s"select *, nvl($validationExpression, false) as is_valid from my_data").show
This produces:
defined class Record
dfData: Unit = ()
dfLookup: org.apache.spark.sql.DataFrame = [ID: int, Value: int]
validationExpression: String = code1 = 3 and code2 = 4
+----+-----+-----+--------+
|Name|code1|code2|is_valid|
+----+-----+-----+--------+
| A| 3| 4| true|
| B| 3| null| false|
+----+-----+-----+--------+

Related

Multi-level counter iteration

I'm stuck on creating an algorithm as follows. I know this shouldn't be too difficult, but I simply can't get my head around it, and can't find the right description of this kind of pattern.
Basically I need a multi-level counter, where when a combination exist in the database, the next value is tried by incrementing from the right.
1 1 1 - Start position. Does this exist in database? YES -> Extract this and go to next
1 1 2 - Does this exist in database? YES -> Extract this and go to next
1 1 3 - Does this exist in database? YES -> Extract this and go to next
1 1 4 - Does this exist in database? NO -> Reset level 1, move to level 2
1 2 1 - Does this exist in database? YES -> Extract this and go to next
1 2 2 - Does this exist in database? NO -> Reset level 2 and 1, move to level 3
2 1 1 - Does this exist in database? YES -> Extract this and go to next
2 1 2 - Does this exist in database? YES -> Extract this and go to next
2 1 3 - Does this exist in database? NO -> Reset level 1 and increment level 2
2 2 1 - Does this exist in database? YES -> Extract this and go to next
2 2 2 - Does this exist in database? YES -> Extract this and go to next
2 2 3 - Does this exist in database? YES -> Extract this and go to next
2 3 1 - Does this exist in database? NO -> Extract this and go to next
3 1 1 - Does this exist in database? NO -> Extract this and go to next
3 2 1 - Does this exist in database? NO -> End, as all increments tried
There could be more than three levels, though.
In practice, each value like 1, 2, etc is actually a $value1, $value2, etc. containing a runtime string being matched against an XML document. So it's not just a case of pulling out every combination already existing in the database.
Assuming, the length of the DB key is known upfront, here's one way how it can be implemented. I'm using TypeScript but similar code can be written in your favorite language.
First, I declare some type definitions for convenience.
export type Digits = number[];
export type DbRecord = number;
Then I initialize fakeDb object which works as a mock data source. The function I wrote will work against this object. This object's keys are representing the the database records' keys (of type string). The values are simple numbers (intentionally sequential); they represent the database records themselves.
export const fakeDb: { [ dbRecordKey: string ]: DbRecord } = {
'111': 1,
'112': 2,
'113': 3,
'211': 4,
'212': 5,
'221': 6,
'311': 7,
};
Next, you can see the fun part, which is the function that uses counterDigits array of "digits" to increment depending on whether the record presence or absence.
Please, do NOT think this is the production-ready code! A) there are unnecessary console.log() invocations which only exist for demo purposes. B) it's a good idea to not read a whole lot of DbRecords from the database into memory, but rather use yield/return or some kind of buffer or stream.
export function readDbRecordsViaCounter(): DbRecord[] {
const foundDbRecords: DbRecord[] = [];
const counterDigits: Digits = [1, 1, 1];
let currentDigitIndex = counterDigits.length - 1;
do {
console.log(`-------`);
if (recordExistsFor(counterDigits)) {
foundDbRecords.push(extract(counterDigits));
currentDigitIndex = counterDigits.length - 1;
counterDigits[currentDigitIndex] += 1;
} else {
currentDigitIndex--;
for (let priorDigitIndex = currentDigitIndex + 1; priorDigitIndex < counterDigits.length; priorDigitIndex++) {
counterDigits[priorDigitIndex] = 1;
}
if (currentDigitIndex < 0) {
console.log(`------- (no more records expected -- ran out of counter's range)`);
return foundDbRecords;
}
counterDigits[currentDigitIndex] += 1;
}
console.log(`next key to try: ${ getKey(counterDigits) }`);
} while (true);
}
The remainings are some "helper" functions for constructing a string key from a digits array, and accessing the fake database.
export function recordExistsFor(digits: Digits): boolean {
const keyToSearch = getKey(digits);
const result = Object.getOwnPropertyNames(fakeDb).some(key => key === keyToSearch);
console.log(`key=${ keyToSearch } => recordExists=${ result }`);
return result;
}
export function extract(digits: Digits): DbRecord {
const keyToSearch = getKey(digits);
const result = fakeDb[keyToSearch];
console.log(`key=${ keyToSearch } => extractedValue=${ result }`);
return result;
}
export function getKey(digits: Digits): string {
return digits.join('');
}
Now, if you run the function like this:
const dbRecords = readDbRecordsViaCounter();
console.log(`\n\nDb Record List: ${ dbRecords }`);
you should see the following output that tells you about the iteration steps; as well as reports the final result in the very end.
-------
key=111 => recordExists=true
key=111 => extractedValue=1
next key to try: 112
-------
key=112 => recordExists=true
key=112 => extractedValue=2
next key to try: 113
-------
key=113 => recordExists=true
key=113 => extractedValue=3
next key to try: 114
-------
key=114 => recordExists=false
next key to try: 121
-------
key=121 => recordExists=false
next key to try: 211
-------
key=211 => recordExists=true
key=211 => extractedValue=4
next key to try: 212
-------
key=212 => recordExists=true
key=212 => extractedValue=5
next key to try: 213
-------
key=213 => recordExists=false
next key to try: 221
-------
key=221 => recordExists=true
key=221 => extractedValue=6
next key to try: 222
-------
key=222 => recordExists=false
next key to try: 231
-------
key=231 => recordExists=false
next key to try: 311
-------
key=311 => recordExists=true
key=311 => extractedValue=7
next key to try: 312
-------
key=312 => recordExists=false
next key to try: 321
-------
key=321 => recordExists=false
next key to try: 411
-------
key=411 => recordExists=false
------- (no more records expected -- ran out of counter's range)
Db Record List: 1,2,3,4,5,6,7
It is strongly recommended to read the code. If you want me to describe the approach or any specific detail(s) -- let me know. Hope, it helps.

Reshape data in pig - change row values to column names

Is there a way to reshape the data in pig?
The data looks like this -
id | p1 | count
1 | "Accessory" | 3
1 | "clothing" | 2
2 | "Books" | 1
I want to reshape the data so that the output would look like this--
id | Accessory | clothing | Books
1 | 3 | 2 | 0
2 | 0 | 0 | 1
Can anyone please suggest some way around?
If its a fixed set of product line the below code might help, otherwise you can go for a custom UDF which helps in achieving the objective.
Input : a.csv
1|Accessory|3
1|Clothing|2
2|Books|1
Pig Snippet :
test = LOAD 'a.csv' USING PigStorage('|') AS (product_id:long,product_name:chararray,rec_cnt:long);
req_stats = FOREACH (GROUP test BY product_id) {
accessory = FILTER test BY product_name=='Accessory';
clothing = FILTER test BY product_name=='Clothing';
books = FILTER test BY product_name=='Books';
GENERATE group AS product_id, (IsEmpty(accessory) ? '0' : BagToString(accessory.rec_cnt)) AS a_cnt, (IsEmpty(clothing) ? '0' : BagToString(clothing.rec_cnt)) AS c_cnt, (IsEmpty(books) ? '0' : BagToString(books.rec_cnt)) AS b_cnt;
};
DUMP req_stats;
Output :DUMP req_stats;
(1,3,2,0)
(2,0,0,1)

Linq - Get last entry from different customers

I try to create a linq query but unfortunately I have no ideas to resolve my problem.
I would like get the highest entry of all customers and form this result only 5 entries sort by date.
ID Date ID_Costumer
1 - 01.01.2014 - 1
2 - 02.01.2014 - 2
3 - 02.01.2014 - 1
4 - 03.01.2014 - 1 --> this value
5 - 04.01.2014 - 3
6 - 05.01.2014 - 3 --> this value
7 - 05.01.2014 - 4
8 - 06.01.2014 - 4 --> this value
9 - 08.01.2014 - 5 --> this value
10 - 09.01.2014 - 6 --> this value
I try it with this query
var query = from g in context.Geraete
where g.Online && g.AltGeraet == false
select g;
query.GroupBy(g => g.ID_Anbieter).Select(g => g.Last());
query.Take(5);
but it doesn't work.
You should assign results of selecting last item from group back to query variable:
query = query.GroupBy(g => g.ID_Anbieter).Select(g => g.Last());
var result = query.Take(5);
Keep in mind - operator Last() is not supported by Linq to Entities. Also I think you should add ordering when selecting latest item from each group, and selecting top 5 latest items:
var query = from g in context.Geraete
where g.Online && !g.AltGeraet
group g by g.ID_Anbieter into grp
select grp.OrderByDescending(g => g.Date).First();
var result = query.OrderByDescending(x => x.Date).Take(5);

merge rows csv by id ruby

I have a .csv file that, for simplicity, is two fields: ID and comments. The rows of id's are duplicated where each comment field had met max char from whatever table it was generated from and another row was necessary. I now need to merge associative comments together thus creating one row for each unique ID, using Ruby.
To illustrate, I'm trying in Ruby, to make this:
ID | COMMENT
1 | fragment 1
1 | fragment 2
2 | fragment 1
3 | fragment 1
3 | fragment 2
3 | fragment 3
into this:
ID | COMMENT
1 | fragment 1 fragment 2
2 | fragment 1
3 | fragment 1 fragment 2 fragment 3
I've come close to finding a way to do this using inject({}) and hashmap, but still working on getting all data merged correctly. Meanwhile seems my code is getting too complicated with multiple hashes and arrays just to do a merge on selective rows.
What's the best/simplest way to achieve this type of row merge? Could it be done with just arrays?
Would appreciate advice on how one would normally do this in Ruby.
Keep the headers and use group by ID:
rows = CSV.read 'comment.csv', :headers => true
rows.group_by{|row| row['ID']}.values.each do |group|
puts [group.first['ID'], group.map{|r| r['COMMENT']} * ' '] * ' | '
end
You can use 0 and 1 but I think it's clearer to use the header field names.
With the following csv file, tmp.csv
1,fragment 11
1,fragment 21
2,fragment 21
2,fragment 22
3,fragment 31
3,fragment 32
3,fragment 33
Try this (demonstrated using irb)
irb> require 'csv'
=> true
irb> h = Hash.new
=> {}
irb> CSV.foreach("tmp.csv") {|r| h[r[0]] = h.key?(r[0]) ? h[r[0]] + r[1] : r[1]}
=> nil
irb> h
=> {"1"=>"fragment 11fragment 21", "2"=>"fragment 21fragment 22", "3"=>"fragment 31fragment 32fragment 33"}

ruby multiple loop sets but with limited rows per set

Alrightie, so I'm building an CSV file this time with ruby. The outer loop will run up to length of num_of_loops, but it runs for an entire set rather than up to the specified row. I want to change the first column of a CSV file to a new name for each row.
If I do this:
class_days = %w[Wednesday Thursday Friday]
num_of_loops = (num_of_loops / class_days.size).ceil
num_of_loops.times {
["Wednesday","Thursday","Friday"].each do |x|
data[0] = x
data[4] = classname()
# Write all to file
#
csv << data
end
}
Then the loop will run only 3 times for a 5 row request.
I'd like it to run the full 5 rows such that instead of stopping at Wed/Thurs/Fri it goes to Wed/Thurs/Fri/Wed/Thurs instead.
class_days = %w[Wednesday Thursday Friday]
num_of_loops.times do |i|
data[0] = class_days[i % class_days.size]
data[4] = classname
csv << data
end
The interesting part is here:
class_days[i % class_days.size]
We need an index into class_days that is between 0 and class_days.size - 1. We can get that with the % (modulo) operator. That operator yields the remainder after dividing i by class_days.size. This table shows how it works:
i i % 3
0 0
1 1
2 2
3 0
4 1
5 2
...
The other key part is that the times method yields indices starting with 0.

Resources