Ruby storing data for queries - ruby

I have a string
"4813243948,1234433948,1.3,Type2
1234433948,4813243948,1.3,Type1
1234433948,6345635414,1.3,Type1
4813243948,2435677524,1.3,Type2
4813243948,5245654367,1.3,Type2
2345243524,6754846756,1.3,Type1
1234512345,2345124354,1.3,Type1
1342534332,4565346546,1.3,Type1"
This is telephone outbound call data where each new line represents a new phone call.
(Call From, Call To, Duration, Line Type)
I want to save this data in a way that allows me to query a specific number and get a string output of the number, its type, its total minutes used, and all the calls that it made (outbound calls). I just want to do this in a single ruby file.
Thus typing in this
4813243948
Returns
4813243948, Type 2, 3.9 Minutes total
1234433948, 1.3
2435677524, 1.3
5245654367, 1.3
I am wondering if I should try to store values in arrays, or create a custom class and make each number an object of a class then append the calls to each number.. not sure how to do the class method. Having a different array for each number seems like it would get cluttered as there are thousands of numbers and millions of calls. Of course, the provided input string is a very small portion of the real source.

I have a string
"4813243948,1234433948,1.3,Type2
1234433948,4813243948,1.3,Type1
This looks like a CSV. If you slap some headers on top, you can parse it into an array of hashes.
str = "4813243948,1234433948,1.3,Type2
1234433948,4813243948,1.3,Type1"
require 'csv'
calls = CSV.parse(str, headers: %w[from to length type], header_converters: :symbol).map(&:to_h)
# => [{:from=>"4813243948", :to=>"1234433948", :length=>"1.3", :type=>"Type2"},
# {:from=>"1234433948", :to=>"4813243948", :length=>"1.3", :type=>"Type1"}]
This is essentially the same as your original string, only it trades some memory for ease of access. You can now "query" this dataset like this:
calls.select{ |c| c[:from] == '4813243948' }
And then aggregate for presentation however you wish.
Naturally, searching through this array takes linear time, so if you have millions of calls you might want to organize them in a more efficient search structure (like a B-Tree) or move the whole dataset to a real database.

If you only want to make queries for the number the call originated from, you could store the data in a hash where the keys are the "call from" numbers and the value is an array, or another hash, containing the rest of the data. For example:
{ '4813243948': { call_to: 1234433948, duration: 1.3, line_type: 'Type2' }, ... }
If the dataset is very large, or you need more complex queries, it might be better to store it in a database and just query it directly.

Related

Performance: Replacing Series values with keys from a Dictionary in Python

I have a data series that contains various names of the same organizations. I want harmonize these names into a given standard using a mapping dictionary. I am currently using a nested for loop to iterate through each series element and if it is within the dictionary's values, I update the series value with the dictionary key.
# For example, corporation_series is:
0 'Corp1'
1 'Corp-1'
2 'Corp 1'
3 'Corp2'
4 'Corp--2'
dtype: object
# Dictionary is:
mapping_dict = {
'Corporation_1': ['Corp1', 'Corp-1', 'Corp 1'],
'Corporation_2': ['Corp2', 'Corp--2'],
}
# I use this logic to replace the values in the series
for index, value in corporation_series.items():
for key, list in mapping_dict.items():
if value in list:
corporation_series = corporation_series.replace(value, key)
So, if the series has a value of 'Corp1', and it exists in the dictionary's values, the logic replaces it with the corresponding key of corporations. However, it is an extremely expensive method. Could someone recommend me a better way of doing this operation? Much appreciated.
I found a solution by using python's .map function. In order to use .map, I had to invert my dictionary:
# Inverted Dict:
mapping_dict = {
'Corp1': ['Corporation_1'],
'Corp-1': ['Corporation_1'],
'Corp 1': ['Corporation_1'],
'Corp2': ['Corporation_2'],
'Corp--2':['Corporation_2'],
}
# use .map
corporation_series.map(newdict)
Instead of 5 minutes of processing, took around 5s. While this is works, I sure there are better solutions out there. Any suggestions would be most welcome.

Performance difference map() vs withColumn()

I have a table with over 100 columns. I need to remove double quotes from certain columns. I found 2 ways to do it, using withColumn() and map()
Using withColumn()
cols_to_fix = ["col1", ..., "col20"]
for col in cols_to_fix:
df = df.withColumn(col, regexp_replace(df[col], "\"", ""))
Using map()
def remove_quotes(row: Row) -> Row:
row_as_dict = row.asDict()
cols_to_fix = ["col1", ..., "col20"]
for column in cols_to_fix:
if row_as_dict[column]:
row_as_dict[column] = re.sub("\"", "", str(row_as_dict[column]))
return Row(**row_as_dict)
df = df.rdd.map(remove_quotes).toDF(df.schema)
Here is my question. I found using map() takes about 4 times longer than withColumn() on a table that has ~25M records. I will really appreciate if any fellow stack overflow user can explain the reason for the performance difference, so that I can avoid similar pitfall in future.
firstly, one piece of advice: do not convert DataFrame to RDD and just do df.map(your function here), this may save a lot of time.
the following page
https://dzone.com/articles/apache-spark-3-reasons-why-you-should-not-use-rdds
would save us a lot of time, its main conclusion is that RDD is remarkably slow than DataFrame/Dataset, not to mention the time used for the conversion from DataFrame to RDD.
Let's talk about map and withColumn without any conversion between DataFrame to RDD now.
Conclusion first: map is usually 5x slower than withColumn. the reason is that map operation always involves deserialization and serialization while withColumn can operate on column of interest.
to be specific, map operation should deserialize the Row into several parts on which the operation will be carrying,
An example here :
assume we have a DataFrame which looks like
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
Then we want to increment all the values in column users_count by 1, we can do it like this:
df.map(row => {
val usersCount = row.getInt(1) + 1
(row.getString(0), usersCount)
}).toDF("language", "users_count_incremented_by_1")
In the code above, we firstly need to deserialize every row to extract the values in the 2nd column, after that we output the modified values and save it as an DataFrame(this step requires serialization of (a,b) into Row(a, b) since DataFrame is nothing but a DataSet of Rows).
for more detailed explanation, check the following excellent article
https://medium.com/#fqaiser94/udfs-vs-map-vs-custom-spark-native-functions-91ab2c154b44
map can not operate on the column itself but have to operate on the values of the column, getting the values require deserialization, saving it as a DataFrame requires serialization.
But map is still of great use: with the help of map method people could implement very sophisticated operations while just built-in operations could be done if we just use withColumn.
To sum it up, map is slower but more flexible, withColumn is surely the most efficient while it's functionality is limited.

Exporting data into excel using iterative loop

I am doing an iterative calculation on maple and I want to store the resulting data (which comes in a column matrix) from each iteration into a specific column of an Excel file. For example, my data is
mydat||1:= <<11,12,13,14>>:
mydat||2:= <<21,22,23,24>>:
mydat||3:= <<31,32,33,34>>:
and so on.
I am trying to export each of them into an excel file and I want each data to be stored in consecutive columns of the same excel file. For example, mydat||1 goes to column A, mydat||2 goes to column B and so on. I tried something like following.
with(ExcelTools):
for k from 1 to 3 do
Export(mydat||k, "data.xlsx", "Sheet1", "A:C"): #The problem is selecting the range.
end do:
How do I select the range appropriately here? Is there any other method to export the data and store in the way that I explained above?
There are couple of ways to do this. The easiest is certainly to put all of your data into one data structure and then export that. For example:
mydat1:= <<11,12,13,14>>:
mydat2:= <<21,22,23,24>>:
mydat3:= <<31,32,33,34>>:
mydata := Matrix( < mydat1 | mydat2 | mydat3 > );
This stores your data in a Matrix where mydat1 is the first column, mydat2 is the second column, etc. With the data in this form, either ExcelTools:-Export or the more generic Export command will work:
ExcelTools:-Export( data, "data.xlsx" );
Export( "data.xlsx", data );
Now since you mention that you are doing an iterative calculation, you may want to write the results out column by column. Here's another method that doesn't involve the creation of another data structure to house the results. This does assume that the data in mydat"i" has been created before the loop.
for i to 3 do
ExcelTools:-Export( cat(`mydat`,i), "data.xlsx", 1, ["A1","B1","C1"][i] );
end do;
If you want to write the data out to a file as you are building it, then just do the Export call after the creation of each of the columns, i.e.
ExcelTools:-Export( mydat1, "data.xlsx", 1, "A1" );
Note that I removed the "||" characters. These are used in Maple for concatenation and caused some issues with the second method.

removing duplicates from list of strings (output from Twilio call - Ruby)

I'm trying to display total calls from a twilio object as well as unique calls.
The total calls is simple enough:
# set up a client to talk to the Twilio REST API
#sub_account_client = Twilio::REST::Client.new(#account_sid, #auth_token)
#subaccount = #sub_account_client.account
#calls = #subaccount.calls
#total_calls = #calls.list.count
However, I'm really struggling to figure out how to display unique calls (people sometimes call back form the same number and I only want to count calls from the same number once). I'm thinking this is a pretty simple method or two but I've burnt quite a few hours trying to figure it out (still a ruby noob).
Currently I've been working it in the console as follows:
#sub_account_client = Twilio::REST::Client.new(#account_sid, #auth_token)
#subaccount = #sub_account_client.account
#subaccount.calls.list({})each do |call|
#"from" returns the phone number that called
print call.from
end
This returns the following strings:
+13304833615+13304833615+13304833615+13304833615+13304567890+13304833615+13304833615+13304833615
There are only two unique numbers there so I'd like to be able to return '2' for this.
Calling class on that output shows strings. I've used "insert" to add a space then have done a split(" ") to turn them into arrays but the output is the following:
[+13304833615][+13304833615][+13304833615][+13304833615][+13304567890][+13304833615][+13304833615][+13304833615]
I can't call 'uniq' on that and I've tried to 'flatten' as well.
Please enlighten me! Thanks!
If what you have is a string that you want to manipulate the below works:
%{+13304833615+13304833615+13304833615+13304833615+13304567890+13304833615+13304833615+13304833615}.split("+").uniq.reject { |x| x.empty? }.count
=> 2
However this is more ideal:
#subaccount.calls.list({}).map(&:from).uniq.count
Can you build an array directly instead of converting it into a string first? Try something like this perhaps?
#calllist = []
#subaccount.calls.list({})each do |call|
#"from" returns the phone number that called
#calllist.push call.from
end
you should then be able to call uniq on #calllist to shorten it to the unique members.
Edit: What type of object is #subaccount.calls.list anyway?
uniq should work for creating a unique list of strings. I think you may be getting confused by other non-related things. You don't want .split, that's for turning a single string into an array of word strings (default splits by spaces). Which has turned each single number string, into an array containing only that number. You may also have been confused by performing your each call in the irb console, which will return the full array iterated on, even if your inner loop did the right thing. Try the following:
unique_numbers = #subaccount.calls.list({}).map {|call| call.from }.uniq
puts unique_numbers.inspect

How to Connect Logic with Objects

I have a system that contains x number of strings. These string are shown in a UI based on some logic. For example string number 1 should only show if the current time is past midday and string 3 only shows if a randomly generated number between 0-1 is less than 0.5.
How would be the best way to model this?
Should the logic just be in code and be linked to a string by some sort or ID?
Should the logic be some how stored with the strings?
NOTE The above is a theoretical example before people start questioning my logic.
It's usually better to keep resources (such as strings) separate from logic. So referring strings by IDs is a good idea.
It seems that you have a bunch of rules which you have to link to the display of strings. I'd keep all three as separate entities: rules, strings, and the linking between them.
An illustration in Python, necessarily simplified:
STRINGS = {
'morning': 'Good morning',
'afternoon': 'Good afternoon',
'luck': 'you must be lucky today',
}
# predicates
import datetime, random
def showMorning():
return datetime.datetime.now().hour < 12
def showAfternoon():
return datetime.datetime.now().hour >= 12
def showLuck():
return random.random() > 0.5
# interconnection
RULES = {
'morning': showMorning,
'afternoon': showAfternoon,
'luck': showLuck,
}
# usage
for string_id, predicate in RULES.items():
if predicate():
print STRINGS[string_id]

Resources