I am dealing with a csv file that has some customer information (email, name, address, amount, [shopping_list: item 1, item 2]).
I would like work with the data and produce some labels for printing... as well as to gather some extra information (total amounts, total items 1...)
My main concern is to find the appropriate structure to store the data in ruby for future manipulation. For now I have thought about the following possibilities:
multidimensional arrays: pretty simple to build, but pretty hard to access the data in a beautiful ruby way.
hashes: having the email as key, and storing the information in different hashes (one hash for name, another hash for address, another hash for shopping list...)
(getting the cvs data in to a Database and working with the data from ruby??)
I would really appreciate your advice and guidance!!
Once you have more than a couple pieces of information that you need to group together, it's time to consider moving from a generic hash/array to something more specialized. A good candidate for what you've described is Ruby's struct module:
Customer = Struct.new(:email, :name, :address) # etc.
bill = Customer.new('bill#asdf.com', 'Bill Foo', '123 Bar St.')
puts "#{bill.name} lives at #{bill.address} and can be reached at #{bill.email}"
Output:
Bill Foo lives at 123 Bar St. and can be reached at bill#asdf.com
Struct#new simply creates a class with an attr_accessor for each symbol you pass in. Well, it actually creates a bit more than that, but for starters, that's all you need to worry about.
Once you've got the data from each row packed into an object of some sort (whether it's a struct or a class of your own), then you can worry about how to store those objects.
A hash will be ideal for random access by a given key (perhaps the customer's name or other unique ID)
A one-dimensional array works fine for iterating over the entire set of customers in the same order they were inserted
Related
I'm a bit new to Redis, so please forgive if this is basic.
I'm working on an app that sends automatic replies to users for certain events. I would like to use Redis to store who has received what event.
Essentially, in ruby, the data structure could look like this where you have a map of users to events and the dates that each event was sent.
{
"mary#example.com" => {
"sent_comment_reply" => ["12/12/2014", "3/6/2015"],
"added_post_reply" => ["1/4/2006", "7/1/2016"]
}
}
What is the best way to represent this in a Redis data structure so you can ask, did Mary get a sent_comment_reply? and if so, when was the latest?
In short, the question is, how(if possible) can you have a Hash structure that holds an array in Redis.
The rationale as opposed to using a set or list with a compound key is that hashes have O(1) lookup time, whereas lookups on lists(lrange) and sets(smembers) will be O(s+n) and sets O(n), respectively.
One way of structuring it in Redis, depending on the idea that you know the events of the user and you want the latest to be fresh in memory :
A sorted set per user. the content of the sorted set will be event codes; sent_comment_reply, added_post_reply with the score of the latest event as the highest. you can use ZRANK to get the answer for the question :
Did Mary get a sent_comment_reply?
A hash also for the user, this time you will have the field as the event sent_comment_reply and the value is the content of it which should be updated with the latest value including the body, date, etc. this will answer the question:
and if so, when was the latest?
Note: Sorted sets are really fast , and in this example we are depending on the events as the data.
With sorted sets you can add, remove, or update elements in a very
fast way (in a time proportional to the logarithm of the number of
elements). Since elements are taken in order and not ordered
afterwards, you can also get ranges by score or by rank (position) in
a very fast way. Accessing the middle of a sorted set is also very
fast, so you can use Sorted Sets as a smart list of non repeating
elements where you can quickly access everything you need: elements in
order, fast existence test, fast access to elements in the middle!
A possible approach to use a hash to map an array is as follows:
add_element(key , value):
len := redis.hlen(key)
redis.hset(key , len , value)
this will map array[i] element to i field in a hash key.
this will work for some cases, but I would probably go with the answer suggested in https://stackoverflow.com/a/34886801/2868839
I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.
I'm downloading content from a webpage that seems to be in JSON. It is a large file with the following format:
"address1":"123 Street","address2":"Apt 1","city":"City","state":"ST","zip":"xxxxx","country":"US"
There are about 1000 of these entries, where each entry is contained within brackets. When I download the page using RestClient.get (open-uri for some reason was throwing a http 500 error), the data is in the following format:
\"address\1":\"123 Street\",\"address2\":\"Apt 1\",\"city\":\"City\",\"state\":\"ST\",\"zip\":\"xxxxx\",\"country\":\"US\"
When I then use the json class
parsed = JSON.parse(data_out)
it completely scrambles both the order of entries within the data structure, and also the order of the objects within each entry, for example:
"address1"=>"123 Street", "city"=>"City", "country"=>"US", "address2"=>"Apt 1"
If instead I use
data_j=data_out.to_json
then I get:
\\\"address\\\1":\\\"123 Street\\\",\\\"address2\\\":\\\"Apt 1\\\",\\\"city\\\":\\\"City\\\",\\\"state\\\":\\\"ST\\\",\\\"zip\\\":\\\"xxxxx\\\",\\\"country\\\":\\\"US\\\"
Further, only using the json class seems to allow me to select the entries I want:
parsed[1]["address1"]
=> "123 Street"
data_j[1]["address1"]
TypeError: can't convert String into Integer
from (irb):17:in `[]'
from (irb):17
from :0
Any idea whats going on? I guess since the json commands are working I can use them, but it is disconcerting that its scrambling the entries and order of the objects.
Although the data appears ordered in string form, it represents an unordered dataset. The line:
parsed = JSON.parse(data_out)
which you use is the correct way to convert the string form into something usable in Ruby. I cannot see the full structure from your example, so I don't know whether the top level is an array or id-based hash. I suspect the latter since you say it becomes unordered when you view from Ruby. Therefore, if you knew which part of the address you were interested in you might have code like this:
# Writes all the cities
parsed.each do |id,data|
puts data["city"]
end
If the outer structure is an array, you'd do this:
# Writes all the cities
parsed.each do |data|
puts data["city"]
end
I have a list of objects that are returned as a sequence, I would like to retrieve the keys of each object so as to be able to display the object correctly. At the moment I try data?first?keys which seems to get something like the queries that return the objects (Not sure how to explain that last sentence either but img below shows what I'm trying to explain).
The objects amount of objects returned are correct (7) but displaying the keys for each object is my aim. The macro that attempts this is here (from the apache ofbiz development book chapter 8).
Seems like it my sequence is a list of hashes and as explained by Daniel Dekany this post:
The original problem is that, someHash[key] expects a
string as key. Because, the hash type of FTL, by definition, maps
string keys to arbitrary values. It's not the same as Java's Map.
(Note that to further complicate the matters, in FTL
someSequenceOrString[index] expects an integer index. So, the [] thing
is used for that too.) Now someBeanWrappedMap(key) has technically
nothing to do with all the []-s, it's just a method call, so it
accepts all kind of keys. If you have a Map with non-string keys, you
must use that.
Thanks D Dekany if you're on stack, this ended my half day frustration with the ftl template.
My problem:
I'm looking for a way to represent a person's name and address as an encoded id. The id should contain only alpha-numeric characters, be collision-proof, and be represented in a smallest number of characters possible. My first thought was to simply use a cryptographic hash function like MD5 or SHA1, but this seems like overkill (security isn't important - doesn't need to be one-way) and I'd prefer to find something that would produce a shorter id. Does anyone know of an existing algorithm that fits this problem?
In other words, what is the best way to implement the following function so that the return value is the same consistently for the same input, collisions are unlikely, and ids are less than 20 characters?
>>> make_fake_id(fname = 'Oscar', lname = 'Grouch', stnum = '1', stname = 'Sesame', zip = '12345')
N1743123734
Application Context (for those that are interested):
This will be used for a record linkage app. Given an input name and address we search a very large database for the best match and return the database id and other data (how we do this is not important here). If there isn't a match I need to generate this psuedo/generated/derived id from the search input (entity's name and address data). Every search record should result in an output record with either a real (the actual database id resulting from a match/link) or this generated psuedo/generated/derived id. The psuedo id will be prefixed with a character (e.g. N) to differentiate it from a real id.
I know you said no to MD5 and SHA1, but I think you should consider them anyway. As well as being well studied hashing algorithms, the length gives you more protection against possible collisions. No hash is collision-proof, but the cryptographic ones generally are less collision-prone than something you couuld come up with yourself.
Use a cryptographic hash for its collision resistance, not its other qualities
Use as many bytes from the hash as you want (truncate)
convert to alpha-numeric characters
You can also truncate the alpha-numeric string instead of the hash
An easy way to do this: hash the data, encode in base64, remove all non-alpha-numeric characters, truncate.
N_HASH_CHARS = 11
import hashlib, re
def digest(name, address):
hash = hashlib.md5(name + "|" + address).digest().encode("base64")
alnum_hash = re.sub(r'[^a-zA-Z0-9]', "", hash)
return alnum_hash[:N_HASH_CHARS]
How many alpha-numeric characters should you keep? Each character gives you around 5.95 bits of entropy (log(62,2)). 11 characters give you 65.5 bits of entropy, which should be enough to avoid a collision for the first 2**32.7 users (about 7 billion).
A good solution is somewhat dependent on your application. Do you know how many users and what the set of all users is? If you provide more details you would get better help.
I agree with the other poster suggesting serial numbers. OTOH, if you really, really really want to do something else:
Create a SHA1 hash from the data, and store it in a table with a serial number field.
Then, when you get the data, calculate the hash, look it up on the table, get the serial, and that's your id. If it's not on the table, insert it.
I wonder whether you intend to "assign" these ids to the users? If so, I would expect your users to hate anything that you propose; who would want a user id of "AAAAA01"?
So, if these ids are visible to the user, then you should just let them pick what they like and check them for uniqueness (easy). If they are not visible to the user (e.g., internal primary key), then just generate them sequentially using an appropriate technique such as an Oracle Sequence or SQL Server AutoNumber (also easy).
If these ids are an attempt to detect a user that is registering more than once, then I would agree that you should consider a cryptographic hash followed by a full comparison of the registration data (name, address, etc.). However, to be usable, you will need to translate the data into a canonical form (standardized letter case, whitespace, canonical street address, etc.) before computing the hash or making the comparison. Otherwise, you will mismatch based on trivial differences.
EDIT: Now that I understand the problem space better based on your edits, I think that it is highly unlikely that your algorithm (so far) will catch most matches. Beyond my suggestion to canonicalize the inputs, I recommend that you consider an approach that results in a ranked list of a handful of possible matches (to be resolved by a human if possible) rather than an all-or-nothing attempt at a single match. In other words, I recommend a search approach rather than a lookup approach.
Is that feasible in your situation?
Well, if there's more than one person at the same address with the same name, you're toast here, (w/o adding code to detect this and add a discriminator of some kind).
but assuming that issue is not, then the street address and zip code portion of the full addresss is sufficient to guaranteee uniqueness there, so adding enough data from the name should take care of the issue...
Do you have access to a database, or other persistence mechanism, where you could generate and maintain key values for each address? Then keep the address and individual entities in two keyed dictionary structures, where the key is autogenerated for each new distinct address, person encountered... and then use the autogenerated alpha-numeric key...
You could use AAAAA01 for first person at first address,
AAAAA02 for second person at first address,
AAAAB07 for the seventh resident at the second adresss, etc.
If you donlt have any way to generate and maintain these entity-Key mappings then you need to use the full street address/Zip and fullNAme, or a hash value of the same, although the Hash value approach has a smnall chance of generating duplicates...