I'm looking for a way to generate codes/ids in my specs. I found barcodes, codes, and id numbers. None of these quite fit my purpose...or if I use them they will be misleading for the type of code I'm actually generating. Is there a generator that allows for a format specifier? For example, I'd like to generate a string with digits and dashes in a specified sequence, like #####-###-####-#####, for example.
Faker's numerify, letterify, and bothify seem like what you're looking for:
Faker::Base.numerify('###-###') # "203-099"
Faker::Base.letterify('???-???') # "ADB-VMZ"
Faker::Base.bothify('???-###') # "ISE-485"
Docs
Related
I'm using AWS Textract to pull information from PDF documents. After the scanned text is returned from AWS and persisted to a var, I'm doing this:
phone_number = '(555) 123-4567'
scanned_pdf_text.should have_text phone_number
But this fails about 20% of the time because of the non-deterministic way that AWS is returning the scanned PDF text. On occasion, the phone numbers can appear either of these two ways:
(555)123-4567 or (555) 123-4567
Some of this scanned text is very large, and I'd prefer not to go through the exercise of sanitizing the text coming back if I can avoid it (I'm also not good at regex usage). I also think using or logic to handle both cases seems to be a little heavy handed just to check text that is so similar (and clearly near-identical to the human eye).
Is there an rspec matcher that'll allow me to check on this text? I'm also using Capybara.default_normalize_ws = true but that doesn't seem to help in this case.
Assuming scanned_pdf_text is a string and the only differences you're seeing is in spaces then you can just get rid of the spaces and compare
scanned_pdf_text.gsub(/\s+/, '').should eq('(555)123-4567') # exact
scanned_pdf_text.gsub(/\s+/, '').should match('(555)123-4567') # partial
scanned_pdf_text.gsub(/\s+/, '').should have_text('(555)123-4567') # partial
So I have a big csv file, over 1gb. There's a column with IP addresses in ipv4 and ipv6. I want to convert the ipv6 addresses into numbers, but there are too many rows for libre calc. So I'm wondering if it's possible to use python in the terminal to convert all the ipv6 addresses.
Also, I could split the file up into smaller pieces, then use libre calc, but same problem--I wouldn't know how to script that either.
EDIT:
I don't mind, it might get more complicated though. Also not sure how this should be formatted, but I hope people get the idea...So I have one table with IPv6 addresses like these examples:
2001:db8::cafe:1111
2001:db8:0:a:1:2:3:4
2001:db8:aaaa::c
2001:db8:0:0:1::4
There are a bunch of different rules that govern the formatting--way too hard for me. I've heard that python has a function that will specifically return the conversion, but not sure about the rest (how to get the returned values back into the csv correctly, with formatting unbroken, etc.). Anyway, here's a row from the other table:
"58569107296622255421594597096899477504","58569107375850417935858934690443427839","NG","Nigeria","Abuja Federal Capital Territory","Abuja","9.057350","7.489760"
So the part I need to match is the first two numbers (first two columns), where there are several ranges from
"0","340282366920938463463374607431768211455"
So I wanted to take the IPv6 addresses, convert them to IP numbers, then sort them into their respective ranges.
Yes, this is something you can do in Python. I'll demonstrate with a few short snippets and links to documentation that will fall short of a full solution in favor of empowering you with the resources that you need to put the pieces together yourself.
First off, if you want to load one CSV file line-by-line and write to a second one this is how you would do it:
>>> import csv
>>> with open('eggs.csv', newline='') as in and open('omellette.csv', 'w') as out:
... r = csv.reader(in)
... w = csv.writer(out)
... for row in r:
... print(', '.join(row)) # print unmodified
... row[0] = ipToNum(row[0])
... row[1] = ipToNum(row[1])
... print(', '.join(row)) # print modified
... w.writerow(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam
The original on which this example was based and additional information about python's built-in CSV capabilities can be found here:
https://docs.python.org/3/library/csv.html
You will probably need to make adjustments depending on the exact formatting of your particular CSV file. Now, to convert IP addresses to numbers you can do something like the following:
import socket, struct
def ipToNum(ip):
"convert ipv4/6 string to long integer"
return struct.unpack('>L',socket.inet_pton(ip))[0]
def numToDottedip(n):
"convert long int to ipv4/6"
return socket.inet_ntop(struct.pack('>L',n))
This example is adapted from what I found here:
https://www.oreilly.com/library/view/python-cookbook/0596001673/ch10s06.html
You will have to modify it
Also, if you want to learn more about the socket and struct modules here is the documentation:
https://docs.python.org/3/library/socket.html
https://docs.python.org/3/library/struct.html
You shouldn't need to split the file up since the CSV reader object will only return one line at a time rather than reading in the whole file at once. Of course, you also probably want to actually do something with those numbers once you've read them in but since you didn't specify I'll figuring that out to you.
Also note that I haven't tried any of this code. It's worth repeating here in the form of a metaphor: I'm trying to teach you to fish rather than just giving you fish. It's in your best interest to take this advice and wrestle with getting it to work yourself as that would be your first step toward actually being a programmer.
I would like to use the newest version of ELKI, but I get errors leading to nullpointerexeptions and that task fails. When using 0.6.0 it works fine.
Here is some toy arff-data:
#ATTRIBUTE 'var_0032' real
#ATTRIBUTE 'id' real
#ATTRIBUTE 'outlier' {'no','yes'}
#DATA
0.185185185185,1.0,'no'
0.0740740740741,2.0,'no'
But I get the failure in 0.6.5:
Invalid quoted line in input: no closing quote found in: #ATTRIBUTE 'outlier' {'no','yes'}
Task failed
java.lang.NullPointerException
at de.lmu.ifi.dbs.elki.visualization.VisualizerContext.processNewResult(VisualizerContext.java:300)
at de.lmu.ifi.dbs.elki.visualization.VisualizerContext.<init>(VisualizerContext.java:141)
at de.lmu.ifi.dbs.elki.visualization.VisualizerParameterizer.newContext(VisualizerParameterizer.java:193)
at de.lmu.ifi.dbs.elki.visualization.gui.ResultVisualizer.processNewResult(ResultVisualizer.java:116)
at de.lmu.ifi.dbs.elki.workflow.OutputStep.runResultHandlers(OutputStep.java:70)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:120)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:60)
at [...]
In the 0.6.0 this just seems to be a warning:
Invalid quoted line in input: no closing quote found in: #ATTRIBUTE 'outlier' {'no','yes'} it still produces the ROCCURVE.
Should I be worried?
Should I change my arff file, and how?
The ARFF file format (https://weka.wikispaces.com/ARFF+%28developer+version%29) doesn't use quotes there.
#RELATION example
#ATTRIBUTE var_0032 NUMERIC
#ATTRIBUTE id NUMERIC
#ATTRIBUTE outlier {no,yes}
#DATA
0.185185185185,1.0,no
0.0740740740741,2.0,no
also, if your id column is really an id, don't give it real (which is only an alias for numeric) datatype. It's not a numerical column, and if you aren't careful it may be misused in the analysis.
So maybe better use something like this:
#RELATION example
#ATTRIBUTE var_0032 NUMERIC
#ATTRIBUTE id STRING
#ATTRIBUTE class {no,yes}
#DATA
0.185185185185,'1',no
0.0740740740741,'2',no
to get a proper ARFF file. I havn't tested, does this work better?
First of all, definitely use 0.6.5. ELKI is not at a 1.0 release yet, there are bugs. They will not be fixed in old versions, only in the new version, because we still need to be able to do larger API changes. Essentially, there should be no reason to use anything but the latest version. ELKI 0.7 will appear end of August at VLDB 2015.
ARFF is not used a lot. There may be errors in the parser, and ARFF support for categorial data is very very limited right now. The strengths of the ARFF format are when you have lots of categorial attributes, but that is mostly used in classification - and ELKI doesn't include much classification algorithms yet (since Weka is a strong tool for that already, we focus on algorithms that are not available/good in Weka).
Batik errors like this are usually due to NaN or infinite values. There are still some errors in the visualization code because SVG doesn't give good type safety, unfortunately. You can easily build SVG documents that are invalid, or that contain invalid characters such as ∞ in some coordinate, and then the Batik renderer will fail with such an error message.
What are you exactly trying to do? It looks a bit as if you are trying to compute the ROC curve for the existing output of an algorithm? I don't think there is an easy way to read an ARFF file containing (score, id, label) rows and compute a ROC curve using the MiniGUI. It's not hard to do in Java code, but it's not a use case of the KDD process workflow of the UI.
Can I determine if the user entered a phone number that can be safely formatted into E164?
For Germany, this requires that the user started his entry with a local area code. For example, 123456 may be a subscriber number in his city, but it cannot be formatted into E164, because we don't know his local area code. Then I would like to keep the entry as it is. In contrast, the input 089123456 is independent of the area code and could be formatted into E164, because we know he's from Germany and we could convert this into +4989123456.
You can simply convert your number into E164 using libphonenumber
and after conversion checks if both the strings are same or not. If they're same means a number can not be formatted, otherwise the number you'll get from library will be formatted in E164.
Here's how you can convert
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
String formattedNumber = phoneUtil.format(inputNumber, PhoneNumberFormat.E164);
Finally compare formattedNumber with inputNumber
It looks as though you'll need to play with isValidNumber and isPossibleNumber for your case. format is certainly not guaranteed to give you something actually dialable, see the javadocs. This is suggested by the demo as well, where formatting is not displayed when isValidNumber is false.
I also am dealing with this FWIW. In the context of US numbers: The issue is I'd like to parse using isPossibleNumber in order to be as lenient as possible, and store the number in E164. However then we accept, e.g. +15551212. This string itself even passes isPossibleNumber despite clearly (I think) not being dialable anywhere.
I'm trying to generate UUIDs with the same style as bit.ly urls like:
http://bit [dot] ly/aUekJP
or cloudapp ones:
http://cl [dot] ly/1hVU
which are even smaller
how can I do it?
I'm now using UUID gem for ruby but I'm not sure if it's possible to limitate the length and get something like this.
I am currently using this:
UUID.generate.split("-")[0] => b9386070
But I would like to have even smaller and knowing that it will be unique.
Any help would be pretty much appreciated :)
edit note: replaced dot letters with [dot] for workaround of banned short link
You are confusing two different things here. A UUID is a universally unique identifier. It has a very high probability of being unique even if millions of them were being created all over the world at the same time. It is generally displayed as a 36 digit string. You can not chop off the first 8 characters and expect it to be unique.
Bitly, tinyurl et-al store links and generate a short code to represent that link. They do not reconstruct the URL from the code they look it up in a data-store and return the corresponding URL. These are not UUIDS.
Without knowing your application it is hard to advise on what method you should use, however you could store whatever you are pointing at in a data-store with a numeric key and then rebase the key to base32 using the 10 digits and 22 lowercase letters, perhaps avoiding the obvious typo problems like 'o' 'i' 'l' etc
EDIT
On further investigation there is a Ruby base32 gem available that implements Douglas Crockford's Base 32 implementation
A 5 character Base32 string can represent over 33 million integers and a 6 digit string over a billion.
If you are working with numbers, you can use the built in ruby methods
6175601989.to_s(30)
=> "8e45ttj"
to go back
"8e45ttj".to_i(30)
=>6175601989
So you don't have to store anything, you can always decode an incoming short_code.
This works ok for proof of concept, but you aren't able to avoid ambiguous characters like: 1lji0o. If you are just looking to use the code to obfuscate database record IDs, this will work fine. In general, short codes are supposed to be easy to remember and transfer from one medium to another, like reading it on someone's presentation slide, or hearing it over the phone. If you need to avoid characters that are hard to read or hard to 'hear', you might need to switch to a process where you generate an acceptable code, and store it.
I found this to be short and reliable:
def create_uuid(prefix=nil)
time = (Time.now.to_f * 10_000_000).to_i
jitter = rand(10_000_000)
key = "#{jitter}#{time}".to_i.to_s(36)
[prefix, key].compact.join('_')
end
This spits out unique keys that look like this: '3qaishe3gpp07w2m'
Reduce the 'jitter' size to reduce the key size.
Caveat:
This is not guaranteed unique (use SecureRandom.uuid for that), but it is highly reliable:
10_000_000.times.map {create_uuid}.uniq.length == 10_000_000
The only way to guarantee uniqueness is to keep a global count and increment it for each use: 0000, 0001, etc.