Remove Signature from Received Message - sms

I have a python script that receives text messages from users, and processes them as a query. However, some users have signatures automatically appended to their messages, and the script incorrectly treats them as actual content. What's the best programmatic way to recognize and remove these signatures?
(I'd prefer in python, but am fine with any other language too, as well as just saying it in pseudocode)

If the signatures are appended to the body of the message such that they're actually part of the body text, then there are only two ways to remove them:
Heuristics, such as "anything following three dashes must be a signature". These may be effective if you spend some time tuning them.
A classifier. This is a lot of work to set up, and requires that you "train" it by marking some message parts as signatures. These can also be very effective, but like heuristics will never work 100% of the time.

If the signature always follows a specific pattern, you should be able to just use a regular expression to trim it off.
However, if the user can setup their signature any way they wish, and there is no leading characters (ie: -- at the beginning), this is going to be very difficult. The only reliable way to do this would be to know the content of the signature for each user in advance so you can strip it out. Imagine a worst-case scenario: Somebody could always send a blank message, with a signature that was a fully valid "query". There'd be no way for the script to differentiate that from a "query" message with no signature.

Related

Validating FirstName in a web application

I do not want to be too strict as there may be thousands of possible characters in a possible first name
Normal english alphabets, accented letters, non english letters, numbers(??), common punctuation synbols
e.g.
D'souza
D'Anza
M.D. Shah (dots and space)
Al-Rashid
Jatin "Tom" Shah
However, I do not want to except HTML tags, semicolons etc
Is there a list of such characters which is absolutely bad from a web application perspective
I can then use RegEx to blacklist these characters
Background on my application
It is a Java Servlet-JSP based web app.
Tomcat on Linux with MySQL (and sometimes MongoDB) as a backend
What I have tried so far
String regex = "[^<>~##$%;]*";
if(!fname.matches(regex))
throw new InputValidationException("Invalid FirstName")
My question is more on the design than coding ... I am looking for a exhaustive (well to a good degree of exhaustiveness) list of characters that I should blacklist
A better approach is to accept anything anyone wants to enter and then escape any problematic characters in the context where they might cause a problem.
For instance, there's no reason to prohibit people from using <i> in their names (although it might be highly unlikely that it's a legit name), and it only poses a potential problem (XSS) when you are generating HTML for your users. Similarly, disallowing quotes, semi-colons, etc. only make sense in other scenarios (SQL queries, etc.). If the rules are different in different places and you want to sanitize input, then you need all the rules in the same place (what about whitespace? Are you gong to create filenames including the user's first name? If so, maybe you'll have to add that to the blacklist).
Assume that you are going to get it wrong in at least one case: maybe there is something you haven't considered for your first implementation, so you go back and add the new item(s) to your blacklist. You still have users who have already registered with tainted data. So, you can either run through your entire database sanitizing the data (which could take a very very long time), or you can just do what you really have to do anyway: sanitize data as it is being presented for the current medium. That way, you only have to manage the sanitization at the relevant points (no need to protect HTML output from SQL injection attacks) and it will work for all your data, not just data you collect after you implement your blacklist.

Converting NSString to keyCode+modifiers for AXUIElementPostKeyboardEvent

Edit: Turns out, I was misled during my initial explorations of the accessibility APIs. Once I found the secure text field in the AX hierarchy, I was easily able to set the value. Not sure what to do with this question beyond that, but I wanted to update this for future searchers.
I'm working on some code that will post keyboard events to targeted applications using the Accessibility APIs. So far, I have been able to write a trivial app that allows me to type in a string value and then post keyboard events with those key codes to the targeted application. In reality, the strings would be read from another location.
What I have not yet been able to figure out is how to ascertain whether and which modifier keys should also be posted. For instance, when I type Hello, world! into my test application, the input is sent to the other application as hello, world1 because I am not yet including the modifier keys to create the upper case H and the exclamation point. This is made doubly complicated by multi-keystroke characters like é or ü. Sending é sends a raw e with no accent for example.
Is there a simple method I am overlooking for discerning the modifiers to combine with a keycode for creating a particular NSString or unichar? If not, does anyone have a suggestion of how to proceed? So far, the best I have come up with is calling UCKeyTranslate with all possible modifier combinations until I find one that matches the unichar I get using -[NSString characterAtIndex:] I'm not sure this is scalable or reliable, though, given the multi-keystroke nature of some characters as noted above.
Thanks in advance!
This probably won't help. But just in case: Is it really necessary to send keyboard events? Because that is going to get really difficult if you need to support, say, Kotoeri.
It's a simple matter to override insertText: and doCommandBySelector: and send the results of the key sequence, rather than the individual keystrokes.
I have found a example which does the trick but it's incomplete:It will not be a general solution in any case ...how can this handle multiple keyboard layouts ?
There is an cgquartz obsolete function to do so: CGPostKeyboardEvent (not sure it's possible to pass only the char?) may be can still be used (marked undocumented with some side effect to but .. ).
EDIT: UCKeyTranslate as a way to build a dictionary. Interesting but how the OS do this? A better answer should be hidden somewhere !

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

Are unescaped user names incompatible with BNF?

I've got a (proprietary) output from a software that I need to parse. Sadly, there are unescaped user names and I'm scratching my hairs trying to know if I can, or not, describe the files I need to parse using a BNF (or EBNF or ABNF).
The problem, oversimplified (it's really just an example), may look like this:
(data) ::= <username>
<username> ::= (other type of data)
And in some case, instead of appearing at the left or at the right, the username can also appear in the middle of a line.
The problem is that the username is unescaped and there are not enough restrictions on user names (they're printable ASCII, max 20 chars and they can't contain line break). So "=" would be a perfectly valid username, for example. And so would "= 1 = john = 2" (because user, at sign-on, where allowed to choose any user name they wanted and these appear unescaped in the output I've got).
I'm asking because my parser chocked on some very creative usernames (once again, not in my control, they're "weird" and I need to deal with it) and I cannot find an easy way to deal with this. Also note that I do not know in advance the user names (for example I don't have access to a database that would contain all the user names that the users created).
So are unrestricted and unescaped user names incompatibles with BNF?
P.S: be cool with me if I made mistakes, it's my first post on stackoverflow :)
BNF doesn't "care" for user names per-se. It works on the token level. If you define a username token, you can build describe a grammar using BNF based on it.
Your problem should be solved on the lexer level. The lexer should be smart enough to recognize user names, even when they're not escaped, and pass username tokens to the parser.
In theory you could describe all kinds of user names with a grammar, but this heavily depends on the other things in your language. Is = a valid token on its own right? How can you tell a username having = in it apart if it is? I think you'll have to describe the rest of the rules and valid tokens in your language to get a fuller answer here.
It might be possible to work by recognising things that are not usernames and then declaring everything else a username, even if this means parsing from right to left instead of left to right or doing something equally eccentric.
It may be worth looking to see if your input is actually ambiguous: can you find two different situations that lead to identical output being generated? If so, you need to go back and get requirements for which of them to favour, or what sort of error to produce, or whatever. If not, the reason why not might help you work out what your parser or lexer or whatever needs to do.

Is it acceptable to normalize text box content when it loses focus?

I have received requirements that ask to normalize text box content when the user changes the focus to another control on the same data input form. Example normalizations:
whitespace at the start and end of the input is trimmed
If the text box was made empty and this is not valid, replace the content of the text box with the default value
I have a feeling that this is not in line with good GUI design. I have read the Windows UX Guidelines for text boxes but I did not immediately find any relevant rules.
Is normalizing text box content in this way acceptable?
I have definitely seen this before (examples elude me right now) but I personally don't like it when the UI changes my input.
If the UI is smart enough to change my input on me then it should accept it as is and change the value when it needs to process it.
When the input changes auto-magically you are now forcing the user to stop and ask themselves why it changed and if they did something wrong or if the application has an error. Don't make the user think!
Generally, you should accept user input exactly has they entered it. Chances are users did it that way for a good reason. For example, imagine a user entering a foreign address, and then your app screws it up trying to format like a domestic address. At the very least, users entered the input the way they’re used to it being, so changing it can make it hard for them to cross-check it.
However, there are several exceptions:
Add defaults to incomplete input. Adding input the user left off (e.g., years to dates, units to dimensions) provides good feedback on how the app is interpreting the input that would otherwise be ambiguous. This also encourages the user to use defaults, making their input more efficient.
Resolve other ambiguities. Change to an unambiguous format if the user’s format is open to interpretation. For example, if you have international users, you may want to change “9-8-09” to “Sep 8 2009” (or “9 Aug 2009”) to provide feedback on what your app considers the month and day to be.
Add delimiters when none provided. Automagically adding standard or even arbitrary delimiters to long alphanumeric strings (e.g., phone numbers, credit card numbers, serial numbers) provides an input display that the users can crosscheck more easily. Sometimes users may enter a string without delimiters in order to go faster or because they are the victim of web abuse by sites that refuse to accept even standard delimiters.
Spelling, grammar, and capitalization correction. Users often appreciate this, but only if there’s also a means to override it. Some users like to use "i" as the first person pronoun.
If the field is used by more than one user, then you probably should automatically format the value in some standard way that accommodates the majority of your users, but that should be done when the value is stored on the backend, not when focus leaves the field. For example, if a user enters a time of 15:30 it should remain as 15:30 as long as the user views the page. However, the next time a user (any user) retrieves the data, it should appear as 3:30pm (if that’s how most of your users are used to seeing time).
Such backend formatting applies to trimming whitespace so that all users can search, find, and sort on the field consistently. It’s probably not a good idea to replace a blank value (or any invalid value) with the default because users are unlikely to anticipate getting that value. An exception would perhaps be changing blank to 0 for numeric fields in situations where obviously blank == none == zero, but again this probably should be done when storing in the backend, not in the field itself. If blank is ambiguous, (e.g., may mean 0 or may mean "I don't know") then the second bullet above applies, and you may want to autocorrect in the field when focus is lost.
Of course, if your users vary in how they need to have a data type formatted, then you can have different variants of the app that display the data type in different ways for different user groups, or you can make the format of the data type a user preference, but that’s really another issue.
If the user wants it, and the Stakeholder ask for it, then is perfectly safe.
Trimming is very common. and the replace is common when you are talking about filling textbox with numbers. (a 0 instead of a blank).
It's a fairly standard feature, especially the whitespace trimming. The default value replacement raises a larger flag just because it is less common.
I'm pretty sure that I've seen versions of Microsoft Office that do this - putting "pt." after a value in points, for instance. Microsoft's endorsement should be a good sign.
We have quite a few of these kind of requirement. The reason given for forcing a default value rather than a blank space is that it looks better in reports or if the client wants to see the live system. A blank looks a bit like "couldn't be bothered to enter anything". For a similar reason, we often upper-case the text for consistency as the users never use consistent formatting.

Resources