hypothesis parameter pocketsphinx - pocketsphinx

I use Pocketsphinx in my Android app. I recognize speech using addGrammarSearch (String name, File file). When there is noise around, then Pocketsphinx catches it and recognizes it as a word or phrase from the grammar. But the word was not uttered. Is there any parameter in the hypothesis, which shows how much the recognized sound is like a word from a grammar? What method can I get this parameter?
I want to filter out the recognized sound, which has a low value, that is, it does not look like a word in the grammar.

This answer has an approach you may find useful -- that is, if you can tolerate using a keyword list instead of a full grammar. (Note that keywords do not have to be single words.) With SpeechRecognizer.addKeywordSearch(), you can set "thresholds" for each keyword, which is critical to reducing false positives.
The thresholds are usually found via experimentation. The closer it is to 1e-50, the more likely you are to get a false positive. The closer it is to 1e0, the more likely you are to miss a valid utterance.

Related

Why do Julia programmers need to prefix macros with the at-sign?

Whenever I see a Julia macro in use like #assert or #time I'm always wondering about the need to distinguish a macro syntactically with the # prefix. What should I be thinking of when using # for a macro? For me it adds noise and distraction to an otherwise very nice language (syntactically speaking).
I mean, for me '#' has a meaning of reference, i.e. a location like a domain or address. In the location sense # does not have a meaning for macros other than that it is a different compilation step.
The # should be seen as a warning sign which indicates that the normal rules of the language might not apply. E.g., a function call
f(x)
will never modify the value of the variable x in the calling context, but a macro invocation
#mymacro x
(or #mymacro f(x) for that matter) very well might.
Another reason is that macros in Julia are not based on textual substitution as in C, but substitution in the abstract syntax tree (which is much more powerful and avoids the unexpected consequences that textual substitution macros are notorious for).
Macros have special syntax in Julia, and since they are expanded after parse time, the parser also needs an unambiguous way to recognise them
(without knowing which macros have been defined in the current scope).
ASCII characters are a precious resource in the design of most programming languages, Julia very much included. I would guess that the choice of # mostly comes down to the fact that it was not needed for something more important, and that it stands out pretty well.
Symbols always need to be interpreted within the context they are used. Having multiple meanings for symbols, across contexts, is not new and will probably never go away. For example, no one should expect #include in a C program to go viral on Twitter.
Julia's Documentation entry Hold up: why macros? explains pretty well some of the things you might keep in mind while writing and/or using macros.
Here are a few snippets:
Macros are necessary because they execute when code is parsed,
therefore, macros allow the programmer to generate and include
fragments of customized code before the full program is run.
...
It is important to emphasize that macros receive their arguments as
expressions, literals, or symbols.
So, if a macro is called with an expression, it gets the whole expression, not just the result.
...
In place of the written syntax, the macro call is expanded at parse
time to its returned result.
It actually fits quite nicely with the semantics of the # symbol on its own.
If we look up the Wikipedia entry for 'At symbol' we find that it is often used as a replacement for the preposition 'at' (yes it even reads 'at'). And the preposition 'at' is used to express a spatial or temporal relation.
Because of that we can use the #-symbol as an abbreviation for the preposition at to refer to a spatial relation, i.e. a location like #tony's bar, #france, etc., to some memory location #0x50FA2C (e.g. for pointers/addresses), to the receiver of a message (#user0851 which twitter and other forums use, etc.) but as well for a temporal relation, i.e. #05:00 am, #midnight, #compile_time or #parse_time.
And since macros are processed at parse time (here you have it) and this is totally distinct from the other code that is evaluated at run time (yes there are many different phases in between but that's not the point here).
In addition to explicitly direct the attention to the programmer that the following code fragment is processed at parse time! as oppossed to run time, we use #.
For me this explanation fits nicely in the language.
thanks#all ;)

Efficient differentiation between keyword and Identifiers

I am building a compiler and in lexical analyzer phase:
Install the reserved words in the symbol table initially.A field of the symbol-table entry indicates that these strings are never ordinary identi-fiers, and tells which token they represent. We have supposed that thismethod is in use in Fig. 3.14. When we find an identifier, a call to installID places it in the symbol table if it is not already there and returns a pointerto the symbol-table entry for the lexeme found. Of course, any identifiernot in the symbol table during lexical analysis cannot be a reserved word,so its token is id. The function getToken examines the symbol table entryfor the lexeme found, and returns whatever token name the symbol tablesays this lexeme represents either id or one of the keyword tokens thatwas initially installed in the table.
But now everytime i recognize a keyword, i will have to go through the entire symbol table, it's like comparing 'n' elements for every keyword/Id recognition.
Wont it be too inefficient. What else can i do?
Kindly help.
If you build a finite state automata to identify lexemes, then its terminal states should correspond to language lexemes.
You can leave keywords out of the FSA and you'll end up with only a single terminal state for strings that look like identifiers. This is a common implementation when coding the FSA by hand. You'll have the problem you have now. As a practical matter for the symbol table no matter what you do with keywords, you will want an extremely fast identifier lookup which pretty much suggests you need a hashing solution. If you have this, then you can do a lookup quickly and check your "it must be a keyword" bit. There are plenty of good hash schemes in existence; as usual, Wikipedia on hash functions is a pretty good place to start. This is a practical solution; I use it in my PARLANSE compiler (see my bio) which processes million-line files in a few tens of seconds.
This isn't really the fastest solution. It is better to include the keywords in the FSA (this tends to encourage the use of a lexer generator, because adding all the keywords to an manualy coded FSA is inconvenient, but not hard). If you do that, and you have keywords that look like identifiers, e.g., goto, there will be terminal states that in effect indicate that you have recognized an identifier that happens to be spelled as a specific keyword.
How you interpret that end state is up to you. One obvious choice is that such end states indicate you have found a keyword. No hash table lookup required.
you can use hash table for list of keywords. It makes your search O(1) complexity
You could use a perfect hash like one generated with gperf.

What is an example of a lexical error and is it possible that a language has no lexical errors?

for our compiler theory class, we are tasked with creating a simple interpreter for our own designed programming language. I am using jflex and cup as my generators but i'm a bit stuck with what a lexical error is. Also, is it recommended that i use the state feature of jflex? it feels wrong as it seems like the parser is better suited to handling that aspect. and do you recommend any other tools to create the language. I'm sorry if i'm impatient but it's due on tuesday.
A lexical error is any input that can be rejected by the lexer. This generally results from token recognition falling off the end of the rules you've defined. For example (in no particular syntax):
[0-9]+ ===> NUMBER token
[a-zA-Z] ===> LETTERS token
anything else ===> error!
If you think about a lexer as a finite state machine that accepts valid input strings, then errors are going to be any input strings that do not result in that finite state machine reaching an accepting state.
The rest of your question was rather unclear to me. If you already have some tools you are using, then perhaps you're best to learn how to achieve what you want to achieve using those tools (I have no experience with either of the tools you mentioned).
EDIT: Having re-read your question, there's a second part I can answer. It is possible that a language could have no lexical errors - it's the language in which any input string at all is valid input.
A lexical error could be an invalid or unacceptable character by the language, like '#' which is rejected as a lexical error for identifiers in Java (it's reserved).
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other side, will be thrown by your scanner when a given set of already recognised valid tokens don't match any of the right sides of your grammar rules.
it feels wrong as it seems like the
parser is better suited to handling
that aspect
No. It seems because context-free languages include regular languages (meaning than a parser can do the work of a lexer). But consider than a parser is a stack automata, and you will be employing extra computer resources (the stack) to recognise something that doesn't require a stack to be recognised (a regular expression). That would be a suboptimal solution.
NOTE: by regular expression, I mean... regular expression in the Chomsky Hierarchy sense, not a java.util.regex.* class.
lexical error is when the input doesn't belong to any of these lists:
key words: "if", "else", "main"...
symbols: '=','+',';'...
double symbols: ">=", "<=", "!=", "++"
variables: [a-z/A-Z]+[0-9]*
numbers: [0-9]*
examples: 9var: error, number before characters, not a variable and not a key word either.
$: error
what I don't know is whether something like more than one symbol after each other is accepted, like "+-"
Compiler can catch an error when it has the grammar in it!
It will depend on the compiler itself whether it has the capacity (scope) of catching the lexical errors or not.
If is decided during the development of compiler what types of lexical error and how (according to the grammar) they are going to be handled.
Usually all famous and mostly used compiler has this capabilities.

How to do case conversion in Prolog?

I'm interfacing with WordNet, and some of the terms I'd like to classify (various proper names) are capitalised in the database, but the input I get may not be capitalised properly. My initial idea here is to write a predicate that produces the various capitalisations possible of an input, but I'm not sure how to go about it.
Does anyone have an idea how to go about this, or even better, a more efficient way to achieve what I would like to do?
It depends on what Prolog implementation you're using, but there may be library functions you can use.
e.g. from the SWI-Prolog reference manual:
4.22.1 Case conversion
There is nothing in the Prolog standard for converting case in textual data. The SWI-Prolog
predicates code_type/2 and char_type/2 can be used to test and convert individual
characters. We have started some additional support:
downcase_atom(+AnyCase, -LowerCase)
Converts the characters of AnyCase into lowercase as char_type/2 does (i.e. based on the
defined locale if Prolog provides locale support on the hosting platform) and unifies the
lowercase atom with LowerCase.
upcase_atom(+AnyCase, -UpperCase)
Converts, similar to downcase_atom/2, an atom to upper-case.
Since this just downcases whatever's passed to it, you can easily write a simple predicate to sanitise every input before doing any analysis.

Are semantics and syntax the same?

What is the difference in meaning between 'semantics' and 'syntax'? What are they?
Also, what's the difference between things like "semantic website vs. normal website", "semantic social networking vs. normal social networking" etc.
Syntax is the grammar. It describes the way to construct a correct sentence. For example, this water is triangular is syntactically correct.
Semantics relates to the meaning. this water is triangular does not mean anything, though the grammar is ok.
Talking about the semantic web has become trendy recently. The idea is to enhance the markup (structural with HTML) with additional data so computer could make sense of the web pages more easily.
Syntax is the grammar of a language - the rules by which to form sentences or expressions.
Semantics is the meaning you are trying to express with your code.
A program that is syntactically correct will compile and run.
A program that is semantically correct will actually do what you as the programmer intended it to do. i.e. it doesn't have any bugs in it.
Two programs written to perform the same task in different languages will use different syntaxes, but they would be the same semantically.
If you are talking about web (rather than programming languages):
The syntax of the language is whatever the browser (or processing program) can legally recognize and handle, and render to you. For example, your browser can render HTML, while your API can parse XML trees.
Semantics involve what is actually being represented. There's a lot of buzz now about semantic webs and all that stuff, but it essentially means that each entity is also associated with some human-readable information or metadata, so that a certain tag would have a supposed meaning and refer you to it.
Social networks are the same story. You put knowledge in the links
"An ant ate an aunt." has a correct syntax, but will not make sense semantically. A syntax is a set of rules that can be combined to produce infinite number of gramatically valid sentences, but few, very few of which has a semantics.
Syntax is the word order of a sentence. In English it would be the subject-verb-object form.
Semantic is the meaning behind words. E.g: she ate a saw. The word saw doesn't match according to the meaning of the sentence. but it is grammatically correct. so its syntax is correct. =)
Specifically, semantic social networking means embedding the actual social relationships within the page markup. The standard format for doing this as defined by microformats is XFN, XHTML Friends Network. In regards to the semantic web in general, microformats should be the go-to guide for defining embedded semantic content.
Semantic web sites use the concept of the semantic web, which aims to bring meaning to web content by using special annotations to identify certain concepts in a page. This makes possible the automatic (by a computer, not a human) reasoning about the content, which improves its aggregation, extraction, indexing and searching.
Explanations above are vague on the semantics side, semantics could mean the different elements at disposition to build arguments of value(these being comprehensible, to end-user man and digestible to the machine).
Of course this puts semantics and the programmer-editor-writer-communicator in the middle: he decides on the semantics that should be ideally defined to his public, comprehended by his public, general convention by his public and digestible to the machine-computer. Semantics should be agreed upon, are conceptual, must be implementable to both sides.
Say footnotes, inline and block-quotes, titles and on and on to end up into a well-defined and finite list. Mediawiki, wikitext as an example fails in that perspective, defining syntax for elements of semantic meaning left undefined, no finite list agreed upon. "meaning by form" as additional of what a title as an example again carries as textual content. Example "This is a title" becomes only semantics integrated by the supposition within agreed upon semantics, and there can be more then one set of say "This is important and will be detailed"
Asciidoc and pandoc markup is quite different in it's semantics, regardless of how each translates this by convention of syntax to output formats.
Programming, output formats as html, pdf, epub can have consequentially meaning by form, by semantics, the syntax having disappeared as a temporary tool of translation, and as one more consequence thus the output can be scanned robotically for meaning, the champ of algorithms of 'grep': Google. Looking for the meaning of "what" in "What is it that is looked for" based upon whether a title or a footnote, or a link is considered.
Semantics, and there can be more then one layer, even the textual message carries (Chomsky) semantics thus could be translated as meaning by form, creating functional differences to anything else in the output chain, including a human being, the reader.
As a conclusion, programmers and academics should be integrated, no academic should be without knowledge of his tools, as any bread and butter carpenter. Programmers should be academics in the sense that the other end of the bridging they accomplish is the end user, the bridge... much so: semantics.
m.

Resources