Running Sudoku Solver Example in hadoop - hadoop

I am trying to run sudoku solver provided with hadoop example jar file. I am not sure the format in which input is supposed to be given. Can anybody guide?
Thanks,
Tapan

From the documentation:
The sudoku solver is so fast, I didn't bother making a distributed
version. (All of the puzzles that I've tried, including a 42x42 have
taken around a second to solve.) On the command line, give the solver
a list of puzzle files to solve. Puzzle files have a line per a row
and columns separated by spaces. The squares either have numbers or
'?' to mean unknown.

Related

Plink - reverse complement

I am using Plink for the first time and am checking my data against some previously genotyped samples (these are the same samples so the genotypes should match up).
My data is nearly correct as in it has called homs and hets correctly but for some SNPs my data has the reverse complement.
e.g.
What command do I need in plink do I need to tell it to call the reverse complement when needed??
I think I have got to the source of the problem should anyone else stumble across this post with the same issue.
I was using an Illumina SNP chip. Apparently there is some discrepancy in post-processing of genotype data with strand definition. These two links explain it beautifully:
http://gengen.openbioinformatics.org/en/latest/tutorial/coding/
https://academic.oup.com/bioinformatics/article/33/15/2399/3204987

Algorithm for one sign checksum

I am desperate in the search for an algorithm to create a checksum that is a maximum of two characters long and can recognize the confusion of characters in the input sequence. When testing different algorithms, such as Luhn, CRC24 or CRC32, the checksums were always longer than two characters. If I reduce the checksum to two or even one character, then no longer all commutations are recognized.
Does any of you know an algorithm that meets my needs? I already have a name with which I can continue my search. I would be very grateful for your help.
Taking that your data is alphanumeric, you want to detect all the permutations (in the perfect case), and you can afford to use the binary checksum (i.e. full 16 bits), my guess is that you should probably go with CRC-16 (as already suggested by #Paul Hankin in the comments), as it is more information-dense compared to check-digit algorithms like Luhn or Damm, and is more "generic" when it comes to possible types of errors.
Maybe something like CRC-CCITT (CRC-16-CCITT), you can give it a try here, to see how it works for you.

what is the relationship between LCS and string similarity?

I wanted to know how similar where two strings and I found a tool in the following page:
https://www.tools4noobs.com/online_tools/string_similarity/
and it says that this tool is based on the article:
"An O(ND) Difference Algorithm and its Variations"
available on:
http://www.xmailserver.org/diff2.pdf
I have read the article, but I have some doubts about how they programmed that tool, for example the authors said that it is based on the C library GNU diff and analyze.c; maybe it refers to this:
https://www.gnu.org/software/diffutils/
and this:
https://github.com/masukomi/dwdiff-annotated/blob/master/src/diff/analyze.c
The problem that I have is how to understand the relation with the article, for what I read the article shows an algorithm for finding the LCS (longest common subsequence) between a pair of strings, so they use a modification of the dynamic programming algorithm used for solving this problem. The modification is the use of the shortest path algorithm to find the LCS that has the minimum number of modifications.
At this point I am lost, because I do not know how the authors of the tool I first mentioned used the LCS for finding how similar are two sequences. Also the have put a limit value of 0.4, what does that mean? can anybody help me with this? or have I misunderstood that article?
Thanks
I think the description on the string similarity tool is not being entirely honest, because I'm pretty sure it has been implemented using the Perl module String::Similarity. The similarity score is normalised to a value between 0 and 1, and as the module page describes, the limit value can be used to abort the comparison early if the similarity falls below it.
If you download the Perl module and expand it, you can read the C source of the algorithm, in the file called fstrcmp.c, which says that it is "Derived from GNU diff 2.7, analyze.c et al.".
The connection between the LCS and string similarity is simply that those characters that are not in the LCS are precisely the characters you would need to add, delete or substitute in order to convert the first string to the second, and the number of these differing characters is usually used as the difference score, as in the Levenshtein Distance.

Generation of sudoku questions

I am doing a sudoku game. My problem is the generation of sudoku questions. I want to generate questions in three difficulties. Is there any idea to generate 3 level questions?
If we go for pre generated sudoku puzzles, maybe you could have a look at this :
http://www.setbb.com/phpbb/viewtopic.php?t=102&mforum=sudoku
we used terminal sudoku in the Linux distributions
it has a batch generator mode.
the website is down but it is packaged for some linux distributions.
generate puzzles for each level : easy, medium and hard
sudoku -fcompact -ceasy -g5>sudoku_easy.txt
sudoku -fcompact -cmedium -g5>sudoku_medium.txt
sudoku -fcompact -chard -g5>sudoku_hard.txt
solve the puzzles
sudoku -fcompact -v sudoku_easy.txt >sudoku_easy-resolved.txt
sudoku -fcompact -v sudoku_medium.txt >sudoku_medium-resolved.txt
sudoku -fcompact -v sudoku_hard.txt >sudoku_hard-resolved.txt
I checked some of them and they had only one solution.
Generate full (filled) sudokus and before printing the sudoku out, make some percentage of the fields empty again for the human to fill.
Select random fields to empty. Raise the percentage of empty fields on each difficulty level.

Word Suggestion program

Suggest me a program or way to handle the word correction / suggestion system.
- Let's say the input is given as 'Suggset', it should suggest 'Suggest'.
Thanx in advance. And I'm using python and AJAX. Please don't suggest me any jquery modules cuz I need the algorithmic part.
Algorithm that solves your problem called "edit distance". Given the list of words in some language and mistyped/incomplete word you need to build a list of words from given dictionary closest to it. For example distance between "suggest" and "suggset" is equal to 2 - you need one deletion and one insertion. As an optimization you can assign different weights to each operation - for example you can say that substitution is cheaper than deletion and substitution between two letters that lie closer on keyboard (for example 'v' and 'b') is cheaper that between those that are far apart (for example 'q' and 'l').
First description of algorithm for spelling and correction appeared in 1964. In 1974 efficient algorithm based on dynamic programming appeared in paper called "String-to-string correction problem" by Robert A. Wagner and Michael J. Fischer. Any algorithms book have more or less detailed treatment of it.
For python there is library to do that: Levenshtein distance library
Also check this earlier discussion on Stack Overflow
It will take a lot of work to make one of those yourself. There is a really nice spell checker library written in python called PyEnchant that I've found to be quite nice. Here's an example from their website:
>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

Resources