Segment multilanguage parallel text - algorithm

I have multi-language text that contains a message translated to several languages.
For example:
English message
Russian message
Ukrainian message
The order is not exact.
I would like to devise some kind of supervised/unsupervised learning algorithm to do the segmentation automatically, and extract each translation in order to create a parallel corpus of data.
Could you suggest any papers/approaches?
I am not able to get the proper keywords for googling.

The most basic approach to your problem would be to generate a bag of words from your document. To sum up, a bag of word is a matrix where each row is a line in your document and each column a distinct term.
For instance, if your document is like this :
hello world
привет мир
привіт світ
You will have this matrix :
hello | world | привет | мир | привіт | світ
l1 | 1 | 1 | 0 | 0 | 0 | 0
l2 | 0 | 0 | 1 | 1 | 0 | 0
l3 | 0 | 0 | 0 | 0 | 1 | 1
You can then apply classifications algorithms (such as k-means or svms) according to your needs.
For more details, I would suggest to read this paper which provides a great summary of techniques.
Regarding keywords for googling, I would say text analysis, text mining or information retrieval are a good start.

Why don't you try some language identification software? They are reporting > 90% accuracy:
langid.py https://github.com/saffsd/langid.py
TextCat http://odur.let.rug.nl/~vannoord/TextCat/
Linguine http://www.jmis-web.org/articles/v16_n3_p71/index.html

Related

Missing results after reducing the visualization size

I would like to count the same log messages in Kibana. With the Size set to 200, it turns out that there are two results that happened twice
But, if I lower the Size to 5, I don't see those two:
It should show me top 5 rows, ordered by count. I expected something like this:
| LogMessage | Count |
|------------|-------|
| xx | 2 |
| yy | 2 |
| zz | 1 |
| qq | 1 |
| ww | 1 |
What am I missing?
The issue is the little warning about Analyzed Field. You should use a keyword field.
With analyzed fields, the analyzer breaks down the original string during indexing into sub-strings to facilitate search use cases (handling things like word boundaries, punctuation, case insensitivity, declination, etc)
A keyword field is just a simple string.
What's probably happening is that you have data like
| LogMessage | Count |
|------------|-------|
| a | 1 |
| b | 1 |
| c x | 1 |
| d x | 1 |
With an analyzed field, if you have a terms agg of size 2 you might (depending on the sort order) get a and b
With a larger terms agg, the top sub-string will be x
This is a simplified example, but I hope it gets the issue across.
The Terms Aggregation docs have a good section about how to avoid/solve this issue.

Is there any calibration tool between two languages performance?

I'm measuring the performance of A and B programs. A is written in Golang, B is written in Python. The important point here is that I'm interested in how the performance value increases, not the absolute performance value of the two programs over time.
For example,
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 3 | 500 |
+------+-----+-----+
| 2 | 5 | 800 |
+------+-----+-----+
| 3 | 9 | 1300|
+------+-----+-----+
| 4 | 13 | 1800|
+------+-----+-----+
Where the values in columns A and B(A: 3, 5, 9, 13 / B: 500, 800, 1300, 1800) are the execution times of the program. This execution time can be seen as performance, and the difference between the absolute performance values of A and B is very large. Therefore, the slope comparison of the performance graphs of the two programs would be meaningless.(Python is very slow compared to Golang.)
I want to compare the performance of Program A written in Golang with Program B written in Python, and I'm looking for a calibration tool or formula based on benchmarks that calculates the execution time when Program A is written in Python.
Is there any way to solve this problem ?
If you are interested in the relative change, you should normalize the data for each programming language. In other words, divide the values for golang with 3 and for python, divide with value 500.
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 1 | 1 |
+------+-----+-----+
| 2 | 1.66| 1.6 |
+------+-----+-----+
| 3 | 3 | 2.6 |
+------+-----+-----+
| 4 |4.33 | 3.6 |
+------+-----+-----+

Looping through a subset of an Oracle Collection

I would like some help with a bit of recursive code I need to traverse a graph stored as a collection in PL/SQL.
---------
|LHS|RHS|
---------
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
| 3 | 5 |
---------
Assuming 1 is the start node, I would like to be able to find 2-3 and 2-4 without looping through the entire collection to check each LHS. I know one solution is to use a global temporary table instead of a collection, but I would really like to avoid reading and writing to and from disk if at all possible.
Edit: The expected output for the above example would be an XML like this:
<1>
<2>
<3>
<5>
</5>
</3>
<4>
</4>
</2>
</1>
Thanks.

Best suitable Machine Learning algorithm for classifying bank transactions

I am new to field of Machine Learning so I hope that there a people here to help me with my case...
I want to apply Machine Learning to bank transactions in order to determine if a particular transacties belongs to grocery, assurance, mortgage etc.
A typical transactions looks something like below:
accountnumber | counterpart accountnumber | amount | type | code | date | time | description
12345678 | 09876543 | 100.00 | c | bg | 01-02-2001 | 10:01:22 | Hema hermanruiterstraat pasnr:78 10:01:22
12345678 | 12343278 | 45.95 | d | ba | 02-02-2001 | 18:34:54 | Albert Hein janvangalenstraar
I looked at Naive Bayes classifiers but I am not very pleased with the results I got. I trained my model based on 5 attributes
amount
type (c = 1, d = 2)
code (bg = 1, ba = 2)
date (converted to long)
time (converted to long)
My question is which algorithm would be best to classify these transactions? If possible please provide some background about the algorithm of choice.
cheers!
Martijn

Simulate output with 3 cases

Physically is possible to simulate such situation on a board, using electronic components.
I got 2 inputs A and B , with 3 possible values for each one (-1,0,1). My final aim is to achieve this following truth table
A | B | result
–1 | –1 | +1
–1 | +1 | 0
0 | 0 | 0
0 | +1 | +1
+1 | –1 | 0
+1 | 0 | +1
+1 | +1 | -1
In pseudo code:
if (A equals B)
result = A * -1
else
result = A + B
Yes it is absolutely possible and this what todays CPUs are using. The so called logic gates.
Of course depending on your project but won't probably need Intel processor to redo your work but much simpler components doing just that. See the above link for example components doing it.

Resources