I am new to field of Machine Learning so I hope that there a people here to help me with my case...
I want to apply Machine Learning to bank transactions in order to determine if a particular transacties belongs to grocery, assurance, mortgage etc.
A typical transactions looks something like below:
accountnumber | counterpart accountnumber | amount | type | code | date | time | description
12345678 | 09876543 | 100.00 | c | bg | 01-02-2001 | 10:01:22 | Hema hermanruiterstraat pasnr:78 10:01:22
12345678 | 12343278 | 45.95 | d | ba | 02-02-2001 | 18:34:54 | Albert Hein janvangalenstraar
I looked at Naive Bayes classifiers but I am not very pleased with the results I got. I trained my model based on 5 attributes
amount
type (c = 1, d = 2)
code (bg = 1, ba = 2)
date (converted to long)
time (converted to long)
My question is which algorithm would be best to classify these transactions? If possible please provide some background about the algorithm of choice.
cheers!
Martijn
Related
Lets say i have multiple columns coming from two different files like that :
USERNAME | AGE | GENDER | CHILDREN
Joe | 23 | male | 2
Annie | 45 | female | 5
| | |
And another one like this :
USERNAME | AGE |
Jonathan | 33 |
Mike | 41 |
And i want to merge the data of the columns that have the same name into one like this while keeping the data of the columns that are unique at each field:
USERNAME | AGE | GENDER | CHILDREN
Joe | 23 | male | 2
Annie | 45 | female | 5
Jonathan | 33 | |
Mike | 41 | |
Sorry if the answer is obvious, im new to talend, thanks.
What tool is available toy you?
The Append function in SAS for example can do this for you.
You can use the append approach in Python, R or other language you intend using.
For Talen:
Copy the complete subjob1 – copy me sub job and paste it to create a second sub job.
Link the two sub jobs using an onSubjobOK link.
Open tFixedFlowInput, and change Records from first subjob to Records from second subjob.
Open tFileOutputDelimited on the new sub job, and tick Append, as shown in the following screenshot:
use a tUnite component to accomplish that
here is the link of the documentation : https://help.talend.com/r/fr-FR/8.0/orchestration/tunite
your flow would be
tFileInput1(excel or csv ) ----------------------------------------------
|
| ->tUnite -> tLogRow
tFileInput2(excel or csv )->tMap (add to empty fields GENDER & Children )|
I have data in two different linked tables in Airtable and I need to join them together. See example:
The PERSON table looks like:
Name | Classes
----------------
John | A,B,C,F
Sally | B,F
Max | B,C
While the linked CLASSES table looks like:
Class | Date | People
---------------------------
A | 1975 | John
B | 2000 | John,Sally,Max
C | 1823 | John,Max
D | 1492 |
E | 2020 |
F | 2010 | John,Sally
What I need is:
Person|Class|Date
--------------
John | A | 1975
John | B | 2000
John | C | 1823
John | F | 2010
Sally | B | 2000
Sally | F | 2010
Max | B | 2000
Max | C | 1823
How do I get this view / table as output?
The more I see questions like this, with no answer, the more I realise how airtable just isn't a database in any real sense.
This is a perfectly reasonable question about how to join 2 tables after those tables have been normalised. Answer? It can't be done, not easily!
So what is airtable supposed to be used for, building non-normalised databases, otherwise known as a spreadsheet!
If you use click "Class" field like "A" or "B" in the "Person" table, it'll show the popup so that you could see the class details.
Or if you really want to need that kind of table, my suggestion is like this
Create a new table called "xxx", and write the code in the scripting block and populate the data from "Person", "Class" tables to the new table.
PS: Scripting block is only supported in the "Pro" plan.
I have two tables one includes about 17K (NLIST) records while the other 57K (FNAMES).
I would like to join the both by comparing the records using levenshtein formula.
Here is the example for the content of tables:
Table NLIST:
+------+-------------+
| ID | S_NAME |
+------+-------------+
| 1 | Avi |
| 2 | Moshe |
| 3 | David |
....
Table FNAMES:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
The above tables are just examples. In the real case the names column can include more than one word.
The required result should be:
+------+-------------+--------+
| ID | NICKNAMES | S_NAME |
+------+-------------+--------+
| 1 | Avile | Avi |
| 2 | Dudi | David |
| 3 | Moshiko | Moshe |
| 4 | Avi | Avi |
| 5 | DAVE | David |
...
Here is the code I use:
select FNAMES.NICKNAMES, NLIST.S_NAME
from NICKNAMES
LEFT OUTER JOIN NLIST
ON(true)
WHERE levenshtein (FNAMES.NICKNAMES, NLIST.S_NAME) <=4
The above code runs for a very long time and I stopped its running.
How can I make it run in a reasonable time?
In addition, I think the levenshtein distance depends on the length of the words. How can I find the optimal value for the distance (in this case I chose 4 arbitrarily)?
Hive Table performance is depends upon various point .
Query enginee
File format
use VECTORIZATION set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;
If you have good server you can try with Impala and definitely it is faster than Hive.
You can do the fine tuning of impala which will give you an edge to execute this query faster .Tuning Impala for Performance
Using Derby, is it possible to offset by a value from the query rather than an integer literal?
When I run this query, it complains about the value I've given to the offset clause.
select
PRIZE."NAME" as "Prize Name",
PRIZE."POSITION" as "Position",
(select
PARTICIPANT."NAME"
from PARTICIPANT
order by POINTS desc
offset PRIZE."POSITION" rows fetch next 1 row only <-- notice I'm trying to pass in a value to offset by
) as "Participant"
from PRIZE
With the expectation that the results would look like this:
| Prize Name | Position | Participant |
|--------------|----------|---------------|
| Gold medal | 1 | Mari Loudi |
| Silver medal | 2 | Keesha Vacc |
| Bronze medal | 3 | Melba Hammit |
| Hundredth | 100 | James Thornby |
The documentation suggests that it's possible to pass in a value from java code, but I'm trying to use a value from the query itself.
By the way, this is just an example schema to illustrate the point.
I know there are other ways to achieve the ranking, but I'm specifically interested if there's a way to pass values to the offset clause.
Assume, that we have large file, which contains descriptions of the cells of two matrices (A and B):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 10 | A |
| 1 | 2 | 20 | A |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 1 | 1 | 5 | B |
| 1 | 2 | 7 | B |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
And we want to calculate the product of this matrixes: C = A x B
By definition: C_i_j = sum( A_i_k * B_k_j )
And here is a two-step MapReduce algorithm, for calculation of this product (I will provide a pseudocode):
First step:
function Map (input is a single row of the file from above):
i = row[0]
j = row[1]
value = row[2]
matrix = row[3]
if(matrix == 'A')
emit(i, {j, value, 'A'})
else
emit(j, {i, value, 'B'})
Complexity of this Map function is O(1)
function Reduce(Key, List of tuples from the Map function):
Matrix_A_tuples =
filter( List of tuples from the Map function, where matrix == 'A' )
Matrix_B_tuples =
filter( List of tuples from the Map function, where matrix == 'B' )
for each tuple_A from Matrix_A_tuples
i = tuple_A[0]
value_A = tuple_A[1]
for each tuple_B from Matrix_B_tuples
j = tuple_B[0]
value_B = tuple_B[1]
emit({i, j}, {value_A * value_b, 'C'})
Complexity of this Reduce function is O(N^2)
After the first step we will get something like the following file (which contains O(N^3) lines):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 50 | C |
| 1 | 1 | 45 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 2 | 2 | 70 | C |
| 2 | 2 | 17 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
So, all we have to do - just sum the values, from lines, which contains the same values i and j.
Second step:
function Map (input is a single row of the file, which produced in first step):
i = row[0]
j = row[1]
value = row[2]
emit({i, j}, value)
function Reduce(Key, List of values from the Map function)
i = Key[0]
j = Key[1]
result = 0;
for each Value from List of values from the Map function
result += Value
emit({i, j}, result)
After the second step we will get the file, which contains cells of the matrix C.
So the question is:
Taking into account, that there are multiple number of instances in MapReduce cluster - which is the most correct way to estimate complexity of the provided algorithm?
The first one, which comes to mind is such:
When we assume that number of instances in the MapReduce cluster is K.
And, because of the number of lines - from file, which produced after the first step is O(N^3) - the overall complexity can be estimated as O((N^3)/K).
But this estimation doesn't take into account many details: such as network bandwidth between instances of MapReduce cluster, ability to distribute data between distances - and perform most of the calculations locally etc.
So, I would like to know which is the best approach for estimation of efficiency of the provided MapReduce algorithm, and does it make sense to use Big-O notation to estimate efficiency of MapReduce algorithms at all?
as you said the Big-O estimates the computation complexity, and does not take into consideration the networking issues such(bandwidth, congestion, delay...)
If you want to calculate how much efficient the communication between instances, in this case you need other networking metrics...
However, I want to tell you something, if your file is not big enough, you will not see an improvement in term of execution speed. This is because the MapReduce works efficiently only with BIG data. Moreover, your code has two steps, that means two jobs. MapReduce, from one job to another, takes time to upload the file and start the job again. This can affect slightly the performance.
I think you can calculate the efficiently in term of speed and time as the MapReduce approach is for sure faster when it comes to big data. This is if we compared it to the sequential algorithms.
Moreover, efficiency can be with regards to the fault-tolerance. This is because MapReduce will manage to handle failures by itself. So, no need for the programmers to handle instance failure or networking failures..