I've been trying to match the a name received from 2 sources with each other and check if they are almost a match or not - string-comparison

In the sample data, I've listed the names of employers of a particular person(a prospective customer) which we received from 2 different sources.
I've been trying to find a way to better match the two names and get good results. (Currently, it's being done as a manual job)
I don't think I'm trying to do the impossible...but if it's not achievable, please don't be harsh!
The below is the dataset which is a "match" as per manual verification.
ADDUS==============================================Addus Home Care
Amazon.com, Inc. and its affiliates=====================Amazon.com
Aon========================================Aon Service Corporation
ARAMARK Food & Support Svc.================================Aramark
AT&T Mobility Services LLC===========================AT&T Mobility
CDW, LLC===========================================CDW Corporation
Lurie Children's Hospital of Chicago======Lurie Childrens Hospital
Securitas Security Services USA, Inc============Securitas security
The PNC Financial Services Group, Inc.======================PNC NA
United States Department of Homeland Security====US Homeland Securiti
TCS=========================================Tata Consultancy Services
Although almost obvious, let me state them for the sake of emphasis.
There might be spelling mistakes in names from either of these sources
There might be abbreviations(Ex: TCS in one place and Tata Consultancy in another)
Please suggest me an algorithm or a way to do this with least number of "wrong acceptance cases" - by which I meant cases like this, which have gotten high match ratios from different algorithms.
Please try to suggest a way of doing this.

I see only one, but over the time pretty progressive and accurate option:
(1) first the caveat: you have your 'manual job' and you will stick with it.
(2) but now the better part: the manual job is getting shorter and shorter the more data you have had classified over the time - kind of self learning machine. See the following attempt description, if you are interested, we can discus the details at later time.
1. Yur current workflow
1. create a initial employer list of triplets.
1. employer1 (string)
2. employer2 (string)
3. equivalence (values {VALID|INVALID}), default: INVALID
Result: AllEpmployersList, unverified.
2. Process the AllEpmployersList manually
1. for each AllEpmployersList member (triplet)
1. set the value for equivalence element
VALID or INVALID respectively.
Result: VerifiedEpmployersList, triplets with verified equivalence value.
3. Use the VerifiedEpmployersList as required for downstream processing.
2. The Adapted (advanced) new workflow
1. create a initial employer list of triplets.
1. employer1 (string)
2. employer2 (string)
3. equivalence (values {VALID|INVALID}), default: INVALID
Result: AllEpmployersList, unverified.
2. feed unverified AllEpmployersList into matchKnownEmployers process (described later).
Result: two lists, AllKnownEmployers and AllUnknownEployers.
3. Process the AllUnknownEployers list manually.
Result: VerifiedEpmployersList with verified equivalence value.
4. feed the VerifiedEpmployersList list into importKnownEmployers process
5. feed (again) the AllEpmployerList (Result 2.1) into matchKnownEmployers process
Result:two lists, AllKnownEmployers and AllUnknownEployers.
6. Use the AllKnownEmployers as required for downstream processes.
3. Required Investments (instances you have to establish)
1. create KnownEmployers database
1. create table knownEmployerNames,
1. columns:
1. id
2. employerName
3. aliasIdValue
2. create table lastAliasIdValue
1. columns:
1. aliasIdValue
3. init table lastAliasIdValue
1. insert one initial row, aliasIdValue = 0
2. create matchKnownEmployersProcess with this characteristics:
1. Input data: employerList (triplets)
2. init empty list for knownEmployers and unknownEployers
3. for each member in employerList do:
1. if employer1 and employer2 in table knownEmployerNames and employer1::aliasIdValue equals employer2::aliasIdValue
1. then set member::equivalence value to VALID and append the member into knownEmployers list
2. else append the member into unknownEployers list
4. Output data: two lists, knownEmployers and unknownEployers.
3. create importKnownEmployersProcess with this characteristics:
1. Input data: employerList (triplets)
2. for each element in employerList do:
1. if equivalence element value is VALID
1. insert new pattern
1. if employer1 or employer2 is in table knownEmployerNames
1. then
1. function isUnknown(employer1, employer2) {
retVal = {}
retVal[‘aliasIdValue’] =
employer1::aliasIdValue ||
employer2::aliasIdValue
retVal[‘newEmployer’] =
(!employer1 || !employer2)
return retVal
}
2. aliasIdValue, newEmployer = isUnknown(employer1, employer2)
3. insert aliasIdValue, newEmployer into knownEmployerNames table
2. else
1. fetch and increment aliasIdValue from lastAliasIdValue table
2. insert into knownEmployerNames (employer1, aliasIdValue) and (employer2, aliasIdValue)
3. update incremented lastAliasIdValue in the lastAliasIdValue table
3. Output data: none

Related

What is the best approach for large scale Paths and Funnels Analysis?

We have a big dataset of user actions on our internal apps. I am trying to create an algorithm for Paths & Funnels analytics which will take parameters for Paths (i.e. Start and End point) and a defined step of actions for Funnel. What is the best algorithm to program this with large data? The output should be just counts of users for specific set of actions like this :
Format of the file to scan:
UserID
Action
TS
1
A
06/04/2022
1
B
06/04/2022
1
C
06/04/2022
1
D
06/04/2022
2
G
06/04/2022
2
H
06/04/2022
2
K
06/04/2022
Algorithm input parameters:
For Path : User statistics on the start point A and end point F
For Funnel: User statistics on the defined steps A->B->C->D
Path
Count
A->B->C->D
385
G->H->K
89
where A,B,C,D,... are nodes for user actions or pages.
This should be easy using Python for a smaller set, but the issue is, I am worried about performance, as I am dealing with millions of records like this. Please help!
Assuming that
...
1 A ts
1 B ts
...
in the input data means user 1 went A -> B
the algorithm is
CREATE new table paths_users_followed
CREATE new path
LOOP over data input rows, except last
IF user in row equals user in row+1
ADD action in row to path
IF row+1 is last row
ADD action in last to path
ADD user, path to paths_users_followed
ELSE
ADD user, path to paths_users_followed
CREATE new PATH
ENDLOOP
LOOP P over input of "path statistics"
COUNT occurrences of P in paths_users_followed
This can be most easily and efficiently implemented using a high performance database engine - I would use SQLite.

How to get level(depth) number of two connected nodes in neo4j

I'm using neo4j as a graph database to store user's connections detail into this. here I want to show the level of one user with respect to another user in their connections like Linkedin. for example- first layer connection, second layer connection, third layer and above the third layer shows 3+. but I don't know how this happens using neo4j. i searched for this but couldn't find any solution for this. if anybody knows about this then please help me to implement this functionality.
To find the shortest "connection level" between 2 specific people, just get the shortest path and add 1:
MATCH path = shortestpath((p1:Person)-[*..]-(p2:Person))
WHERE p1.id = 1 AND p2.id = 2
RETURN LENGTH(path) + 1 AS level
NOTE: You may want to put a reasonable upper bound on the variable-length relationship pattern (e.g., [*..6]) to avoid having the query taking too long or running out of memory in a large DB). You should probably ignore very distant connections anyway.
it would be something like this
// get all persons (or users)
MATCH (p:Person)
// create a set of unique combinations , assuring that you do
// not do double work
WITH COLLECT(p) AS personList
UNWIND personList AS personA
UNWIND personList AS personB
WITH personA,personB
WHERE id(personA) < id(personB)
// find the shortest path between any two nodes
MATCH path=shortestPath( (personA)-[:LINKED_TO*]-(personB) )
// return the distance ( = path length) between the two nodes
RETURN personA.name AS nameA,
personB.name AS nameB,
CASE WHEN length(path) > 3 THEN '3+'
ELSE toString(length(path))
END AS distance

Split test groups base on GUID

Users in the system are identified by GUID, and with a new feature, I want to divide users into two groups - test and control.
Is there a easy way to split users into one of the two group with a 50/50 chance, based on their GUID?
e.g. If the nth character's ascii code is an odd -> test group, otherwise control group.
What about 70/30, or other ratio?
The reason I want to classify users base on GUID, is because later I can easily tell which users are in which group and compare the performance between two groups, without having to keep track of the group assignment - I simply need to calculate it again.
As Derek Li notes, the GUID's bits might be based on a timestamp, so you shouldn't use them directly.
The safest solution is to hash the GUID using a hash function like MurmurHash. This will produce a random number (but the same random number every time for any given GUID) which you can then use to do the split.
For example, you could do a 30/70 split like this:
function isInTestGroup(user) {
var hash = murmurHash(user.guid);
return (hash % 100) < 30;
}
If some character in the GUID has a 1 in 16 change of being one of the following characters: "0123456789ABCEDF", then perhaps you could test a scheme that determines placement by that character.
Say the last character of the guid called c has a 1/16 chance of being any hex digit:
for 50/50 distribution -> c <= 7 for group 1, c > 7 for group 2
for 70/30 c <= A for group 1, c > A for group 2
etc...

Sorting part of a spreadsheet alphabetically

I am trying to sort one section of the data I have alphabetically based upon the company names in one row.
The relevant columns are the company name in column 7 and the specification number in column 9. The script that I wrote so far finds the highest and lowest column with the correct specification number and then inserts all of the companies data in a new row following the last row with the right specification.
I then want to sort only the rows with that specification number by the company name so that the company name is in alphabetical order. This must sort the full row with all of the company's info, not just the name.
The bit of code that I had tried to use to accomplish this is as follows:
Range(Cells(firstSpec, 7), Cells(lastSpec, 7)).Sort Key1:=Target, Order1:=xlAscending, Header:=xlGuess, OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom
This just gives me a runtime error of 1004
What do I need to do to make this sort properly?
Good news!
You line of code works in this context:
Sub qwerty()
Dim firstSpec As Long, lastSpec As Long
Set target = Cells(1, 7)
firstSpec = 1
lastSpec = 3
Range(Cells(firstSpec, 7), Cells(lastSpec, 7)).Sort Key1:=target, Order1:=xlAscending, Header:=xlGuess, OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom
End Sub
If you are getting errors, check the values for firstSpec, lastSpec and target

Visual Basic Function Procedure

I need help with the following H.W. problem. I have done everything except the instructions I numbered. Please help!
A furniture manufacturer makes two types of furniture—chairs and sofas.
The cost per chair is $350, the cost per sofa is $925, and the sales tax rate is 5%.
Write a Visual Basic program to create an invoice form for an order.
After the data on the left side of the form are entered, the user can display an invoice in a list box by pressing the Process Order button.
The user can click on the Clear Order Form button to clear all text boxes and the list box, and can click on the Quit button to exit the program.
The invoice number consists of the capitalized first two letters of the customer’s last name, followed by the last four digits of the zip code.
The customer name is input with the last name first, followed by a comma, a space, and the first name. However, the name is displayed in the invoice in the proper order.
The generation of the invoice number and the reordering of the first and last names should be carried out by Function procedures.
Seeing as this is homework and you haven't provided any code to show what effort you have made on your own, I'm not going to provide any specific answers, but hopefully I will try to point you in the right direction.
Your first 2 numbered items look to be variations on the same theme... string manipulation. Assuming you have the customer's address information from the order form, you just need to write 2 separate function to take the parts of the name and address, take the data you need and return the value (which covers your 3rd item).
To get parts of the name and address to generate the invoice number, you need to think about using the Left() and Right() functions.
Something like:
Dim first as String, last as String, word as String
word = "Foo"
first = Left(word, 1)
last = Right(word, 1)
Debug.Print(first) 'prints "F"
Debug.Print(last) 'prints "o"
Once you get the parts you need, then you just need to worry about joining the parts together in the order you want. The concatenation operator for strings is &. So using the above example, it would go something like:
Dim concat as String
concat = first & last
Debug.Print(concat) 'prints "Fo"
Your final item, using a Function procedure to generate the desired values, is very easily google-able (is that even a word). The syntax is very simple, so here's a quick example of a common function that is not built into VB6:
Private Function IsOdd(value as Integer) As Boolean
If (value Mod 2) = 0 Then 'determines of value is an odd or even by checking
' if the value divided by 2 has a remainder or not
' (aka Mod operator)
IsOdd = False ' if remainder is 0, set IsOdd to False
Else
IsOdd = True ' otherwise set IsOdd to True
End If
End Function
Hopefully this gets you going in the right direction.

Resources