Submitted by Super-Martingale t3_y4w0sw in MachineLearning
We are trying to standardize a long list (in millions) of company name strings. The same company can show up in different rows because of abbreviations, nicknames, subsidiaries, business units, typos, etc. So we need a way to group rows based on whether they are the same company. Given the size of our data, is there any good way to process the standardization efficiently?
Below is an example in which all strings should be grouped as a single company:
JPMorgan Chase & Co.
JPMorgan Chase
JPM Chase
JPM
J.P. Morgan
The JPM Company
Global Technology at JPMorgan Chase
JPM Company
JPM Chase Bank
JPM CHASE
JP Morgan Chase
J.P. Morgan Asset Management
JPMorgan Chase Bank, N.A.
JPMorgan
JPMorganChase
J.P. Morgan Chase
JPMorgan Chase Bank
J.P. Morgan Private Bank
InstaMed, a J.P. Morgan company
J.P. Morgan Chase Bank, N.A.
JPMorgan Private Bank
JP Morgan Asset Management
Jpmorgan Chase Bank National Association
J.P. Morgan Retirement Plan Services
JPMorgan Retirement Plan Services
JPMorgan Chase & Company
JP Morgan Chase (formerly Washington Mutual)
Washington Mutual/JP Morgan Chase
J.P. Morgan Investment Bank
JPMorgan Chase (formerly WaMu)
JPMorgan Chase Commercial Banking
JP Morgan Chase & NSPCC
JP Morgan Chase / Bank One
JP Morgan & Company Real Estate Appraisers And Con
WaMu/JPMorgan Chase
JP Morgan & Chase Co. (Formerly Washington Mutual
Bank One (JP Morgan Chase)
​
hjmb t1_isg9v1h wrote
Fuzzy matching will help with the typos, but from experience we crafted nicknames by hand.
If your jurisdiction(s) have accessible company records then you could match on those names to determine which rows are official names. This solves half your problem, as you then just need to match the remaining rows to an accepted official name.
You could also modify Levenshtein distance so that dropping characters is free in an attempt to match full names with shorter names, but this will be computationally expensive.