Article is about parameter estimation from text string. I explain the algorithm and show detail results on Chicago marathon 2011 runners and total results on real Czech data.
Czech version published - http://zdrojak.root.cz/clanky/odhad-pohlavi-z-celeho-jmena/
One of the problems of data quality is incomplete data. For example for some clients we have information about full name and address but no information about gender (male/female), legal type (individual/corporate) or country (domestic/foreign countries) Our company need this data to correctly calculate risk level of client, for reporting, to compute segment etc.
Czech version published - http://zdrojak.root.cz/clanky/odhad-pohlavi-z-celeho-jmena/
One of the problems of data quality is incomplete data. For example for some clients we have information about full name and address but no information about gender (male/female), legal type (individual/corporate) or country (domestic/foreign countries) Our company need this data to correctly calculate risk level of client, for reporting, to compute segment etc.
Let's consider situation when we need to estimate missing parameter with 2 possibilities for clients with some text information and we have Oracle and some data with filled parameter (standard situation in DWH).
Algorithm
Without loss of generality let's estimate parameter gender from full name.
From filled data we create 2 dictionaries (one for female and one for male) with words after data standardization (hereinafter referred to as the “word/s”). To create words we use Oracle functions upper, trim, translate, convert and regexp_replace [1] and we consider space, comma and dot like word separators (you can change it according to your data – add dash, question mark etc.).
Without loss of generality let's estimate parameter gender from full name.
From filled data we create 2 dictionaries (one for female and one for male) with words after data standardization (hereinafter referred to as the “word/s”). To create words we use Oracle functions upper, trim, translate, convert and regexp_replace [1] and we consider space, comma and dot like word separators (you can change it according to your data – add dash, question mark etc.).
words:=upper(convert(Trim(regexp_replace(translate(full_name,',.',' '),'( ){2,}',' ')),'US7ASCII'))
We compute score for each word from dictionary by using Oracle percent_rank function [3]. The most typical words (like Isabella for women and Jacob for men [2]) have high scoring and influence decision rules most.
Select word, PERCENT_RANK () OVER (order by count(*)) score
….
group by word
For better understanding see attachment chicago_marathon.xls.
To estimate gender from full name we join full name through words to both dictionaries and sum scores for male and female dictionary. If difference for male and female sum of score is bigger then epsilon (epsilon is 0 or little positive number), we make the estimate according bigger sum of score.
Precondition for using the algorithm are:
- Strong dependency between string, which we use and parameter, which we want to estimate (there is a correlation between the name and gender)
- The same distribution for words and parameter for both groups (with filled parameter and without filled parameter - it’s not a good idea
to use dictionary from English full names to estimate gender from Russians full names) - To have enough data volume with good quality to generate dictionary (more than 10 000 records for each dictionary)
This algorithm doesn't consider data meaning, construction dependencies of string, combination with other entities, experiences and knowledge of other sources and so. Because of that the results of this algorithm are worse than in full analysis and systematic method. Thus I recommend, when possible, you use standard methods.
On the other hand there are situations when we have no information about data meaning or we have to work with untypical environment (e.g. there is no customization for such data, we don't have good dictionaries, way of data entry is unusual, distribution of data is untypical, we have no special tool, etc) it still will work - and we can get relatively good results without exceptionally hard work and your data distribution will be respected as well!
Results and comparison with specialized tools
On the other hand there are situations when we have no information about data meaning or we have to work with untypical environment (e.g. there is no customization for such data, we don't have good dictionaries, way of data entry is unusual, distribution of data is untypical, we have no special tool, etc) it still will work - and we can get relatively good results without exceptionally hard work and your data distribution will be respected as well!
Results and comparison with specialized tools
For test I used data with filled parameter (gender or legal type) and I divided this group into 2 random disjoint groups. One I used to generate dictionary, the other one to estimate gender from full name and control estimate value with filled value.
For gender test from Chicago marathon results I used 80% of 35 747 runners names to create dictionary and 20% (every 5th place – see attachment) to check results.
For the best match I need biggest positive (estimate male for man and female for woman) results and the smallest negative results (estimate male for woman and female for man).
The algorithm is positive for 6 543 of 7 157 runners (for epsilon = 0 - see attachments for details).
Chicago marathon | Algorithm | |||
epsilon = 0 | epsilon = 0.2 | |||
Positive | Negative | Positive | Negative | |
Gender | 91.421% | 6.371% | 72.684% | 2.124% |
For other test I used real data from Czech bank. Data volume was much bigger than in Chicago marathon. Most of clients are from the Czech Republic but there are also significant groups of client from Slovakia, Russia, Ukraine, Germany, etc. I tested gender and legal type estimation. These combinations meet the algorithm preconditions.
Using the same test data I used SAS Dataflux for Czech Republic [4] for comparison with specialized tool.
For gender test I used data with filled gender and I divided this group into 2 random disjoint groups. One I used to generate dictionary, the other one to estimate gender from full name and control estimate value with filled value.
For legal type test I used data with filled legal type and I divided this group into 2 random disjoint groups (one for dictionary, another for estimate and control).
For legal type test I used data with filled legal type and I divided this group into 2 random disjoint groups (one for dictionary, another for estimate and control).
Every group contained tens of thousands records
For the best match I need biggest positive (estimate male for man and female for woman) results and the smallest negative results (estimate male for woman and female for man).
For the best match I need biggest positive (estimate male for man and female for woman) results and the smallest negative results (estimate male for woman and female for man).
For details I measured results for Czech and for foreign clients.
Algorithm | Dataflux | |||||
epsilon = 0 | epsilon = 0.2 | |||||
Positive | Negative | Positive | Negative | Positive | Negative | |
Gender - Total | 98.278% | 0.236% | 95.055% | 0.117% | 94.050% | 0.210% |
Gender - Czech | 99.597% | 0.052% | 97.095% | 0.026% | 99.007% | 0.057% |
Gender - foreign | 88.993% | 1.531% | 80.695% | 0.754% | 59.156% | 1.292% |
Legal type - Total | 98.676% | 0.400% | 88.232% | 0.211% | 98.749% | 0.890% |
Legal type - Czech | 99.397% | 0.233% | 89.392% | 0.121% | 98.881% | 0.842% |
Legal type - foreign | 91.611% | 2.037% | 76.869% | 1.091% | 97.450% | 1.363% |
The algorithm is for bigger epsilon more conservative (returns smaller positive number and smaller negative number). The algorithm better meets conditions for majority group (better dictionary and stronger dependency and also better results for Czech clients). On the other hand Dataflux shows its bigger power on foreign legal types. But Czech Dataflux doesn't work for foreign gender (under 60% of positive).
There is a possibility to change the algorithm. To ignore the words with very small scoring in dictionary, to find the best epsilon like function of count of dictionaries and count of words in full name, to consider Bayes’ theorem for different gender cardinality, to consider more than only 2 results for estimate etc. But these steps also reduce the strongest quality of algorithm – relatively good results for short time without analyze necessity.
PL/SQL
I only want to show basic principles. For details see attachments.
For brevity I don't consider exceptions and special situations.
We have tables TABLE_OF_WOMEN_FULL_NAMES and TABLE_OF_MEN_FULL_NAMES with one varchar2 column. We use these tables to create dictionaries.
Run create_dictionary('TABLE_OF_WOMEN_FULL_NAMES', ‘DICTIONARY_WOMEN’) and create_dictionary('TABLE_OF_MEN_FULL_NAMES',‘DICTIONARY_MEN’)
Run create_dictionary('TABLE_OF_WOMEN_FULL_NAMES', ‘DICTIONARY_WOMEN’) and create_dictionary('TABLE_OF_MEN_FULL_NAMES',‘DICTIONARY_MEN’)
where
CREATE OR REPLACE Procedure
create_dictionary(table_name_in In user_tables.table_name%Type,
CREATE OR REPLACE Procedure
create_dictionary(table_name_in In user_tables.table_name%Type,
table_name_out In varchar2)
Is
Is
Type tabtype Is Table Of
Varchar(2000);
Varchar(2000);
input_table tabtype;
words Varchar(2000);
count_of_words Number;
Begin
--create table for dictionary
Execute Immediate 'create table '|| table_name_out ||' (word VARCHAR2(200))';
Execute Immediate 'select * from ' || table_name_in
Bulk Collect
Bulk Collect
Into input_table;
For i In input_table.first ..
input_table.last
input_table.last
Loop --(1)
--space, comma and dot are separated symbols for
words we replace them with space by using translate
words we replace them with space by using translate
--multiple space we replace with single space by
using regexp_replace
using regexp_replace
--standartization of word by using upper and
convert oracle functions
convert oracle functions
words := upper(convert(Trim(regexp_replace(translate(input_table(i),',.',' '),'( ){2,}',' ')),'US7ASCII')) || ' ';
If words = ' '
Then
count_of_words := 0;
Else
count_of_words := length(words) - length(Replace(words, ' '));
End If;
For j In 1 .. count_of_words
Loop --(2)
--inserted j-th standarted word from full name to
dictionary
dictionary
If j = 1
ThenExecute Immediate 'insert into ' || table_name_out || ' values (''' ||substr(words,1,instr(words, ' ', 1,
1) - 1) ||''')';
Else
Execute Immediate 'insert into ' || table_name_out || ' values (''' ||
substr(words,instr(words, ' ', 1,
j - 1) + 1,
instr(words, ' ', 1, j) - instr(words, ' ', 1, j - 1) - 1) || ''')';
End If;
End Loop; --(2)
Commit;
End Loop; --(1)
End;To values from dictionary_men we assign score by using percent_rank. I use also union all twice to give every word from dictionary positive score.
select word, PERCENT_RANK () OVER (order by count(*)) score
from
(select word from DICTIONARY_MEN union all select word from DICTIONARY_MEN union all select null from dual)
group by word
To estimate gender we use following select:
select t1.full_name,
select t1.full_name,
case when
t1.men_score-t2.women_score>&epsilon then 'male'
t1.men_score-t2.women_score>&epsilon then 'male'
when t2.women_score-t1.men_score>&epsilon then 'female'
else 'no result' end result
from
--full_name_without_gender
is table with full names we need to estimate gender
is table with full names we need to estimate gender
--men and
woman are tables with words from dictionaries and score
woman are tables with words from dictionaries and score
(select t.full_name ,sum(nvl(men.score,0)) men_score
from full_name_without_gender t
left join (select word, PERCENT_RANK () OVER (order by count(*)) score
from (select word from DICTIONARY_MEN union all select word from DICTIONARY_MEN union all select null from dual)
group by word)
men
men
on ' '||upper(convert(translate(full_name,',.-',' '),'US7ASCII'))||' ' like '% '||men.word||' %'
group by
t.full_name) t1
t.full_name) t1
join
(select t.full_name ,sum(nvl(women.score,0)) women_score
from full_name_without_gender t
left join (select word, PERCENT_RANK () OVER (order by count(*)) score
from (select word from DICTIONARY_WOMEN union all select word from DICTIONARY_WOMEN union all select null from dual)
group by word)
women
women
on ' '||upper(convert(translate(full_name,',.-',' '),'US7ASCII'))||' ' like '% '||women.word||' %'
group by
t.full_name) t2
t.full_name) t2
on t1.full_name=t2.full_name
This select with like join condition works really slowly for big data volume. But it takes only few minutes for Chicago marathon results (see attachments).
To accelerate way to result you separate words from full name (in the same way like in procedure create dictionary) and left join it to dictionaries with scores or use other ways to tune select.
Attachments:
chicago_marathon.xls
inserting script.sql
chicago.sql
Sources:
[1] Oracle Database SQL Reference 10g Release 1 (10.1), Chapter SQL Functions
http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions001.htm#i88893
[2] Popular baby nameshttp://www.ssa.gov/OACT/babynames/
[3] Oracle Database Data Warehousing Guide 10g Release 2 (10.2), Chapter 21 - SQL for Analysis andReporting
http://download.oracle.com/docs/cd/B19306_01/server.102/b14223.pdf
[4] Dataflux Technology OverviewCourse Notes
[5] Bank of America Chicago
marathon 2011 results
No comments:
Post a Comment