Peter Brejcak: Estimate gender from full name with Oracle

Estimate gender from full name with Oracle

Article is about parameter estimation from text string. I explain the algorithm and show detail results on Chicago marathon 2011 runners and total results on real Czech data.

Czech version published - http://zdrojak.root.cz/clanky/odhad-pohlavi-z-celeho-jmena/

One of the problems of data quality is incomplete data. For example for some clients we have information about full name and address but no information about gender (male/female), legal type (individual/corporate) or country (domestic/foreign countries) Our company need this data to correctly calculate risk level of client, for reporting, to compute segment etc.

Let's consider situation when we need to estimate missing parameter with 2 possibilities for clients with some text information and we have Oracle and some data with filled parameter (standard situation in DWH).

Algorithm

Without loss of generality let's estimate parameter gender from full name.

From filled data we create 2 dictionaries (one for female and one for male) with words after data standardization (hereinafter referred to as the “word/s”). To create words we use Oracle functions upper, trim, translate, convert and regexp_replace [1] and we consider space, comma and dot like word separators (you can change it according to your data – add dash, question mark etc.).

words:=upper(convert(Trim(regexp_replace(translate(full_name,',.',' '),'( ){2,}',' ')),'US7ASCII'))

We compute score for each word from dictionary by using Oracle percent_rank function [3]. The most typical words (like Isabella for women and Jacob for men [2]) have high scoring and influence decision rules most.

Select word, PERCENT_RANK () OVER (order by count(*)) score

….

group by word

For better understanding see attachment chicago_marathon.xls.

To estimate gender from full name we join full name through words to both dictionaries and sum scores for male and female dictionary. If difference for male and female sum of score is bigger then epsilon (epsilon is 0 or little positive number), we make the estimate according bigger sum of score.

Precondition for using the algorithm are:

Strong dependency between string, which we use and parameter, which we want to estimate (there is a correlation between the name and gender)
The same distribution for words and parameter for both groups (with filled parameter and without filled parameter - it’s not a good idea
to use dictionary from English full names to estimate gender from Russians full names)
To have enough data volume with good quality to generate dictionary (more than 10 000 records for each dictionary)

This algorithm doesn't consider data meaning, construction dependencies of string, combination with other entities, experiences and knowledge of other sources and so. Because of that the results of this algorithm are worse than in full analysis and systematic method. Thus I recommend, when possible, you use standard methods.

On the other hand there are situations when we have no information about data meaning or we have to work with untypical environment (e.g. there is no customization for such data, we don't have good dictionaries, way of data entry is unusual, distribution of data is untypical, we have no special tool, etc) it still will work - and we can get relatively good results without exceptionally hard work and your data distribution will be respected as well!

Results and comparison with specialized tools

For test I used data with filled parameter (gender or legal type) and I divided this group into 2 random disjoint groups. One I used to generate dictionary, the other one to estimate gender from full name and control estimate value with filled value.

For gender test from Chicago marathon results I used 80% of 35 747 runners names to create dictionary and 20% (every 5^th place – see attachment) to check results.

For the best match I need biggest positive (estimate male for man and female for woman) results and the smallest negative results (estimate male for woman and female for man).

The algorithm is positive for 6 543 of 7 157 runners (for epsilon = 0 - see attachments for details).

Chicago marathon	Algorithm
	epsilon = 0		epsilon = 0.2
	Positive	Negative	Positive	Negative
Gender	91.421%	6.371%	72.684%	2.124%

For other test I used real data from Czech bank. Data volume was much bigger than in Chicago marathon. Most of clients are from the Czech Republic but there are also significant groups of client from Slovakia, Russia, Ukraine, Germany, etc. I tested gender and legal type estimation. These combinations meet the algorithm preconditions.

Using the same test data I used SAS Dataflux for Czech Republic [4] for comparison with specialized tool.

For gender test I used data with filled gender and I divided this group into 2 random disjoint groups. One I used to generate dictionary, the other one to estimate gender from full name and control estimate value with filled value.

For legal type test I used data with filled legal type and I divided this group into 2 random disjoint groups (one for dictionary, another for estimate and control).

Every group contained tens of thousands records

For the best match I need biggest positive (estimate male for man and female for woman) results and the smallest negative results (estimate male for woman and female for man).

For details I measured results for Czech and for foreign clients.

	Algorithm				Dataflux
	epsilon = 0		epsilon = 0.2		Dataflux
	Positive	Negative	Positive	Negative	Positive	Negative
Gender - Total	98.278%	0.236%	95.055%	0.117%	94.050%	0.210%
Gender - Czech	99.597%	0.052%	97.095%	0.026%	99.007%	0.057%
Gender - foreign	88.993%	1.531%	80.695%	0.754%	59.156%	1.292%
Legal type - Total	98.676%	0.400%	88.232%	0.211%	98.749%	0.890%
Legal type - Czech	99.397%	0.233%	89.392%	0.121%	98.881%	0.842%
Legal type - foreign	91.611%	2.037%	76.869%	1.091%	97.450%	1.363%

The algorithm is for bigger epsilon more conservative (returns smaller positive number and smaller negative number). The algorithm better meets conditions for majority group (better dictionary and stronger dependency and also better results for Czech clients). On the other hand Dataflux shows its bigger power on foreign legal types. But Czech Dataflux doesn't work for foreign gender (under 60% of positive).

There is a possibility to change the algorithm. To ignore the words with very small scoring in dictionary, to find the best epsilon like function of count of dictionaries and count of words in full name, to consider Bayes’ theorem for different gender cardinality, to consider more than only 2 results for estimate etc. But these steps also reduce the strongest quality of algorithm – relatively good results for short time without analyze necessity.

PL/SQL

I only want to show basic principles. For details see attachments.

For brevity I don't consider exceptions and special situations.

We have tables TABLE_OF_WOMEN_FULL_NAMES and TABLE_OF_MEN_FULL_NAMES with one varchar2 column. We use these tables to create dictionaries.

Run create_dictionary('TABLE_OF_WOMEN_FULL_NAMES', ‘DICTIONARY_WOMEN’) and create_dictionary('TABLE_OF_MEN_FULL_NAMES',‘DICTIONARY_MEN’)

where

CREATE OR REPLACE Procedure
create_dictionary(table_name_in In user_tables.table_name%Type,

table_name_out In varchar2)
Is

Type tabtype Is Table Of
Varchar(2000);

input_table tabtype;

words Varchar(2000);

count_of_words Number;

Begin

--create table for dictionary

Execute Immediate 'create table '|| table_name_out ||' (word VARCHAR2(200))';

Execute Immediate 'select * from ' || table_name_in
Bulk Collect

Into input_table;

For i In input_table.first ..
input_table.last

Loop --(1)

--space, comma and dot are separated symbols for
words we replace them with space by using translate

--multiple space we replace with single space by
using regexp_replace

--standartization of word by using upper and
convert oracle functions

words := upper(convert(Trim(regexp_replace(translate(input_table(i),',.',' '),'( ){2,}',' ')),'US7ASCII')) || ' ';

If words = ' '

Then

count_of_words := 0;

Else

count_of_words := length(words) - length(Replace(words, ' '));

End If;

For j In 1 .. count_of_words

Loop --(2)

--inserted j-th standarted word from full name to
dictionary

If j = 1

Then
Execute Immediate 'insert into ' || table_name_out || ' values (''' ||substr(words,1,instr(words, ' ', 1,
1) - 1) ||''')';

Else

Execute Immediate 'insert into ' || table_name_out || ' values (''' ||

substr(words,instr(words, ' ', 1,
j - 1) + 1,
instr(words, ' ', 1, j) - instr(words, ' ', 1, j - 1) - 1) || ''')';

End If;

End Loop; --(2)

Commit;

End Loop; --(1)

End;

To values from dictionary_men we assign score by using percent_rank. I use also union all twice to give every word from dictionary positive score.

select word, PERCENT_RANK () OVER (order by count(*)) score

from

(select word from DICTIONARY_MEN union all select word from DICTIONARY_MEN union all select null from dual)

group by word

To estimate gender we use following select:

select t1.full_name,

case when
t1.men_score-t2.women_score>&epsilon then 'male'

when t2.women_score-t1.men_score>&epsilon then 'female'

else 'no result' end result

from

--full_name_without_gender
is table with full names we need to estimate gender

--men and
woman are tables with words from dictionaries and score

(select t.full_name ,sum(nvl(men.score,0)) men_score

from full_name_without_gender t

left join (select word, PERCENT_RANK () OVER (order by count(*)) score

from (select word from DICTIONARY_MEN union all select word from DICTIONARY_MEN union all select null from dual)

group by word)
men

on ' '||upper(convert(translate(full_name,',.-',' '),'US7ASCII'))||' ' like '% '||men.word||' %'

group by
t.full_name) t1

join

(select t.full_name ,sum(nvl(women.score,0)) women_score

from full_name_without_gender t

left join (select word, PERCENT_RANK () OVER (order by count(*)) score

from (select word from DICTIONARY_WOMEN union all select word from DICTIONARY_WOMEN union all select null from dual)

group by word)
women

on ' '||upper(convert(translate(full_name,',.-',' '),'US7ASCII'))||' ' like '% '||women.word||' %'

group by
t.full_name) t2

on t1.full_name=t2.full_name

This select with like join condition works really slowly for big data volume. But it takes only few minutes for Chicago marathon results (see attachments).

To accelerate way to result you separate words from full name (in the same way like in procedure create dictionary) and left join it to dictionaries with scores or use other ways to tune select.

Attachments:
chicago_marathon.xls
inserting script.sql
chicago.sql

Sources:

Peter Brejcak

31 Oct 2011

Estimate gender from full name with Oracle

[1] Oracle Database SQL Reference 10g Release 1 (10.1), Chapter SQL Functions

http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions001.htm#i88893

http://www.ssa.gov/OACT/babynames/

http://download.oracle.com/docs/cd/B19306_01/server.102/b14223.pdf

http://results.public.chicagomarathon.com/2011/

No comments:

Post a Comment