Gender Prediction for Expert Finding Task

Predicting gender by names is one of the most interesting problems in the domain of Information Retrieval and expert finding task. In this research paper, we propose a machine learning approach for gender prediction task. We propose a new feature, that is, combination of letters in names which gives 86.54% accuracy. Our data collection consists of 3000 Urdu language names written using English Alphabets. This technique can be used to extract names from email addresses and hence is also valid for emails. To the best of our knowledge, it is the firstever attempt for predicting gender from Pakistani (Urdu) names written using English alphabets. Keywords—Urdu; Semantic Web; Gender Prediction; Expert Profiling; Machine Learning


INTRODUCTION
As internet becomes an intrinsic part of our lives, organizations tend to focus on automated solutions that can exploit the information available on the web.With the volume of increasing information on the web, the motivation for generating increased mass of knowledge is also increasing.However, it is very obvious that if technology is meant to bring benefits, it has to be able to support not only access to documented knowledge but also, most importantly, knowledge held by individuals [1].To find and process such knowledge, expert finding task has been proposed by Information Retrieval research community.
The objective of an expert finding system is to help find people with the appropriate expertise through some intelligent automated techniques [2].This task is very challenging because of rich set of information needs related to it.For example, finding the experts with particular set of skills within a particular domain or finding an expert from a specific geographical location.One of the most interesting and challenging tasks associated with expert finding task is gender prediction through expert's names.When searching for experts with the data available on the web, finding experts with a particular gender could be a very pertinent information need.
Gender prediction through names (or emails) is not only important for expert finding task only but also for many tasks like Co-reference Resolution, Machine Translation, Textual Entailment, Question Answering, Contextual Advertising and Information Extraction [3].In literature, most of the work regarding gender prediction can be associated with author profiling tasks [4,5] or gender prediction using names [3,6] for expert finding tasks.
In this paper, we propose a machine learning approach to predict gender when written using English alphabets as these are mostly found on the web.We propose a feature (named as combination of letters) which gives better results when combined with existing proposed features (proposed for other languages).To the best of our knowledge, there is no previous work on this problem.

II. RELATED WORK
Generally four types of work can be found when talking about gender prediction i.e.
 gender prediction using text,  gender prediction using names,  gender prediction using images,  gender prediction using voice.
Gender predictions using images [7,8] and voice [9,10] are beyond scope of our work so we only discuss first two categories of work in this section.

A. Gender Prediction Using Text
Gender prediction using text is a sub-task of author profiling task.Author profiling, in general, is used to determine an author's gender, age, native language, personality type, etc [11].It is a problem of growing importance in a variety of areas, including forensics, security and marketing.This is why it was also introduced as part of PAN (CLEF) 1 in year 2013 and continues to year 2016 as one of its core tasks.Gender prediction from text has been performed in several forms like blogs [12], electronic discourse [13], online social networks [14], and email [15].www.ijacsa.thesai.orgResearchers have been using style-based features ( Ngrams of POS tags in documents, punctuation symbols and number of href links [16,17] etc.) as well as topic-based features for gender prediction from text (for example, males usually use words like 'daily life' to describe their work and whereas females use 'daily life' to describe their love or spiritual life).

B. Gender Prediction from names
Gender prediction from names is a challenging task; hence, one cannot find lot of work already done for this particular task.One of the foremost works done in this regard was on North American names [6].In this work, researchers used morphological features of English language and find out many handful features of sound and language.Similarly, Tripathi and Faruqui [3] used support vector machine (SVM) approach for gender classification using Indian names.They used n-gram suffix along with other morphological to classify males and females names.

C. What makes our Work Different
As discussed above, we could only find works on English and Indian names for gender prediction.Therefore, to the best of our knowledge there exists no work for Urdu names.Morphological analysis of American, Indian and Urdu names reveals their differences [3].This makes our work different from the existing work.
Second thing that makes our approach distinct from existing approaches is the use of a new feature, that is, combination of letters.
Another difference is the size of data collection.We use around 3047 names (1729 female names while 1308 males) for our work while work for Indian names [3] used a collection of 2000 names (890 female and 1110 male names ) while work on North American names [6] included 489 names (222 females and 267 male).Last but not the least is the use of only textual features for identification of gender.We did not use sound-based features like syllables and sonorant consonant ending.
The following table shows feature-based analysis of our training data.We consider names long if they contain six or more letters.

III. MACHINE LEARNING FEATURES
Previous works [3,6] have used the following features differentiating between male and female names.We have only used a subset of the following features in our work because our focus is on using only text-based features.
Vowel Ending: Names of females generally end in a vowel while that of males in consonants.

Number of syllables:
A syllable is a unit of pronunciation uttered without interruption, loosely a single sound.Female names tend to have more number of syllables than males.

Sonorant Consonant Ending:
A sonorant is a sound that is produced without turbulent airflow in the vocal tract.Hindi possesses eight sonorant consonants [19].Compared to females, male names generally end with a sonorant consonant.
Length of the Word: Even though length of a name does not relate to its gender, our data shows that females have longer names than males in Pakistani names when compared to Indian names where opposite trend has been reported [3].

A. Issues with Previously Used Features
Previously used features for gender prediction through names are a mixture of textual and speech based characteristics.However, we focus on using only textual based features which is more practical when predicting gender through names in real time.Each language has its own conventions inherited from the region where it is spoken.Therefore, textual features like vowel ending and length of the word might not behave the same way for Urdu language as for other languages.Therefore, we propose a new feature called "combination of letters" for gender prediction through Urdu language names.This feature tries to capture the consecutive or nonconsecutive combination of letters in names."Combination of letters" could prove very useful when Urdu names (written in Roman Urdu) have the same ending letters and length because in that context these can't accurately distinguish between "Male" and "Female" names.Table 2 describes some examples of Urdu names (in Roman Urdu) with all three textual features.In this table, we can see that names "Danyal" and "Faryal" have same lengths and 1 gram ending but it is "combination of letters" which can help recognizing the gender of the name.To compute this feature, that is, "Combination of letter", we develop an algorithm which extracts this information automatically.

A. Data Set
We prepared a dataset of Urdu (Pakistani) names ourselves from online web sites (containing Urdu names and their meanings) and old PTCL (Pakistan Telecommunication Limited) telephone directories available.All names are written in Roman Urdu script i.e. using English language alphabets.It is to be noted that most of the Urdu linguistics resources have been developed by Centre of Language Engineering2 but we could not find a collection for Urdu names on their web site even they claim that they have developed one already [18].www.ijacsa.thesai.orgOur data collection consists of 3047 Pakistani names.It consists of 1729 female while 1308 male names.

B. Classifiers
We use Decision Tree (J48), Support Vector Machine (SVM), K-nearest neighbor (Lazy-IBK) and Random Forest classifiers for individual as well as for different combination of features and compare their performances on results of testing data.We use 1828 (almost 60 percent of total data) name instances for building training model while rest of the 1219 instances are used as testing data.We use Weka 3 toolkit for our experimentation.

C. Results and Discussions
In this section, we describe the results obtained through different classifiers using individual or combination of different features.a) Decision Tree: Following tables shows results for decision tree classifier.Among individual features, unigram seems to be outperforming all other individual features for all classifiers.However, when unigram feature is combined with combination of letters, it further boosts up its performance to 86.55% from 83.92 % for decision tree, to 84.74 % from 83.92 % for SVM, to 86.30 % from 83.92% for KNN and Random Forest classifier.Accuracy of different classifiers for this combination is also shown in figure 1 (at end of the document).It is also very interesting to observe that 2-gram features seem to be playing more effective role than 3-gram features.We think it is because of the relatively shorter length of the names than other type of general words.Another www.ijacsa.thesai.orgpositive aspect of these results is using 2-gram features with combination of letters which helps in further improving accuracy of 2-gram features.

V. CONCLUSIONS AND FUTURE WORK
Results show that newly purposed feature.that is, finding combination of letters in name and unigram both together is the best feature for predicting gender by name written in URDU (Roman like English).We got highest accuracy of 86.54% with J48 classifier.It proves that using only textual based features can also improve gender prediction from names while existing works have achieved similar level of accuracy by using both i.e. speech and textual features.
While we have mentioned above that gender prediction from names is very important for expert finding task.We have focused on gender prediction in this work and we keep the task of predicting geographical background of the author from his /her text.For example, we might collect a data collection on same topic for authors from different locations and then find hidden patterns in their writings to determine their geographical background automatically.This task can be very helpful in determining political orientations or extremists attitudes for different topics.www.ijacsa.thesai.org

TABLE III .
RESULTS FOR DECISION TREE CLASSIFER

TABLE IV .
RESULTS FOR SVM CLASSIFER

TABLE VI .
RESULTS FOR RANDOM FOREST CLASSIFIER