Top Banner
INTRODUCTION/BACKGROUND/MOTIVATION RESEARCH METHODOLOGY CONCLUSIONS REFERENCE/ACKNOWLEDGEMENTS Seniority Classification of Job Titles The Data Mine Corporate Partners Symposium 2021 Before training dataset into models, our team did some preprocessing to the raw data: We use one hot encoding for classification labels for calculation efficiency because it is multi-label classification Instead of using classification labels as a string, we are representing them as vectors. We used two types of encoder which changes raw text data into a vector, which is called encoding process Universal Sentence Encoder: From Google, we use it to encode text into high dimensional vectors that can be used for text classification SpaCy: Performs tokenizing and encoding with pretrained word vectors Below is an example of how texts can be encoded into vectors Thanks for our mentor Reuben Wilson's help CATHERINE MAO, EVAN SHAW, ADRIENNE ZHANG, JACOB ZHANG These are the six categories that our team use to classify job titles, a description for each category and some common examples that will be classify as that category Eventually we went to the evaluation step For our Deep Neutral Network Model, we've got average K-fold accuracy (accuracy and validation of our model) score as 0.895 with standard error 0.003 To the left are two graphs for model accuracy and loss score during training We've done several tuning so that model accuracy can be higher and loss score gets lower Considering we're classifying among 6 labels, the accuracy score we got it pretty high (we think) From azure machine learning, we found out that the best model is stochastic gradient descent with MaxAbsScaler. MaxAbsScaler means that the function scales each feature by its maximum absolute value. TMap A small company who uses technology and targeted marketing to identify and engage qualified employment at scale The motivation for this project is to classify job titles based on seniority to make job titles more accessible Match Candidates with Roles or Job Opportunities This is a result confusion matrix. We can find out that senior individual contributors(5th label) are the hardest category to classify. It’s because our dataset is not balanced, so that many instances are misclassified as individual contributors. Keep improving the Ontology to recognize "skill" related terms Keep cleaning up and normalizing the dataset FUTURE GOALS This is an example for our dataset. We started with 5000 instances at first and expanded it to 11K instances so far. In the left there is a partial 2-d array. A single vector is a label for an instance. That is, for the first label, it should be the last one from our 6 categories: which is student. After preprocessing, we started to train models: We implement a simple deep neural network with TensorFlow To the right is a summary of the chosen model after few trials of parameter tuning It has 2 layers with dropout to prevent overfitting. At the last layer, we use softmax activation since it's a multi-label classification. We used Microsoft Azure machine learning platform To check which model is the best for our given input data This platform automatically samples partial data and work on training/testing for each model This is the softmax activation function we refer to a summary of the chosen model after few trials of parameter tuning
1

Seniority Classification of Job Titles

May 14, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Seniority Classification of Job Titles

INTRODUCTION/BACKGROUND/MOTIVATION RESEARCH METHODOLOGY

CONCLUSIONS

REFERENCE/ACKNOWLEDGEMENTS

Seniority Classification of Job Titles

The Data Mine Corporate Partners Symposium 2021

• Before training dataset into models, our team did some preprocessing to the raw

data:

• We use one hot encoding for classification labels for calculation efficiency

because it is multi-label classification

• Instead of using classification labels as a string, we are representing them as

vectors.

• We used two types of encoder which changes raw text data into a vector,

which is called encoding process

• Universal Sentence Encoder: From Google, we use it to encode text into

high dimensional vectors that can be used for text classification

• SpaCy: Performs tokenizing and encoding with pretrained word vectors

• Below is an example of how texts can be encoded into vectors

Thanks for our mentor Reuben Wilson's

help

CATHERINE MAO, EVAN SHAW, ADRIENNE ZHANG, JACOB ZHANG

• These are the six categories that our team use to classify job titles, a description for each category and some common examples that will be classify as that category

• Eventually we went to the evaluation step

• For our Deep Neutral Network Model, we've got

average K-fold accuracy (accuracy and validation

of our model) score as 0.895 with standard error

0.003

• To the left are two graphs for model accuracy and

loss score during training

• We've done several tuning so that model

accuracy can be higher and loss score gets

lower

• Considering we're classifying among 6

labels, the accuracy score we got it pretty high

(we think)

• From azure machine learning, we found out that the best model is stochastic gradient descent with MaxAbsScaler.• MaxAbsScaler means that the function scales each feature by its maximum absolute value.

TMap• A small company who uses technology and targeted marketing to identify

and engage qualified employment at scale• The motivation for this project is to classify job titles based on seniority

to make job titles more accessible• Match Candidates with Roles or Job Opportunities

• This is a result confusion matrix. We can find out that senior individual contributors(5th label) are the hardest category to classify.

• It’s because our dataset is not balanced, so that many instances are misclassified as individual contributors.

• Keep improving the Ontology to recognize "skill" related terms

• Keep cleaning up and normalizing the dataset

FUTURE GOALS

• This is an example for our dataset.• We started with 5000 instances at first

and expanded it to 11K instances so far.

In the left there is a partial 2-d array. A single vector is a label for an instance. That is, for the first label, it should be the last one from our 6 categories: which is student.

After preprocessing, we started to train models:

• We implement a simple deep neural network with TensorFlow

• To the right is a summary of the chosen model after few trials of parameter tuning

•It has 2 layers with dropout to prevent overfitting.

•At the last layer, we use softmax activation since it's a multi-label classification.

• We used Microsoft Azure machine learning platform

• To check which model is the best for our given input data

• This platform automatically samples partial data and work on training/testing for

each model

This is the softmax activation function we refer to

a summary of the chosen model after few trials of parameter tuning