INTRODUCTION/BACKGROUND/MOTIVATION RESEARCH METHODOLOGY CONCLUSIONS REFERENCE/ACKNOWLEDGEMENTS Seniority Classification of Job Titles The Data Mine Corporate Partners Symposium 2021 • Before training dataset into models, our team did some preprocessing to the raw data: • We use one hot encoding for classification labels for calculation efficiency because it is multi-label classification • Instead of using classification labels as a string, we are representing them as vectors. • We used two types of encoder which changes raw text data into a vector, which is called encoding process • Universal Sentence Encoder: From Google, we use it to encode text into high dimensional vectors that can be used for text classification • SpaCy: Performs tokenizing and encoding with pretrained word vectors • Below is an example of how texts can be encoded into vectors Thanks for our mentor Reuben Wilson's help CATHERINE MAO, EVAN SHAW, ADRIENNE ZHANG, JACOB ZHANG • These are the six categories that our team use to classify job titles, a description for each category and some common examples that will be classify as that category • Eventually we went to the evaluation step • For our Deep Neutral Network Model, we've got average K-fold accuracy (accuracy and validation of our model) score as 0.895 with standard error 0.003 • To the left are two graphs for model accuracy and loss score during training • We've done several tuning so that model accuracy can be higher and loss score gets lower • Considering we're classifying among 6 labels, the accuracy score we got it pretty high (we think) • From azure machine learning, we found out that the best model is stochastic gradient descent with MaxAbsScaler. • MaxAbsScaler means that the function scales each feature by its maximum absolute value. TMap • A small company who uses technology and targeted marketing to identify and engage qualified employment at scale • The motivation for this project is to classify job titles based on seniority to make job titles more accessible • Match Candidates with Roles or Job Opportunities • This is a result confusion matrix. We can find out that senior individual contributors(5th label) are the hardest category to classify. • It’s because our dataset is not balanced, so that many instances are misclassified as individual contributors. • Keep improving the Ontology to recognize "skill" related terms • Keep cleaning up and normalizing the dataset FUTURE GOALS • This is an example for our dataset. • We started with 5000 instances at first and expanded it to 11K instances so far. In the left there is a partial 2-d array. A single vector is a label for an instance. That is, for the first label, it should be the last one from our 6 categories: which is student. After preprocessing, we started to train models: • We implement a simple deep neural network with TensorFlow • To the right is a summary of the chosen model after few trials of parameter tuning •It has 2 layers with dropout to prevent overfitting. •At the last layer, we use softmax activation since it's a multi-label classification. • We used Microsoft Azure machine learning platform • To check which model is the best for our given input data • This platform automatically samples partial data and work on training/testing for each model This is the softmax activation function we refer to a summary of the chosen model after few trials of parameter tuning