Deep Learning for Object Classification in Retail Stores Idawati Bustan, Mervyn Wee, Timothy Chong {idawati, wyrm, timchong}@stanford.edu Stanford University Introduction Being able to recognize products from images would be useful in automating inventory management for the retail space. We took pictures of products on shelves in stores and investigated the use of a Convoluted Neural Network (CNN) in accurately recognizing these products. By examining the way in which we constructed the training data, the CNN obtained good accuracy not only in recognizing products that it had seen before, but also in products that it had never seen before. Model The retail stores we took images from had a limited variety of products, leading to a small training set. We used transfer learning to overcome this problem. Transfer learning allowed us to take a CNN fully trained on another set of classes, and retrain just the final layer to suit our purposes (Torrey, 2016). In effect, it took less time to train our model, allowed us to use a small dataset(Soekhoe, 2016). Data Retail stores have thousands of products. To start, we targeted classifying eight classes of products. We constructed 3 CNN models that each received a different training set. The rationale behind this was to see how a mix of in-store and outside data could help in generalizing the model (for future products) and yet make it specific to classify current products All 3 models were tested on the same test set in order to compare which model performed better. Despite M3 performing the best, we found that many predictions had a probability of 20% to 40%, even if these predictions were correct. The first step we would take would be to increase the confidence in these predictions so that the model would be more well trained. This could be done my training it on more data or increasing the epochs when training the CNN. Result M3 made the best predictions on seen and unseen products, followed by M1, and then M2. Discussion We had expected M3 to perform the best, followed by M1 and then M2. We were worried that the lack of variety of products in retail stores meant that M1 would have low accuracy on products it has not seen before. M3 proved that adding images of products from the Internet to the training set would increase the accuracy of such predictions. Future Confusion Matrix (25 seen & 25 unseen items per model) Can Tape Drill Hanger Paint Hammer Axe Shower Can 9 0 0 0 0 0 0 0 Tape 11 25 0 0 0 0 0 0 Drill 0 0 25 2 0 0 0 0 Hanger 0 0 0 23 0 0 0 0 Paint 4 0 0 0 25 0 0 0 Hamme 1 0 0 0 0 23 2 0 Axe 0 0 0 0 0 2 23 0 Shower 0 0 0 0 0 0 0 25 Predicted Actual Can Tape Drill Hanger Paint Hammer Axe Shower Can 22 0 0 1 0 2 0 0 Tape 0 14 0 0 0 0 0 0 Drill 0 0 25 1 0 3 0 0 Hanger 0 0 0 18 0 0 0 0 Paint 3 0 0 0 25 0 0 1 Hamme 0 3 0 5 0 20 1 0 Axe 0 0 0 0 0 0 24 0 Shower 0 8 0 0 0 0 0 24 Predicted Actual Can Tape Drill Hanger Paint Hammer Axe Shower Can 14 0 0 0 0 0 0 0 Tape 9 25 0 0 0 0 0 0 Drill 0 0 25 2 0 0 0 0 Hanger 0 0 0 23 0 0 0 0 Paint 2 0 0 0 25 0 0 0 Hamme 0 0 0 0 0 24 1 0 Axe 0 0 0 0 0 1 24 0 Shower 0 0 0 0 0 0 0 25 Actual Predicted Can Tape Drill Hanger Paint Hammer Axe Shower Can 24 0 0 0 0 0 0 0 Tape 0 16 0 0 0 0 0 0 Drill 0 0 25 1 0 0 1 0 Hanger 0 1 0 21 0 0 0 0 Paint 1 0 0 0 25 0 0 1 Hamme 0 2 0 2 0 25 12 0 Axe 0 0 0 0 0 0 12 0 Shower 0 6 0 1 0 0 0 24 Actual Predicted Can Tape Drill Hanger Paint Hammer Axe Shower Can 18 0 0 0 5 3 1 2 Tape 7 25 3 2 0 0 3 1 Drill 0 0 22 0 0 0 0 0 Hanger 0 0 0 8 0 0 0 0 Paint 0 0 0 0 20 0 0 0 Hamme 0 0 0 1 0 12 4 0 Axe 0 0 0 0 0 9 17 0 Shower 0 0 0 14 0 1 0 22 Actual Predicted Seen items Unseen items Model 1 Model 3 Model 2 Can Tape Drill Hanger Paint Hammer Axe Shower Can 25 0 0 3 2 1 0 2 Tape 0 24 0 0 0 2 0 6 Drill 0 0 25 0 0 0 1 0 Hanger 0 0 0 3 0 0 0 0 Paint 0 0 0 0 23 0 0 0 Hamme 0 1 0 1 0 19 11 0 Axe 0 0 0 1 0 3 13 0 Shower 0 0 0 17 0 0 0 17 Actual Predicted Overall results (50 test images per model) Model 1 Model 3 Model 2 Reference Soekhoe, D., van der Putten, P., and Plaat, A. (2016). On the impact of data set size in transfer learning using deep neural networks. Torrey, L. and Shavlik. J. (2009). Transfer Learning . In E. Soria, J. Martin, R. Magdalena, M. Martinez and A. Serrano, editors, Handbook of Research on Machine Learning Applications , IGI Global 2009. December 13 th 2016, Machine Learning, Computer Science We also expected that M2, would give poor predictions on images because images store the Internet were significantly different from images from the store. M3 was a balanced model which did well on seen and unseen test images. For future consideration, M2 performed best on products that had a specific, shape that was low in variation.