Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Data Parallelism: Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.
– Ex: Build one Churn model for each state in the US simultaneously, when customer data is distributed by state code.
Task Parallelism: Split the problem into independent sub-tasks which can executed in parallel.
– Ex: Build one Churn model in parallel for the entire US, though customer data is distributed by state code.
Going Beyond Data Parallelism Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data
For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
– Identify and prevent customers who are likely to churn.
Challenges– Cost of acquiring new customers is high– Recouping cost of customer acquisition high if customer is not retained long enough– Lower barrier to switching subscribers– With mobile number portability, barrier to switching even lower
Good News– Cost of retaining existing customers is lower!
• Tweets alone had significant predictive power for the commodity of interest to us. When blended with structured features like weather data we expect to see much better results.