Show Notes: http://www.superdatascience.com/107 1 SDS PODCAST EPISODE 107 WITH GABOR SOLYMOSI
Show Notes: http://www.superdatascience.com/107 1
SDS PODCAST
EPISODE 107
WITH
GABOR SOLYMOSI
Show Notes: http://www.superdatascience.com/107 2
Kirill: This is episode number 107 with data scientist at Utopus
Insights, Gabor Solymosi.
Welcome to the SuperDataScience podcast, my name is Kirill
Eremenko, data science coach and lifestyle entrepreneur, and
each week we bring inspiring people and ideas to help you
build your successful career in data science. Thanks for being
here today, and now let’s make the complex simple.
[Background music plays]
Kirill: Welcome everybody to the SuperDataScience podcast, super
pumped to have you on board and today I’ve got a special
guest, a friend whom Hadelin and I met during our European
road trip, Gabor.
Gabor is from Budapest, Hungary, and that was I think our
third stop during our road trip. It was very exciting to meet
everybody there and Gabor’s story especially resonated with
me because of his dreams and passions and how he works to
accomplish them, how he works towards them. And I’m very
excited to hear that since the road trip which was a couple of
months ago, Gabor has made progression in his career, he’s
got a new job and he’s actually working towards his goals as
you’ll see from this podcast.
Gabor is a very interesting person, very passionate about data
scientists, we’ll talk about three of the roles that he’s had to
date in the space of data science, we’ll talk about things like
Show Notes: http://www.superdatascience.com/107 3
text analytics, survival analysis, and jumping into an industry
completely foreign to him, how you’re able to switch from one
industry to another, being a data scientist and transferring
those data science skills and what is the experience of that,
what he is going through as he’s moving to something
completely different. As you’ll hear from the podcast it’s very,
very, exciting, the industry that he’s just jumped into. He’s
working with solar and wind turbine energy- who would have
thought they also need data scientists there?
So, quite a lot of interesting things we talked about here, but
probably the main thing I’d like you to focus on is the path,
the way that Gabor has intentionally chosen the roles in his
data science career, and how he’s working towards his
dreams. I can’t wait for you to hear his story, and let’s get
started. Without further ado, I bring to you Gabor Solymosi
who is a data scientist at Utopus Insights.
[Background music plays]
Kirill: Welcome everybody to the SuperDataScience podcast, today
I’ve got a very exciting guest, a friend of mine from Budapest,
Hungary, Gabor Solymosi. How are you doing, Gabor?
Gabor: Thank you very much, Kirill. I’m really excited to be hear
actually.
Show Notes: http://www.superdatascience.com/107 4
Kirill: Awesome. Cool to hear you. Can you remind us how we met,
where did we meet? So that the listeners can get a bit
acquainted with our story.
Gabor: I’d known you before because I took several of your classes on
Udemy, but then once I got an email that Kirill is coming to
Hungary, Budapest, I took the opportunity and we met there.
It was quite a nice dinner and we went on a few bar trips, let’s
say. It was quite interesting and we had a lot of good talks
that time.
Kirill: Yeah, exactly, it was fun. It was during the road trip, a lot of
you listening to this podcast might know that Hadelin and I
did a road trip this summer through Europe. One of our stops
was Budapest, Hungary and so we met quite a few of our
students there.
It was interesting talking because you or someone else was
saying that you were very surprised that we came to
Budapest. We started with Italy then we went to Munich and
then the next email that came out was, we’re going to
Budapest. Were you a bit surprised at that or were you
expecting it?
Gabor: Yeah, actually I didn’t expect it. I was checking the mails that
you were doing this Europe trip and I didn’t think that you
were going to come to Budapest. I thought okay, maybe
Prague or some other bigger cities. But I think you had a good
time here also.
Show Notes: http://www.superdatascience.com/107 5
Kirill: Yeah it was good and thanks a lot for showing us around, a
very interesting city. If somebody hasn’t been to Budapest, we
liked it quite a bit, Hadelin really fell in love with the city. It’s
got this big massive river. What’s the river called again?
Gabor: It’s the Danube.
Kirill: The Danube, the river, it’s a big river. It goes through a lot of
countries in Europe, but in Budapest it’s really wide and
you’ve got two parts to the city, you’ve got Buda and Pest. I
think the story goes that they were two separate cities that
were growing on both sides of the river and at some point,
they just decided to become one city, is that right?
Gabor: Yeah, it’s kind of like the short summary of how we got
together.
Kirill: Then Gabor and some other students showed us around the
city. It’s got quite a lot of monuments and we even saw the
statue to Gabriel and that’s where you told me that your name
Gabor is a derivative of Gabriel, is that right?
Gabor: Yeah, that’s correct.
Kirill: That was very interesting to learn, I never knew that before.
Anyway, so we’re here to talk about data science and your
journey into the space of data science, so tell us a bit about
Show Notes: http://www.superdatascience.com/107 6
what you do. You told me just before the podcast, you got a
new job, congratulations.
Gabor: Thank you very much. I have recently changed my job from
one company to another. I’m working as a contractor data
scientist for an exciting new energy analytics company called
Utopus Insights. As I told you, it’s actually a spinoff from IBM
Research and it’s headquartered in New York but I work from
Budapest, Hungary. It’s kind of a remote job.
Kirill: That’s so cool. Let’s just realign a bit. Where did you work
before Utopus Insights?
Gabor: I was also working as a data scientist at XAPT. I will go into
details with that also because it was really interesting. So,
that’s my two first data science jobs. Before that, I was a data
analyst but it was something a bit different. Now I’m really
happy to be here because it’s really cool. Basically, I’m
working from home most of the time, which has its benefits of
course, and disadvantages as well since I have a lot of time
dealing with things around the house or waking up a little bit
later or doing work outs every morning.
Kirill: Getting distracted.
Gabor: Yeah, that’s true but on the other hand, of course, it can be a
bit boring sometimes. I have regular Skype meetings with the
others in the States and here we have a team in Budapest
with whom I regularly meet. Actually, it’s quite nice.
Show Notes: http://www.superdatascience.com/107 7
Kirill: You said it’s a contract. When you were working with XAPT
was it also a contract or was it a full-time job?
Gabor: No, that was a full-time job. Actually, it’s also like kind of full-
time but I’m working through a major company.
Kirill: You traded in a full-time secure job for a contract, is that
right?
Gabor: Yeah. It’s like that. But of course, it’s really interesting and
it’s really exciting for me to work with this now because I really
wanted to do something with the energy industry. How to help
the future, doing something with renewables and these
things. It’s really interesting.
Kirill: I can totally imagine. But the whole concept is very interesting
because a lot of people wouldn’t do that. They would think,
this is a full-time job versus a contract, a contract can expire,
versus a full-time job, I’m very secure in what I’m doing. Was
it a hard decision to make, to give up that security of your job,
of your income, and to go for something more exciting but
something that’s a contract, that can end and might not be
renewed?
Gabor: When I was thinking about it, I didn’t want to change jobs at
the time, but it was quite an opportunity for me because I
have a friend who said that they have an opening in the
Budapest office and I really wanted to do something with this
Show Notes: http://www.superdatascience.com/107 8
renewable energy data science stuff. And for this, you know,
yeah, I kind of traded my secure things for a contract but it
was worth it, I think. Of course, it’s a contract but I wouldn’t
change, it’s not for more money and it’s really exciting.
Kirill: Yeah, I know. That’s really cool and very inspiring to hear as
well because, after our chat in Budapest … Probably I should
mention this to the listeners. This was one of my most
inspiring conversations that I had on the road trip because
when we were talking, you said that, look, there’s certain
dreams that I have and goals, and ideally, I’d love to live … Do
you remember that conversation about Spain?
Gabor: Of course, yeah.
Kirill: What did you say? Tell us about your dream. What is your
dream in relation to Spain?
Gabor: I lived in Spain because I did my Erasmus semester in
Barcelona and I really loved the spirit of the city and it’s just
extremely cool and I’ve always wanted to go back there since
I was there for this one semester. My dream was to just find
a job there with this hotness and everything. Here, I know
everything in Hungary, in Budapest, and it’s just not that
exciting. I wanted something more, of course I’ve always
wanted more. With this one now I think it’s kind of great.
Kirill: Yeah. Do you like the Spanish language?
Show Notes: http://www.superdatascience.com/107 9
Gabor: Yeah, of course. I actually learned Spanish so I know a couple
of things. I’m not perfect but I know the basics and I really
like it. I learned Catalan also because of Barcelona.
Kirill: That’s really cool. All right, what I was just going to say is that
Gabor’s dream is to live in Spain and work in Spain and so
on, and one of the things that we discussed during our catch
up was that, you remember you said you were a bit
disappointed that unfortunately the economy in Spain isn’t
the best right now and it might be hard to find a job and so
on. And I mentioned that you don’t really have to find a job in
Spain, you can live in Spain but you can work as a freelancer
through Upwork or through other websites, and it’s very
exciting to hear that now you have a remote job. You just got
a remote job where you are working from home and in my
view, it’s like a step towards that goal and it’s very inspiring
to hear that you are on that journey already.
Gabor: Thank you very much. Actually, it’s really exciting and I also
feel like it’s kind of an improvement since we last talked.
Kirill: Awesome. So, you just moved from XAPT to Utopus Insights,
tell us a bit about the work that you do. In what space of data
science are you at the moment?
Gabor: I’m kind of a data scientist/analytics engineer. I’m involved in
multiple projects that focus on forecasting the performance of
renewable energy farms, like solar farms, wind farms,
turbines and so on. That’s what I currently do. It involves a
lot of statistical learning methods and a lot of mathematics
Show Notes: http://www.superdatascience.com/107 10
also and a lot of engineering. I actually don’t have an electrical
engineering industry background but with these people here,
they help and I bring the data science knowledge also, so it’s
kind of cool. I really like it.
Kirill: Okay. That’s pretty awesome. What does an analytics
engineer do? I’ve never heard of that profession before.
Gabor: It’s kind of data scientist stuff also. It’s just about building the
analytics platform, like in the databases and how to extract
the data, how to put it into the analytics platforms and these
things. Behind also I know the science stuff so that’s what it
is. It’s just how they call us.
Kirill: Okay. It’s a mix of a data scientist, a database architect, that
type? Like you do …
Gabor: Yeah. Kind of that. We’re working with a lot of software
developers who actually do this backend stuff, the deep
backend. But of course, I have to be involved in these things.
Kirill: Interesting. So, you work with wind turbines, what other
forms of energy? Is it solar?
Gabor: Yeah, it’s solar and wind.
Show Notes: http://www.superdatascience.com/107 11
Kirill: Solar and wind. Out of curiosity, which one is the most
efficient right now, out of the ones that you work with, not the
world standards or the leading world ones. The ones that you
work with, what do you find is more efficient, solar or wind?
Gabor: Actually, I’m not quite into that one yet so I don’t have too
much insight on which one is better, but probably in a few
months I could give some insights on this also.
Kirill: Okay, gotcha. All right, cool. When you say you do analytics
for solar and wind, what exactly do you do? Do you calculate
how much is consumed or do you calculate how much, the
maintenance requirements, what part of that analytics are
you involved in?
Gabor: Now, I’m actually involved in validating machine learning
models like forecasts and choosing the right matrix for
evaluation. Communicating with the other analysts and
engineers, software developers on what and how to improve,
this kind of stuff. Of course, it involves a lot of research.
Kirill: What are you forecasting?
Gabor: We are forecasting the performance, the power of the wind
turbines, like wind and solar panels, how much power they
give.
Kirill: So how much energy we’ll have in the future?
Show Notes: http://www.superdatascience.com/107 12
Gabor: Yeah.
Kirill: That’s very interesting because like we all use energy, we all
use electricity, and we all hear about solar and wind and so
on, but I’ve never actually spoken to someone in this space.
It’s good to have an example that even in these industries, you
still need data science, you still need data scientists.
I was thinking originally maybe there’s very historical types of
roles and types of calculation like scientists or engineers that
are performing these estimates and forecasts but
nevertheless, you are a data scientist who’s working in this
space. And this is something new for you, right? When you
were working at XAPT, were you doing the same thing or was
your role related to something different?
Gabor: Well, it was a bit different but at that I was working with
predictive algorithms, predictive maintenance analytics,
which is kind of involved because here we are also planning
to do something like that. As you mentioned, you thought that
it was scientists and these kinds of people who are doing these
forecasts and these things, well we have a meteorologist on
the team also, who is doing the weather forecast. And we have
a lot of electrical engineers and they have a vast background
of science, of the field.
Kirill: Okay. In XAPT, were you working with energy as well or
something else?
Show Notes: http://www.superdatascience.com/107 13
Gabor: No. Actually, there I was working on a project which was
called predictive maintenance for heavy machines. It was kind
of interesting, we were doing survival analysis there,
predictive algorithms through R server. I was creating like
these web APIs with R which was really cool, I really liked it.
Kirill: All right. We’ll get to that in a second. I just wanted to, again,
stress for those listening that before … When did you start at
Utopus, was this a few months ago?
Gabor: Yeah, it was a few months ago.
Kirill: Okay, so literally a few months ago, Gabor … How much
knowledge did you have about solar and wind energy and
their consumption and stuff like that? Were you an expert in
that field?
Gabor: Not too much. You know, if you’re really interested in
something, just make researches and that’s what I did before
applying for these things.
Kirill: And that’s why it’s so exciting because, like two months ago
or so you had no knowledge of that industry, or very little
knowledge about what solar turbines, how they work, what
their energy flow is, efficiency and so on and the same thing
for … Sorry not solar turbines, solar farms and solar panels,
and then wind turbines, and the same thing. But all you had
was like your data science skills, your machine learning skills
Show Notes: http://www.superdatascience.com/107 14
and so on, and you brought that, and now two months later,
you’re in a completely new field, something very interesting. I
think it’s a very inspiring example for those listening that if
you’re interested in something, even as complex as solar
energy, you can just go and become a data scientist there. If
you’re interested in wind turbines, you can go and become a
data scientist there, regardless of your background.
What I’m getting to, is that somebody might think that you
have to be an expert in solar to even be considered for a role
in solar. No, you don’t. Like Gabor here is showing by
example, you just have to be a data scientist or like be
confidence in your skills, do some research, and then go
there. I think it’s a good testament as well to the
transferability of data science skills that you can go from one
industry to another very quickly. Like in your case, from
heavy machinery which doesn’t have that much to do with
solar in the first place, and you can just move to solar energy
or wind turbines or whatever. So, basically, guys, dream big
and wherever you want to work, whatever is your passion, you
will be able to get in there quite quickly.
Gabor: Yeah, that’s true. You summed it up really good. Actually, the
funny thing that I will go into in a bit is that before doing data
science for heavy machinery, I was doing text analytics.
Kirill: Text analytics. That’s awesome. There you go, that’s a jump.
And that’s when you were data analyst? What was the
company called there?
Show Notes: http://www.superdatascience.com/107 15
Gabor: Yeah, it was Sykes.
Kirill: Sykes. Before XAPT you were at Sykes and it’s text analytics,
that’s so cool. Such a big change from text analytics to
working heavy machinery to now working in the space of solar
and stuff like that. You touched on a very interesting topic, I
think we should expand on that more because I haven’t heard
anybody on the podcast talk about it yet. Survival analysis.
I’ve heard a little bit about it, I’ve read a bit about survival
analysis. Could you give us an overview, what it’s all about
and how does it work?
Gabor: Yeah. Actually, I wanted to talk about it of course because it’s
really interesting. Did you know that it’s actually one of the
oldest statistical disciplines? It has roots in demography and
actuarial science like economics, and it dates back to the 17th
century. I did a lot of research on it because it’s so interesting.
In the beginning, it was most importantly used in
demographical analysis like vital statistics that deals with
statistics on birth, deaths, marriages, divorces and these
kinds of things. Since then of course, it became widely used
in other fields as well, like economics, failure analysis and
mechanical systems, like what I did for heavy machines. It’s
actually about analysing data where your outcome variable is
the time until an event. For example, it can be death or
marriage, or failure or something.
Kirill: Sorry to interrupt. So, in marriage, survival is how long you
can survive until you get married?
[Laughter]
Show Notes: http://www.superdatascience.com/107 16
Gabor: Yeah.
Kirill: That’s so funny. Puts marriage in a bad light. But okay,
gotcha. It’s just a term. I guess it comes from where it
originated. It originated with like how long people live, like
before they get sick, or before they die, or something like that.
Gabor: Yeah, that’s it. If you think about it, that the outcome is kind
of like a continuous variable, but it’s not continuous because
it’s time. It’s actually a generalized form of a high dimensional
regression analysis and it’s really interesting. It’s really cool
and this is the one thing, a good example that we were talking
about a few minutes before. You can just apply it on anything.
Kirill: When you say the outcome is not continuous, what do you
mean by that? Like it’s time, right, it can be …
Gabor: Yeah. It’s kind of like continuous. For example, if you want I
can talk a bit more about it for course.
Kirill: However you want to structure this. You’re the expert, just
tell us about survival analysis. What do we need to know?
What’s the most important fun stuff?
Gabor: Okay. Actually, what I did it’s also like this time that you’re
measuring or time to event, or the survival time, it can be
Show Notes: http://www.superdatascience.com/107 17
measured in whatever you want, so in days, weeks, years. It’s
a continuous variable, let’s say.
For example, if the event of interest is like a failure, then the
survival time can be the time in days or hours or even years
until for example a machine develops a failure, let’s say. It has
a lot of interesting terms also like censored and uncensored
observations and like for example there are two kinds of
subsets of the data, what you can deal with, like the censored
and the uncensored one. In some of them there is, for
example, if the event hasn’t happened, you don’t have any
observation of the event with that kind of machine and then
it becomes hard to define the survival time at the end.
Kirill: Okay, and how do you go about it then?
Gabor: Yeah. Actually, there I incorporated some averages and other
statistics where you don’t have the exact time. I can talk about
too much of these things.
Kirill: Okay. But let’s say in real life, would you use survival analysis
if you’re testing some sort of medicine? You have a population
of people and they’re … Or let’s say not even people. You have
like this group of mice, you want to see if this medicine helps
them live longer. Is that an example when you would use
survival analysis?
Gabor: Yes. Actually, if you have a good number of observations, of
course. In biostatistics, it’s really commonly used.
Show Notes: http://www.superdatascience.com/107 18
Kirill: All right. And so, what makes survival analysis stand out? Is
it just the fact that we are counting backwards, we’re looking
at how much time until these mice start unfortunately, dying,
or is it something else? Is there a certain reason why survival
analysis is so interesting, it has its own kind of domain?
Gabor: Unlike ordinary regression models, here are dependant
variables in survival analysis, it’s composed of two parts. One
is the time to the event of interest and the other is the event
status which records the event of interest occurred or not.
From this you can define the censored and uncensored
observations of course. For example, here you can estimate
two functions that are dependent on time, the survival
function and the hazard function. These two functions are the
key concepts in survival analysis describing the distribution
of event times. For example, the survival function, gives for
every time the probability of surviving or actually not
surviving or not experiencing the event up to that time. It’s
starting at 1, it’s a positive valued monotone decreasing
function. So, when you’re going through time, of course, you
will get the score at every timespan let’s say, and as you’re
going forward in time, probably the score that you will survive
will decrease, that’s why it’s starting at 1 and it’s a positive
valued monotone decreasing function.
On the other hand, the hazard function, gives the current
potential that the event will occur per time unit and given that
the individual has survived up to that specific time. It’s part
of the survival function also so it can change over time, for
example it’s increasing as components age, so it’s the
Show Notes: http://www.superdatascience.com/107 19
difference. It’s actually kind of the opposite, so the survival
function is going decreasingly and the hazard function is
going upwards.
Kirill: They’re not like, one is not the complement of the other, it’s
just that it’s normal that the older a person is or a machine
is, the more likely that something will go wrong.
Gabor: Yeah, it just gives you a hazard score but it’s not the 1-minus
this of the other function.
Kirill: Gotcha. So, just to recap, so you got the survival function
which gives you a score. Give us an example of the machine,
right, how would you apply that score in the case of heavy
machinery?
Gabor: For example, if you have an engine that you start the engine
and if it’s a brand-new engine, that of course at zero time your
survival score will be 1 because you just started the engine
and you probably think that it’s going to survive.
Kirill: Yeah, 100%.
Gabor: As you’re going through time, you’re going forward in time,
this score will just decrease and of course it could decrease
based on the features that you’re using, based on the
correlation between those features and how they correlate
Show Notes: http://www.superdatascience.com/107 20
with the output, of course with the time and the status and
everything that’s there.
Kirill: And so, for example if your survival function gives you like a
value of 0.6, 30 days after you started using the engine, that
means there’s a 60% chance that it survives, is that right?
Gabor: Yeah. Exactly.
Kirill: Okay. That’s like for one engine, you can think of it as
probability of survival. But in terms of like let’s say if you have
1,000 flowers, and then if on day 30 they have a 60% score or
0.6 score from the survival function, you can say that out of
those 1,000 flowers, only 600 will survive up to day 30.
Gabor: Yeah, you can put it that way also.
Kirill: Okay, cool. And then with the hazard function, how does it
work?
Gabor: Actually with that you will get a hazard score which, while
that score is including the survival function of course, it’s an
increasing function not like the survival function. So, then
you will get a hazard score. For example, it’s almost like
minus the … 1-minus the survival function, but it just
incorporates. The survival function incorporates the hazard
function inside of it.
Show Notes: http://www.superdatascience.com/107 21
Kirill: Okay makes sense. Like at the start you have an engine, it’s
brand new, everything is okay, so the hazard will be like very
low. But then if you go forward and go further and like 100
days later you have that same engine, your hazard score will
be higher meaning that there’s a higher chance that it will
break down.
Gabor: Yes, true. And of course, these scores are assigned based on
the historical data that you are using, and the features …
Kirill: Okay, gotcha. I see now. So, the hazard function will take into
account that let’s say, the engine won’t survive for another,
like, one day or for another two days, is that right?
Gabor: Yes, it’s true.
Kirill: So it depends on the time that you want to evaluate. It’s
already lived. It’s survived like 100 days, now you want to see
will the engine survive another five days, then you’ll have 1
hazard score. But if you want to see will the engine survive
another 50 days, then you’ll have a worse hazard score,
because it’s less likely to survive longer.
Gabor: Yeah.
Kirill: Okay, that’s pretty cool. There is a whole mathematical
apparatus behind this survival theory, I think that’s why I
found it very interesting in the first place. Very well defined
Show Notes: http://www.superdatascience.com/107 22
and it can be applied to many different problems in life, like
you say, machinery and other areas as well.
Gabor: Yeah, that’s true. I’m glad you like it also. It’s really interesting
actually and fascinating how it can be applied for anything.
Kirill: Awesome. Well, guys, if you found this quick overview of
survival analysis interesting, have a look into it, I think it’s a
pretty cool area of data science which is good to at least know
about.
Okay, so you said you did survival analysis at your previous
job, now you’re working with solar and wind, but before that
you mentioned text analytics. What were you doing in the
space of text analytics?
Gabor: Yeah. Well actually I got that job because of my thesis work,
because during the university, my thesis was based on social
media analytics and it was counted one-year-old laboratory
work. Actually, there I broadened my knowledge in natural
language processing like, text mining, classifying the
sentiment and emotions of reviews from social networking
sites like Yelp, Foursquare and Twitter also. I really enjoyed
doing it, I did some parts of my work in Python, like SPSS,
RapidMiner, but the heavy part, the computation and running
the algorithms was done in R. I don’t know, I just find it quite
handy doing this in R.
Show Notes: http://www.superdatascience.com/107 23
I got that job because of this project. I started working as a
social media data analyst, where I had to rely on my
knowledge on text analysis, on keyword based information,
extraction and so on, and I was analysing the sentiments of
posts, comments, messages, forum question for tech
companies. If they were positive or negative mentions, or if
they were about a specific product like hardware or software
related issue, etc. Well it was quite interesting. I still believe
that businesses really can make a difference and improve
themselves by gaining knowledge on their customers from
social media. Because everybody’s posting and texting a lot of
things and if you analyse it well, you can really improve your
business.
Of course, you know what, the hardest part was like, it was
getting harder and harder when we extended the regions, we
were not just analysing the English content but when you
have European languages like Hungarian, Polish, Greek, and
so on, that’s the heavy part. Actually, there I’ve been looking
into some of the best sentiment analytics tools but I haven’t
found the best that have a great accuracy. So, in this case you
will need the help of manual categorization by language
professionals or someone who can do it for you. Actually, now
I think back that I really liked that job, I really liked text
analytics because maybe for all data scientists it’s one of the
most common things to like, analysing texts. I really liked it.
There, I became a senior data analyst very quick as they
actually saw that I know what I’m doing, and they trusted my
insights and during that two years of work, I kind of helped
setting up a team of like multilingual international data
analysts, like 5-10 people. It was really interesting.
Show Notes: http://www.superdatascience.com/107 24
Kirill: That’s so cool. What tool did you use for that, was it Python
or something else?
Gabor: First in some parts I was using R, some parts I was using
Visual Basic and Excel but the most part of it was done by …
Salesforce has quite a good marketing tool for this
sentimental analysis and kind of like web scraping, it’s called
Radian6. I’m not sure if they still call it like that, at that time
it was Radian6 by Salesforce. It’s a very powerful tool that you
can just collect a lot of things from blogs, technical forums,
you can just give the URLs and so on and you can just collect
any kind of information by … Like, you know it also relies on
this Boolean keyword extraction where you can define what
keywords you want to use with “and” or “ors”. It was really
cool. And you can just download a lot of things, it can be
connected even to your Facebook business pages or Twitter
business pages, with Google+, everything, and you can even
see the inbox messages, you can just analyse the inbox
messages also. It’s really cool. We were doing like a customer
service job, I wasn’t involved in answering the customers but
I was the one analysing and gathering the insights for our
clients.
Kirill: All right. Can you walk us through this? Once you let’s say
use this tool that you mentioned or some other tool and you
connect it, what happens? It goes to the webpage, it finds a
comment, it downloads it into a file, or into a database and
then what? It restructures it, because all comments have
different structure? Can you walk us through the process
from start to finish, please?
Show Notes: http://www.superdatascience.com/107 25
Gabor: You can define what kind of platforms you want to make your
search on and then you can define the keywords that you
want to look for, for example if you’re looking for Apple, you
can just put Apple and plus you can put iPhone and plus if
you’re analysing the different models, you can put like, 4, 5,
6, 7, 8 and so on, and of course then you can for example add
like the keyword for camera or hardware or something. Then,
this tool will find all the posts, all the content on those sites
that you were including in the beginning with these keywords
and then you can download them via, like, excel, CSV files,
even XML, anything you want. You can probably even connect
it to a database and just put all the data there. It actually kind
of structures your data, of course the textual data will not be
structured in the beginning but it can also, before
downloading the data, you can just say that I want to see the
positive and negative mentions also. And it has these prebuilt
algorithms for finding the positive and the negative mentions
based on, probably it’s also based on something like lexicon
or something that’s behind it but it doesn’t really work with
other languages, you can tweak it manually to work with it
but it takes a lot of time. But with English it’s working quite
well. Of course, with these things, you will always have the
problem of sarcasm in everything. because you might find it’s
negative but then it might be positive and your ratio will not
be the one that you were looking for.
Kirill: Okay, gotcha. So, how do you deal with the fact that some
comments might be very long, some comments might be very
short data. Does it matter or not at all?
Gabor: It doesn’t really matter. I was doing a lot of text mining and
analytics with textual data and one of the easiest things to do
Show Notes: http://www.superdatascience.com/107 26
is like build a term document matrix where it contains the
frequency of the words, that’s in one of the documents. For
example, one document can be just one line of text or one
sentence. You can just structure it really well and it doesn’t
matter how long it is. Of course, it will probably take more
time to do it but it doesn’t really matter.
Kirill: Okay, gotcha. How long would you say it would take
somebody who has never used text analytics before to get into
this field and be able to form their first text analytics?
Gabor: Well, I would probably say that not too much because really,
if you’re interested in it just go put like, text mining in Python
or R, there are a lot of packages that you can just download.
It’s really easy to use. Once when I started getting into it
deeper during the university, I was using R because it had a
text mining package, and I was using Python also for web
scraping, and actually once I created, through a web API of
Twitter, I was able to easily download a lot of posts, like an
automated job. It downloaded my posts during the night and
then I was analysing them during the day. It’s really cool, it’s
really easy. If you guys out there are really interested, just
take a look at it.
Kirill: Awesome. Thanks a lot for sharing those insights into text
analytics. It sounds like a very exciting space for people to get
into. You are right, a lot of companies will need to do more
and more of that because there is so much unstructured data
floating around and now companies are just getting good at
using their structured data, the ones that are using it, and
Show Notes: http://www.superdatascience.com/107 27
the next frontier for competitiveness is unstructured data,
and part of that is text analytics.
All right, you’ve done quite a lot of different stuff. You’ve done
text analytics, you’ve done survival analysis, building some
forecasting predictive models. What would you say is the most
exciting for you and also what are you looking forward to
learning the most? What is the next type of data science that
you can’t wait to get your hands on?
Gabor: Well, probably as I’m working in the energy industry now, I
will go deeper into energy forecasting and how we can use all
the other features from weather insights like for example we
can use even the precipitation to forecast wind power, on this
kind of things. I would like to go into details with, like, what
else to use in these algorithms.
Kirill: Okay. That’s fair enough and sounds like a big area of data
science that I didn’t even know existed until today. What I
wanted to ask you is you said that you were doing a thesis on
text analytics, that was part of your research and that’s how
you got your job. What exactly did you study at university?
Gabor: Well, that’s a funny job actually. It might take a bit longer to
tell you but actually I graduated as a Business Informatics
Engineer but it has kind of a longer story how I got there.
Kirill: All right, tell us.
Show Notes: http://www.superdatascience.com/107 28
Gabor: My professional career started with an internship at banking,
where I worked under the supervision of seniors like business
consultants dealing with IT demand management for
business departments and I got to handle smaller and larger
excel sheets for administrative purposes. Seriously, I
remember my first day at work and my boss gave me an excel
sheet containing the project backlog and I was just so afraid
to touch it.
[laughter]
It was during my bachelor’s but I was like, wow I have to do
something with this. But you know, it was that job that
pushed me into analytics. At that time, I had a conversation
with my boss about staying there full-time after my bachelor’s
and she said that, “Yeah, I would like to hire someone with a
master’s,” but at that time I didn’t think I would want to have
one. And the funny part here is that a few days later, I actually
had a dream where I was talking to my supervisor at the
university, I was handing in some papers to him when he
looked at me and he said, “Hey Gabor, I think you made a
mistake here because these papers are for bachelor’s and you
need the application for the master’s.” I woke up, I
immediately checked the application deadline for master’s
and it was a week from that day and it was my birthday.
I was like, okay, that’s a sign. And I got into one of the best
universities in Hungary for master’s and my specialization
here was like Business Intelligence and Analytics, and it was
mainly about data mining, customer analytics, going deeper
into machine learning algorithms. I also remember that I was
Show Notes: http://www.superdatascience.com/107 29
even building and calculating decision trees and neural
networks on paper for smaller data sets.
Kirill: That’s so cool. You got into data science because of a
dream, that’s so awesome.
Gabor: Yeah, kind of. If nothing then that was a sign for sure.
Kirill: Why did you choose data science for master’s even though
you studied something else as a bachelor’s?
Gabor: I was studying bachelor’s as Business Information Systems
and there I got to study some decision science, business
intelligence and then it sounded really cool to study business
intelligence and that’s what the advertisement was saying for
this specialization because it was in English and it said,
“Business Intelligence and Analytics” and I really wanted to
do some analytics jobs also in the future. And it happened
and I’m really glad that I did it because I’m very happy now.
Kirill: That’s cool. What I also wanted to ask you is, you said your
thesis helped you get your job. How did you use your project,
your research at uni, how did you use that to show to
employers? Did they accidently find it themselves, or were
you proactively sending it out to people? How did that
happen?
Show Notes: http://www.superdatascience.com/107 30
Gabor: No. I had a friend in university and her aunt or someone
worked at the company, at Sykes, and she was saying that it
would be nice if you could go to an interview there because
they’re looking for someone with what we are doing at the
university and my thesis was about it. Then I went there and
I was talking about my project and at that time in Hungary,
they didn’t see anyone else before who was doing exactly that
what they needed.
Kirill: So it was very easy for you to get it then, if you were the only
one in the whole country.
Gabor: Probably not the only one in the country but who they were
checking.
Kirill: Who they had seen before. Okay, that’s a very interesting
story. All right, so I’ve got a few rapid-fire questions for you
which I would like to pose, are you ready for these?
Gabor: Yeah, sure.
Kirill: Okay. What’s the biggest challenge you’ve ever had as a data
scientist?
Gabor: It’s a hard one. Actually, we haven’t talked too much about
feature engineering and data wrangling until now. But they
are kind of one of the most important and most stressful
parts of being a data scientist, when you have to transform
Show Notes: http://www.superdatascience.com/107 31
and map the data from one format to another or to create a
long data structure from wide data or wide spatial. Or when
you have to gather or spread the data based on some specific
key value pairs. To do it well it can be really stressful and
time consuming sometimes. That’s the biggest challenge that
I have.
I think the biggest challenge, generally, being a data scientist
is preparing the data in the right format to be used as an
input for the machine learning algorithms, because this is
what they don’t teach too deep in the universities. It’s just
easy to use some premier tidy data that they give you in the
university but in real-life scenarios, it’s really not the case.
You have to be careful how to make that input data that you
will feed your algorithms with.
Kirill: Yeah. What I also find is it’s really hard to find courses or
even books that can prepare you for that because it’s hard to
come up with a data set that’s dirty intentionally, right? You
commonly find them, you come across them during your
work and you get caught by surprise, but when you want to
find one, or when you want to create one yourself, it’s not an
easy task for exercise purposes. I think that’s why a lot of
courses miss out on that side of things.
Gabor: Yeah, I think so.
Kirill: Okay. Thanks for your answer to that question and now let’s
move on to the next one. What is a recent win you can share
with us that you’ve had in your role? Something that you can
Show Notes: http://www.superdatascience.com/107 32
disclose, I know that there’s certain things that you cannot
talk about, classified things maybe, but let’s talk about
something that you can share, something you’re proud of
that you’ve achieved or accomplished.
Gabor: Back in my previous job when I was doing survival analysis,
I actually got a hold of some historical GPS data of the
machines, like coming from sensors and IoT devices, and I
analysed if there is any correlation between the machine’s
location and the failures or survival time, and it showed that
actually there was. The survival time in one region differed
from the survival time in another.
Kirill: Wow, that’s so interesting.
Gabor: Yeah, and I don’t need to say that I got so happy because I
got to analyse more and of course I had some discussions
with the business side, the industry side, and I started
extracting and gathering terrain data, like land cover from
raster files and also data from weather stations to get more
features to use in the survival analysis. At the end, I got
better results with these new features, so I included them in
the solutions. I think it was quite a win.
Kirill: That’s really cool. Did you ever get to the reason behind why
in a certain region the survival was lower?
Gabor: Yeah, actually if you think about the US like for example if
you just think about one state like if there are, like, rocky
Show Notes: http://www.superdatascience.com/107 33
mountains, of course your machine will fail more likely than
when your machine is working on an urban area. If you think
about it it’s really simple and it shows in the research also,
very cool.
Kirill: That’s cool. And so were these machines constantly stationed
in those separate regions or were these machines travelling
across different regions?
Gabor: During two failures or services, or between those services,
they were kind of in the same type of location.
Kirill: So it was just generally the case that the machines in this
area are more like to fail than machines in this area so we
need to service them more often.
Gabor: Yeah, that’s right.
Kirill: Very interesting. And so nobody knew that before you did
that analysis?
Gabor: Well, not in that solution.
Kirill; Not in an analytical way. That’s really cool. Awesome. It’s
always fun, I find that, to relate back to the reality of things
or the business knowledge. That you find something in the
data and then you look at the locations and you’re like, oh
Show Notes: http://www.superdatascience.com/107 34
that does make sense, it is more rocky mountains here and
it is an urban area here. It’s good when the real life confirms
what you see in the data, isn’t it?
Gabor: Yeah, that’s true.
Kirill: Awesome, that’s a big win. What is your one most favourite
thing about being a data scientist, except for cleaning data of
course?
Gabor: Yeah, that’s the best part. Well, to learn new approaches, to
discuss your results with other professionals. The
visualization part is one of my favourites. Currently I’m using
R, R ggplot package for visualization, but I’ve also used
Tableau and Power BI where you can just import any kind of
data, you can just create nice plots. Another favourite thing
is that there is always something new to do, to make research
about, to try them, to implement them.
Kirill: Yeah, that’s really cool. I like that as well. It’s constantly
growing, there’s always things that you’ve got to be up to date
with to see what’s happening and you can also contribute to
the new areas of data science, so that’s really cool.
From all the different various experience you’ve seen and the
types of work even, like you’ve had for work in the office,
you’re now working remotely and so on, where do you think
the field of data science is going and what should our
listeners look into to prepare for the future?
Show Notes: http://www.superdatascience.com/107 35
Gabor: Everything. Actually, this is a good one. I’ve recently been to
an AI summit in Vienna and there was this inspiring talk of
Sepp Hochreiter if you know him. He’s a professor at
Johannes Kepler University.
Kirill: That’s so cool that you got to hear him. He created the
LSTMs, long short-term memory for recurrent neural
networks in deep learning.
Gabor: Yeah. He asked this question, who of you are still going to
human doctors? And everyone raised their hands. Then he
asked, who is going to AIs, then of course no one put their
hands up, and he just concluded that, “Well, this will change;
Don’t trust human doctors for diagnostics because machines
are better.” Surely, he was just making a joke but if you think
about deep learning and image processing, and of course you
have this course on Udemy. You know machines can just
better classify for example skin cancer, because it can learn
from millions and millions of images, whereas doctors might
have just thousands of patients during their practice. So, I
think this is really interesting to see how far machine
learning and AI and data science will go in the near future.
Kirill: That’s so cool that you bring that up because I totally agree.
There’s an app, you can actually download it. It’s called Skin
Vision I think. There’s another one, M-I-I Skin, Miiskin. But
Skin Vision I think I’ve heard of before, and basically …
Anyway, there’s an app you can get on App Store, I’m not
sure about the name but you can get it on App Store, for
Show Notes: http://www.superdatascience.com/107 36
iPhone, maybe for Android, and it does exactly that. You take
a photo of something on your skin you’re worried about, and
based on thousands and millions of images that its
algorithm, its deep learning algorithm has been through, it
can help understand. Whatever algorithm’s in the
background, can help understand if that’s skin cancer or not
and they actually tested it against doctors and it performed
as good as several professional skin doctors in the world and
so, yeah, exactly what you’re saying. We’re getting into an age
where it’s going to be AI mostly.
Gabor: Yeah, I totally agree.
Kirill: Awesome. Thank you so much for coming on this show and
sharing your story and insights. If our listeners would like to
get in touch or follow you or maybe ask questions or see how
your career goes further, where is the best place to get in
touch with you?
Gabor: I’m on LinkedIn, so just feel free to send me a message or
request if you have some questions or you want to learn
more, or I can learn more from you, then definitely send me
a request.
Kirill: Okay, awesome. We’ll share your LinkedIn on the show
notes. I have one final question for you. What is your one
favourite book that you’d like to share with our listeners
today?
Show Notes: http://www.superdatascience.com/107 37
Gabor: It’s really not easy to say just one book. It’s hard but since
I’m still really in love with text mining and I talked a lot about
it, and how much I really love it and its applications, I will
recommend the Introduction to Information Retrieval by
Manning. It has topics on how to build web search engines,
how they work, also areas on text classification and
clustering, indexing, ranking, and so on. I can only
recommend it for those who are interested in text analysis,
it’s really a great book. Next to it you can just create your
own stuff on R, Python, anything really.
Kirill: Okay. That’s awesome. Well, there you go, Introduction to
Information Retrieval by Manning. And if there’s a book you’d
like to recommend to people who are not interested in text
analytics, does anything come to mind?
Gabor: Yeah. That was the first book that I got a hold of when I was
studying data mining during the university, it’s called
Introduction to Data Mining. I will send you the link.
Kirill: Okay, sounds good. So, our second book is Introduction to
Data Mining.
Well, there we go. Once again, thank you so much, Gabor,
for coming on this show and sharing all the insights. I really
appreciate you taking time and good luck with the remote
work and can’t wait to see how your career goes from here.
Gabor: Thank you very much.
Show Notes: http://www.superdatascience.com/107 38
Kirill: So, there you have it. That is Gabor Solymosi and his story
and how he’s moved through the space of data science. Hope
you find that inspiring and it’s hopefully going to give you
some ideas of how you can structure your path better.
Personally, what I found most inspiring was of course how
Gabor, since our catch up in Europe, has moved to a remote
data science role, he’s working from home now. That is very
in line with his passion of being free and being able to do
what he loves and at the same time being able to do it from
wherever he wants and on his terms. He is a very talented
data scientist and I’m sure that by being able to control his
own time as he desires, he is bringing even more value to the
organisation that he’s working for and I’m super excited that
they have created this environment for him. I can’t wait to
hear how his story will progress, I’m very excited to hear how
it will progress over the next couple of months or next year
or so.
Make sure to Gabor on LinkedIn and follow along his career
path and see where it takes him. You can find the URL to
his LinkedIn at the show notes at
www.superdatascience.com/107. There you will also find
the transcript for this episode and any other materials that
we mentioned during the show.
On that note, thank you so much for being here today. Can’t
wait to see you back here next time, and until then, happy
analysing.
Show Notes: http://www.superdatascience.com/107 39
[Background music plays]