SDS PODCAST EPISODE 107 WITH GABOR SOLYMOSI€¦ · Gabor’s dream is to live in Spain and work in Spain and so on, and one of the things that we discussed during our catch up was

Show Notes: http://www.superdatascience.com/107 1

SDS PODCAST

EPISODE 107

WITH

GABOR SOLYMOSI


Kirill: This is episode number 107 with data scientist at Utopus

Insights, Gabor Solymosi.

Welcome to the SuperDataScience podcast, my name is Kirill

Eremenko, data science coach and lifestyle entrepreneur, and

each week we bring inspiring people and ideas to help you

build your successful career in data science. Thanks for being

here today, and now let’s make the complex simple.

[Background music plays]

Kirill: Welcome everybody to the SuperDataScience podcast, super

pumped to have you on board and today I’ve got a special

guest, a friend whom Hadelin and I met during our European

road trip, Gabor.

Gabor is from Budapest, Hungary, and that was I think our

third stop during our road trip. It was very exciting to meet

everybody there and Gabor’s story especially resonated with

me because of his dreams and passions and how he works to

accomplish them, how he works towards them. And I’m very

excited to hear that since the road trip which was a couple of

months ago, Gabor has made progression in his career, he’s

got a new job and he’s actually working towards his goals as

you’ll see from this podcast.

Gabor is a very interesting person, very passionate about data

scientists, we’ll talk about three of the roles that he’s had to

date in the space of data science, we’ll talk about things like


text analytics, survival analysis, and jumping into an industry

completely foreign to him, how you’re able to switch from one

industry to another, being a data scientist and transferring

those data science skills and what is the experience of that,

what he is going through as he’s moving to something

completely different. As you’ll hear from the podcast it’s very,

very, exciting, the industry that he’s just jumped into. He’s

working with solar and wind turbine energy- who would have

thought they also need data scientists there?

So, quite a lot of interesting things we talked about here, but

probably the main thing I’d like you to focus on is the path,

the way that Gabor has intentionally chosen the roles in his

data science career, and how he’s working towards his

dreams. I can’t wait for you to hear his story, and let’s get

started. Without further ado, I bring to you Gabor Solymosi

who is a data scientist at Utopus Insights.


Kirill: Welcome everybody to the SuperDataScience podcast, today

I’ve got a very exciting guest, a friend of mine from Budapest,

Hungary, Gabor Solymosi. How are you doing, Gabor?

Gabor: Thank you very much, Kirill. I’m really excited to be hear

actually.


Kirill: Awesome. Cool to hear you. Can you remind us how we met,

where did we meet? So that the listeners can get a bit

acquainted with our story.

Gabor: I’d known you before because I took several of your classes on

Udemy, but then once I got an email that Kirill is coming to

Hungary, Budapest, I took the opportunity and we met there.

It was quite a nice dinner and we went on a few bar trips, let’s

say. It was quite interesting and we had a lot of good talks

that time.

Kirill: Yeah, exactly, it was fun. It was during the road trip, a lot of

you listening to this podcast might know that Hadelin and I

did a road trip this summer through Europe. One of our stops

was Budapest, Hungary and so we met quite a few of our

students there.

It was interesting talking because you or someone else was

saying that you were very surprised that we came to

Budapest. We started with Italy then we went to Munich and

then the next email that came out was, we’re going to

Budapest. Were you a bit surprised at that or were you

expecting it?

Gabor: Yeah, actually I didn’t expect it. I was checking the mails that

you were doing this Europe trip and I didn’t think that you

were going to come to Budapest. I thought okay, maybe

Prague or some other bigger cities. But I think you had a good

time here also.


Kirill: Yeah it was good and thanks a lot for showing us around, a

very interesting city. If somebody hasn’t been to Budapest, we

liked it quite a bit, Hadelin really fell in love with the city. It’s

got this big massive river. What’s the river called again?

Gabor: It’s the Danube.

Kirill: The Danube, the river, it’s a big river. It goes through a lot of

countries in Europe, but in Budapest it’s really wide and

you’ve got two parts to the city, you’ve got Buda and Pest. I

think the story goes that they were two separate cities that

were growing on both sides of the river and at some point,

they just decided to become one city, is that right?

Gabor: Yeah, it’s kind of like the short summary of how we got

together.

Kirill: Then Gabor and some other students showed us around the

city. It’s got quite a lot of monuments and we even saw the

statue to Gabriel and that’s where you told me that your name

Gabor is a derivative of Gabriel, is that right?

Gabor: Yeah, that’s correct.

Kirill: That was very interesting to learn, I never knew that before.

Anyway, so we’re here to talk about data science and your

journey into the space of data science, so tell us a bit about


what you do. You told me just before the podcast, you got a

new job, congratulations.

Gabor: Thank you very much. I have recently changed my job from

one company to another. I’m working as a contractor data

scientist for an exciting new energy analytics company called

Utopus Insights. As I told you, it’s actually a spinoff from IBM

Research and it’s headquartered in New York but I work from

Budapest, Hungary. It’s kind of a remote job.

Kirill: That’s so cool. Let’s just realign a bit. Where did you work

before Utopus Insights?

Gabor: I was also working as a data scientist at XAPT. I will go into

details with that also because it was really interesting. So,

that’s my two first data science jobs. Before that, I was a data

analyst but it was something a bit different. Now I’m really

happy to be here because it’s really cool. Basically, I’m

working from home most of the time, which has its benefits of

course, and disadvantages as well since I have a lot of time

dealing with things around the house or waking up a little bit

later or doing work outs every morning.

Kirill: Getting distracted.

Gabor: Yeah, that’s true but on the other hand, of course, it can be a

bit boring sometimes. I have regular Skype meetings with the

others in the States and here we have a team in Budapest

with whom I regularly meet. Actually, it’s quite nice.


Kirill: You said it’s a contract. When you were working with XAPT

was it also a contract or was it a full-time job?

Gabor: No, that was a full-time job. Actually, it’s also like kind of full-

time but I’m working through a major company.

Kirill: You traded in a full-time secure job for a contract, is that

right?

Gabor: Yeah. It’s like that. But of course, it’s really interesting and

it’s really exciting for me to work with this now because I really

wanted to do something with the energy industry. How to help

the future, doing something with renewables and these

things. It’s really interesting.

Kirill: I can totally imagine. But the whole concept is very interesting

because a lot of people wouldn’t do that. They would think,

this is a full-time job versus a contract, a contract can expire,

versus a full-time job, I’m very secure in what I’m doing. Was

it a hard decision to make, to give up that security of your job,

of your income, and to go for something more exciting but

something that’s a contract, that can end and might not be

renewed?

Gabor: When I was thinking about it, I didn’t want to change jobs at

the time, but it was quite an opportunity for me because I

have a friend who said that they have an opening in the

Budapest office and I really wanted to do something with this


renewable energy data science stuff. And for this, you know,

yeah, I kind of traded my secure things for a contract but it

was worth it, I think. Of course, it’s a contract but I wouldn’t

change, it’s not for more money and it’s really exciting.

Kirill: Yeah, I know. That’s really cool and very inspiring to hear as

well because, after our chat in Budapest … Probably I should

mention this to the listeners. This was one of my most

inspiring conversations that I had on the road trip because

when we were talking, you said that, look, there’s certain

dreams that I have and goals, and ideally, I’d love to live … Do

you remember that conversation about Spain?

Gabor: Of course, yeah.

Kirill: What did you say? Tell us about your dream. What is your

dream in relation to Spain?

Gabor: I lived in Spain because I did my Erasmus semester in

Barcelona and I really loved the spirit of the city and it’s just

extremely cool and I’ve always wanted to go back there since

I was there for this one semester. My dream was to just find

a job there with this hotness and everything. Here, I know

everything in Hungary, in Budapest, and it’s just not that

exciting. I wanted something more, of course I’ve always

wanted more. With this one now I think it’s kind of great.

Kirill: Yeah. Do you like the Spanish language?


Gabor: Yeah, of course. I actually learned Spanish so I know a couple

of things. I’m not perfect but I know the basics and I really

like it. I learned Catalan also because of Barcelona.

Kirill: That’s really cool. All right, what I was just going to say is that

Gabor’s dream is to live in Spain and work in Spain and so

on, and one of the things that we discussed during our catch

up was that, you remember you said you were a bit

disappointed that unfortunately the economy in Spain isn’t

the best right now and it might be hard to find a job and so

on. And I mentioned that you don’t really have to find a job in

Spain, you can live in Spain but you can work as a freelancer

through Upwork or through other websites, and it’s very

exciting to hear that now you have a remote job. You just got

a remote job where you are working from home and in my

view, it’s like a step towards that goal and it’s very inspiring

to hear that you are on that journey already.

Gabor: Thank you very much. Actually, it’s really exciting and I also

feel like it’s kind of an improvement since we last talked.

Kirill: Awesome. So, you just moved from XAPT to Utopus Insights,

tell us a bit about the work that you do. In what space of data

science are you at the moment?

Gabor: I’m kind of a data scientist/analytics engineer. I’m involved in

multiple projects that focus on forecasting the performance of

renewable energy farms, like solar farms, wind farms,

turbines and so on. That’s what I currently do. It involves a

lot of statistical learning methods and a lot of mathematics


also and a lot of engineering. I actually don’t have an electrical

engineering industry background but with these people here,

they help and I bring the data science knowledge also, so it’s

kind of cool. I really like it.

Kirill: Okay. That’s pretty awesome. What does an analytics

engineer do? I’ve never heard of that profession before.

Gabor: It’s kind of data scientist stuff also. It’s just about building the

analytics platform, like in the databases and how to extract

the data, how to put it into the analytics platforms and these

things. Behind also I know the science stuff so that’s what it

is. It’s just how they call us.

Kirill: Okay. It’s a mix of a data scientist, a database architect, that

type? Like you do …

Gabor: Yeah. Kind of that. We’re working with a lot of software

developers who actually do this backend stuff, the deep

backend. But of course, I have to be involved in these things.

Kirill: Interesting. So, you work with wind turbines, what other

forms of energy? Is it solar?

Gabor: Yeah, it’s solar and wind.


Kirill: Solar and wind. Out of curiosity, which one is the most

efficient right now, out of the ones that you work with, not the

world standards or the leading world ones. The ones that you

work with, what do you find is more efficient, solar or wind?

Gabor: Actually, I’m not quite into that one yet so I don’t have too

much insight on which one is better, but probably in a few

months I could give some insights on this also.

Kirill: Okay, gotcha. All right, cool. When you say you do analytics

for solar and wind, what exactly do you do? Do you calculate

how much is consumed or do you calculate how much, the

maintenance requirements, what part of that analytics are

you involved in?

Gabor: Now, I’m actually involved in validating machine learning

models like forecasts and choosing the right matrix for

evaluation. Communicating with the other analysts and

engineers, software developers on what and how to improve,

this kind of stuff. Of course, it involves a lot of research.

Kirill: What are you forecasting?

Gabor: We are forecasting the performance, the power of the wind

turbines, like wind and solar panels, how much power they

give.

Kirill: So how much energy we’ll have in the future?


Gabor: Yeah.

Kirill: That’s very interesting because like we all use energy, we all

use electricity, and we all hear about solar and wind and so

on, but I’ve never actually spoken to someone in this space.

It’s good to have an example that even in these industries, you

still need data science, you still need data scientists.

I was thinking originally maybe there’s very historical types of

roles and types of calculation like scientists or engineers that

are performing these estimates and forecasts but

nevertheless, you are a data scientist who’s working in this

space. And this is something new for you, right? When you

were working at XAPT, were you doing the same thing or was

your role related to something different?

Gabor: Well, it was a bit different but at that I was working with

predictive algorithms, predictive maintenance analytics,

which is kind of involved because here we are also planning

to do something like that. As you mentioned, you thought that

it was scientists and these kinds of people who are doing these

forecasts and these things, well we have a meteorologist on

the team also, who is doing the weather forecast. And we have

a lot of electrical engineers and they have a vast background

of science, of the field.

Kirill: Okay. In XAPT, were you working with energy as well or

something else?


Gabor: No. Actually, there I was working on a project which was

called predictive maintenance for heavy machines. It was kind

of interesting, we were doing survival analysis there,

predictive algorithms through R server. I was creating like

these web APIs with R which was really cool, I really liked it.

Kirill: All right. We’ll get to that in a second. I just wanted to, again,

stress for those listening that before … When did you start at

Utopus, was this a few months ago?

Gabor: Yeah, it was a few months ago.

Kirill: Okay, so literally a few months ago, Gabor … How much

knowledge did you have about solar and wind energy and

their consumption and stuff like that? Were you an expert in

that field?

Gabor: Not too much. You know, if you’re really interested in

something, just make researches and that’s what I did before

applying for these things.

Kirill: And that’s why it’s so exciting because, like two months ago

or so you had no knowledge of that industry, or very little

knowledge about what solar turbines, how they work, what

their energy flow is, efficiency and so on and the same thing

for … Sorry not solar turbines, solar farms and solar panels,

and then wind turbines, and the same thing. But all you had

was like your data science skills, your machine learning skills


and so on, and you brought that, and now two months later,

you’re in a completely new field, something very interesting. I

think it’s a very inspiring example for those listening that if

you’re interested in something, even as complex as solar

energy, you can just go and become a data scientist there. If

you’re interested in wind turbines, you can go and become a

data scientist there, regardless of your background.

What I’m getting to, is that somebody might think that you

have to be an expert in solar to even be considered for a role

in solar. No, you don’t. Like Gabor here is showing by

example, you just have to be a data scientist or like be

confidence in your skills, do some research, and then go

there. I think it’s a good testament as well to the

transferability of data science skills that you can go from one

industry to another very quickly. Like in your case, from

heavy machinery which doesn’t have that much to do with

solar in the first place, and you can just move to solar energy

or wind turbines or whatever. So, basically, guys, dream big

and wherever you want to work, whatever is your passion, you

will be able to get in there quite quickly.

Gabor: Yeah, that’s true. You summed it up really good. Actually, the

funny thing that I will go into in a bit is that before doing data

science for heavy machinery, I was doing text analytics.

Kirill: Text analytics. That’s awesome. There you go, that’s a jump.

And that’s when you were data analyst? What was the

company called there?


Gabor: Yeah, it was Sykes.

Kirill: Sykes. Before XAPT you were at Sykes and it’s text analytics,

that’s so cool. Such a big change from text analytics to

working heavy machinery to now working in the space of solar

and stuff like that. You touched on a very interesting topic, I

think we should expand on that more because I haven’t heard

anybody on the podcast talk about it yet. Survival analysis.

I’ve heard a little bit about it, I’ve read a bit about survival

analysis. Could you give us an overview, what it’s all about

and how does it work?

Gabor: Yeah. Actually, I wanted to talk about it of course because it’s

really interesting. Did you know that it’s actually one of the

oldest statistical disciplines? It has roots in demography and

actuarial science like economics, and it dates back to the 17th

century. I did a lot of research on it because it’s so interesting.

In the beginning, it was most importantly used in

demographical analysis like vital statistics that deals with

statistics on birth, deaths, marriages, divorces and these

kinds of things. Since then of course, it became widely used

in other fields as well, like economics, failure analysis and

mechanical systems, like what I did for heavy machines. It’s

actually about analysing data where your outcome variable is

the time until an event. For example, it can be death or

marriage, or failure or something.

Kirill: Sorry to interrupt. So, in marriage, survival is how long you

can survive until you get married?

[Laughter]


Gabor: Yeah.

Kirill: That’s so funny. Puts marriage in a bad light. But okay,

gotcha. It’s just a term. I guess it comes from where it

originated. It originated with like how long people live, like

before they get sick, or before they die, or something like that.

Gabor: Yeah, that’s it. If you think about it, that the outcome is kind

of like a continuous variable, but it’s not continuous because

it’s time. It’s actually a generalized form of a high dimensional

regression analysis and it’s really interesting. It’s really cool

and this is the one thing, a good example that we were talking

about a few minutes before. You can just apply it on anything.

Kirill: When you say the outcome is not continuous, what do you

mean by that? Like it’s time, right, it can be …

Gabor: Yeah. It’s kind of like continuous. For example, if you want I

can talk a bit more about it for course.

Kirill: However you want to structure this. You’re the expert, just

tell us about survival analysis. What do we need to know?

What’s the most important fun stuff?

Gabor: Okay. Actually, what I did it’s also like this time that you’re

measuring or time to event, or the survival time, it can be


measured in whatever you want, so in days, weeks, years. It’s

a continuous variable, let’s say.

For example, if the event of interest is like a failure, then the

survival time can be the time in days or hours or even years

until for example a machine develops a failure, let’s say. It has

a lot of interesting terms also like censored and uncensored

observations and like for example there are two kinds of

subsets of the data, what you can deal with, like the censored

and the uncensored one. In some of them there is, for

example, if the event hasn’t happened, you don’t have any

observation of the event with that kind of machine and then

it becomes hard to define the survival time at the end.

Kirill: Okay, and how do you go about it then?

Gabor: Yeah. Actually, there I incorporated some averages and other

statistics where you don’t have the exact time. I can talk about

too much of these things.

Kirill: Okay. But let’s say in real life, would you use survival analysis

if you’re testing some sort of medicine? You have a population

of people and they’re … Or let’s say not even people. You have

like this group of mice, you want to see if this medicine helps

them live longer. Is that an example when you would use

survival analysis?

Gabor: Yes. Actually, if you have a good number of observations, of

course. In biostatistics, it’s really commonly used.


Kirill: All right. And so, what makes survival analysis stand out? Is

it just the fact that we are counting backwards, we’re looking

at how much time until these mice start unfortunately, dying,

or is it something else? Is there a certain reason why survival

analysis is so interesting, it has its own kind of domain?

Gabor: Unlike ordinary regression models, here are dependant

variables in survival analysis, it’s composed of two parts. One

is the time to the event of interest and the other is the event

status which records the event of interest occurred or not.

From this you can define the censored and uncensored

observations of course. For example, here you can estimate

two functions that are dependent on time, the survival

function and the hazard function. These two functions are the

key concepts in survival analysis describing the distribution

of event times. For example, the survival function, gives for

every time the probability of surviving or actually not

surviving or not experiencing the event up to that time. It’s

starting at 1, it’s a positive valued monotone decreasing

function. So, when you’re going through time, of course, you

will get the score at every timespan let’s say, and as you’re

going forward in time, probably the score that you will survive

will decrease, that’s why it’s starting at 1 and it’s a positive

valued monotone decreasing function.

On the other hand, the hazard function, gives the current

potential that the event will occur per time unit and given that

the individual has survived up to that specific time. It’s part

of the survival function also so it can change over time, for

example it’s increasing as components age, so it’s the


difference. It’s actually kind of the opposite, so the survival

function is going decreasingly and the hazard function is

going upwards.

Kirill: They’re not like, one is not the complement of the other, it’s

just that it’s normal that the older a person is or a machine

is, the more likely that something will go wrong.

Gabor: Yeah, it just gives you a hazard score but it’s not the 1-minus

this of the other function.

Kirill: Gotcha. So, just to recap, so you got the survival function

which gives you a score. Give us an example of the machine,

right, how would you apply that score in the case of heavy

machinery?

Gabor: For example, if you have an engine that you start the engine

and if it’s a brand-new engine, that of course at zero time your

survival score will be 1 because you just started the engine

and you probably think that it’s going to survive.

Kirill: Yeah, 100%.

Gabor: As you’re going through time, you’re going forward in time,

this score will just decrease and of course it could decrease

based on the features that you’re using, based on the

correlation between those features and how they correlate


with the output, of course with the time and the status and

everything that’s there.

Kirill: And so, for example if your survival function gives you like a

value of 0.6, 30 days after you started using the engine, that

means there’s a 60% chance that it survives, is that right?

Gabor: Yeah. Exactly.

Kirill: Okay. That’s like for one engine, you can think of it as

probability of survival. But in terms of like let’s say if you have

1,000 flowers, and then if on day 30 they have a 60% score or

0.6 score from the survival function, you can say that out of

those 1,000 flowers, only 600 will survive up to day 30.

Gabor: Yeah, you can put it that way also.

Kirill: Okay, cool. And then with the hazard function, how does it

work?

Gabor: Actually with that you will get a hazard score which, while

that score is including the survival function of course, it’s an

increasing function not like the survival function. So, then

you will get a hazard score. For example, it’s almost like

minus the … 1-minus the survival function, but it just

incorporates. The survival function incorporates the hazard

function inside of it.


Kirill: Okay makes sense. Like at the start you have an engine, it’s

brand new, everything is okay, so the hazard will be like very

low. But then if you go forward and go further and like 100

days later you have that same engine, your hazard score will

be higher meaning that there’s a higher chance that it will

break down.

Gabor: Yes, true. And of course, these scores are assigned based on

the historical data that you are using, and the features …

Kirill: Okay, gotcha. I see now. So, the hazard function will take into

account that let’s say, the engine won’t survive for another,

like, one day or for another two days, is that right?

Gabor: Yes, it’s true.

Kirill: So it depends on the time that you want to evaluate. It’s

already lived. It’s survived like 100 days, now you want to see

will the engine survive another five days, then you’ll have 1

hazard score. But if you want to see will the engine survive

another 50 days, then you’ll have a worse hazard score,

because it’s less likely to survive longer.

Gabor: Yeah.

Kirill: Okay, that’s pretty cool. There is a whole mathematical

apparatus behind this survival theory, I think that’s why I

found it very interesting in the first place. Very well defined


and it can be applied to many different problems in life, like

you say, machinery and other areas as well.

Gabor: Yeah, that’s true. I’m glad you like it also. It’s really interesting

actually and fascinating how it can be applied for anything.

Kirill: Awesome. Well, guys, if you found this quick overview of

survival analysis interesting, have a look into it, I think it’s a

pretty cool area of data science which is good to at least know

about.

Okay, so you said you did survival analysis at your previous

job, now you’re working with solar and wind, but before that

you mentioned text analytics. What were you doing in the

space of text analytics?

Gabor: Yeah. Well actually I got that job because of my thesis work,

because during the university, my thesis was based on social

media analytics and it was counted one-year-old laboratory

work. Actually, there I broadened my knowledge in natural

language processing like, text mining, classifying the

sentiment and emotions of reviews from social networking

sites like Yelp, Foursquare and Twitter also. I really enjoyed

doing it, I did some parts of my work in Python, like SPSS,

RapidMiner, but the heavy part, the computation and running

the algorithms was done in R. I don’t know, I just find it quite

handy doing this in R.


I got that job because of this project. I started working as a

social media data analyst, where I had to rely on my

knowledge on text analysis, on keyword based information,

extraction and so on, and I was analysing the sentiments of

posts, comments, messages, forum question for tech

companies. If they were positive or negative mentions, or if

they were about a specific product like hardware or software

related issue, etc. Well it was quite interesting. I still believe

that businesses really can make a difference and improve

themselves by gaining knowledge on their customers from

social media. Because everybody’s posting and texting a lot of

things and if you analyse it well, you can really improve your

business.

Of course, you know what, the hardest part was like, it was

getting harder and harder when we extended the regions, we

were not just analysing the English content but when you

have European languages like Hungarian, Polish, Greek, and

so on, that’s the heavy part. Actually, there I’ve been looking

into some of the best sentiment analytics tools but I haven’t

found the best that have a great accuracy. So, in this case you

will need the help of manual categorization by language

professionals or someone who can do it for you. Actually, now

I think back that I really liked that job, I really liked text

analytics because maybe for all data scientists it’s one of the

most common things to like, analysing texts. I really liked it.

There, I became a senior data analyst very quick as they

actually saw that I know what I’m doing, and they trusted my

insights and during that two years of work, I kind of helped

setting up a team of like multilingual international data

analysts, like 5-10 people. It was really interesting.


Kirill: That’s so cool. What tool did you use for that, was it Python

or something else?

Gabor: First in some parts I was using R, some parts I was using

Visual Basic and Excel but the most part of it was done by …

Salesforce has quite a good marketing tool for this

sentimental analysis and kind of like web scraping, it’s called

Radian6. I’m not sure if they still call it like that, at that time

it was Radian6 by Salesforce. It’s a very powerful tool that you

can just collect a lot of things from blogs, technical forums,

you can just give the URLs and so on and you can just collect

any kind of information by … Like, you know it also relies on

this Boolean keyword extraction where you can define what

keywords you want to use with “and” or “ors”. It was really

cool. And you can just download a lot of things, it can be

connected even to your Facebook business pages or Twitter

business pages, with Google+, everything, and you can even

see the inbox messages, you can just analyse the inbox

messages also. It’s really cool. We were doing like a customer

service job, I wasn’t involved in answering the customers but

I was the one analysing and gathering the insights for our

clients.

Kirill: All right. Can you walk us through this? Once you let’s say

use this tool that you mentioned or some other tool and you

connect it, what happens? It goes to the webpage, it finds a

comment, it downloads it into a file, or into a database and

then what? It restructures it, because all comments have

different structure? Can you walk us through the process

from start to finish, please?


Gabor: You can define what kind of platforms you want to make your

search on and then you can define the keywords that you

want to look for, for example if you’re looking for Apple, you

can just put Apple and plus you can put iPhone and plus if

you’re analysing the different models, you can put like, 4, 5,

6, 7, 8 and so on, and of course then you can for example add

like the keyword for camera or hardware or something. Then,

this tool will find all the posts, all the content on those sites

that you were including in the beginning with these keywords

and then you can download them via, like, excel, CSV files,

even XML, anything you want. You can probably even connect

it to a database and just put all the data there. It actually kind

of structures your data, of course the textual data will not be

structured in the beginning but it can also, before

downloading the data, you can just say that I want to see the

positive and negative mentions also. And it has these prebuilt

algorithms for finding the positive and the negative mentions

based on, probably it’s also based on something like lexicon

or something that’s behind it but it doesn’t really work with

other languages, you can tweak it manually to work with it

but it takes a lot of time. But with English it’s working quite

well. Of course, with these things, you will always have the

problem of sarcasm in everything. because you might find it’s

negative but then it might be positive and your ratio will not

be the one that you were looking for.

Kirill: Okay, gotcha. So, how do you deal with the fact that some

comments might be very long, some comments might be very

short data. Does it matter or not at all?

Gabor: It doesn’t really matter. I was doing a lot of text mining and

analytics with textual data and one of the easiest things to do


is like build a term document matrix where it contains the

frequency of the words, that’s in one of the documents. For

example, one document can be just one line of text or one

sentence. You can just structure it really well and it doesn’t

matter how long it is. Of course, it will probably take more

time to do it but it doesn’t really matter.

Kirill: Okay, gotcha. How long would you say it would take

somebody who has never used text analytics before to get into

this field and be able to form their first text analytics?

Gabor: Well, I would probably say that not too much because really,

if you’re interested in it just go put like, text mining in Python

or R, there are a lot of packages that you can just download.

It’s really easy to use. Once when I started getting into it

deeper during the university, I was using R because it had a

text mining package, and I was using Python also for web

scraping, and actually once I created, through a web API of

Twitter, I was able to easily download a lot of posts, like an

automated job. It downloaded my posts during the night and

then I was analysing them during the day. It’s really cool, it’s

really easy. If you guys out there are really interested, just

take a look at it.

Kirill: Awesome. Thanks a lot for sharing those insights into text

analytics. It sounds like a very exciting space for people to get

into. You are right, a lot of companies will need to do more

and more of that because there is so much unstructured data

floating around and now companies are just getting good at

using their structured data, the ones that are using it, and


the next frontier for competitiveness is unstructured data,

and part of that is text analytics.

All right, you’ve done quite a lot of different stuff. You’ve done

text analytics, you’ve done survival analysis, building some

forecasting predictive models. What would you say is the most

exciting for you and also what are you looking forward to

learning the most? What is the next type of data science that

you can’t wait to get your hands on?

Gabor: Well, probably as I’m working in the energy industry now, I

will go deeper into energy forecasting and how we can use all

the other features from weather insights like for example we

can use even the precipitation to forecast wind power, on this

kind of things. I would like to go into details with, like, what

else to use in these algorithms.

Kirill: Okay. That’s fair enough and sounds like a big area of data

science that I didn’t even know existed until today. What I

wanted to ask you is you said that you were doing a thesis on

text analytics, that was part of your research and that’s how

you got your job. What exactly did you study at university?

Gabor: Well, that’s a funny job actually. It might take a bit longer to

tell you but actually I graduated as a Business Informatics

Engineer but it has kind of a longer story how I got there.

Kirill: All right, tell us.


Gabor: My professional career started with an internship at banking,

where I worked under the supervision of seniors like business

consultants dealing with IT demand management for

business departments and I got to handle smaller and larger

excel sheets for administrative purposes. Seriously, I

remember my first day at work and my boss gave me an excel

sheet containing the project backlog and I was just so afraid

to touch it.

[laughter]

It was during my bachelor’s but I was like, wow I have to do

something with this. But you know, it was that job that

pushed me into analytics. At that time, I had a conversation

with my boss about staying there full-time after my bachelor’s

and she said that, “Yeah, I would like to hire someone with a

master’s,” but at that time I didn’t think I would want to have

one. And the funny part here is that a few days later, I actually

had a dream where I was talking to my supervisor at the

university, I was handing in some papers to him when he

looked at me and he said, “Hey Gabor, I think you made a

mistake here because these papers are for bachelor’s and you

need the application for the master’s.” I woke up, I

immediately checked the application deadline for master’s

and it was a week from that day and it was my birthday.

I was like, okay, that’s a sign. And I got into one of the best

universities in Hungary for master’s and my specialization

here was like Business Intelligence and Analytics, and it was

mainly about data mining, customer analytics, going deeper

into machine learning algorithms. I also remember that I was


even building and calculating decision trees and neural

networks on paper for smaller data sets.

Kirill: That’s so cool. You got into data science because of a

dream, that’s so awesome.

Gabor: Yeah, kind of. If nothing then that was a sign for sure.

Kirill: Why did you choose data science for master’s even though

you studied something else as a bachelor’s?

Gabor: I was studying bachelor’s as Business Information Systems

and there I got to study some decision science, business

intelligence and then it sounded really cool to study business

intelligence and that’s what the advertisement was saying for

this specialization because it was in English and it said,

“Business Intelligence and Analytics” and I really wanted to

do some analytics jobs also in the future. And it happened

and I’m really glad that I did it because I’m very happy now.

Kirill: That’s cool. What I also wanted to ask you is, you said your

thesis helped you get your job. How did you use your project,

your research at uni, how did you use that to show to

employers? Did they accidently find it themselves, or were

you proactively sending it out to people? How did that

happen?


Gabor: No. I had a friend in university and her aunt or someone

worked at the company, at Sykes, and she was saying that it

would be nice if you could go to an interview there because

they’re looking for someone with what we are doing at the

university and my thesis was about it. Then I went there and

I was talking about my project and at that time in Hungary,

they didn’t see anyone else before who was doing exactly that

what they needed.

Kirill: So it was very easy for you to get it then, if you were the only

one in the whole country.

Gabor: Probably not the only one in the country but who they were

checking.

Kirill: Who they had seen before. Okay, that’s a very interesting

story. All right, so I’ve got a few rapid-fire questions for you

which I would like to pose, are you ready for these?

Gabor: Yeah, sure.

Kirill: Okay. What’s the biggest challenge you’ve ever had as a data

scientist?

Gabor: It’s a hard one. Actually, we haven’t talked too much about

feature engineering and data wrangling until now. But they

are kind of one of the most important and most stressful

parts of being a data scientist, when you have to transform


and map the data from one format to another or to create a

long data structure from wide data or wide spatial. Or when

you have to gather or spread the data based on some specific

key value pairs. To do it well it can be really stressful and

time consuming sometimes. That’s the biggest challenge that

I have.

I think the biggest challenge, generally, being a data scientist

is preparing the data in the right format to be used as an

input for the machine learning algorithms, because this is

what they don’t teach too deep in the universities. It’s just

easy to use some premier tidy data that they give you in the

university but in real-life scenarios, it’s really not the case.

You have to be careful how to make that input data that you

will feed your algorithms with.

Kirill: Yeah. What I also find is it’s really hard to find courses or

even books that can prepare you for that because it’s hard to

come up with a data set that’s dirty intentionally, right? You

commonly find them, you come across them during your

work and you get caught by surprise, but when you want to

find one, or when you want to create one yourself, it’s not an

easy task for exercise purposes. I think that’s why a lot of

courses miss out on that side of things.

Gabor: Yeah, I think so.

Kirill: Okay. Thanks for your answer to that question and now let’s

move on to the next one. What is a recent win you can share

with us that you’ve had in your role? Something that you can


disclose, I know that there’s certain things that you cannot

talk about, classified things maybe, but let’s talk about

something that you can share, something you’re proud of

that you’ve achieved or accomplished.

Gabor: Back in my previous job when I was doing survival analysis,

I actually got a hold of some historical GPS data of the

machines, like coming from sensors and IoT devices, and I

analysed if there is any correlation between the machine’s

location and the failures or survival time, and it showed that

actually there was. The survival time in one region differed

from the survival time in another.

Kirill: Wow, that’s so interesting.

Gabor: Yeah, and I don’t need to say that I got so happy because I

got to analyse more and of course I had some discussions

with the business side, the industry side, and I started

extracting and gathering terrain data, like land cover from

raster files and also data from weather stations to get more

features to use in the survival analysis. At the end, I got

better results with these new features, so I included them in

the solutions. I think it was quite a win.

Kirill: That’s really cool. Did you ever get to the reason behind why

in a certain region the survival was lower?

Gabor: Yeah, actually if you think about the US like for example if

you just think about one state like if there are, like, rocky


mountains, of course your machine will fail more likely than

when your machine is working on an urban area. If you think

about it it’s really simple and it shows in the research also,

very cool.

Kirill: That’s cool. And so were these machines constantly stationed

in those separate regions or were these machines travelling

across different regions?

Gabor: During two failures or services, or between those services,

they were kind of in the same type of location.

Kirill: So it was just generally the case that the machines in this

area are more like to fail than machines in this area so we

need to service them more often.

Gabor: Yeah, that’s right.

Kirill: Very interesting. And so nobody knew that before you did

that analysis?

Gabor: Well, not in that solution.

Kirill; Not in an analytical way. That’s really cool. Awesome. It’s

always fun, I find that, to relate back to the reality of things

or the business knowledge. That you find something in the

data and then you look at the locations and you’re like, oh


that does make sense, it is more rocky mountains here and

it is an urban area here. It’s good when the real life confirms

what you see in the data, isn’t it?

Gabor: Yeah, that’s true.

Kirill: Awesome, that’s a big win. What is your one most favourite

thing about being a data scientist, except for cleaning data of

course?

Gabor: Yeah, that’s the best part. Well, to learn new approaches, to

discuss your results with other professionals. The

visualization part is one of my favourites. Currently I’m using

R, R ggplot package for visualization, but I’ve also used

Tableau and Power BI where you can just import any kind of

data, you can just create nice plots. Another favourite thing

is that there is always something new to do, to make research

about, to try them, to implement them.

Kirill: Yeah, that’s really cool. I like that as well. It’s constantly

growing, there’s always things that you’ve got to be up to date

with to see what’s happening and you can also contribute to

the new areas of data science, so that’s really cool.

From all the different various experience you’ve seen and the

types of work even, like you’ve had for work in the office,

you’re now working remotely and so on, where do you think

the field of data science is going and what should our

listeners look into to prepare for the future?


Gabor: Everything. Actually, this is a good one. I’ve recently been to

an AI summit in Vienna and there was this inspiring talk of

Sepp Hochreiter if you know him. He’s a professor at

Johannes Kepler University.

Kirill: That’s so cool that you got to hear him. He created the

LSTMs, long short-term memory for recurrent neural

networks in deep learning.

Gabor: Yeah. He asked this question, who of you are still going to

human doctors? And everyone raised their hands. Then he

asked, who is going to AIs, then of course no one put their

hands up, and he just concluded that, “Well, this will change;

Don’t trust human doctors for diagnostics because machines

are better.” Surely, he was just making a joke but if you think

about deep learning and image processing, and of course you

have this course on Udemy. You know machines can just

better classify for example skin cancer, because it can learn

from millions and millions of images, whereas doctors might

have just thousands of patients during their practice. So, I

think this is really interesting to see how far machine

learning and AI and data science will go in the near future.

Kirill: That’s so cool that you bring that up because I totally agree.

There’s an app, you can actually download it. It’s called Skin

Vision I think. There’s another one, M-I-I Skin, Miiskin. But

Skin Vision I think I’ve heard of before, and basically …

Anyway, there’s an app you can get on App Store, I’m not

sure about the name but you can get it on App Store, for


iPhone, maybe for Android, and it does exactly that. You take

a photo of something on your skin you’re worried about, and

based on thousands and millions of images that its

algorithm, its deep learning algorithm has been through, it

can help understand. Whatever algorithm’s in the

background, can help understand if that’s skin cancer or not

and they actually tested it against doctors and it performed

as good as several professional skin doctors in the world and

so, yeah, exactly what you’re saying. We’re getting into an age

where it’s going to be AI mostly.

Gabor: Yeah, I totally agree.

Kirill: Awesome. Thank you so much for coming on this show and

sharing your story and insights. If our listeners would like to

get in touch or follow you or maybe ask questions or see how

your career goes further, where is the best place to get in

touch with you?

Gabor: I’m on LinkedIn, so just feel free to send me a message or

request if you have some questions or you want to learn

more, or I can learn more from you, then definitely send me

a request.

Kirill: Okay, awesome. We’ll share your LinkedIn on the show

notes. I have one final question for you. What is your one

favourite book that you’d like to share with our listeners

today?


Gabor: It’s really not easy to say just one book. It’s hard but since

I’m still really in love with text mining and I talked a lot about

it, and how much I really love it and its applications, I will

recommend the Introduction to Information Retrieval by

Manning. It has topics on how to build web search engines,

how they work, also areas on text classification and

clustering, indexing, ranking, and so on. I can only

recommend it for those who are interested in text analysis,

it’s really a great book. Next to it you can just create your

own stuff on R, Python, anything really.

Kirill: Okay. That’s awesome. Well, there you go, Introduction to

Information Retrieval by Manning. And if there’s a book you’d

like to recommend to people who are not interested in text

analytics, does anything come to mind?

Gabor: Yeah. That was the first book that I got a hold of when I was

studying data mining during the university, it’s called

Introduction to Data Mining. I will send you the link.

Kirill: Okay, sounds good. So, our second book is Introduction to

Data Mining.

Well, there we go. Once again, thank you so much, Gabor,

for coming on this show and sharing all the insights. I really

appreciate you taking time and good luck with the remote

work and can’t wait to see how your career goes from here.

Gabor: Thank you very much.


Kirill: So, there you have it. That is Gabor Solymosi and his story

and how he’s moved through the space of data science. Hope

you find that inspiring and it’s hopefully going to give you

some ideas of how you can structure your path better.

Personally, what I found most inspiring was of course how

Gabor, since our catch up in Europe, has moved to a remote

data science role, he’s working from home now. That is very

in line with his passion of being free and being able to do

what he loves and at the same time being able to do it from

wherever he wants and on his terms. He is a very talented

data scientist and I’m sure that by being able to control his

own time as he desires, he is bringing even more value to the

organisation that he’s working for and I’m super excited that

they have created this environment for him. I can’t wait to

hear how his story will progress, I’m very excited to hear how

it will progress over the next couple of months or next year

or so.

Make sure to Gabor on LinkedIn and follow along his career

path and see where it takes him. You can find the URL to

his LinkedIn at the show notes at

www.superdatascience.com/107. There you will also find

the transcript for this episode and any other materials that

we mentioned during the show.

On that note, thank you so much for being here today. Can’t

wait to see you back here next time, and until then, happy

analysing.

http://www.superdatascience.com/107



SDS PODCAST EPISODE 107 WITH GABOR SOLYMOSI€¦ · Gabor’s dream is to live in Spain and work in Spain and so on, and one of the things that we discussed during our catch up was

Documents