LØRN case C0052 -

Ellie Dobson

Chief Data Scientist


Use of data

In this episode of #LØRN, Silvija talks to VP Data Science in Arundo, Ellie Dobson, about the Higgs-boson, how to get the most out of your data, and how to train algorithms to retrieve data. She talks about what you can get out of your data and explains how to get answers to what you want with enough data. She explains who a good data scientist is, stating such a person as one that starts with a question and not the data. She leads the data science team at Arundo and has worked in quite a few different industries before ending up in this role. She spent several years working as a particle physicist at CERN for several years on the LHC experiment Ellie spent the next few years working for a couple of tech companies in a data scientist role - effectively building predictive models in industries as far removed as fashion analytics and Formula 1 racing.
LØRN case C0052 -

Ellie Dobson

Chief Data Scientist


Use of data

In this episode of #LØRN, Silvija talks to VP Data Science in Arundo, Ellie Dobson, about the Higgs-boson, how to get the most out of your data, and how to train algorithms to retrieve data. She talks about what you can get out of your data and explains how to get answers to what you want with enough data. She explains who a good data scientist is, stating such a person as one that starts with a question and not the data. She leads the data science team at Arundo and has worked in quite a few different industries before ending up in this role. She spent several years working as a particle physicist at CERN for several years on the LHC experiment Ellie spent the next few years working for a couple of tech companies in a data scientist role - effectively building predictive models in industries as far removed as fashion analytics and Formula 1 racing.

24 min

Choose your preferred format

Velkommen til Lørn.Tech – en læringsdugnad om teknologi og samfunn med Silvija Seres, Sunniva Rose og venner.

SS: Hello, and welcome to this edition of Lørn.Tech. I´m Silvija Seres, and with me today is Ellie Dobson, VP of Data Science from Arundo. Ellie, welcome.

ED: Thank you.

SS: Ellie, I love talking to you for many reasons. Among other that we get to share some past data points about Oxford and PhD´s and theoretical subjects. You have your PhD from Oxford in particle physics and you’ve spent quite a lot of time researching stuff around particle physics amongst other things, with a Marie Curie scholarship at CERN. Looking for some very rare particles?

ED: That’s correct. Looking for the higgs-boson.

SS: What does a lady like you do in start-up with bigdata science?

ED: So I’m head of the data science group Arundo. What I discovered when I decided to leave academia and move to industry is that particle physics detect looks very similar to the Formula

One racing car. Which looks very similar to an oil rig. If you’re a data scientist, all of those things are complex objects which is streaming lots of data, which may be useful for various reasons. It is a data scientist job to be able to extract value from that data.

SS: How does one do that?

ED: Well, it can be a very long process. I think the first thing to do is to find out the questions that you want to ask of the data. Rather starting from the data and saying “what can this data tell us?”, you would start differently by saying “what is it we want to know about this piece of equipment or this system?”. What is it we want to know about our customer base, for example. Then you ask how or if can we ask that question of the data and start that way around. Once you’ve formulated what the question is that you want to ask of the data which will allow us to generate value from it, then you can start breaking down the problem into a set of technical algorithms which need to be constructed, which can prove or disprove your hypothesis, meaning answer or not answer the question you’re seeking to answer.

SS: I think there’s an element here that many people are not really aware of. We think of data as the new oil, but we forget how incredibly expensive that data is to get and to handle. Data acquisition cost versus data relevance is really the thing you have to decide before you start modelling things, right?

ED: Its kind of a balance. In Arundo I work with a variety of customers, and what we help our customers to understand is how they can extract value from the data and what they need to do. Of course, we go in and help them do it. One of the ways I find is a simple way to think about it is to draw a set of two axes. On one you have feasibility and on the other you have value. So you’re asking yourself firstly, what is possible to do with the data, and the second thing you’re asking, which is on a completely different axes is what value will answering the question bring to you. I find we have a series of questions you want to ask of the data, plot them on the two sets of axes and see, what are the values and what is the feasibility. It will often really help to map out what your data science endeavour actually will look like. You mentioned data acquisition and data cost. It’s absolutely true that the dataset that you’re looking at is, at least in the business where I currently work which is asset heavy industries, one of the biggest drives of the success of a model is very often not how clever the algorithm is but how rich the dataset is, which drives how much further you can get on. That consideration would be one of the things which goes into your feasibility considerations. Let’s say you have a very small dataset that doesn’t have a lot of info in it, then that questions would go low on that axes and wouldn’t score very high on feasibility, but your dataset might be rich in other ways that you haven’t considered. Maybe you have the potential to increase the value of the dataset that you have. If costs are too high to access additional data, maybe you can get more creative about what you consider to be your data. Combining a proprietary dataset that you may have with a public source of data, a lot of people tap into API´s for weather systems for example, and I’ve worked with transport API´s. You can buy twitter data, whatever floats your boat. Thinking about what ways you can expand your dataset on top of what you already have can help.

SS: So flexibility as well, or expandability. Can you help us visualize the transition from particle physics to data scientist? How did that happen?

ED: Well, in a way I kind of fell into it. I was working on the Higgs discovery team and I was at CERN for many years. In the time I spent there I spent a lot of time learning more about how to look for a needle in a haystack of data.

SS: Why?

ED: Because the amount of data that’s generated by these accelerators is colossal, even 10 years ago when I was doing a lot of work there, but even by todays standards.

SS: So accelerators move particles about and bounce them. Why? What is the problem you were trying to solve?

ED: What we were trying to do was looking for a particle which is created very rarely, called the higgs-boson.

SS: Which at the time, we couldn’t prove existed?

ED: Yes. Nobody had ever seen it, but the existence of higgs-boson is a little bit like a smoking gun of a mechanism called the higgs mechanism which answers a lot of unanswered questions.

SS: About? Big bang and stuff like that?

ED: About the way particles fundamentally interact with each other. It answers the question of why we have mass in the weight-sense. We know we have mass but without the higgs there’s not a way to explain how that is generated.

SS: Is higgs very small or very rare?

ED: Its very rare. By particle physics-standards it’s not very small but by human standards it is extremely small. It is not generated very often, so you need a lot of energy and collisions to generate it. If you imagine a particle physics accelerator it’s a little bit like if you have two cars crashing on the road. Normally after this happens you get a bunch of bits of the car scattered all over the road. Now, that’s the same with the particle accelerator; if you collide two protons into each other, afterwards you get a lot of bits of proton over your detector. However, one time in, well maybe many millions of times something rare happens, which is let’s say a bus comes out of the collision, a new fundamental object comes out of the collision. That’s kind of what higgs generation is like, higgs comes flying out of the collision, and that’s a very heavy object and when it crashes into you detector you get a lot of noise. The difficulty is how to distinguish higgs from stuff that looks like a higgs but isn’t actually higgs. There’s a lot of similarities between looking for higgs and looking for, which is what my main speciality is and what our main speciality in Arundo is, which is predictive maintenance. When you get a failure in a complex piece of equipment, that could be an oil rig, a Formula One car, a ship. If you get a failure, they don’t happen very often, hopefully.

SS: How can you notice early enough?

ED: What is the signature of a failure before it occurs, how does that look? That is a very challenging problem because you don’t get a lot of failures. So how do you tease out this signal which is a pre-failure in industrial predictive maintenance problems, with enough certainty that you can actually act on it. What you really don’t want, in industrial terms and also in higgs-boson, you don’t want to have fake positives

SS: In a way, what you’re trying to say is that finding patterns in big data about stuff that usually happens is fine, but finding patterns that are far more hidden requires a lot of statistical, scientific, subject matter intuition, to figure out how to rig the system so you would recognize that.

ED: Yes. I have a fairly strong opinion on that it is a fool’s game to say, “machine learning, go and do it”. Your algorithm’s intelligence is limited by the intelligence of the people who built it, and that isn’t just the data scientist, it’s all the people the data scientist hopefully consults with to build the algorithm. This is thinking of mistakes I’ve made in my past career, this is one of the pitfalls which I think a lot of people will fall into, which is trusting too much in to the algorithm. Just getting a lot of data, throwing it into an algorithm and assuming the algorithm will do the right thing. The algorithm will not do the right thing unless it´s told what the right thing looks like. In our business at Arundo we believe it´s absolutely crucial to have as much subject matter expertise involved in building the algorithms as possible, so we work very closely with the operators of the Riggs, with the people who know the equipment. If you ask these people “how do you currently spot a failure”, they will say “we listened to the equipment”. They’ve been working with that equipment for 20 years and they know it better than anybody else. Like I said, it’s a fool’s game to try to replace, or assume that you could somehow replace subject matter expertise with an algorithm. To me the best solution to get the best of both worlds, thinking about how you can blend human experience and expertise with an algorithm that can help humans detect things that they may not otherwise detect in a very complex environment, when you have thousands of sentences that can’t all be looked at the same time. That’s the nut that a lot of people are trying to crack.

SS: I think it is really interesting because if you start following uses of artificial intelligence in, let’s say legal space, there are some really cool products that help you for example anonymize or block out pieces of information you have to, in order to forward the information to the necessary instance. If you automate that all together then the system will do wrong things, false positives and so on. The way these things work best is if they highlight the potential things that need to be, and then a human can help to figure out the uncertain cases. The same things happen in let’s say image analysis of breast cancer. It’s really this hybrid model where the human somehow handles the stuff that’s rare, and the machines help with the stuff that is common.

ED: Yes, so that’s what I would call semi-supervised learning. A lot of what we´re doing is when we look at these rare failures. We sit down with the subject matter experts, because failures are rare and you can physically go through each one and ask “what is this? What happened here?”. That’s the way we´re currently working to be able to blend the machine and the man in a way. We work a lot with applications to do that, so you can build applications that are kind of the interface between the algorithm and the person, whoever that may be, so they can go in and manually label failures. Then the algorithm will give predictions back according to what their labels are. Let’s say the labels are a little bit wrong, maybe a particular sort of failure happens the next day that’s completely different. This gets labelled and it goes into the system, the algorithm updates and shows new predictions. I think often you see on Coursera or something like that, you can get the impression that there is one state where you train the algorithm, and the next where you put the algorithm to make predictions and that’s the end of it. Its sometimes sold as a two-step process and it is absolutely not, it is a process that goes round and round. The more places where you can get a physical persons input into the algorithm training, certainly in our business and many others as well, the better.

SS: I need you to tell us about your examples so we can imagine this particle accelerator, CERN, lots of white coats, finding the higgs and the Nobel price etc. The other stuff like finding and predicting failure in a race car, or predicting fashion trends, could you quickly tell us how you did that?

ED: Yes, I can certainly tell you how we tried to do it. Predicting failure in a race car is very similar to predicting failure in something like an oil rig. That is actually what I was doing prior to Arundo, working in Formula One. Race car with data scientist is a complex beast, it’s about 2500 senses on a race car, at least when I was working in that field.

SS: Do you need to read them quickly?

ED: They are clocking fairly fast, not incredibly fast, but several times each second, they will be making a reading of the various states, the pressures, the temperatures and what have you, all over the race car. There is a few things you can do with a race car, you can increase performance that would normally be what we would describe as a regression problem, get the car to the end a little bit faster, also getting the car to the end at all.

SS: Because they do blow up sometimes?

ED: Yes, if you watch Formula 1 you will sometimes see some smoke coming from the distance. That’s also an interesting problem in data science; can you see those failures coming up in real time and is it possible that the racetrack crew can take preventative action, perhaps pull the car in for pitstop earlier or take some sort of action to be able to prevent that failure. So it is very similar to predictive maintenance problems in heavy industry.

SS: Were you able to improve this? I know these are difficult problems, some are NP-complete. Did you move the needle?

ED: The difficulty with these kinds of problems, something we also run into in predictive maintenance is quantifying the improvement an algorithm has made in predictive maintenance. The thing about predictive maintenance which is a curse and a blessing, a curse to a data scientist and a blessing to the people who own the asset, is that failure happens very rarely. You’re looking at crucial assets which is what we want to work on because they are the important ones, you get failures maybe once a year.

SS: You´re looking for these black swans, really.

ED: I mean, time will tell. There are other sorts of problems which can also be worked on. This goes back to value versus feasibility, it comes back to “is the value you will be able to prevent one catastrophic failure in the next ten years”. Depending on the customer, you could save huge amounts of money. Or, do you want to continuously improve production by a fraction of the percent? That’s still a lot of money. If an enterprise wants to work out your data science endeavour, it´s important to recognise which of these things you are looking for. Are you looking for preventing one failure over the next ten years, or are you looking at consistently upping production, increasing efficiency, where you improve by a small percent which is more easily quantifiable in an immediate term. In the case of Formula One when I was working on it, it was the first category.

SS: What about the fashion industry?

ED: I sometimes give a talk at meetups and so on, it is called “from Formula One to fashion analytics” because when I was working at Formula One I also happened to be working on fashion analytics at the same time. If you can find two worlds more different, I´d like to hear it. That was about looking at historical purchasing patterns across a variety of different retailers and trying to figure out what future patterns might be.

SS: What´s hot for the next season?

ED: Well, maybe hot and what is not. In that case the dataset was “what´s going out of stock, what is not selling, what is being marked down”, and looking at this across a variety of retailers, and being able to share the information between retailers because the information was being scraped from the websites of the different retailers. So this is a little bit coming back to the point I was talking about earlier, about being creative with your dataset. One retailer may have their own propriety dataset about their own sales, another might have a completely different dataset. What a third party could do is scrape the info off their websites, and suddenly the thirds party has some interesting information, because they can compare information across different retailers. So that’s what I would call a creative solution about expanding a dataset and being able to get more insight from it. It is kind of an interesting problem also in asset heavy industries when you look at the problem with failures, like I said there is not a lot of them, a real pain for a data scientist. Imagine a world where you could combine failures from one set of asset owners to another set of asset owners and be able to combine that information, that will build you a better model.

SS: A cross-connection of the different siloes.

ED: Considerations to make in an asset heavy industry is if there is some value in combining datasets across different entities.

SS: Ellie, we need to round up. I usually ask people if there are any controversies in this area, from a social perspective. I also want to hear what your biggest frustration when it comes to big data science?

ED: My biggest frustration is when there is not enough data to support what you´re trying to do. I think sometimes we´re viewed as magicians, “wave your sparkly data science wand and sprinkle magic fairy dust”, we´re not magicians. We are analysts and the hard truth is that the limit to what you can do as a data scientist is limited by what data is available to you. Like I said, its ways to be creative about what you call data, maybe you can make a simulation, maybe you can get some subject matter expertise to label your dataset like we discussed, maybe you can combine your data with public data sources. The biggest frustration is when there isn’t that data set to work from because if there isn’t there is very little we can actually do. When I´m in that situation it is frustrating.

SS: Brilliant. Ellie, if people are to read one thing or want to start learning, where would you send them? The non-data people.

ED: I find that there are a variety of MOOC´s out there. Massive online courses which you can access and listen to for free. I find some of them are very technical, some are not, so I would highly recommend the one with Andrew Ng. That one is for somebody who has just a small amount of technical knowledge, when you´re at the point when you would understand , kind of high school level. That is a nice place to start, and he gives a pretty good overview of the different algorithms that are out there without diving too much into the technical nitty gritty.

SS: Ellie Dobson from Arundo, thank you so much for coming here and sharing your knowledge about big data with us. I will add one sentence, and that is that you brought your little Mathilda with you, five months old. Officially the youngest Lørner ever, that is her title. Thank you so much.

ED: Thank you very much.

SS: Thank you for listening.

Du har flyttet til en podcast fra Lørn.Tech - en læringsdugnad om teknologi og samfunn. Følg oss i sosiale medier og på våre nettsider Lørn.tech.

Two lines on your “claim to fame”

I was part of the Higgs discovery team.

Who are you and how did you become interested in AI?

My name is Ellie Dobson and I became interested in AI due to my background in particle physics, where we used machine learning, alongside many other tools, to discover the Higgs boson.

What is your role at work?

I head up the data science team at Arundo Analytics, a company specializing in data-driven analytics and software for asset-heavy industries.

What are the most important concepts in AI?

It depends on whom you ask. To me, it is about finding the best ways to use AI in practice, to drive business value.

Why is this exciting?

AI has been around for a long time, as my dad – whose Ph.D. was in AI 50 years ago – likes to remind me. However, it is only recently that we have begun to acquire enough data to feed the AI algorithms sufficiently so they make useful predictions in a business setting.

What do you think are the most interesting controversies?

I think the key issues are the extent to which AI will replace humans in decision-making and what jobs those people will then be doing in the future as a result.

What is your own favorite example of AI?

A good example of AI is how sensor data and maintenance logs can be used to predict failures on complex industrial equipment.

Can you name any other good examples of big data, nationally or internationally?

I’m also intrigued by how sensor data and race information can be used to improve the performance of Formula One cars.

How do you usually explain how it works, in simple terms?

AI is about using historical data to teach an algorithm to link cause and effect from previous examples. That algorithm can then go on to make predictions based on what it has learned.

Is there anything unique about what we do in AI here in Norway?

We have access to many interesting datasets in the industrial space in Norway. Coupled with its booming tech sector, this makes Norway an exciting place for industrial AI professionals.

Do you have a favorite big data quote?

A bad data scientist will tell you to start with the data. A good one will tell you to start with a question.

Ellie Dobson
Chief Data Scientist
CASE ID: C0052
DATE : 181012
DURATION : 24 min
Big DataHiggs-bosonAlgorithms
"A bad data scientist will tell you to start with the data. A good one will tell you to start with a question."
More Cases in topic of AI AND BIG DATA TECHNOLOGY
Digitale tvillinger

Michael Link



Big Data og geometrisk modellering

Heidi Dahl



Slik kan Big Data predikere fremtiden

Sverre Kjenne