On the surface, Matchmade is all about Influencer Marketing, campaigns, and measurement, but behind the scenes, powering all that is an impressive amount of data and technology. In this blog series, we peek behind the scenes and interview the fantastic folks from our engineering team. Hope you enjoy! Got some questions you’d like to ask? Drop us a line at email@example.com
This time we chatted with Paula, The Scientist behind some of our most used features and under-the-hood analytics. Let’s hear what’s up…
What do you do here at Matchmade?
I’m the Lead Data Scientist in our engineering team. Titles aside, I spend most of my time analyzing data and building statistical models to enrich our data-sets with more insights, which in turn then helps our customers make better decisions. Sciencing (is that a word?) doesn’t happen in a silo, but I write code to also productionize the models we’ve developed, and educate the rest of the company on how to best interpret the results.
“We’re also building practical solutions and solving problems, not just doing science for its own sake”
For a data scientist, I think two things that sets our industry apart are the combination of scale and quick feedback loop. On the one hand, we’re sitting on top of around 500M (and counting!) YouTube videos waiting to be analyzed, sliced, and diced. We’re also building practical solutions and solving problems, not just doing science for its own sake. We can come up with a hypothesis, test it, and ship a solution helping our customers do their work better, all during one week.
What are you currently working on?
I’m working on a model to detect whether a YouTube video is sponsored or not. Based on this, we can, for example, analyze what kind of sponsorship work well on which kinds of channels, and provide better recommendations for our advertisers. Perhaps surprisingly this benefits also the channels and their audiences. Having relevant sponsorship makes the content creation possible while being interesting for the people watching the videos, too.
On the technical level, the predictions are based on the video descriptions, and for the text processing we’re using Facebook’s FastText. For our use case it works really well for several reasons. First of all, FastText has excellent support for language agnostic models: the video descriptions come in many languages, and the usage of the words can also be very… creative. It’s far from your default NLP model trained on Wikipedia texts. Secondly, it’s also in the name, but FastText is fast. We need to process some thousands of videos per minute, so that sets some constraints for us.
What’s the coolest thing you’ve built at Matchmade?
I think my biggest epic was building the estimator of YouTube channel views and installs for our campaigns. The estimates rely heavily on statistics and probabilities, which is always a fascinating world to dive in, as there’s always something more to learn.
Initially, we did the estimation simply based on channel averages, but since the early days, we’ve come far. The current version is still built on a similar idea but taken further. It uses Bayesian bootstrapping to estimate the distribution of the mean views. This has proven to be an excellent way to factor out outliers, viral one-hit wonders, etc. Of course, the work is never done, and next we’re planning to incorporate the data from the sponsored content -detection into these predictions.
Oh, and this is built and still running R in production. Nowadays, most of our science code is Python, but R will always have a special place in my heart.
What technologies do you use?
As mentioned, we mainly write our data-related code in Python and the usual data science stack. Lately, I’ve also been learning more about Docker and Kubernetes, which we use to deploy our services. It’s pretty empowering to have a good understanding of not only the data science layer, but also the devops tools we use to productionize and automate the models we build.
I think the product we’re building simply makes sense. Impact-wise, it’s also cool how we make it possible for independent content creators to pay rent while continuing to delight their audiences. To be honest, the marketing and social media isn’t really something I’m passionate about, but there’s so much more to Matchmade than that. I don’t think it’s really a marketing tool, but more of a marketplace.
Also, currently the influencer marketing industry is a bit of a wild west, with tons of inequality such as influencers not getting paid fairly, fraudulent influencers and so much more. I’m excited about the possibility of putting the data in use, drive the scammy actors out of business, and make influencer marketing fair and transparent.
In the earlier post, Alex already raved about our colleagues, so I’ll just +1 that!
Favorite blog post, technical or not?
Rasmus Bååth’s blog about Bayesian model and statistics. It’s both informative and funny – at least in the science / engineering way.
How about the favorite YouTube channel?
CrazyRussianHacker is, well, crazy and also funny:
Be sure to also check out his other channel Kul Farm
What’s something cool you’ve learned recently?
Have you heard of Chernoff face? In short, the idea is to display multivariate data in the form of a face. It taps into people’s ability to easily recognize faces and notice small changes without difficulty. Maybe we could visualize differences in YouTube channels so that the viewership maps to eyes, and the amount of installs to hair. But what would for example nose represent? We’ll see…
Also, there’s a similar concept but for fish!
Paula is looking for new colleagues. If you, or someone you know, enjoys working with vast amounts of data and isn’t afraid of getting hands dirty with coding and quick experimentation, we should talk. We’re hiring for all kinds of roles, ranging from data scientists to DevOps engineers and full-stack engineers. Let’s chat? Drop us a line at firstname.lastname@example.org