On the surface, Matchmade is all about creator marketing, campaigns, and measurement, but behind the scenes, powering all that is an impressive amount of data and technology. In this blog series, we peek behind the scenes and interview the fantastic folks from our engineering team. Hope you enjoy it! Got some questions you’d like to ask? Drop us a line at firstname.lastname@example.org
This time we sat down with Niko, our senior data scientist and the wielder of multiple hats. Without further ado, meet Niko!
What do you do here at Matchmade?
I think my business card claims that I’m a Senior Data Scientist, but in practice I work on different things throughout all of our stack. On one hand I’m maintaining our (data) infrastructure, and on the other hand I’m working directly on all kinds of new features. I do data science as well! But, I don’t think data or science should be done in a silo, and it’s important that it all ties back to the product.
So, let’s say I’m a full stack engineer, with interest in data and science!
What are you currently working on?
For the past few months, I’ve been expanding our support for a new social media platform: Instagram. It’s crucial for the business that we cover all of the major social media platforms and aren’t dependent on just one. It’s been an exciting engineering exercise as well.
As we’ve been adding support for Instagram, we’ve revisited many of the earlier assumptions we’ve had when building our platform with just Youtube (and Twitch) in mind. How can we make sure that we’re in a better position to add more platforms in the future, after adding Instagram now.
One of the things I worked on was a new microservice abstracting away authenticated API calls to all kinds of Social Media platforms. In our platform, the content creators grant us access to their Youtube account, and now Instagram profiles. This allows us to fetch more detailed analytics and to better recommend advertisers to work with. The new microservice makes it easy to add new platforms in the future and keep the code and data nicely isolated.
What’s the coolest thing you’ve built at Matchmade?
It’s got to be the data pipeline that handles updating, indexing, and storing our social media data. For example, we check and process statistics of 1.5 – 2.5 million youtube videos per hour. The whole pipeline is split into a few smaller microservices.
“We check and process statistics of 1.5 – 2.5 million youtube videos per hour”
First, there’s a service that periodically fetches and syncs data from the social media APIs and pushes results to a Kafka stream. It keeps track of hundreds of millions of Youtube videos and channels and schedules periodical updates. All the scheduling happens within the boundaries of this service, so none of the other services need to worry about whether we’re refreshing stats for a given channel often enough.
On the other end are a bunch of different consumers, processing and analyzing the updated data. The volume of the incoming data is something we need to keep in mind with all the analysis we do, which adds its own challenges. If running an NLP model, or even just the database upserts, takes too long, we’ll fall behind. Milliseconds matter.
Kafka is something I hadn’t used earlier, but have since fallen in love with it. It’s amazingly convenient to be able to add a new consumer or tool for analysis, and just replay previous data to get things up and running. Or when (not if!) we introduce new bugs. Overall, the system we’ve built is very resilient to all kinds of hiccups and issues.
What technologies do you use?
I work pretty much throughout our whole stack, so there’s quite a bunch of different technologies. Depends on a given day!
On some days I tweak and improve our infrastructure which runs on Kubernetes and a combination of AWS services and bare metal on Hetzner. For smaller services AWS is convenient, but for data warehousing the costs on RDS can easily spiral out of control.
When I’m working on features, it’s usually a combination of Python or Typescript, and PostgreSQL. Typescript is great for most things, but when doing more science-y stuff, nothing beats Python, it’s ecosystem and Jupyter notebooks.
I might be a weird engineer, as I genuinely find online marketing fascinating. My educational background is in applied mathematics, and many of the things from there are also relevant here. Also, given that we’re the good guys, we can actually make the world a better place by replacing the shitty ads with relevant and interesting content.
“Since the early days, we’ve had team members in both Helsinki and Berlin, so there’s never been any processes relying on people being located at the same office”
One thing I really appreciate is the culture of asynchronous work and remote-friendliness in general. Of course nowadays remote work has become the norm, but we’ve been doing this for years, and it really shows.
Since the early days, we’ve had team members in both Helsinki and Berlin, so there’s never been any processes relying on people being located at the same office. Instead, planning is mostly handled in Twist, day-to-day discussions in Slack, and whenever there’s a meeting, whether it’s a team daily standup or a company all-hands, calling in is the primary option.
Favorite blog post, technical or not?
Cliqz’s post Indexing Billions of Text Vectors encapsulates all of my favorite topics.
On one hand, it covers semantic similarity with text vectors and approximated nearest neighbour algorithms. These are both interesting subjects, and also relevant for us at Matchmade, for indexing and searching for social media channels.
On the other hand, the post also covers the practical, engineering concerns, which are often overlooked when discussing the scientific side of things. If you can’t deploy, run and scale the models in production, it’s not going to be very useful. As said, data nor science can’t be done in silos.
Something else I should ask?
We haven’t yet covered my favorite standard! Or, it’s not really (yet!) a standard but an RFC. Did you know, that whenever you schedule a recurring meeting in your calendar, it’s not some ad-hoc, proprietary code. Instead, there’s actually somewhat-standardized format called RRule (RFC-5545) that describes recurrence. It’s way more flexible than what e.g. crontab can do. It’s so cool!
The next time you’re writing something that deals with stuff repeating, I recommend having a look at the RRule library of your favorite language. Helps to avoid all the different calendaring headaches, dealing with timezones, DST, etc.
Niko is on the lookout for new colleagues! If you, or someone you know, enjoys working with vast amounts of data and isn’t afraid of getting hands dirty with coding and quick experimentation, we should talk. We’re hiring for all kinds of roles, ranging from data scientists to DevOps engineers and full-stack engineers. Let’s chat? Drop us a line at email@example.com