How to build event-based models

Building reactive systems is not only important for the software engineering. Also in Machine Learning building dynamic models that react to the events is just as important.
Blog
Churn

Table of Contents

In my life, there were a few “aha” moments that quite affected how I think about certain problems. One such moment happened during Kaggle Days in Warsaw in 2018. During the presentation, Qingchen Wang presented a paper “Data-Driven Consumer Debt Collection via Machine Learning and Approximate Dynamic Programming” (link to the paper). The title, while a little intimidating, hides a wonderful idea about how to create models that deal with event data. And perhaps even more important idea is how your decisions in a company affect the customer. The rest of the blog post will not be 100% summary of the paper but more like a paraphrase of how I understand it.

Static models

Imagine a standard problem: churn. Most of the tutorials along with datasets present a static view about the customer: age, gender, registration date, the amount spent, etc. Your boss asks you to create a model to create for the next marketing campaign. You happily return after 2 weeks with results:

“People younger than 25 years old with less than $100 spent in the last 30 days are the most likely to leave the company”

So you prepare the campaign, call center contacts the customers. Job done… not.

You forgot about something. How the campaign itself affected the customers? The only point of doing such a campaign is to make the customer change his mind about your company. Some of the people would churn even if you called them 1000 times. By not taking into account the call itself you called to people who are impossible to persuade and not called to people who could be talked out of leaving the company.

Dynamic models

The rest is simple. You build a model as you would and then you count the number of calls to the customer in the last 30 days. To select customers for the next campaign you select people who are sensitive to the “+1” call. It is that simple. After the model is trained on historical events you can just ask it what happens if I call the customer 1 more time. If the customer is insensitive to a call then I would argue there is no sense to call him, is it? To create those “what if” scenarios the event log is a central point of your next project.

Important
This approach is slightly different than testing marginal effects of variables during prediction. You need to have explicit intentions during data preparation. To prepare the training data you should also include state of the customer at the time before he was called.

Event log to the rescue

If your company does not collect logs of customer-centric events this is really bad news. Every good data warehouse should contain at least:

  1. Customer Id
  2. Date
  3. Event type
  4. Event attributes

Also make sure to include outgoing events like call center contacts, emails, SMS, offers, etc. Each of this event type is a tool which should affect the customer.

Of course, there are multiple options on how to structure the data itself. At least to build Machine Learning models the structure can be kept in a single table with a JSON column with the nested attributes.

Examples of projects that can leverage event-driven modeling

Churn models

We already mentioned the possibility to use it for detecting churn. It is one of the most obvious use cases. Funny enough if you want to test the ideas on some freely available datasets there are almost none. I believe this is because, even though, we live in the era of big data, not enough information is kept as event logs.

Minimum viable dataset: customer transactions, calls to customers

“+1” problem formulation: find a customer who is sensitive to the call in such a way that they increase the number of transactions after the call

Predictive maintenance

This is a hot topic currently. Machines break if you don’t maintain them. If you predict which machines can fail soon – fixing it before the actual failure is much less expensive than after the failure itself.

Minimum viable dataset: sensor readings from the machines, failures, maintenance tasks

“+1” problem formulation: find machines for which you can prolong the predicted time to the next failure after the maintenance task

Debt collection

This is the topic of the paper mentioned at the beginning. The authors of said paper during the project proved it is possible to call the customer less frequently and that the time of the call can be selected optimally for each customer. This resulted in the process which was improved by 40%.

Minimum viable dataset: information about debt payments, calls to the customers

“+1” problem formulation: call to the customers which are sensitive to the call and after the call are more likely to pay an overdue debt

I can go on like this forever and I believe that it is a good exercise for any data scientist to think about the problem at hand and make it more dynamic.

Libraries and tools

You have a dataset and you have an idea what model to create. What tools you can use.

There are 2 options:

  1. You can write your library. While it is not an easy task it is sometimes needed. The case that you are solving can be very specific and sometimes it is the best option.
  2. You can use existing options which I describe below. All of them try to solve a similar problem.

There are some choices when it comes to libraries that can handle events.

What is especially interesting is that most of the libraries are written in Scala. Some of them are written in Scala but have an interface in Python. Only 2 of them are pure Python. Scala is a popular choice here because some of the tools rely on Spark/Flink and this rules out any other programming language choice. It is one of the reasons we are betting on Scala to write Big Data systems.

Here are the library choices (full disclaimer: the first one is our library).

This library is our child. We offer it to our clients and help them to implement it on premises. It hasn’t been open-sourced yet. Of course, we are not super objective about it but we think that it offers an easy way to define new features and also support event augmentation methods out-of-the-box. This is the only library that does it at the moment. Currently, we have 2 such strategies:

  1. event bootstrapping – it works by sampling the events with replacement – it helps to add robustness to the model just as it does for Random Forests
  2. event injection – it adds an artificial event to events to create “what if” scenarios that can help you to decide which tools are good for your business

This is an example how to define features for predictive maintenance task

val features: List[Feature[_]] = List(
EntityIdExtractor(),
TimestampExtractor(),
FeatureDef("event_count_3h", "last 3 hours", BoolExpr("eventType == 'telemetry'").count),
FeatureDef("next_failure",
"next 1 days",
Filter("eventType == 'failure'") andThen StrExpr("failure").first,
columnType = ColumnType.Label),
FeatureDef("last_age",
"alltime",
Filter("eventType == 'machine'") andThen NumExpr("age").last),
FeatureDef("days_since_maint",
"alltime",
Filter("eventType == 'maint'") andThen GroupBy("comp") andThen DaysSinceLast.apply)
) ++ List("volt", "rotate", "pressure", "vibration").flatMap(
field =>
List(3, 24).flatMap(
hour =>
List(
FeatureDef(s"${field}_${hour}h_avg",
s"last $hour hours",
Filter("eventType == 'telemetry'") andThen NumExpr(field).avg),
FeatureDef(s"${field}_${hour}h_stddev",
s"last ${hour} hours",
Filter("eventType == 'telemetry'") andThen NumExpr(field).stdDev)
)
)
)

This Python library is probably the most known. It offers an elegant solution if you work with relational tables. The main selling point is Deep Feature Synthesis (DFS) which can generate features automatically. Setting up takes a little bit of time at the beginning.

This Airbnb framework offers a nice declarative language in which you can define simple aggregations (Airbnb argues that simple is better than complex). Together with some additional key-value stores – they say that they achieve O(1) complexity even when aggregating long periods of events. The library is not yet open-source. Judging by the presentation content (authors refer to category theory a lot) I can only guess that the library is written in Scala.

Cool-named project from SalesForce handles not only events data but the entire machine learning pipeline. The most impressive thing about this library is the advanced feature hierarchy that goes beyond typical: numerical, text, categorical separation. For categorical data, they have plenty of options to transform: addresses, zipcodes, URLs, etc. Also worth noting is the conditional aggregation which is similar to event injection in EventAI (here is more explanation).

Not to confuse with Flink which is a streaming library for big data systems. This library is probably the most specific one. Its goal is to help with high-frequency trading. It is strictly used for time-series data. It doesn’t handle other types of information. 2-sigma team did a great job implementing a custom RDD collection prepared to handle this type of data.

The library itself is not documented very well – at least the Scala part. Python wrapper has a lot more usage examples. It is also really hard to find examples of other people using the library.

This is a plugin to Dataiku. It is the best option if you are already using this platform. Other than that it has a unique functionality that translates the feature definitions into SQL query which means it can be easily integrated into the workflows.

If you need an open-source solution the final answer would be: use featuretools for Python and TransmogrifAI in Scala. Both of the libraries have a similar maturity. TransmogrifAI has slightly worse documentation but it is nonetheless quite powerful.

Summary

I hope this blog post affected how you think about similar problems. The most important thing is to understand that the world is not static. What you see, how the Clients behave is just a sum of previous experiences. The idea is quite simple but powerful and changes the methodology behind several types of tasks.

CTO / Data Scientist / Kaggle Grandmaster

CTO / Data Scientist / Kaggle Grandmaster

Other stories in category

Fraud detection for Polish Ministry of Health

Fraud detection for Polish Ministry of Health

Fraud detection for Polish Ministry...

GovTech Polska is using the competition formula to involve tech startups in solving state-scale technological challenges through Artificial Intelligence and Data Science. The central entity is the public sector, which reports challenges and looks for modern ways to solve them but the indirect beneficiaries are of course citizens. 

LogicAI Team

22 Nov 2022