Custom Algorithms and Advanced Processes with Dataiku
Today no one can deny the power of data and the benefits that understanding and analyzing it brings. Among other things, using data effectively lets us predict future behaviour or reduce the impact that a specific event may have. For this reason, the use of Artificial Intelligence is increasingly common across a range of business contexts. Even though each business is different, the theory behind these machine learning and deep learning techniques continues to advance and thanks to this, a world of possibilities opens up for us to adapt them to the needs of each client.
Dataiku’s Machine Learning tool
Dataiku is a project that began in 2013 and has been named Leader in the Gartner 2021 Magic Quadrant for Data Science and Machine-Learning platforms for two consecutive years. It offers some interesting facilities for applying Machine Learning (ML) and Deep Learning (DL) algorithms to our projects through the ML interface tool.
This tool includes ready-to-train algorithms from Scikit-Learn, Tensorflow and XGBoost that also allow a degree of customisation. Features include modifying the model parameters, hyperparameters, and the evaluation metric - as well as automatically offering, after training a model, an analysis of partial dependencies and subpopulations.
Even though the ML tool is a recommended way to apply pre-installed algorithms on the platform quickly and easily, much of the time it is necessary to incorporate more complex algorithms - either ones that are not installed or have been designed specifically for the needs of the project. This is where the custom algorithms usually appear.
Custom algorithms
Dataiku has been growing and improving in a variety of ways and is now considered a valuable data science tool due to its friendly interface and ease of customisation. Among the many available options for personalizing the data flow, in this article, we will focus on the Machine Learning aspect.
There are 2 ways to implement custom algorithms in Dataiku’s projects, via notebooks (Jupiter notebook and R Markdown) or by adding an algorithm script to the ML tool.
While using notebooks directly can be quicker, integrating algorithms within the Machine Learning tool lets us use the visual ML tool capabilities like feature preprocessing and model interpretability.
Focusing on the last option, algorithms can be imported from the project library, the global instance library, a library imported in a code environment or a plugin. The algorithm must be scikit-learnt compatible - meaning that it must include a fit and predict method (for classifiers a class attribute and a predict probability are also needed). This level of customisation is part of what gives Dataiku the ability to cover such a broad spectrum of use cases.
Advanced processes
Another advantage of adding these algorithms via the Machine Learning tool is the possibility of using advanced methods to manage imbalanced data and calibrate the model’s probabilities.
Weighting Strategy:
In most cases, customer data is imbalanced, which will bias the prediction of our model towards the common classes. To avoid that, the ML tool has 4 built-in methods:
No weighting: Each row of the dataset is considered equally, thus we are assuming that the data is not imbalanced.
Class weights: The weights are defined as inversely proportional to the cardinality of their target class.
Sample weights: A positive column of the dataset defines the row weights.
Class and sample weights: In this method, the product of class weight and sample weight is used as the row weight.
Probability calibration:
In other cases, depending on the specific problem, probabilistic classifiers are needed where not only the most likely class label is returned, but also the probability of that class. In this scenario, the probability calibration process is really useful to get the true likelihood of an event by adjusting the predicted probabilities to the actual class frequencies.
The two methods offered by Dataiku are isotonic regression and Platt scaling. While Platt scaling is simpler and most effective when the distortion in the predicted probabilities is sigmoid-shaped, isotonic regression presents more complexity, requires more data, can correct any monotonic distortion but it is more prone to overfitting. In the end, the correct choice of calibrator depends on the data structure.
Best practices
When dealing with machine learning models in Dataiku, there are a few guidelines that we recommend you keep in mind for optimal results.
It is highly recommended to create specific code environments for each project to address the problem of managing dependencies and versions of software libraries.
It is also important, as a good preprocessing step, to understand the data before applying the model in order to choose the solution that best fits the problem - and it is possible to add custom code in Python for that process too.
It’s advisable to define Python classes and functions in project Python’s library and instantiate them in the code windows.
It’s also advisable to define custom algorithms in project Python’s library or global Python library rather than in the model code windows.
Further information
While we’ve certainly touched on the highlights of using Dataiku to facilitate advanced machine learning and deep learning integration, there’s much more detail to discover which falls outside the scope of this article. If you have any further questions or would like more information on using Dataiku effectively in your organization, please contact us to schedule a convenient time.