Unlike big tech giants, companies, in general, are only dipping their toes into data science. This article is a non-technical conceptual introduction for business readers - domain experts, project managers and decision makers. It is meant to give a satellite view on the challenges and opportunities of data science within the business data landscape.
Key takeaways include:
Even an excellent Machine Learning algorithm will not succeed as long as the previous steps are insufficiently executed
// what are the previous steps?
The demand for advanced business analytics has grown rapidly as companies can access more data than ever before. With the help of Machine Learning applications, Data Scientists will be able to leverage the full potential of the data at hand and to address current business challenges more efficiently. The resulting predictions - say, a robust estimate of next week’s production needs or a strong hint about whether a customer is going to churn - will then allow more flexible actions and a faster evaluation of business cases. This way, companies are finally able to increase their competitiveness while, at the same time, keeping the effort at a reasonable level. Obviously, the magical derivation of actionable predictions sounds like a fairytale to any decision maker. That’s why it’s all the more surprising that the actual implementation is not as far off as it might seem. This article is meant to encourage you to learn more about the technical solution that turns this fairytale into modern, real life.
Now, if a company has hired someone with statistical background to handle the data, the right people are already in place. The crucial question that remains is whether they have the right tool at hand to harness the full potential? An excellent tool should allow an expert data scientist to accomplish her job more efficiently and to make the most out of the given data. At the same time, it should empower a user with less scientific or technical acumen to effectively address the challenge and to quickly reach a level that generates highly valuable insights. Yet, in what way can such a tool actually benefit the work of data scientists and hence business decisions?
As the ambitious minds behind getML //cringe...//, it is a pleasure for us to take you on a short journey to delve deeper into this topic. In order to shed light on some common misconceptions, we‘ll take a closer look at the entire cycle of a data science project.
- Placeholder data science process -
#Even an excellent Machine Learning algorithm will not succeed in making the best possible predictions as long as the previous steps are insufficiently executed.
Probably the most prominent buzzword that has recently hit the business analytics market is the term “Machine Learning”. Although it is the thing of the 21st century, most people have only a rather vague understanding and tend to use it as an umbrella term for all kinds of data-related fields. In simple terms: Within a data science project Machine Learning models are used to forecast future events using historical data. Luckily, this is a very rigid process in which parameters are mathematically well defined. Here, models yield robust results once high quality input is provided - making it easy to automate most of the tasks involved. Consequently, numerous new ventures and tools came onto the market in recent years. Using the current state-of-the-art, several aspects of Machine Learning modelling can be automated with the same accuracy as the manual work done by an expert data scientist. Since data exploitation has become a must-have to improve competitiveness, companies are willing to spend significant amounts of their budget on the automation of even rather small components of the modelling task. For the next few years, analysts forecast a continuing increase in spending on data science tools that support Machine Learning.
However, data science is more than Machine Learning modelling. It is a mistake to underestimate the importance of the data preparation involved in data science projects. The accuracy of a prediction heavily depends on the quality of the features crafted to feed the Machine Learning model in the first place. Even an excellent algorithm will not succeed in making the best possible predictions as long as the previous steps are insufficiently executed. In the worst case, a shortcoming in the development of ML algorithms can jeopardize the prediction accuracy and the resulting decisions will fail to meet the business challenge. Thus, data preparation is of high relevance for the quality and outcome of predictions. The significance becomes even more apparent when we consider the allocation of resources in a data science project: While Machine Learning modelling is responsible for only about 10% of the total time and effort, data preparation demands up to 80% of the dedicated resources. But why is that the case? In order to explain this disparity, let us delve deeper into the tasks involved.
In the very first step of each data science project the data is obtained by querying all available databases. In most cases, real-world data is not given in a standardized //a relational data model is well defined...// format, but scattered across many different data sets. A Machine Learning algorithm, however, can only process a single one at a time. Thus, the input data must first be aggregated into a single spreadsheet // spreadsheet-like //. Since all existing tools suffer this problem, data scientists are forced to manually perform the aggregation //simpler language// , making it one of the most tedious and error-prone tasks. Now, one additional step is necessary before the actual Machine Learning takes place. Through Feature Engineering data scientists disentangle the most crucial information and correlations hidden in the raw data into more conclusive predictive signals - so called features. This is an essential step to further increase the accuracy of the resulting predictions.
In contrast to Machine Learning modelling, Feature Engineering also involves non-mathematical aspects that require a profound understanding of the business context and the respective data at hand. Let’s return to the business challenge we considered at the outset to grasp the full complexity of this process. Given the task to predict whether a customer is going to churn or not, data scientists are most likely working with data sets such as transaction details, product histories and customer service interactions. Now, they have to turn this data into valuable features to feed the Machine Learning model. It is therefore crucial that a data scientist can distinguish between information that is important and of relevance for the research question and other that is redundant. This assessment requires a precise notion of domain-specific correlations and underlying dynamics. However, the educational background of a data scientist does not necessarily comprise a profound business knowledge such as customer behaviour. Consequently, it takes both experience and tedious cycles of trial-and-error to manually craft features of high quality.
// Absatz: feature engineering is not well defined
// fachtermina erklären - glossar?
Sadly, Feature Engineering is all too often underestimated as crucial aspect of data science. In this case, projects fail due to two common mistakes: First, the business side takes it for granted that the data scientists completely grasp the domain-specific dynamics and all of its implicit assumptions. Secondly, the data scientists themselves disregard the Feature Engineering task, often due to a lack of skill and experience, and focus on the Machine Learning modelling only. In other words, rushing through or skipping this step is no real option since the accuracy of the Machine Learning model prediction heavily depends on the quality of the crafted features.
We can see a few ventures on the market offering tools that attempt to automate the Feature Engineering task. These solutions are solely based on statistical approaches and do not rely on domain-specific knowledge. Simply put, they take the single spreadsheet that had to be manually prepared by the data scientist beforehand and apply a set of basic mathematical functions. An example: The sum of the category ‘transaction’ within a chosen time window could add up to a feature ‘purchases over the last 30 days’. However, the current solutions randomly generate new features in a trial-and-error-like fashion. With this approach they are not capable to reach the same performance level as hand-crafted features. In order to fully automate the Feature Engineering task, it requires a solution that can compete with the semantic understanding of all available data sets that the manual work of an expert data scientist implies. Otherwise, we have to face the fact that the resulting predictions of Machine Learning models will never reach their full potential.
Having worked in a team of expert data scientists, we had dedicated a lot of time and passion to carefully craft features that tap the full potential of machine learning predictions. Consequently, we became aware that the everyday workload of manual feature extraction did not only take up our nerves and strength but was also a cause of frustration for us and our colleagues. We realised the time has come to leave the beaten track and embrace the automation of Feature Engineering with a revolutionary new perspective. With the help of a visionary mindset and a profound understanding of the technical challenges involved, we created the sophisticated tool we always wanted from scratch.
With getML we developed a software tool that outperforms the results of an expert data scientist in a fraction of the time. How is that possible? Our solution unites two essential capabilities that existing tools on the market fail to provide. First, it automatically generates sophisticated features that allow for a higher prediction accuracy than current state-of-the-art solutions. Instead of following the random try-and-error-like fashion of existing tools, getML is using an intelligent and advanced systematic approach. Secondly, it is capable of working with multiple data sets simultaneously and eliminates the tedious and error-prone task to manually aggregate all data into a single spreadsheet.
Especially the latter is extremely useful when working in an enterprise context. Here, data scientists typically depend on data from several sources, such as data sets from different departments across the company, records acquired from third parties or public data. In this case the relevant information is scattered across multiple spreadsheets that hold a shared key (e.g., an ID for each customer, or timestamps). The benefit of this key is that it allows to link the data across all records. An example: The name and address of a customer within one spreadsheet will be related to her bank details kept in a different file and furthermore to the orders she placed that are listed in a third record. Most commonly, enterprise data is represented in such a relational model for this very practical reason - whether it is stored in online databases or locally in plain CSV files.
While the amount of data available to companies has continued to increase, existing tools on the market couldn’t accomplish to deal with data from multiple resources. This mismatch created a bottleneck in data preparation within data science projects and led to stagnation of the deployment rate of Machine Learning models. As a result, even larger companies with existing data science resources need several weeks or months to test and deploy a single project. This way, they won’t be nearly as efficient as possible and lose potential returns on their investments.
Due to the immense potential of available data, a growing number of companies want to take advantage of Machine Learning and AI technologies to support their business dynamics and to push operations and processes. At the same time, the high demand for data science talent complicates the access to sufficiently experienced experts and leads to a growing talent gap. This and the high amount of time and costs involved in manual Feature Engineering, add up to a substantial barrier. Consequently, many companies, especially SMEs, are not capable to facilitate data driven optimization and to profit from the progress of technology.
We faced this challenge and developed a tool beyond state-of-the-art that makes data science equally accessible to both large companies and SMEs. With getML we take the automation of Feature Engineering to a new level. Our solution competes with the experience and acumen of expert data scientists and even outperforms the accuracy of Machine Learning models based on manual feature extraction. As a result, data science projects are less constrained by tedious and error-prone manual tasks and dedicated resources can be used more efficiently. At the same time, it is the first tool to derive information directly from all available records. This ability makes it possible to quickly and easily leverage external data sources. The unique combination of efficiency, accuracy and usability makes getML the most ingenious solution to automate Feature Engineering and Machine Learning for data in the enterprise context.
What once took months to complete, now takes only days, significantly accelerating time to value for Machine Learning predictions. When deployed, getML can enable large companies with existing data science facilities to more than double their productivity. The unique design of our solution allows easy integration into other systems and thus to leverage existing investments. SMEs, on the other hand, can finally run their own data science projects even with rather limited resources. For the first time we see a tool that empowers both large companies and SMEs to take advantage of Machine Learning at the highest level.
Of course, this is not the end of the journey. Instead, it is the beginning of a new chapter in which a growing number of users leverage the full potential of data science. In the future we strive to develop more exceptional tools to allow data scientists and decision makers alike to accomplish their jobs more efficiently.
Thank you for sticking to the end of our review on the value chain of data science and the shortcomings of existing Machine Learning tools. If our enthusiasm for ingenious solutions has captured you, feel free to have a look at our project page or visit the ....... for more insights on how we are going to change data science within the business landscape.
Any Questions unanswered? We love to talk about the functionalities and possibilities of automated Feature Engineering and Machine Learning. Just leave your contact details below to get in touch with us.
The year is 2020. The world is entirely dominated by Deep Learning applications. Well, not entirely... This post highlights the blank spot on the map.
Automated feature engineering for relational business data? Sound great, but you don't really know what relational data is? This post is for you!