AutoML- The Future of Machine Learning
By: Ankush Gupta
This article from our research challenge appeared in insideBIGDATA.
Automation is pervasive with the advancement of science and technology in every field. Enterprises are now using machines instead of people for decision-making, thanks to the models created by data scientists. This inevitably raises the question: of whether the tasks performed by a data scientist can be automated or not. As a result, automated machine learning is becoming a hot topic of discussion in the Data Science world.
Before diving deep into AutoML, let’s understand what AI and ML are. AI is a technology that enables a machine to simulate human behaviour and ML is a subset of AI which allows machines to automatically learn from past data without programming explicitly. AI aims to develop smart computer systems like humans and solve complex problems.
In today’s world, machine learning is the most popular technology which is now used in practically every field imaginable. But what about humans who are not very familiar with ML? That’s where AutoML or automated machine learning comes in!
The goal of the article is to address the following questions: (i) what available ML functionalities are provided by the Auto ML tools; (ii) what insights and conclusions can be drawn from the study of research papers in the field of AutoML; (iii) what kind of different AutoML solution providers are available at present; and (iv) how data scientists and AutoML are going to have a future together.
Let’s use a small chat between two data scientists to understand the importance of AutoML in the field of machine learning.
Use cases for AutoML
Companies automate their machine learning processes for a variety of purposes. In most of these use cases, companies have already implemented ML and want to improve their performance. Mostly, companies want to have automated insights for better data-driven decisions and predictions. The typical automated processes observed from the case studies are:
- Fraud Detection
- AML Detection
- Sales Management
- Marketing Management
Table 1 – Use Cases of AutoML
Interpreting Google Trend Analysis:
When analysing the Search Volume Index in Google Trends, the graph that appears does NOT represent the actual search volume numbers, but rather an index ranging from 0-100. The numbers represent the search interest relative to the highest point on the chart for the selected region and time. A value of 100 is the peak popularity of the term, whilst a value of 50 means that the term is half as popular. Scores of 0 mean that a sufficient amount of data was not available for the selected term.
Fig. 1 – Google Search Score of the keyword “AutoML” in last 5 years
Here the worldwide searches for the keyword ‘Automated Machine Learning’‘ over last 5 years were analysed and it was seen that it increased from an average score of 30 in 2017 to 55 in 2019, while there was a slight dip in score from 55 to 54 in 2020, but it increased back to a score of 57 in 2022 which makes us believe that not just the data science community but also the world has started exploring this topic in the last 5 years.
Causes that are driving the need for AutoML
- Shortage of experienced technical experts
- Lengthy development process
- Huge expenditure in the current manual process
- Large amount of repetitive work
What role does automated machine learning play?
- AutoML enables companies to use ML solutions while not having to invest extra money and time in finding all the professionals required for the end to end process, offering a greater return on investment
- AutoML helps to bridge gaps between Data Scientist and ML problems
- AutoML increases productivity and democratises ML tools
- AutoML helps enterprise users in swiftly adopting ML tools or solutions by automating most of the modelling process required to construct and deploy ML models, allowing company’s Data Scientist to focus on more complex issues.
Benefits of using AutoML:
Which machine learning processes can be automated?
- Helps save time: A typical data science problem requires humans to run many models before deciding the suitable algorithm for the given business problem. AutoML eliminates this manual labour and assists in transferring data to the training algorithm and searching for the appropriate model. The results are available in a few minutes instead of hours with AutoML.
- Reduced errors while using ML Algorithms: AutoML improves models by minimising the likelihood of inaccuracies caused by bias or human mistakes.
Which machine learning processes can be automated?
- Data pre-processing: This process includes improving data quality and converting unstructured, raw data to a structured format with methods like data cleaning, data integration, data transformation, and data reduction.
- Feature engineering: AutoML can automate the task of:
- Feature Creation: Creating features involves creating new variables which will be most helpful for our model. This can be adding or removing some features.
- Transformations: Feature transformation is simply a function that transforms features from one representation to another. The goal here is to plot and visualise data, if something is not adding up with the new features we can reduce the number of features used, speed up training, or increase the accuracy of a certain model.
- Feature Extraction: Feature extraction is the process of extracting features from a data set to identify useful information. Without distorting the original relationships or significant information, this compresses the amount of data into manageable quantities for algorithms to process.
- Algorithm selection & hyperparameter optimization: AutoML tools choose the best algorithm for the given ML problem and the optimal hyperparameters without any human intervention.
Fig. 2 – Status of Automation in Data Science Workflow
Challenges of AutoML:
- Conformance to flexible specifications: The main challenge of using AutoML is not conforming to all the flexible specifications of the user. All these solutions focus more on performance, while in real world performance is only one aspect of ML projects. It hardly cares about the storage and computing requirements of the businesses.
- The 80/20 rule: AutoML automates roughly 80% of data science work while the remaining 20% like understanding client’s needs and presenting the final model to the stakeholders will still need human intervention.
- Explainability: Although one gets to see the reason codes and model blueprints of these AutoML solutions, sometimes they are too technical for people from non data science background to understand. As a result humans are still needed to handle such scenarios.
All these challenges of AutoML makes us believe that even in the presence of AutoML approaches we still need Data Scientists to handle other complex problems of an automation project.
Study of research papers in the field of AutoML- Bibliometric Analysis
Bibliometric analysis is a scientific computer-assisted review methodology that can identify core research or authors, as well as their relationship, by covering all the publications related to a given topic or field.
Years considered for the research
439 documents related to automated machine learning published from 2001 to 2021 were considered for this analysis.
Publication Output and Growth Trend in the Field of AutoML Research Domain
There is an increasing trend in the number of documents which could be attributed to the fact that the need for data scientists is increasing and AutoML tools/services are becoming more popular and helping companies to extract business insights in an effective and scalable manner using ML. In general, the number of publications has shown a steady increase over the last decade, starting with only 3 papers in 2012, the number of publications increasing nearly by 98% in 2021 (n = 187). The highest number of articles, 187 were published in the year 2021. This shows that Automated Machine Learning is a young but exploding field within data science.
Fig. 3 – Number of Publications in the field of AutoML (year-wise)
The Keywords Analysis of Research Hotspots on Automated Machine Learning
Fig. 4 – Co-occurrence analysis word cloud
In order to explore the emerging and widely discussed topics and potential future topics, we conducted a co-occurrence analysis on keywords by using VOSViewer. Keywords co-occurrence can effectively reflect the research hotspots, providing auxiliary support for scientific research. In all the 439 automated machine learning related publications, 3622 keywords altogether were obtained.
Here, the bigger the node and word are, the larger the weight is. This means that the particular keyword has been widely cited across the publications. The distance between two nodes reflects the strength of the relation between the two topics. A shorter distance generally reveals a stronger relation. As it can be inferred from the diagram, automated machine learning is a dense keyword compared to other keywords because it is widely cited by authors. Another conclusion that can be drawn from the plot is that AutoML and genetic programming have a close association. This is because AutoML has been widely used in genetic programming. An example of this could be the introduction of the automated machine learning-genetic algorithm framework (AutoML-GA) which has been used to solve a variety of problems in the research domain like rapid engine design optimization, computational fluid dynamics etc.
The larger distance between image analysis and AutoML indicates that they aren’t that strongly connected. This could be attributed to the reason that there aren’t many research papers which talk about the application of autoML in image analysis. Although an exception to this would be Google cloud, they made the Vision API which classifies images into thousands of predefined categories, detects individual objects and faces within images.
Which geographies are the research hotspots of AutoML?
Fig. 5 – Geographic Heat Map
As we can infer from the plot shown above, the US and China are prominent research centres in the field of automated machine learning since they have published a high number of documents. We can also see a lot of AutoML vendors have their headquarters in these countries.
Different shades of blue in the plot indicate different productivity rates: Dark blue = high productivity; Grey = no articles. After referring to the plot, we could also correlate this to the fact that most of the AutoML vendors have their headquarters in these countries.
Market size forecast
- The global AutoML market has generated a revenue of $270 million in 2019 and is expected to reach $15 billion by 2030.
- The global AutoML market is expected to advance at a CAGR of 44% during the forecast period (2020–2030).
- Over 65% of the AutoML market is expected to be in North America and Europe by 2030.
- Current adoption: 61% of data and analytics decision-makers whose firms are adopting AI said they had implemented AutoML software or are in the process of implementing it.
- Future adoption: 25% of data and analytics decision-makers whose firms are adopting AI said they are planning to implement AutoML software within the next year.
AutoML Solution Providers:
- Open Source
- Tech Giants
AutoML Software Comparison:
We are focusing on AutoML Solutions namely:
- Google Cloud AutoML
- Microsoft Azure AutoML
Interpreting Google Searches:
Fig. 6 – Google Search Score of different AutoML tools
From Fig 6, we can see that Dataiku and DataRobot have been trending on Google Searches in the last 5 years as their search scores have increased every year. And more users are looking for them online because of their increased capabilities as shown in Table 2 and Table 3.
This is a software comparison of all the AutoML vendors. Here TPOT, MLjar, TransmogrifAI are the open source autoML solutions, while DataRobot, Dataiku, H20.ai, Darwin are startup based and Google Cloud AutoML, Microsoft Azure AutoML are tech giants based.
The capabilities were categorised into broad categories and then the analysis was done for the same. The table below shows the colour indexing method. A good AutoML software should be able to train custom machine learning models with limited machine learning expertise as per the business needs. It should offer simple, secure and flexible products with an easy-to-use graphical interface
Table 2 – AutoML Solutions Capabilities
Table 3 – AutoML Solutions Capabilities and its sub categories
The analysis on subcategories of the broad categories was also conducted and it was checked if a particular category is offered by the vendor or not. From Table 2 and Table 3 it can be concluded that most of the capabilities are being supported by DataRobot followed by Dataiku.
Data Scientist vs AutoML
AutoML tools have advantages over human data scientists in speed and risk reduction; but the human brain is superior to a machine in other ways. A data scientist brings a level of nuance, intuition and creative problem-solving to the process that AutoML simply cannot match.
Fig. 7 – Data Science Workflow Distribution with Automation
From the analysis it could be inferred that ~43% of data scientist work can be fully automated by machines and another 28.57% of work can be done by both humans and machines in collaboration, remaining 28.57% of work will solely be done by humans.
Also, it is evident from the fact that the recent job description of companies require AutoML solutions as preferred qualifications for the role of Data Scientist. For eg – Growth Analytics, Polaris. As a result, online educational platforms like Udemy, Coursera have started offering courses in AutoML like AutoML Bootcamp, Machine Learning on Google Cloud (Vertex AI and AI Platform), Analyse Datasets and Train ML Models using AutoML to train new Data Scientists to develop this evolving skill and become a part of the revolution.
The “AutoML vs. Data Scientist” discussion is inherently flawed, and the technology leaders are encouraged to dive into the real question: How can businesses fully leverage AutoML AND Data Scientists?
Successful data scientists will embrace AutoML tools the way the construction industry embraces panelization and pre-fabrication tools: as a mechanism to reduce their time spent on repetitive tasks and allow a machine to prepare the materials they need to conduct more-specialised work.
Talk to us for ways our analytics-driven insights can optimize your organizational strategy.
Cryptocurrency: Digital Gold or Digital Equity?
Cryptocurrency: Digital Gold or Digital...
Putin’ Commodities out of supply: Impact of the Russian-Ukraine war on the US economy
Putin' Commodities out of supply: Impact of the...
Media is an ‘acquired’ taste – but who wants to acquire it?
Media is an 'acquired' taste - but who wants to...