Optimizing the Production of Valuable Metabolites using a Hybrid of Constraint-based Model and Machine Learning Algorithms: A Review

—The advances in genome sequencing and metabolic engineering have allowed the reengineering of the cellular function of an organism. Furthermore, given the abundance of omics data, data collection has increased considerably, thus shifting the perspective of molecular biology. Therefore, researchers have recently used artificial intelligence and machine learning tools to simulate and improve the reconstruction and analysis by identifying meaningful features from the large multi-omics dataset. This review paper summarizes research on the hybrid of constraint-based models and machine learning algorithms in optimizing valuable metabolites. The research articles published between 2020 and 2023 on machine learning and constraint-based modeling have been collected, synthesized, and analyzed. The articles are obtained from the Web of Science and Scopus databases using the keywords: “Machine learning”, “flux balance analysis”, and “metabolic engineering”. At the end of the search, this review contained 13 records. This review paper aims to provide current trends and approaches in in silico metabolic engineering while providing research directions by highlighting the research gaps. In addition, we have discussed the methodology for integrating machine learning and constraint-based modeling approaches.


I. INTRODUCTION
Microorganisms have been used in industrial sectors such as food processing, chemical manufacturing, pharmaceuticals, fermentation, and others.Advances in genome sequencing have resulted in several innovations that allow researchers to gain in-depth knowledge and information about an organism.One of these advancements is metabolic engineering, which reengineers the cellular function of an organism.In the 1990s, metabolic engineering was introduced to describe recombinant DNA technology for optimizing microbial activity [1].Metabolic engineering aims to optimize the synthesis of desired metabolites by directing the metabolic flow and the fluxes toward the desired metabolites.The designs are categorized into two types: [1] targeting metabolic network components, such as gene/reaction knockout/knock-in, and [2] enhancing the metabolic network by altering it using network reconstruction tools or incorporating new non-native pathways into the host.
Over the previous few decades, there has been a noticeable breakthrough, such as incorporating adenosylcobinamide phosphate biosynthesis from Rhadobacter capsulatus into the E.coli strain, which improves the vitamin B 12 to 307 µg/g [2].In another case, the yeast was engineered to improve the production of rubusoside and rebaudiosides, leading to 1368.6 mg/L and 132.7 mg/L, respectively [3].Although metabolic pathway optimization technologies have shown promise, an incomplete understanding of the connection between target cell phenotype and genotype impedes their further development.This results in the prevalent utilization of conventional trial-anderror methodologies and indirectly remains tedious, costly, and time-consuming.Therefore, constraint-based modeling (CBM) approaches have been used to analyze organisms by providing significant phenotypic knowledge based on genotypic perturbations.CBM approaches, which include Flux Balance Analysis (FBA) and its variants (Minimization of Metabolic Adjustment, MoMA; Regulatory on/off minimization, ROOM; and Flux Variability Analysis, FVA), are used to reveal metabolic phenotypes by analyzing the optimality of an organism [4], [5].However, a significant challenge in CBM is that the desired flux is not limited to a single solution due to biological network redundancy and complex genome-scale metabolic model (GSMM), thus permitting alternate optimum solutions.Furthermore, due to the intricacy and interdependence of components in the metabolic network, selecting appropriate and optimal reactions/genes for knockout is difficult, laborious, and timeconsuming [6].Hence, previous research has combined metaheuristic optimization algorithms such as genetic algorithm (GA), differential search algorithms (DSA), flower pollination algorithm (FA), and others [7], [8], [9], [10].
With the recent advancement of high-throughput technology and the overwhelming amount of omics data, data collection has increased considerably, thus shifting the perspective on molecular biology [11].Although big data in biology enables data-driven science to comprehend complex biological systems and events, interpreting data is still complicated.Therefore, machine learning (ML) has been applied to deal with biological omics data for various applications such as prediction, classification, and discovery.The involvement of ML in the data shows a great potential to reveal hidden and detailed information in the data.
It has proven successful in diabetes disease prediction, optical character recognition, face identification, and others [12], [13], [14], [15].ML is a set of algorithms to improve prediction accuracy by learning and analyzing the patterns from large experimental datasets.Recently, ML has been applied to increase the accuracy of the genotype-phenotype relationship by analyzing the integrated metabolic networks with regulatory or signaling networks.Furthermore, ML requires fewer parameters than other statistical or computation approaches, thus making them useful for various tasks, including predicting the impact of genetic perturbations, reconstructing phylogenetic trees, and others [16], [17].
This paper aims to review how ML techniques are applied in metabolic engineering, specifically to optimize the production of desired metabolites.The paper is organized as follows: Section II introduces the definition of metabolic engineering.
Section III provides a brief on constraint-based modeling.Section IV discusses machine learning in metabolic engineering.Then, applications of machine learning in metabolic engineering have been described in Section V.After that results and discussion are provided in Section VI.In the last, the conclusion is given in Section VII.

II. METABOLIC ENGINEERING
Each component in biological systems plays a vital role in biological processes and interacts with each other.Therefore, it is crucial to analyze the systems as a whole.The organism's function can be divided into three major biochemical pathways: gene regulatory, signal transduction, and metabolic networks.Gene regulatory involves a set of genes, proteins, and their regulatory mechanisms that determine the expression of the gene.Signal transduction networks communicate between and within cells by mediating, detecting, amplifying, and integrating various external and internal stimuli to govern and coordinate cellular activities.Meanwhile, the metabolic network is a series of biochemical reactions involving the transformation and modification of substrates into different products in which the enzymes act as catalysis agents.The metabolic network is essential in assessing a cell's biochemical and physiological properties.This research is mainly concerned with metabolic networks.
Advancements in genome sequencing have brought about many developments that allow biological researchers to have more profound knowledge and information about an organism.One of the developments is the establishment of metabolic engineering (ME), which allows the researchers to probe in detail the organizations of an organism, including the reactions, pathways, metabolites, and genes, and exploit the organisms for strain optimization.Metabolic engineering aims to optimize the metabolism of organisms by exploiting and manipulating their metabolic capabilities through modeling and, thus, generates economically and industrially viable organisms through optimization and predictive tools.In order to achieve this objective, it is necessary to adapt current metabolic engineering approaches by incorporating automated simulation techniques instead of relying on previous in vivo or in vitro investigations.
In order to exploit and manipulate the metabolic capabilities of an organism, the metabolic pathways within the cell need to be modeled.A model is a simplified system representation that allows the user to understand, predict, and control the system [18].An organism can be modeled based on a dynamic or static approach.In ME, the metabolism of the target organism was represented in the mathematical model.Thus, the network's precise respective pathway or reactions that need to be manipulated and optimized can be identified.Various computational modeling approaches and algorithms have been developed and applied to aid the researchers [19].Different approaches have been developed depending on the representations, as shown in Fig. 1.
The approaches in metabolic engineering can be divided into two, which are the dynamic approach and the static approach.Each approach varies in terms of metabolism representation, whereby the dynamic approach uses kinetic modelling and static approach uses a stoichiometric matrix to represent the metabolic network [20].Furthermore, the difference between these two approaches is the model used.The dynamic approach uses a kinetic model, and the static approach uses a stoichiometric model or a metabolic network.Both of these models consist of different information and representations.The dynamic approach describes the changes in metabolite concentrations over time, while the static approach does not [21].Table I defines the difference between the kinetic and the stoichiometric models.
In stoichiometric models, the biochemical reactions in the metabolic network are represented as a set of stoichiometric equations, whereby the elements of different metabolites in the metabolic network are denoted as stoichiometric coefficients in the stoichiometric matrix.Consequently, the intracellular metabolic fluxes can be determined at the steady state using the Accuracy High Low mass balance constraints.However, stoichiometric models are often underdetermined and eventually lead to many possible non-unique solutions.Thus, the models require additional constraints to narrow the range of possible phenotypic solutions.These constraints may include physicochemical, biological, mass conservation, and thermodynamics.Stoichiometric models have been used to enumerate the fluxes in a metabolic network by employing an objective function.The main application of stoichiometric models is on metabolic networks, specifically in metabolic engineering strategies [7], [8], [22], [23].

III. CONSTRAINT-BASED MODELING
The constraint-based method (CBM) is an approach to investigating the optimality of an organism by predicting and describing the metabolic phenotypes [24].In CBM, constraints are applied to the systems, thus creating feasible flux distribution space.Different types of constraints can be categorized into physicochemical, topo-biological, environmental, and regulatory [25], [26].These constraints can be expressed as equality or inequality constraints, as shown below, and have been reviewed by [26].The equation that describes the incoming and outgoing fluxes accumulation for each metabolite in the metabolic network is described in 1.
where S is the stoichiometric matrix of size m × n (m is the number of metabolites and n is the number of reactions), X is the m concentration vector, and v is the n flux vector.Each metabolite's production rate must equal the consumption rate at the steady state.Therefore, the above equation is simplified to Eq. ( 2).
The imposition of constraints will further reduce the number of allowable flux distributions and constraints taken upon the form in Eq. (3).
where i is the length of m reactions, α i and β i are the lower and upper limits for the i reaction, respectively.The values for α i and β i are determined based on reactions' reversibility or irreversibility and measured uptake rates.These constraints may restrict specific phenotypes from existing in the solution space.Fig. 2 illustrates the differences between unconstrained and constrained solution space of feasible steady-state flux distributions.
As shown in Fig. 2, unconstrained steady-state solution space is underdetermined due to the ratio of reactions typically exceeding the number of metabolites.Eq. ( 1) provides a hyperplane that defines the allowable flux distributions.Considering different constraints, the solution space is limited to specific desired phenotypes.Therefore, CBM aims to describe and predict the desired phenotypes of an organism by describing the metabolic networks of an organism using the stoichiometric framework and a series of constraints.Despite the imposition of constraints and steady-state assumption, the solutions generated are not limited to a single solution.Instead, the solutions generated are limited to the desired phenotypes.
In order to solve the underdetermined system, the problem of measuring internal fluxes is solved using an optimization problem [28].Thus, an objective function is defined, as illustrated in Fig. 3. Generally, an objective function is a biological assumption that an organism can be achieved.Then, linear optimization is used to find the solution that optimizes the desired objective function.Examples of objective functions include minimizing ATP production and nutrient uptake and maximizing growth rate.The most common objective function is growth rate since organisms maximize their growth after evolutionary pressures [29].Referring to the above equations, Eq. 1 to 3, the objective function for maximizing the growth rate is mathematically represented by Eq. 4.
Generally, there are four CBM approaches -flux balance analysis (FBA), flux variability analysis (FVA), minimization of metabolic adjustment (MoMA), and regulatory on/off minimization (ROOM).Table II portrays the characteristics of the four CBM approaches and the applications that have been carried out.
As shown in Table II, FBA is a classical CBM method and has become one of the most common approaches researchers use [7], [8], [25], [30], [31].Despite FBA's non-uniqueness due to the exclusion of regulatory and kinetic parameters, FBA excels in handling vast data within metabolic networks compared to other approaches, such as predicting higher steady states for biological objectives such as growth rate and production rate.Moreover, despite the incompleteness of metabolic network models, FBA can still determine the organism's steady-state fluxes.
FVA employs linear programming to identify multiple biologically optimal solutions with the same objective value.These solutions are non-unique due to the metabolic network's ability to achieve the same objective value through different equivalent pathways, often represented by recessive phenotypes.Unlike FBA, which examines the distribution of flux within pathways, FVA focuses on determining the feasible ranges of minimum and maximum fluxes for each reaction.Meanwhile, MoMA employs quadratic programming to minimize the Euclidean distance on flux space between the wild-type and mutant, while ROOM predicts the post-genetic perturbation steady state of metabolic networks.In contrast to MoMA, ROOM identifies flux distributions that yield high-rate solutions while minimizing flux deviations between wild-type and mutant and preserving the linearity of fluxes based on experimental measurements [10], [32].Additionally, ROOM can discover shorter alternative pathways for rerouting fluxes after genetic perturbations, employing mixed integer linear programming (MILP) to meet the same constraints as FBA.

IV. MACHINE LEARNING IN METABOLIC ENGINEERING
In silico metabolic engineering comprises computer simulations that predict and analyze an organism's metabolic network to improve the organism's cellular activities [8].The improvement involves manipulating metabolic, signal, or regulatory networks.One approach to investigating the effects of genetic changes on metabolite synthesis is in silico reaction knockout modeling.The organism's behavior can be predicted through constraint-based modeling (CBM) methods by analyzing the effects of phenotypic and genotypic perturbations on the organisms.
High-throughput technologies such as gene sequencing, protein purification/quantification, mass spectrometry, and others have enabled a new era of biological information in which the amount of biological data has significantly expanded over

Transient metabolic states
The predicted solutions are nearer to the experimental data time.The various omics biological datasets, ranging from genomic to metabolomic and fluxomic, can provide direct insight into an organism's phenotype.An alternative approach is therefore needed to analyze and process large amounts of information quickly.Machine Learning (ML) has been increasingly used in metabolic engineering to replace human metabolic engineers [33], [34], [35].Given its success in pattern recognition, model prediction, and others [36], [37], [38], [39], [40].
Machine learning (ML) is used to generate trial-and-error inferences and improve the predictions from data without a predefined set of rules.ML has been massively used in data analysis and typically allows applications to develop intelligently by understanding patterns in big data [1].There are two types of ML based on data: labeled and unlabeled (Fig. 4).For the labeled data, algorithms learn from labeled training data to help predict the outcomes of unlabeled data.Meanwhile, unlabeled data use unsupervised learning to seek patterns and clusters in an unlabeled dataset.Examples of supervised learning algorithms include decision trees [41], support vector machines [42], and regression [43], whereas Principal Component Analysis (PCA) [44], [45] and K-means clustering [46] are unsupervised learning algorithms.Another ML type is reinforcement learning, in which the algorithm interacts with experience and learns to maximize the desired goal using experience, data, and trial-and-error interactions.Reinforcement learning does not need labeled input/output but focuses on balancing exploration and exploitation.
ML has recently played a significant role in biological research [16], [39].These algorithms focus on model performance by training highly heterogeneous data.It is undoubtedly an opportunity to integrate ML algorithms with CBM models in various biological data sets such as gene expression, metabolites, phenotypes, and others [4], [47].The application of ML in metabolic engineering will provide several benefits.First, ML can be used in various in silico metabolic engineering stages, from analyzing the metabolic flux data to designing optimal metabolic pathways.Second, the full integration of omics data, including genomic, transcriptomic, proteomic, and metabolomic data, is crucial for predicting the metabolic pathway as it provides valuable insight into biological networks [48].Furthermore, via gene expression analysis using ML, the key regulators of a metabolic pathway can be identified based on the genetic perturbations on cellular metabolism.Therefore, by merging machine learning with other computational tools in metabolic engineering, researchers may optimize cellular metabolism for enhanced production of biofuels, chemicals, and other essential molecules in a quick, cost-effective, and sustainable way.As shown in Fig. 5, the reactions and metabolites from GSMM are extracted and represented in a stoichiometric matrix.These datasets comprise instances (reactions and metabolites involved in the specific pathway).The coefficient in the stoichiometric matrix represents the knockout (coefficient one) and non-knockout reactions (coefficient zero) involved in that pathway.In this case, different combinations of knockout reactions are obtained.The training data, then, is used to train the chosen ML algorithms and predict the response of the test dataset.The responses  include growth rate, product rate of desired metabolites, and different mutants with different combinations of knockout reactions.
According to [24], [49], the merging of ML and CBM can occur in three approaches.The first approach involves the inclusion of ML after CBM generates fluxomic data by predicting the growth conditions, cellular ML productivity, nutrient consumption, gene essentiality, or biomass concentration.The second approach uses a multi-omics data simplification process before entering the CBM process.The results of fluxomic data from CBM are then combined with the initial multi-omics data for the prediction process using a specific ML algorithm.The last approach uses ML on multi-omics data to get fluxomic data.This paper deduces that merging CBM with ML can occur in two ways, namely, ML as input to the CBM and CBM as input to the ML.
In the prior case, machine learning methods can improve metabolic models' accuracy and predictive power by predicting and refining metabolic models.The metabolic fluxes from omics data predicted using ML algorithms are input constraints for the metabolic model.Furthermore, ML can assist in identifying essential features (genes or reactions) for improving specific metabolite production.Considering that the metabolic model is complex, identifying crucial genes or reactions is essential while maintaining the viability of a cell.Meanwhile, in the post case, CBM can provide features, labels, and model selection to machine learning.Constraint-based approaches have been used to model the GSMM for simulating the phenotypic behavior after genotype perturbations.With the inclusion of machine learning methods, the selected features from CBM can be used to train ML models for predicting the pathway activity, thus optimizing the metabolic model.Fig. 6 illustrates the integration of machine learning and constraintbased modeling approaches.

V. APPLICATION OF MACHINE LEARNING IN METABOLIC ENGINEERING
An unprecedented amount of information has now been used to seek biological mechanisms at the molecular level.The recent advancement of high-throughput technologies has significantly boosted data collecting and fundamentally altered how people view molecular biology [50].However, predicting bioproduction titers from microbial hosts has been challenging due to complicated interactions between regulatory networks, signaling, and metabolic networks [50].There are several ways to carry out experiments concerning metabolic engineering.Machine learning, which has undoubtedly led to significant improvements in recent research and is expected to surge shortly, is a critical tool for analyzing, understanding, and exploiting omic data.
A novel approach for predicting yeast metabolome using machine learning based on quantitative proteomic data of kinase knockouts was presented by [51].The results showed that the ML algorithm accurately predicts the metabolome with complex genetic modification.However, the study assumes that protein expression levels are proportional to changes in metabolic flux.Nevertheless, when post-transcriptional or posttranslational modifications occur, the protein expression levels may differ and not proportionate to the changes in metabolic flux.Additionally, the dataset used is relatively small.Thus, expanding the dataset to include a broader range of genetic perturbations and experimental conditions could improve the generalizability of the ML models.
In another research, the integration of knowledge mining, genome-scale modeling, and ML for predicting the bioproduction of Yarrowia lipolytica has been proposed [50].The proposed framework integrates different data, including genomics, metabolomics, and literature, to construct a knowledge-based and optimal GSMM.Then, ML algorithms are applied to predict bioproduction yields based on gene expression data and environmental conditions.They have successfully outperformed the traditional methods.However, the complexity of GSMM and lack of comprehensive knowledge may hinder accurate predictions.Thus, further development and validation are crucial to enhance its applicability and reliability.
Furthermore, [52] have proposed multi-omics data to analyze and characterize key molecular pathways and features essential for yeast growth based on different environmental conditions.The pipeline incorporates biological knowledge in the machine learning model to improve predictions.The proposed pipeline outperforms traditional ML methods and gives insight into the underlying biological mechanisms regulating cell growth.However, the pipeline has several limitations that need to be addressed.For instance, the pipeline relies on the quality and completeness of data sources, which may vary and be limited across different organisms.
A machine learning framework to assess microbial factories' performance was proposed by [1], which thos microbial are microorganisms that can produce various valuable compounds.Like [50], [52], the researchers proposed the integration of different data, including genomics, transcriptomics, metabolomics, and fermentation data.This integration framework is used to model the relationship between genetic and environmental factors and the production of target compounds.The proposed framework uses feature selection, regression, and classification algorithms to predict yields, identify genetic targets for strain engineering, and optimize the conditions.Although the proposed framework successfully demonstrated promising results, however, the framework relies on the availability of data sources.Furthermore, the complexity of metabolic networks and the lack of kinetic transcriptional or genomics data may affect the accuracy of prediction and strain engineering.
In addition, Tachibana and his colleagues prepared a study on Green Fluorescent Protein (GFP) extracted from engineered Escherichia coli.They conducted using Deep Neural Network (DNN) [53].Before being assessed by machine learning to assign the GFP intensities into a reasonable range for analysis with the DNN technique, the GFP intensities were scaled down by five orders of magnitude.All machine learning methods utilized data from the yeast extract for double-validation calculations.The remaining data were divided into learning and test datasets for random cross-validation.DNNs were built using tanh activations and four hidden layers (200, 100, 50, and 20 units).The average Mean Squared Error (MSE), determined from the rearranged matrices for each variable, was used to measure representative importance in their study.Their research discovered that DNN showed high coefficients of determination and low MSE values.
Different ML algorithms, including random forest, support vector machine, and neural networks have been evaluated by [54], to assess their accuracy in predicting the phenotypic traits of three organisms: yeast, rice, and wheat.The study also investigates the impact of different feature selection methods and data preprocessing techniques on predictive performance.Based on the research, the authors found that combinations of ML algorithms and feature selection methods can achieve high accuracy in predicting phenotypic traits based on genetic data.In another domain, elastic net logistic regression has been proposed to determine the functional and structural brain alterations in female schizophrenia patients [55].The study combines functional magnetic resonance imaging and diffusion tensor imaging to identify brain regions associated with the disease.The elastic net logistic regression selects relevant features and builds a predictive model.The study found that the model improves the accuracy of classifying the patients.
The developed framework or pipeline proposed by previous researchers demonstrates that machine learning can achieve high accuracy in predicting phenotypic traits based on genotypic perturbations.Moreover, multi-omics data integration has allowed ML algorithms to improve the accuracy of strain engineering in selecting the optimal genetic perturbations.However, there are some limitations and challenges that need to be addressed.Firstly, transcriptomic and genetic data availability is only limited to specific organisms.Thus, predicting and simulating genetic perturbations for less researched organisms is challenging.In addition, the complexity of metabolic networks, thus the complexity of integrated networks, may hinder the predictive capabilities and strain engineering.Therefore, further development and validation, including biological validation, is needed to enhance ML's interpretability, robustness, and applicability in predicting phenotype changes.Shimizu and Toya in 2021 experimented with evaluating the cellular performances of 13C-metabolic flux analysis using artificial gene deletion [56].It is essential to understand the physiology of the metabolism in practical bioprocesses to evaluate the efficiency of the desired model.They stated that the quantitative imaging of microbial cells for metabolic engineering is enabled by metabolic flux analysis (MFA).The nonlinear least squares approach is used to compute metabolic fluxes.A mathematical model that includes carbon atom transfers and molecular mass balancing is provided.Based on the solution space, it is possible to calculate the best trajectory for a given growth and output rate.For the growth phase, the individual growth rate is kept at its highest level and shifted to the critical value, which produces the highest specific production rate.

A. Combination of Unsupervised and Supervised Techniques
Moreover, many articles have reviewed the recent advances in model-assisted metabolic engineering, which aims to design and optimize the metabolic pathways of organisms to improve the production of desired metabolites [39], [57], [58].Mainly, the review articles discussed the use of ML to assist in predicting the effects of genetic perturbations for integrated multiomics data.Previously, metaheuristics optimization algorithms, such as Genetic Algorithm (GA), Differential Search Algorithm (DSA), flower pollination algorithm, Bee Algorithm, Particle Swarm Optimization (PSO), and others, have been used to improve the design of strain.The improved production of desired metabolites has proved the success of metaheuristic algorithms.However, with multi-omics data integration, the strain design becomes more challenging.Thus, using ML approaches is highly needed to enhance the accuracy of model predictions.

B. Unsupervised Techniques
Unsupervised techniques identify patterns based on predetermined mathematical criteria (such as the number of clusters or variance independence).Large-scale biological datasets have been analyzed using both learning techniques, which have also been combined with FBA.For the unsupervised technique, Sahu et al. developed the "Split Lipids into Measurable Entities reactions" (SLIMEr) approach to model the lipids in genomescale metabolic models in yeast [59].SLIMEr later divides lipid components into acyl chain distributions and lipid classes using a mathematical framework, imposing limitations on both [59].Subsequently, Sahu and his colleagues also established a framework to examine growth-related mechanisms of several S. cerevisiae strains by combining FBA with Multimodal Artificial Neural Networks [59].The study was to use mechanistic knowledge to integrate data-driven ML techniques to overcome their "black-box" restriction in flux distributions.The framework was evaluated using 1,484 strains of S. cerevisiae with single gene knockouts.Growth rates were designated as constraints in pFBA.The study shows that Multimodal Artificial Neural Networks and FBA can train and predict the individual gene expression data for analyzing the flux distributions.2021) performed cancer-specific metabolic signatures using Random forest classification with PCA and FBA [60].For each cancer model, flux distributions were computed using FBA.After that, using PCA and Random Forests techniques, FBA-based characteristics were generated.PCA generates the variation of flux distributions in cancer models representing the response variables.Random Forests then employed these response variables to categorize crucial fluxes (which showed the impacted sub-cellular systems).Based on their study, the authors discovered that the pentose phosphate route, extracellular transport, mitochondrial transporters, fatty acid production, and other metabolic characteristics are the factors that distinguish between normal and abnormal cell metabolisms for the cancer model.

Jalili et al. (
Meanwhile, unsupervised ML mainly creates clusterings or representations of the unlabeled dataset to reduce the dimensional complexity of data.Principal Component Analysis (PCA) and K-means clustering are examples of unsupervised ML.In ME, unsupervised ML techniques can be implemented to identify the appropriate and non-appropriate reactions involved in producing desired metabolites.Moreover, unsupervised clustering techniques have been used to distinguish different cell types, such as healthy and non-healthy, cancer and non-cancer markers, and stressed and non-stressed.Fig. 7 below illustrates the unsupervised methods in ME.
In another study, Barbosa and the team researched the effects of production factors such as sugar, nitrogen level, and fermentation temperature on wine quality in non-Saccharomyces yeasts [61].The Exploratory Data Analysis (EDA) activity was enhanced by employing unsupervised machine learning on the entire experimental data set.Latent variable techniques, such as Principal Component Analysis, were used to investigate the responses of multivariate structure.Using agglomerative hierarchical clustering (AHC), 18 responses of natural groups were found.Consequently, the forward stepwise variable selection method is used to determine the input variables for the regression model.The study successfully found direct patterns between different production factors, signifying positive and negative correlations.
They stated that the correlation distance was used to identify clusters or groups of functionally related fermentation metabolites [61].It was anticipated that the first principal component for the cluster-specific PCA models would explain the majority of the overall variability in the cluster due to highly correlated variables generating clusters.Upon completing PCA, supervised ML was also applied.They used a forward stepwise variable selection method to determine which input variables (experimental factors and their higherorder terms) should be included in the regression model.The stepwise selection technique involved picking and incorporating components one at a time.When there are no variables whose inclusion or exclusion from the model would result in a change in the model's explanatory power that is statistically significant, the method finally ends.
For unsupervised ML, they found that clear patterns of linked variables can be seen in the loading plots, such as those that cluster together or lie in the other direction, signifying positive and negative correlations, respectively, as in Fig. 8.This exploratory PCA analysis supports the necessity to investigate the modular structure of the answers in more detail  and to identify the linked responses' natural building blocks.In order to uncover the natural blocks or clusters of variables, they implemented an agglomerative hierarchical variable-clustering approach (AHC).As a result, the variables are closer to each other, thus reducing the agglomeration distances [61].For supervised ML, the considerable changes seen in the exploratory analysis section are confirmed by the modeling results utilizing main effects, second-order interactions, and quadratic terms, indicating the critical influence of the parameters on the fermentation process.

VI. RESULT AND DISCUSSION
The first activity of this review is collecting the references.We first searched the Scopus and Web of Science databases with the keywords "Machine learning", "Flux balance analysis", and "Metabolic engineering" to find relevant literature in recent years.Then, we filtered for references related to the integration between ML and CBM from the results obtained.
After searching the keywords "Machine learning", "Flux balance analysis", and "Metabolic engineering", 223 research studies were extracted through automatic search from Scopus and Web of Science databases.Of the majority of these 223 studies, 32 were duplicate studies and review papers and thus were eliminated from the list.Based on the title, abstracts, and keywords, the remaining 191 research studies were examined, and 90 studies were excluded.Next, the remaining 101 studies were further selected, in which papers published from 2020 to 2023 were selected and left with 13 studies.
Then, the selected relevant references are synthesized.Table III illustrates the synthesis results of 13 relevant studies in the sources.In the table, 17 machine learning approaches are integrated into constraint-based modeling, namely, binary classifier, random forest, PCA, SVM, KNN, Decision tree, gradient tree boosting, DNN, CNN, t-SNE, ensemble learning, kMeans, lasso, multiview neural network, regularized logistic regression, ANN, and reinforcement learning.Out of those 13 studies, only two use the kinetic model, whereas most use the stoichiometric model.Since the stoichiometric model does not require intracellular experimental parameters, which are hardly known, stoichiometric models are more favorable for biologists to exploit the detailed capabilities of cell metabolism and [70] outperform kinetic model when the dataset used has large networks [71].Though kinetic models provide detailed quantitative descriptions of the processes involved in the systems, thus revealing a system's actual dynamic biological behavior, the kinetic model is only limited to the small-scale and newly curated metabolic network [25].
Meanwhile, flux balance analysis (FBA) is the most widely used model assessment method because FBA uses linear programming that is easier to apply than MoMA and ROOM, which use quadratic programming and mixed-integer linear programming.Moreover, although the solutions provided by FBA are non-unique as it does not consider regulatory and signal data, the existing metabolic networks are still incomplete [23].Regardless of these imperfections, FBA can determine the steady-state fluxes of organisms and predict the optimal longterm evolved state of the cells.In contrast, MoMA and ROOM predict the immediate initial outcome of genetic manipulations.However, cells will evolve from a minimized flux distribution state to an FBA solution [4].In other words, genetic manipulations will first lead to flux distribution predicted by MoMA and ROOM, eventually converging to a solution predicted by FBA.The proposed approach showed that will-type FBA solutions contain enough information to predict essentiality, without perturbation such as reaction or gene knockout.
There is no a standar strategy on machine learning utilized for essentiality prediction generally.

S2
RISK cohort data, gene expression data for all mucosal terminal ileal biopsies.
A framework that is a potential to identify pathways of clinical relevance in Crohn's disease, discover of novel diagnostic biomarkers, and therapeutic targets.
There is the discrepancy in the generated metabolic models of Crohn's disease in both RISK-derived tissue and enteroids.

GEMs of Chinese hamster ovary (CHO) cells
The proposed hybrid FBA by involving the mechanistic and non-parametric constraints can efficiently reduce the solution space and improve the prediction result of FBA.
Need the experimental fluxes datasets with the guaranteed high accuracy.

S4 13 C fluxomics
The proposed approach is reliable for fluxomics method readily and applicable to high-throughput metabolic phenotyping.
Computationally expensive especially in the large-scale metabolic network.

S5
Cancer patientspecific GEMs The results show that tINIT and GIMME has the high performance, but FBA and pFBA has poor performance in cancer metabolism.
Computationally expensive especially in the large GEMs.

GEMs of Yarrowia Lipolytica
This study succeed in integrating knowledge mining, feature extraction, GEMs, and ML for predicting chemical titers in Yarrowia lipolytica.
This model can not capture biosynthesis bottlenecks, consequently, the predictability for low-performance strains is not optimal.

S7 Transcriptomic and genomic datasets
GEMs from patient tumors generated from transcriptomic and genomic datasets.The proposed approach, namely integrating ML and the generated GEMs, can identify prognostic metabolite biomarkers and predict radiosensitivity for individual patients.
Need to collect a larger datasets with the guaranteed high quality.

GEMs of Escherichia coli and transcriptomics data
The proposed method, namely the combination of transcriptome, GEMs, and machine learning can improve the production rate of glycerol.It does not involve other parameters that influence metabolic processes, such as enzyme, transcriptional regulation, and signaling.

S9
GEMs of Synechococcus sp.

PCC 7002, transcriptomics
The proposed approach, namely model-generated flux data, are potential for predicting the growth rates.
Depends on Important information such as the specific metabolite uptake constraints and the nutrient uptake rates that are difficult to obtain directly.The proposed approach, namely metabolic allele classifier (MACs), can predict antimicrobial resistance (AMR) phenotypes with accuracy on par with mechanism-agnostic ML.

Continued on the next page
Not suitable for microbial genome-wide association studies.

S12
The input data of 121 balances of four enzymes in the upper part of glycolysis The ANN algorithms usedto select the enzyme concentration for the upper part of glycolysis The ANN algorithms that was used to select the enzyme concentration for the upper part of glycolysis could select the optimum enzyme concentrations, improve flux up to 63%, and decrease a cost up to 25%.

GEMs of Escherichia coli, k-ecoli457 and Saccharomyces cerevisiae
The proposed method, namely MARL, could optimize the L-tryptophan production in S. cerevisiae and specific metabolite in the k-ecoli457.MARL could also be used to optimize metabolic gene expression levels.
Its application is still restricted to the particular target enzymes.
As for integrating machine learning with constraint-based models, most those relevant studies employed the first strategy in which biological insights from CBM are used as input to ML.Given the intricacy of biological data and certain biological phenomena or systems that cannot be comprehensively described and examined mechanistically.In the table, there are 10 studies utilized the first strategy to integrate ML into CBM.The task of ML in those studies are to identify, improve, estimate, and select.In identifying, ML have been applied to identify the essential genes [31], biomarker [65], [62], the biological features [64], and the key cross-omics features [66].Then, the application of ML in the improving process are to improve the production of glycerol [39], the prediction of phenotypic [69], and the production of L-tryptophan [69].At the rest, ML was applied in estimating the functional effect of genetic-associated aleles and selecting the optimized enzyme concentration for optimal yield.Nevertheless, some research studies employ a second strategy in which ML analyzes multi-omics data for CBM model reconstruction.In this strategy, ML have useful in the reducing, predicting, and reconstructing processes.In reducing, PCA has been implemented by integrating parametric and nonparametric constraints for reducing the search space in order to improve the prediction of FBA [29].Then, several machine learning approaches have been utilized in the predicting process to get the optimal flux ratio based on solvability and feature screening [63].Meanwhile, for reconstructing, Ensemble learning has been applied to reconstruct GSMMs of Yarrowia lipolytica in order to improve organic acids' productions [50], where the reconstruction of GSMM involves multiple steps, including annotation, gap filling, and refinement.
Table IV provides results, dataset used, and disadvantages of the relevant studies from Table III.Based on the synthesis and analysis results obtained from the relevant studies, there are several potentials of ML to contribute in in silico metabolic engineering.Integrating ML in traditional algorithms, such as flux balance analysis, can improve the production of the desired metabolites and even promise to guide strain optimization based on hybrid models, namely, the mechanistic and datadriven models.Moreover, ML has given positive influences on the prediction results by involving several experimental data such as fluxomic, transcriptomic, metabolomic, and proteomic in the process of constraint-based modeling.Also, it has been shown that ML can construct GSM, predict the essential genes, reduce the dimensionality of cross-omics features, and study the pattern of omic data.Based on those potentials, ML needs to be considered in metabolic engineering processes using CBM.

VII. CONCLUSION
The advancements in biology, bioinformatics, and computational tools have led to the development of efficient software for modifying organisms for industrial use.Furthermore, the successful reconstruction models of complex biological systems by integrating data from various molecular levels have yielded valuable insights into organisms, thus offering accurate insights into cell activities during organism perturbations.However, this integration can complicate the identification of near-optimal reaction knockouts due to complex biological networks.Therefore, machine learning (ML) and constraintbased modeling (CBM) are employed to facilitate and enhance prediction accuracy.This review introduced different structure models for representing organisms' systems.Due to the traditional approaches that are costly and irreversible, constraint-based methods have been introduced to overfit the production of valuable metabolites.Though it provides near-optimal solutions, integrating diverse omics data holds substantial promise in predicting the future state of computational biology systems.Over the coming decade, there will be a growing need for machine learning methods that can be effectively utilized and tailored for these large datasets.Therefore, machine learning methods were integrated into CBM methods to improve the reconstruction of GSMM and the prediction accuracy of genetic perturbations.
We also reviewed several algorithms and applications developed and their different strategies and approaches used in metabolic engineering.As mentioned before, the integration of ML and CBM can happen in two ways.The first way is to apply ML to the integrated biological networks in which ML will identify the essential and meaningful features using the classification technique (supervised ML).This step minimizes the solution space and reconstructs a reduced integrated network for modeling in CBM.The second way is to analyze simulation modeling results from CBM (unsupervised ML).
In conclusion, ML is a superior technique for identifying meaningful features and patterns, which can help reconstruct integrated biological networks that represent the true nature of a cell, thus improving the predictive capabilities of identifying near-optimal reactions knockout for optimizing the production rate of valuable metabolites and growth rates of mutants for industrial purposes.

Fig. 5 .
Fig. 5. Overview of the standard workflow of ML in ME.

Fig. 6 .
Fig. 6.Integration of machine learning and constraint-based modeling.(a) Refers to the ML as input to CBM, while (b) is CBM as input to ML).

TABLE I .
DIFFERENCES BETWEEN THE KINETIC AND STOICHIOMETRIC MODELS

TABLE II .
SUMMARY OF CONSTRAINT-BASED MODELING APPROACHES

TABLE III .
SUMMARY OF SELECTED RELEVANT STUDIES Note : S represents the stoichiometric model; K represents the kinetic model; 1 represents CBM as input to ML; 2 represents ML as input to CBM.

TABLE IV .
SUMMARY OF MODEL, ADVANTAGE, AND DISADVANTAGES OF MACHINE LEARNING BASED ON THE RELEVANT STUDIES

TABLE IV .
SUMMARY OF MODEL, ADVANTAGE, AND DISADVANTAGES OF MACHINE LEARNING BASED ON THE RELEVANT STUDIES-CONTINUED