Keywords

1.1 Introduction

The purpose of Machine Learning algorithms is to learn automatically from data employing general procedures. Machine Learning (ML) is today ubiquitous due to its success in many current daily applications such as face recognition (Hassan and Abdulazeez 2021), speech (Malik et al. 2021) and speaker recognition (Hanifa et al. 2021), credit card fraud detection (Ashtiani and Raahemi 2021; Nayak et al. 2021), spam detection (Akinyelu 2021), and cloud security (Nassif et al. 2021). ML governs our specific Google searches and the advertisements we receive (Kim et al. 2001) based on our past actions, along many other interactions (Google cloud 2023). It even anticipates what we will type or what we will do. And, of course, ML schemes also rank us, scientists (Beel and Gipp 2009).

The explosion of applications of ML came with the increased computer power and also the ubiquitous presence of computers, cell phones, and other “smart” devices. These gave ML the spotlight to foster its widespread use to many other areas in which it had less presence. The success in many extremely useful areas such as speech and face recognition has contributed to this interest (Marr 2019). Today, ML may help you (through web services) to find a job, obtain a loan, find a partner, obtain an insurance, and, among others, also helps in the medical and legal services (Duarte 2018). Of course, ML raises many ethical issues, some of which are described, for example in Stahl (2021). However, the discovered power and success of ML in many areas have made a very important impact on our society and, remarkably, on how many problems are addressed. No wonder, the number of ML papers published in almost all fields has sharply increased in the last 10 years, with a rate following approximately Moore’s law (Frank et al. 2020).

Machine Learning is considered a part of Artificial Intelligence (AI) (Michalski et al. 2013). In essence, ML algorithms are general procedures and codes that, with the information from datasets, can give predictions for a wide range of problems (see Fig. 1.1). The main difference to classical programs is that the classical programs are developed for specific applications, like in Computer Aided Engineering, which is the topic of this chapter, to solve specific differential equations in integral forms. An example is how finite elements have been developed. In contrast, ML procedures are for much more general applications, being used almost unchanged in problems apparently unconnected as predicting the evolutions of stocks, spam filtering, face recognition, typing prediction, pharmacologic design, or materials selection. ML methods are different also from Expert Systems because these are based on fixed rules or fixed probability structures. ML methods excel when useful information needs to be obtained from massive amounts of data.

Fig. 1.1
figure 1

Overall Machine Learning (ML) process and contrast between efficiency and generality of the method. Hyperparameters are user-defined parameters which account for the type of problem, whereas parameters are optimized for best prediction. ML may be used for prediction and for classification. it is also often used as a tool for dimensionality reduction

Of course, generality comes usually with a trade-off regarding efficiency for a specific problem solution (Fig. 1.1), so the use of ML for the solution of simple problems, or for problems which can be solved by other more specific procedures, is typically inappropriate. Furthermore, ML is used when predictions are needed for problems which have not or cannot be accurately formulated; that is, when the variables and mathematical equations governing the problem are not fully determined—but physics-informed approaches with ML are now also much focused on, Raissi et al. (2019). Nonetheless, ML codes and procedures are still mostly used as general “black boxes”, typically employing standard implementations available in free and open-source software repositories. A number of input variables are then employed and some specific output is desired, which together comprises the input-to-output process learned and adjusted from known cases or from the structure of the input data. Some of these free codes are Scikit-learn (Pedregosa et al. 2011) (one of the best-known), Microsoft Cognitive Toolkit (Xiong et al. 2018), TensorFlow (Dillon et al. 2017) (which is optimal for Cuda-enabled Graphic Processing Unit (GPU) parallel ML), Keras (Gulli and Pal 2017), OpenNN (Build powerful models 2022), and SystemML (Ghoting et al. 2011), just to name a few. Other proprietary software, used by big companies, are AWS Machine Learning Services from Amazon (Hashemipour and Ali 2020), Cloud Machine Learning Engine from Google (Bisong 2019a), Matlab (Paluszek and Thomas 2016; Kim 2017), Mathematica (Brodie et al. 2020; Rodríguez and Kramer 2019), etc. Moreover, many software offerings have libraries for ML, and are often used in ML projects like Python (NumPy, Bisong 2019b, Scikit-learn, Pedregosa et al. 2011, and Tensorly, Kossaifi et al. 2016; see review in Stančin and Jović 2019), C++, e.g., Kaehler and Bradski (2016), Julia (a recent Just-In-Time (JIT) compiling language created with science and ML in mind, Gao et al. 2020; Innes 2018; Innes et al. 2019), and the R programming environment (Lantz 2019; Bischl et al. 2016; Molnar et al. 2018); see also Raschka and Mirjalili (2019), King (2009), Gao et al. (2020), Bischl et al. (2016). These software offerings also use many earlier published methods for standard computational tasks such as mathematical libraries (like for curve fitting, the solution of linear and nonlinear equations, the determination of eigenvalues and eigenvectors or Singular Value Decompositions), and computational procedures for optimization (e.g., the steepest descent algorithms). The offerings also use earlier established statistical and regression algorithms, interpolation, clustering, domain slicing (e.g., tessellation algorithms), and function approximations.

ML derives from the conceptually fuzzy (uncertain, non-deterministic) learning approach of AI. AI is devoted to mimicking the way the human learning process works—namely, the human brain, through the establishment of neurological connections based on observations, can perform predictions, albeit mostly only qualitative, of new events. And then, the more experience (data) has been gathered, the better are the predictions through experience reinforcements and variability of observations. In addition, classification is another task typically performed by the human brain. We classify photos, people, experiences, and so on, according to some common features: we search continuously for features that allow us to group and separate out things so that we can establish relations of outcomes to such groups. Abundant data, data structuring, and data selection and simplification are crucial pieces of this type of “fuzzy” learning and, hence, of ML procedures.

Based on these observations, neural network concepts were rather early developed by McCulloch and Pitts in 1943 and Hebb in 1949 (Hebb 2005), who wrote the well-known sentence “Cells that fire together, wire together”, meaning that the firing of one cell determines the actions of subsequent cells. While Hebb’s forward firing rule is unstable through successive epochs, it was the foundation for Artificial Neural Network (NN) theories. Probably due to the difficulties in implementations and computational cost in using NN, the widespread use of NN was delayed until the 1990s. The introduction of improvements in the procedures for backpropagation and optimization, as well as improvements in data acquisition, information retrieval, and data mining, made possible the application of NNs to real problems. Today, NNs are very flexible and are the basis of many ML techniques and applications. However, this delay also facilitated the appearance and use of other ML-related methods as expert systems and decision trees, and a myriad of pattern recognition and decision-making approaches.

Today, whenever a complex problem is found, especially if there is no sound theory or reliable formulation to solve it, ML is a valuable tool to try. In many cases, the result is successful and indeed even a good understanding of the behavior of the problem and the variables involved may be obtained. While the introduction of ML procedures into Computer Aided Engineering (CAE) took a longer time than in other areas, probably because for many problems the governing equations and effective computational procedures were known, ML is finally also focused on addressing complex and computationally intensive CAE solutions. In this chapter, we overview some of the procedures and applications of Machine Learning employed in CAE.

1.2 Machine Learning Procedures Employed in CAE

As mentioned, ML is often considered to be a subset of AI (Michalski et al. 2013; Dhanalaxmi 2020; Karthikeyan et al. 2021), although often ML is also recognized as a separate field itself which only has some intersection with AI (Manavalan 2020; Langley 2011; Ongsulee 2017). Deep Learning (DL) is a subset of ML. Although the use of NNs is the most common approach to address CAE problems and ML problems in general, there are many other ML techniques that are being used. We review below the fundamental aspects of these techniques.

1.2.1 Machine Learning Aspects and Classification of Procedures

Our objective in this section is to focus on various fundamental procedures commonly used in ML schemes.

1.2.1.1 Classification, Identification, and Prediction

ML procedures are mainly employed for three tasks: classification, identification (both may broadly be considered as classification), and prediction. An example of classification is the labeling of e-mails as spam or not spam (Gaurav et al. 2020; Crawford et al. 2015). Examples of identification are the identification of a type of behavior or material from some stress–strain history or from force signals in machining (Denkena et al. 2019; Penumuru et al. 2020; Bock et al. 2019), the identification of a nanostructure from optical microscopy (Lin et al. 2018), the identification of a person from a set of images (Ahmed et al. 2015; Ding et al. 2015; Sharma et al. 2020), and the identification of a sentence from some fuzzy input. Examples of prediction are the prediction of behavior of a material under some deformation pattern (Ye et al. 2022; Ibragimova et al. 2021; Huang et al. 2020), the prediction of a sentence from some initial words (Bickel et al. 2005; Sordoni et al. 2015), and the prediction of the trajectory of salient flying objects (Wu et al. 2017; Fu et al. 2020). Of course, there are some AI procedures which may not only belong to one of these categories, as the identification or prediction of governing equations in physics (Rai and Sahu 2020; Raissi and Karniadakis 2018). Clustering ML procedures are typically used for classification, whereas regression ML procedures are customarily used for prediction.

1.2.1.2 Expected and Unexpected Data Relations

Another relevant distinction is between ML approaches and Data Mining (DM). ML focuses on using known properties of data in classification or in prediction, whereas DM focuses on the discovery of new unknown properties or relations of data. However, ML, along information systems, is often considered part of DM (Adriaans and Zantinge 1997). The overlap of DM and ML is seen in cases like the discovery of unknown relations or in finding optimum state variables which may be, for example, given in physical equations. Note that ML typically assumes that we know beforehand the existence of relations (e.g., which are the relevant variables and what is the type of output we expect), but the purpose of DM is to research the existence of perhaps unexpected relations from raw data.

1.2.1.3 Statistical and Optimization Approaches within ML

Many procedures use, or are derived from, statistics, and in particular probability theory (Murphy 2012; Bzdok et al. 2018). In a similar manner, ML employs mostly optimization procedures (Le et al. 2011). The main conceptual difference between these theories and ML is the purpose of the developments. In the case of statistics, the purpose is to obtain inference or characteristics of the population such as the distribution and the mean (which of course could be used thereafter for predictions); see Fig. 1.2. In the case of ML, the purpose is to predict new outcomes, often without the need for statistically characterizing populations, and incorporate these outcomes in further predictions (Bzdok et al. 2018). ML approaches often support predictions on models. ML optimizes parameters for obtaining the best predictions as quantified by a cost function, and the values of these parameters are optimized also to account for the uncertainty in the data and in the predictions. ML approaches may use statistical distributions, but those are not an objective and their evaluation is often numerical (ML is interested in predictions). Also, while ML uses optimization procedures to obtain values of parameters, the objective is not to obtain the “optimum” solution to fit data, but a parsimonious model giving reliable predictions (e.g., to avoid overfitting).

Fig. 1.2
figure 2

Comparison of classical statistics with machine learning approaches

1.2.1.4 Supervised, Unsupervised, and Reinforced Learning

It is typical to classify the ML procedures in supervised, unsupervised, semi-supervised, and reinforced learning (Raschka 2015; Burkov 2019, 2020).

In supervised learning, samples \(\{s_i\equiv \{\textbf{x}_i,y_i \}\}_{\{i=1,\ldots ,n\}}\in S\) with vectors of features \(\textbf{x}_i\) are labeled with a known result or label \(y_i\). The label may be a class, a number, a matrix, or other. The purpose of the ML approach in this case is (typically) to create a model that relates those known outputs \(y_i\) to the dataset samples through some combination of the \(j=1,\ldots ,N\) features \(x_{j(i)}\equiv x_{ji}\) of each sample i. \(x_{j(i)}\) are also referred to as data, variables, measurements, or characteristics, depending on the context or field of application. An example of a ML procedure could be to relate the seismic vulnerability of a building (label) as a function of features like construction type, age, size, location, building materials, maintenance, etc. Rosti et al. (2022), Zhang et al. (2019), Ruggieri et al. (2021). The ML purpose is here to be able to learn the vulnerability of buildings from known vulnerabilities of other buildings. The labeling could have been obtained from experts or from past earthquakes. Supervised learning is based on sufficient known data, and we want to determine predictions in the nearby domain. In essence, we can say that “supervised learning is a high-dimensional interpolation problem” (Mallat 2016; Gin et al. 2021). We note that supervised learning may be improved with further data when available, since it is a dynamic learning procedure, mimicking the human brain.

In unsupervised learning the samples \(s_i\) are unlabeled \((s_i\equiv \{\textbf{x}_i \})\), so the purpose is to label the samples from learning similitudes and common characteristics in the features of the samples; it is usually an instance-based learning. Typical unsupervised ML approaches are employed in clustering (e.g., classifying the structures by type in our previous example), dimensionality reduction (detecting which features are less relevant to the output label, for example because all or most samples have it, like doors in buildings), and outlier detection (e.g., in detecting abnormal traffic in the Internet, Salman et al. 2020, 2022; Salloum et al. 2020) for the case when very few samples have that feature. These approaches are similar to data mining.

Semi-supervised learning is conceptually a combination of the previous approaches but with specific ML procedures. In essence it is a supervised learning approach in which there are few labeled samples (output known) but many more unlabeled samples (output unknown), even sometimes with incomplete features, with some missing characteristics, which may be filled in by imputation techniques (Lakshminarayan et al. 1996; Ramoni and Sebastiani 2001; Liu et al. 2012; Rabin and Fishelov 2017). The point here is that by having many more samples with unassigned features, we can determine better the statistical distributions of the data and the possible significance of the features in the result, resulting in an improvement over using only labeled data for which the features have been used to determine the label. For example, in our seismic vulnerability example, imagine that one feature is that the building has windows. Since almost all buildings have windows, it is unlikely that this feature is relevant in determining the vulnerability (it will give little Information Gain; see below). On the contrary, if \(20\%\) of the buildings have a steel structure, and if the correlation is positive regarding the (lack of) vulnerability, it is likely that the feature is important in determining the vulnerability.

There is also another type of ML seldom used in CAE, which is reinforced learning (or reward-based learning). In this case, the computer develops and changes actions to learn a policy depending on the feedback, i.e. rewards which themselves modify the subsequent actions by maximizing the expected reward. It has some common concepts to supervised learning, but the purpose is an action, instead of a prediction. Hence, it is a typical ML approach in control dynamics (Buşoniu et al. 2018; Lewis and Liu 2013) with applications, for example, in the aeronautical industry (Choi and Cha 2019; Swischuk and Allaire 2019; He et al. 2021).

1.2.1.5 Data Cleaning, Ingestion, Augmentation, Curation, Data Evaluation, and Data Standardization

Data is the key to ML procedures, so datasets are usually large and obtained in different ways. The importance of data requires that the data is presented to the ML method (and maintained if applicable) in optimal format. To reach that goal requires many processes which often also involve ML techniques. For example, in a dataset there may be data which are not in a logical range, or with missing entries, hence they need to be cleaned. ML techniques may be used to determine outliers in datasets, or assign values (data imputation) according to the other features and labels present in other samples in the dataset. Different dataset formats such as qualitative entries like “good”, “fair”, or “bad”, and quantitative entries like “1–9”, may need to be converted (encoded) to standardized formats, using also ML algorithms (e.g., assigning “fair” to a numerical value according to samples in the dataset). This is called data ingestion. ML procedures may also need to have data distributions determined, that is, data evaluated to learn if a feature follows a normal distribution or if there is a consistent bias, and also standardize data according to min–max values or the same normal distribution, for example to avoid numerical issues and give proper weight to different features. In large dynamic databases, much effort is expended for the proper maintenance of the data so it remains useful, using many operations such as data cleaning, organization, and labeling. This is called data curation.

Another aspect of data treatment is the creation of a training set, a validation set, and a test set from a database (although often test data refers to both the validation and the test set, in particular when only one model is considered). The purpose of the training set is to train the ML algorithm: to create the “model”. The purpose of the validation set is to evaluate the models in an independent way from the training set, for example to see which hyperparameters are best suited, or even which ML method is best suited. Examples may be the number of neurons in a neural network or the smoothing hyperparameter in splines fitting; different smoothing parameters yield different models for the same training set, and the validation set helps to select the best values, obtaining the best predictions but avoiding overfitting. Recall that ML is not interested in the minimum error for the training set, but in a predictive reliable model. The test set is used to evaluate the performance of the final selected model from the overall learning process. An accurate prediction of the training set with a poor prediction of the test set is an indicator of overfitting: we have reached an unreliable model. A model with similar accuracy in the training and test sets is a good model. The training set should not be used for assessing the accuracy of the model because the parameters and their values have been selected based on these data and hence overfitting may not be detected. However, if more data is needed for training, there are techniques for data augmentation, typically performing variations, transformations, or combinations of other data (Shorten and Khoshgoftaar 2019). A typical example is to perform transformations of images (rotations, translations, changes in light, etc., Inoue 2018). Data augmentation should be used with care, because there is a risk that the algorithms correlate unexpected features with outputs: samples obtained by augmentation may have repetitive features because in the end they are correlated samples. These repetitive features may mislead the algorithms so they identify the feature as a key aspect to correlate to the output (Rice et al. 2020). An example is a random spot in an image that is being used for data augmentation. If the spot is present in the many generated samples, it may be correlated to the output as an important feature.

Fig. 1.3
figure 3

Modified from Latorre and Montáns (2020)

Using B-splines to fit hyperelastic stress–strain data. Regression may be performed in nominal stress–stretch (\(P-\lambda )\) axes, or in true stress–strain \(\sigma -E\) axes; note that the result is different. While usual test representation in hyperelasticity is in the (\(P-\lambda )\) axes, regression is preferred in \(\sigma -E\) because of the symmetry of tension and compression in logarithmic strains). B-spline fit of experimental data with a overfitting and b proper fit using regularization based on stability conditions.

1.2.1.6 Overfitting, Regularization, and Cross-Validation

Overfitting and model complexity are important aspects in ML; see Fig. 1.3. Given that the data has errors and often some stochastic nature, a model which gives zero error in the training data does not mean that it is a good model; indeed it is usually a hint that it is the opposite: a presentation of overfitting (Fig. 1.3a). Best models are those less complex (parsimonious) models that follow Occam’s razor. They are as simple as possible but still have a great predictive power. Hence, the less parameters, the better. However, it is often difficult to simplify ML models to have few “smart” parameters, so model reduction and regularization techniques are often used as a “no-brainer” remedy for overfitting. Typical regularization (“smoothing”) techniques are Least Absolute Shrinkage Selection Operator, sparse or L1 regularization (LASSO) (Xu et al. 2008) and L2, called Ridge (Tikhonov 1963), or noise (Bishop 1995) regularization, or regression. The LASSO scheme “shrinks” the less important features (hence is also used for feature selection), whereas the L2 scheme gives a more even weight to them. The combination of both is known as elastic net regularization (Zou and Hastie 2005).

Fig. 1.4
figure 4

Training and test sets: k-fold generation of training and validation sets from data. Number of data: 9, data for training and model selection: 6, data for final validation test (test set): 3, number of folds for model selection: 3, data in each fold: 2, number of models: 3 (k = 3). Sometimes, the validation test is also considered as test set. The 10-fold cross-validation is a common choice

Model selection taking into account model fitness and including a penalization for model complexity is often performed by employing the Akaike Information Criterion (AIC). Given a collection of models arising from the available data, the AIC allows to compare these models among them, so as to help select the best fitted model. In essence, the AIC not only estimates the relative amount of information lost by each model but also takes into account its parsimony. In other words, it deals with the trade-off between overfitting and underfitting by computing

$$\begin{aligned} {\text {AIC}}=2p-2\ln {\mathfrak {L}} \end{aligned}$$
(1.1)

where p is the number of parameters of the model (complexity penalty) and \(\mathfrak {L}\) is the maximum of the likelihood function of the model, the joint probability of the observed data as a function of the p parameters of the model (see next section). Therefore, the chosen model should be the one with the minimum AIC. In essence, the AIC penalizes the number of parameters to choose the best model—and that is the model not only with as few parameters as possible but also with a large probability of reproducing the data using these parameters.

Dividing the data into two sets, one for training and one for validation, very often produces overfitting, especially for small datasets. To avoid the overfitting, the method of k-fold cross-validation is frequently used. In this process, the data is divided into k datasets. \(k-1\) of them are used to train the model and the remaining one is used for validation. This process is repeated k times, by employing each of the possible datasets for validation. The final result is given by the arithmetic mean of the k results (Fig. 1.4). Leave-one-out Cross-Validation (LOOCV) is the special case where the number of folds is the same as the number of samples, so the test set has only one element. While LOOCV is expensive in general (Meijer and Goeman 2013), for the linear case it is very efficient because all the errors are obtained simultaneously with a single fit through the so-called hat matrix (Angelov and Stoimenova 2017).

1.2.2 Overview of Classical Machine Learning Procedures Used in CAE

The schemes we present in this section are basic ingredients of many ML algorithms.

1.2.2.1 Simple Regression Algorithms

The simplest ML approach is much older than the ML discipline: linear and nonlinear regression. In the former case, the purpose is to compute the weights \(\textbf{w}\) and the offset b of the linear model \(\tilde{y}\equiv f(\textbf{x})=\textbf{w}^T \textbf{x}+b\), where \(\textbf{x}\) is the vector of features. The parameters \(\textbf{w},b\) are obtained through the minimization of the cost function (MSE: Mean Squared Error)

$$\begin{aligned} C(\textbf{x}_i;\{\textbf{w},b\})=\frac{1}{n}\sum _{i=1}^n \mathcal {L}_i:=\frac{1}{n}\sum _{i=1}^n\left[ f(\textbf{x}_i;\{\textbf{w},b\})- y_i \right] ^2 \end{aligned}$$
(1.2)

with respect to them, which in this case is the average of the loss function \(\mathcal {L}_i=(\tilde{y}_i-y_i)^2\), where the \(y_i\) are the known values, \(\tilde{y}_i=f(\textbf{x}_i;\{\textbf{w},b\})\) are the predictions, and the subindex i refers to sample i, so \(\textbf{x}_i\) is the vector of features of that sample. Of course, in linear regression, the parameters are obtained simply by solving the linear system of equations resulting from the quadratic optimization problem. Other regression algorithms are similar, as for example spline, B-spline regressions, or P-spline (penalized B-splines) regressions, used in nonlinear mechanics (Crespo et al. 2017; Latorre and Montáns 2017) or used to perform efficiently an inverse of functions which does not have an analytical solution (Benítez and Montáns 2018; Eubank 1999; Eilers and Marx 1996). In all these cases, smoothing techniques are fundamental to avoid overfitting.

While it is natural to state the regression problem as a minimization of the cost function, it may be also formulated in terms of the likelihood function \(\mathfrak {L}\). Given some training data \((y_i,\textbf{x}_i)\) (with labels \(y_i\) for data \(\textbf{x}_i\)), we seek the parameters \(\textbf{w}\) (for simplicity we now include b in the set \(\textbf{w}\)) that minimize the cost function (e.g., MSE); or equivalently we seek the set \(\textbf{w}\) which maximizes the likelihood \(\mathfrak {L}(\textbf{w}|(y,\mathbf {x)})=p(y|\textbf{x};\textbf{w})\) for those parameters \(\textbf{w}\) to give the probability representation for the training data, which is the same as the probability of finding the data \((y,\textbf{x})\) given the distribution by \(\textbf{w}\). The likelihood is the “probability” by which a distribution (characterized by \(\textbf{w}\)) represents all given data, whereas the probability is that of finding data if the distribution is known. Assuming data to be identically distributed and independent such that \(p(y_{1},y_2,\ldots , y_n|\textbf{x}_1,\textbf{x}_2,\ldots , \textbf{x}_n;\textbf{w})=p(y_{1}|\textbf{x}_1;\textbf{w})p(y_2|\textbf{x}_2;\textbf{w})\ldots p(y_{n}|\textbf{x}_n;\textbf{w})\), the likelihood is

$$\begin{aligned} \mathfrak {L}(\textbf{w}|(y_i,\textbf{x}_i),i=1,\ldots ,n)=\prod _{i=1}^n p(y_{i}|\textbf{x}_i;\textbf{w}) \end{aligned}$$
(1.3)

or

$$\begin{aligned} \log \mathfrak {L}(\textbf{w}|(y_i,\textbf{x}_i),i=1,\ldots , n)=\sum _{i=1}^n\log p(y_{i}|\textbf{x}_i;\textbf{w}) \end{aligned}$$
(1.4)

Choosing the linear regression \(\tilde{y}=\textbf{w}^T\textbf{x}\) (including b and 1 respectively in \(\textbf{w}\) and \(\textbf{x}\)), and a normal distribution of the prediction, obtained by assuming a zero-centered normal distribution of the error

$$\begin{aligned} p(y|\textbf{x};\textbf{w},\sigma ^2)=\frac{1}{\sqrt{2\pi \sigma ^2}}\exp \left( -\frac{(\textbf{w}^T\textbf{x}-y)^2}{2\sigma ^2}\right) \end{aligned}$$
(1.5)

it is immediate to verify that the maximization of the log-likelihood in Eq. (1.4) is equivalent to minimizing the MSE in Eq. (1.2) (regardless of the value of the variance \(\sigma ^2\)).

A very typical regression used in ML is logistic regression (Kleinbaum et al. 2002; Hosmer Jr et al. 2013), for example to obtain pass/fail (1/0) predictions. In this case, a smooth predictor output \(y\in [0,1]\) can be interpreted as a probability \(p(y=1|\textbf{x}):=p(\textbf{x})\). The Bernoulli distribution (which gives p for \(y_{i}=1\) and \((1-p)\) for \(y_{i}=0\)), or equivalently

$$\begin{aligned} P(y=y_i)\equiv p^{y_i}(1-p)^{(1-y_i)} \end{aligned}$$
(1.6)

describes this case for a given \(\textbf{x}_i\), as it is immediate to check. The linear regression is assigned to the logit function to convert the \((-\infty ,\infty )\) range into the desired probabilistic [0, 1] range

$$\begin{aligned} \textrm{logit}(p(\textbf{x}))=\log \left( \frac{p(\textbf{x})}{1-p(\textbf{x})}\right) =\textbf{w}^T\textbf{x} \end{aligned}$$
(1.7)

The logit function is the logarithm of the ratio between the odds of \(y=1\) (which are p) and \(y=0\) (which are \((1-p)\)). The probability \(p(\textbf{x})\) may be factored out from Eq. (1.7) as

$$\begin{aligned} p(\textbf{x})=\frac{1}{1+\exp {(-\textbf{w}^T\textbf{x})}} \end{aligned}$$
(1.8)

which is known as the sigmoid function. Neural Networks frequently use logistic regression with the sigmoid model function where the parameters are obtained through the minimization of the proper cost function, or through the maximization of the likelihood. In this latter case, the likelihood of the probability distribution in Eq. (1.6) is

$$\begin{aligned} \mathfrak {L}(p(\textbf{x})|(y_i,\textbf{x}_i),i=1,\ldots , n)=\prod _{i=1}^n p(\textbf{x}_i)^{y_i}[1-p(\textbf{x}_i)]^{(1-y_i)} \end{aligned}$$
(1.9)

where \(y_i\) are the labels (with value 1 or 0) and \(p(\textbf{x}_i)\) are their (sigmoid-based) probabilistic predicted values given by Eq. (1.8) for the training data, which are a function of the parameters \(\textbf{w}\). The maximization of the log-likelihood of Eq. (1.9) for the model parameters gives the same solution as the minimization of the cross-entropy

$$\begin{aligned} \arg \min _{w} H(\textbf{w}) =-\sum _{i=1}^n \left[ \frac{y_i}{n} \log p(\textbf{x}_i;\textbf{w})+\frac{(1-y_i )}{n}\log (1-p(\textbf{x}_i;\textbf{w}))\right] \end{aligned}$$
(1.10)

Another type of regression often used in ML is Kernel Regression. A Kernel is a positive-definite, typically non-local, symmetric weighting function \(K(\textbf{x}_i,\textbf{x})=K(\textbf{x},\textbf{x}_i)\), centered in the attribute, with unit integral. The idea is similar to the use of shape functions in finite element formulations. For example, the Gaussian Kernel is

$$\begin{aligned} K(\textbf{x}_i,\textbf{x})\equiv K_i(\textbf{x})=\frac{1}{\sqrt{2\pi \sigma ^2}} \exp \left[ {-\frac{1}{2} \left( \frac{|\textbf{x}-\textbf{x}_i|}{\sigma }\right) ^2 }\right] \end{aligned}$$
(1.11)

where \(\sigma \) is the bandwidth or smoothing parameter (deviation), and the weight for sample i is \(w_i(\textbf{x})=K_i(\textbf{x})/\sum _{j=1}^n K_j(\textbf{x})\). The predictor, using the weights from the kernel, is \(f(\textbf{x})=\sum _{i=1}^n w_i(\textbf{x})y_i\) (although kernels may be used also for the labels). The cost function to determine \(\sigma ^{2}\) or other kernel parameters may be

$$\begin{aligned} \int f^2(\textbf{x})d\Omega -2 \frac{1}{n} \sum _{i=1}^n f^{)i(} (\textbf{x}_i) \longrightarrow \text {min} \end{aligned}$$
(1.12)

where the last summation term is the LOOCV, which excludes sample i from the set of predictions (recall that there are n different \(f^{)i(}\) functions). Equation (1.12) focuses in essence on the minimum squared error for the solution. As explained below, kernels are also employed in Support Vector Machines to deal with nonlinearity and in dimensionality reduction of nonlinear problems to reduce the space.

1.2.2.2 Naïve Bayes

Naïve Bayes (NB) schemes are frequently used for classification (spam e-mail filtering, seismic vulnerability, etc.), and may be Multinomial NB or Gaussian NB. In both cases the probability theory is employed.

NB procedures operate as follows. From the training data, the prior probabilities for the different classes are computed, e.g., vulnerable or safe, p(V) and p(S), respectively, in our seismic vulnerability example. Then, for each feature, the probabilities are computed within each class, e.g., the probability that a vulnerable (or safe) structure is made of steel, \(p(\text {steel}|V)\) (or \(p(\text {steel}|S)\)). Finally, given a sample outside the training set, the classification is obtained from the largest probability considering the class and the features present in the sample, e.g., \(p(V)p(\text {steel}|V)p(\ldots |V)\ldots \) or \(p(S)p(\text {steel}|S)p(\ldots |V)\ldots \), and so on. Gaussian NB are applied when the features have continuous (Gaussian) distributions, as for example the height of a building in the seismic vulnerability example. In this case the feature conditioned probabilities \(p(\cdot |V)\) are obtained from the respective normal distributions. Logarithms of the probabilities are frequently used to avoid underflows.

1.2.2.3 Decision Trees (DT)

Decision trees are nonparametric. The simplest and well-known decision tree generator is the Iterative Dichotomiser 3 (ID3) algorithm, a greedy strategy. The scheme is used in the “guess who?” game and is also in essence the idea behind the root-finding bisection method or the Cuthill–McKee renumbering algorithm. The objective is, starting from one root node, to select at each step the feature and condition from the data that maximizes the Information Gain G (maximizes the benefit of the split), resulting in two (or more) subsequent leaf nodes. For example, in a group of people, the first optimal condition (maximizing the benefit of the split) is typically if the person is male or female, resulting in the male leaf and in the female leaf, each with \(50\%\) of the population, and so on. In seismic vulnerability, it could be if the structure is made of masonry, steel, wood, or Reinforced Concrete (RC). The gain G is the difference between the information entropies before (H(S), parent entropy) and after the split given by the feature at hand j. Let us denote by \(x_{j(i)}\) the feature j of sample i, by \(\textbf{x}_i\) the array of features of sample i, and by \(x_j\) the different features (we omit the sample index if no confusion is possible). If \(H(S|x_{j})\) are the children entropies after the split by feature \(x_j\), the Gain is

$$\begin{aligned} G(S,x_j) = H(S)-H(S|x_j ) = H(S)-\sum _{s_i=\{\textbf{x}_i,y_i \}\in S_j} p_j H(S_j) \end{aligned}$$
(1.13)

where the \(S_j\) are the subsets of S as a consequence of the split using feature (or attribute) \(x_j, p_j\) is the subset probability (number of samples \(s_i\) in subset \(S_j\) divided by the number of samples in the compete set S), and

$$\begin{aligned} H(S)=-\sum _{j=1}^l p(y_j)\log p(y_j) \end{aligned}$$
(1.14)

where H(S) is the information entropy of set S for the possible labels \(y_j,j=1,\ldots ,l\), so Eq. (1.13) results in

$$\begin{aligned} G(S, x_j )=-\sum _{i=1}^l p(y_i) \log p(y_i) + \sum _{s_i\in S_j} p_j \sum _{i=1}^l p(y_i|x_j) \log p(y_i|x_j) \end{aligned}$$
(1.15)
Fig. 1.5
figure 5

Example of determination of the feature with most information gain. If we choose feature \(x_1\) for sorting, we find two subsets, \(S_1=\{s_i\text { such that } x_1=1\}=\{s_2,s_3,s_4\}\) and \(S_2=\{s_i\text { s.t. }x_1=0\}=\{s_1\}\). In \(S_1\) there are two elements (\(s_3,s_4\)) with label A, and one element (\(s_2\)) with label B, so the probabilities are 2/3 for label A and 1/3 for label B. In \(S_2\) the only element (\(s_1\)) has label B, so the probabilities are 0/1 for label A and 1/1 for label B. The entropy of subset \(S_1\) is \(H(S_1)=-\tfrac{2}{3}\log _2\tfrac{2}{3}-\tfrac{1}{3}\log _2\tfrac{1}{3}=0.92\). The entropy of subset \(S_2\) is \(H(S_2)=-0-\tfrac{1}{1}\log _2\tfrac{1}{1}=0\). In a similar form, since there are a total of 4 samples, two with label A and two with label B, the parent entropy is \(H(S)=-\tfrac{2}{4}\log _2\tfrac{2}{4}-\tfrac{2}{4}\log _2\tfrac{2}{4}=1\). Then, the information gain is \(G=H(S)-p(S_1)H(S_1)-p(S_2)H(S_2)=1-\tfrac{3}{4}0.92-\tfrac{1}{4}0=0.31\), where \(p(S_i)\) is the probability of a sample being in subset \(S_i\), i.e. 3/4 for \(S_1\) and 1/4 for \(S_2\). Repeating the computations for the other two features, it is seen that feature \(x_3\) is the one that has the best information gain. Indeed the information gain is \(G=1\) because it fully separates the samples according to the labels

The gain G is computed for each feature \(x_j\) (e.g., windows, structure type, building age, and soil type). The feature that maximizes the Gain is the one selected to generate the next level of leaves. The decision tree building process ends when the entropy reaches zero (the samples are perfectly classified). Figure 1.5 shows a simple example with four samples \(s_i\) in the dataset, each with three features \(x_j\) of two possible values (0 and 1), and one label y of two possible values (A and B). The best of the three features is selected as that one which provides the most information gain. It is seen that feature 1 produces some information gain because after the split using this feature, the samples are better classified according to the label. Feature 2 gives no gain because it is useless to distinguish the samples according to the label (it is in 50% each), and feature 3 is the best one because it fully classifies the samples according to the label (A is equivalent to \(x_3=1\), and B is equivalent to \(x_3=0\)). As for the Cuthill–McKee renumbering algorithm, there is no proof of reaching the optimum.

While DT are typically used for classification, there are regression trees in which the output is a real number. Other decision tree algorithms are the C4.5 (using continuous attributes), Classification And Regresssion Tree (CART), and Multivariate Adaptive Regression Spline (MARS) schemes.

1.2.2.4 Support Vector Machines (SVM), k-Means, and k-Nearest Neighbors (kNN)

The Support Vector Machine (SVM) is a technique which tries to find the optimal hyperplane separating groups of samples for clustering (unsupervised) or classification (supervised). Consider the function \(z(\textbf{x})=\textbf{w}^T \textbf{x}+b\) for classification. The model is

$$\begin{aligned} y\equiv f(\textbf{x})=\text {sign}(z(\textbf{x}))=\text {sign}(\textbf{w}^T \textbf{x}+b) \end{aligned}$$
(1.16)

where the parameters \(\{\textbf{w},b\}\) are obtained by minimizing \(\tfrac{1}{2}|\textbf{w}|^2\) (or equivalently \(|\textbf{w}|^2\) or \(|\textbf{w}|\)) subject to \(y_i(\textbf{w}^T \textbf{x}_i+b)\ge 1 \;\;\forall i\) such that the decision boundary \(f(\textbf{x})=0\) given by the hyperplane has maximum distance to the groups of the samples; see Fig. 1.6. The minimization problem (in primal form) using Lagrange multipliers \(\alpha _i \) is

$$\begin{aligned} \text {find } \mathop {\mathrm {arg\,min}}\limits _{\textbf{w},b} \left[ \tfrac{1}{2}|\textbf{w}|^2 + \sum _{i=1}^n \alpha _i \left( 1-y_i (\textbf{w}^T\textbf{x}_i+b)\right) \right] \text { with } \alpha _i \ge 0 \end{aligned}$$
(1.17)

or in penalty form

$$\begin{aligned} \text {find } \mathop {\mathrm {arg\,min}}\limits _{\textbf{w},b}\left[ \tfrac{1}{2}|\textbf{w}|^2 + C \sum _{i=1}^n \text {max}\left[ 0,1 - y_i (\textbf{w}^T\textbf{x}_i+b)\right] \right] \end{aligned}$$
(1.18)

A measure of certainty for sample i is based on its proximity to the boundary; i.e. \((\textbf{w}^T \textbf{x}_i+b)/|\textbf{w}|\) (the larger the distance to the boundary, the more certain the classification of the sample). Of course, SVMs may be used for multiclass classification, e.g., using the One-to-Rest approach (employing k SVMs to classify k classes) or the One-to-One approach (employing \(\tfrac{1}{2} k(k-1)\) SVMs to classify k classes); see Fig. 1.6.

Taking the derivative of the Lagrangian in squared brackets in Eq. (1.17) with respect to \(\textbf{w}\) and b, we get that at the minimum

$$\begin{aligned} \textbf{w}=\sum _{i=1}^n\alpha _iy_i\textbf{x}_i \;\;\text { and }\;\;\sum _{i=1}^n\alpha _iy_i=0 \end{aligned}$$
(1.19)

and substituting it in the primal form given in Eq. (1.17), the minimization problem may be written in its dual form

$$\begin{aligned} \text {find } \mathop {\mathrm {arg\,max}}\limits _{\alpha _i\ge 0}\left[ \sum _{i=1}^n \alpha _i -\frac{1}{2} \sum _{j,k=1}^n \alpha _j \alpha _k y_j y_k (\textbf{x}_j^{T}\textbf{x}_k) \right] \;\;\text { with }\sum _{i=1}^n\alpha _iy_i=0\end{aligned}$$
(1.20)

and with \(b=y_j-\textbf{w}^T\textbf{x}_j\) being \(\textbf{x}_j\) any active (support) vector, with \(\alpha _j>0\). Then, \(z=\textbf{w}^T\textbf{x}+b\) is \(z=\sum _i\alpha _iy_i\textbf{x}_i^T\textbf{x}+b\). Instead of searching for the weights \(w_i, i=1,\ldots ,N\) (N is the number of features of each sample), we search for the coefficients \(\alpha _i,i=1,\ldots ,n\) (n is the number of samples).

Fig. 1.6
figure 6

Two-class SVM decision boundary and one-to-one and one-to-rest SVM multiclass classification

Nonlinear separable cases may be addressed through different techniques as using positive slack variables \(\xi _i\ge 0\) or kernels. When using slack variables (Soft Margin SVM), for each sample i we write \(y_i (\textbf{w}^T \textbf{x}_i+b)\ge 1-\xi _i\) and we apply a L1 (LASSO) regularization by minimizing \(\tfrac{1}{2} |\textbf{w}|^2 + C \sum _i \xi _i\) subject to the constraints \(y_i(\textbf{w}^T \textbf{x}_i + b)\ge 1-\xi _i\) and \(\xi _i\ge 0\), where C is the penalization parameter. In this case, the only change in the dual formulation is the constraint for the Lagrange multipliers: \(C\ge \alpha _i\ge 0\), as it can be easily verified.

Fig. 1.7
figure 7

Use of higher dimensions to obtain linearly separable data. a Data is linearly separable in 1D. b Data is not linearly separable in 1D. c Using two dimensions with mapping \(\mathbf {\phi }=[x,x^2]^T\), data becomes linearly separable in the augmented space

Fig. 1.8
figure 8

Linearly non-separable samples (left). Linear separation in a transformed higher dimensional space (right)

When using kernels, the kernel trick is typically employed. The idea behind the use of kernels is that if data is linearly non-separable in the features space, it may be separable in a larger space; see, for example, Fig. 1.7. This technique uses the dual form of the SVM optimization problem. Using the dual form

$$\begin{aligned} |\textbf{w}|^2=\textbf{w}^T\textbf{w}=\sum _{i,j=1}^n \alpha _i \alpha _j y_i y_j (\textbf{x}_i^T \textbf{x}_j ) \text { and } \textbf{w}^T \textbf{x}=\sum _{i=1}^n \alpha _i y_i (\textbf{x}_i^T \textbf{x}) \end{aligned}$$
(1.21)

the equations only involve inner products of feature vectors of the type \((\textbf{x}_i^T \textbf{x}_j)\), ideal for using a kernel trick. For example, the case shown in Fig. 1.8 is not linearly separable in the original features space, but using the mapping \(\mathbf {\phi }(\textbf{x}):=\left[ x_1^2,x_2^2,\sqrt{2} x_1 x_2 \right] ^T\) to an augmented space, we find that the samples are linearly separable in this space. Then, for performing the linear separation in the transformed space, we have to compute z in that transformed space (Representer Theorem, Schölkopf et al. 2001)

$$\begin{aligned} z=\sum _{i=1}^n \alpha _i y_i [\mathbf {\phi }(\textbf{x}_i)^T \mathbf {\phi }(\textbf{x})]+b \end{aligned}$$
(1.22)

to substitute the inner products in the original space by inner products in the transformed space. These operations (transformations plus inner products in the high-dimensional space) can be expensive (in complex cases we need to add many dimensions). However, in our example we note that

$$\begin{aligned} K(\textbf{a},\textbf{b}):=\mathbf {\phi }(\textbf{a})^T \mathbf {\phi }(\textbf{b})= \begin{bmatrix} a_1^2&a_2^2&\sqrt{2}a_1 a_2 \end{bmatrix} \begin{bmatrix} b_1^2 \\ b_2^2 \\ \sqrt{2}b_1 b_2 \end{bmatrix} = \left( \begin{bmatrix} a_1& a_2 \end{bmatrix}\begin{bmatrix} b_1\\ b_2 \end{bmatrix}\right) ^2 = (\textbf{a}^T\textbf{b})^2 \end{aligned}$$
(1.23)

so it is not necessary to use the transformed space because the inner product can be equally calculated in both spaces. Indeed note that, remarkably, we even do not need to know explicitly \(\mathbf {\phi }(\textbf{x})\), because the kernel \(K(\textbf{a},\textbf{b})= (\textbf{a}^T\textbf{b})^2\) is fully written in the original space and we never need \(\mathbf {\phi }(\textbf{x})\). Then we just solve

$$\begin{aligned} \text {find } \mathop {\mathrm {arg\,max}}\limits _{\alpha _i\ge 0}\left[ \sum _{i=1}^n \alpha _i -\frac{1}{2} \sum _{j,k=1}^n \alpha _j \alpha _k y_j y_k K(\textbf{x}_j,\textbf{x}_k) \right] \text { with }\sum _{i=1}^n\alpha _iy_i=0\end{aligned}$$
(1.24)

Examples of kernel functions are the polynomial \(K(\textbf{a},\textbf{b})=(\textbf{a}^T \textbf{b} + 1)^d\), where d is the degree, and the Gaussian radial basis function (which may be interpreted as a polynomial form with infinite terms)

$$\begin{aligned} K(\textbf{a},\textbf{b})= \exp (-\gamma |\textbf{a}-\textbf{b}|)\end{aligned}$$
(1.25)

where \(\gamma =1/(2\sigma ^2 )>0 \).

Considering clustering, another algorithm similar in nature to SVM is used and is called the k-means (unsupervised technique creating k clusters). The idea here is to employ a distance measurement in order to determine the optimal center of clusters and the optimal decision boundaries between clusters. The case \(k=1\) is essentially the Voronoi diagram. Another simple approach is the k-Nearest Neighbors (k-NN), a supervised technique also employed in classification. This technique uses the labels of the k-nearest neighbors to predict the label of the target points (e.g., by some weighting method).

1.2.2.5 Dimensionality Reduction Algorithms

When problems have too many features (or data, measurements), dimensionality reduction techniques are employed to reduce the number of attributes and gain insight into the most meaningful ones. These are typically employed not only in pattern recognition and image processing (e.g., identification or compression) but also to determine which features, data, or variables are most relevant for the learning purpose. In essence, the algorithms are similar in nature to determining the principal modes in a dynamic response, because with that information, relevant mechanical properties (mass distribution, stiffness, damping), and the overall response may be obtained. In ML, classical approaches are given by Principal Component Analysis (PCA) based on Pearson’s Correlation Matrix (Abdi and Williams 2010; Bro and Smilde 2014), Singular Value Decomposition (SVD), Proper Orthogonal Decomposition (POD) (Berkooz et al. 1993), Linear (Fisher’s) Discriminant Analysis (LDA) (Balakrishnama and Ganapathiraju 1998; Fisher 1936), Kernel (Nonlinear) Principal Component Analysis (kPCA), Hofmann et al. (2008), Alvarez et al. (2012), Local Linear Embedding (LLE) (Roweis and Saul 2000; Hou et al. 2009), Manifold Learning (used also in constitutive modeling) (Cayton 2005; Bengio et al. 2013; Turaga et al. 2020), Uniform Manifold Approximation and Projection (UMAP) (McInnes et al. 2018), and autoencoders (Bank et al. 2020; Zhuang et al. 2021; Xu and Duraisamy 2020; Bukka et al. 2020; Simpson et al. 2021). Often, these approaches are also used in clustering.

LLE is one of the simplest nonlinear dimension reduction processes. The idea is to identify a global space with smaller dimension that reproduces the proximity of data in the higher dimensional space; it is a k-NN approach. First, we determine the weights \(w_{ij}\), such that \(\sum w_{ij}=1\), which minimize the error

$$\begin{aligned} \text {Error}(\textbf{w})=\sum _{i=1}^n \left( \textbf{x}_i-\sum _{j=1}^k w_{ij} \textbf{x}_j \right) ^2 \end{aligned}$$
(1.26)

in the representation of a point from the local space given by the k-nearest points (k is a user-prescribed hyperparameter), so

$$\begin{aligned} \textbf{w}=\mathop {\mathrm {arg\,min}}\limits \left[ \text {Error}(\textbf{w})\right] \end{aligned}$$
(1.27)

Then, we search for the images \(\textbf{y}_i\) of \(\textbf{x}_i\) in the lower dimensional space, simply by considering that the computed \(w_{ij}\) reflect the geometric properties of the local manifold and are invariant to translations and rotations. Given \(w_{ij}\), we now look for the lower dimensional coordinates \(\textbf{y}_i\) that minimize the cost function

$$\begin{aligned} C(\textbf{y}_i,i=1,\ldots ,n)=\sum _{i=1}^n \left( \textbf{y}_i-\sum _{j=1}^k w_{ij} \textbf{y}_j\right) ^2 \end{aligned}$$
(1.28)
Fig. 1.9
figure 9

a Principal Component Analysis. The principal components are those with largest variations (largest eigenvalues of the variances matrix). b Linear Discriminant Analysis to separate clusters. It is seen that feature \(x_1\) is not a good choice to determine if a sample belongs to a given cluster, but there is a features combination (a line) which gives the best discrimination between clusters. That combination maximizes the distance between means of clusters whereas minimizes the dispersion of samples within the clusters

Isometric Mapping (ISOMAP) techniques are similar, but use geodesic k-node-to-k-node distances (computed by Dijkstra’s 1959 or the Floyd–Warshall 1962 algorithms to find the shortest path between nodes) and look for preserving them in the reduced space. Another similar technique is the Laplacian eigenmaps scheme (Belkin and Niyogi 2003), based on the non-singular lowest eigenvectors of the Graph Laplacian \(\textbf{L}=\textbf{d}-\textbf{w}\), where \(d_{ii}=\sum _j w_{ij}\) gives the diagonal degree matrix and \(w_{ij}\) are the edge weights, computed for example using the Gaussian kernel \(w_{ij}=K(\textbf{x}_i,\textbf{x}_j )=\exp (-|\textbf{x}_i-\textbf{x}_j|^2 /(2\sigma ^2 ))\). Within the same k-neighbors family, yet more complex and advanced, are Topological Data Analysis (TDA) techniques. A valuable overview may be found in Chazal and Michel (2021); see also the references therein.

For the case of PCA, it is typical to use the covariance matrix

$$\begin{aligned} S_{jk}=\frac{1}{n} \sum _{i=1}^n (x_{j(i)} -\bar{x}_j)(x_{k(i)}-\bar{x}_k)\end{aligned}$$
(1.29)

where the overbar denotes the mean value of the feature, and \(x_{j(i)}\) is feature j of sample i. The eigenvectors and eigenvalues of the covariance matrix are the principal components (directions/values of maximum significance/relevance), and the number of them selected as sufficient is determined by the variance ratios; see Fig. 1.9(a). PCA is a linear unsupervised technique. The typical implementation uses mean-corrected samples, as in kPCA, so in such case \(S_{jk}=\frac{1}{n} \sum _{i=1}^n x_{j(i)} x_{k(i)}\), or in matrix notation \(\textbf{S}=\tfrac{1}{n}\textbf{X}\textbf{X}^T\). kPCA (Schölkopf et al. 1997) is PCA using kernels (such as polynomials, the hyperbolic tangent, or the Radial Basis Function (RBF)) to address the nonlinearity by expanding the space. For example, using the RBF, we construct the kernel matrix \(K_{ij}\), for which the components are obtained from the samples i, j as \(K_{ij}=\exp (-\gamma |\textbf{x}_i-\textbf{x}_j|^2)\) . The RBF is then centered in the transformed space by (note that being centered in the original features space does not mean that the features are also automatically centered in the transformed space, hence the need for this operation)

$$\begin{aligned} \bar{\textbf{K}}=\textbf{K}-\frac{1}{n}\textbf{1}\textbf{K}-\frac{1}{n}\textbf{K}\textbf{1}+\frac{1}{n^2}\textbf{1}\textbf{K}\textbf{1} \end{aligned}$$
(1.30)

where \(\textbf{1}\) is a \(n\times n\) matrix of the unit entry “1”. Then \(\bar{\textbf{K}}(\textbf{x}_i,\textbf{x}_j)=\bar{\mathbf {\phi }}(\textbf{x}_i)^T \bar{\mathbf {\phi }}(x_j)\) with the centered \(\bar{\mathbf {\phi }}(\textbf{x}_i)={\mathbf {\phi }}(\textbf{x}_i)-1/n \sum _{r=1}^n {\mathbf {\phi }}(\textbf{x}_r)\). The larger eigenvalues are the principal components in the transformed space, and the corresponding eigenvectors are the samples already projected onto the principal axes.

In contrast to PCA, the purpose of LDA is typically to improve separability of known classes (a supervised technique), and hence maximize information in this sense: maximizing the distance between the mean values of the classes and, within each class, minimizing the variation. It does so through the eigenvalues of the normalized between-classes scatter matrix \(\textbf{S}_w^{-1} \textbf{S}_b\) (the between-variances by the within-variances) where

$$\begin{aligned} \textbf{S}_w &=\sum _{i=1}^{n-\text {classes}} \sum _{\text {within-class-} i}^{n_i} (\textbf{x}-\textbf{m}_i) (\textbf{x}-\textbf{m}_i)^T \end{aligned}$$
(1.31)
$$\begin{aligned} \textbf{S}_b &=\sum _{\text {Class}\, i=1}^{n-\text {classes}} n_i (\textbf{m}_i-\bar{\textbf{x}}) (\textbf{m}_i-\bar{\textbf{x}})^T \end{aligned}$$
(1.32)

and \(\bar{\textbf{x}}\) is the overall mean vector of the features \(\textbf{x}\) and \(\textbf{m}_i\) is the mean vector of those within-class i. If \(\bar{\textbf{x}}=\textbf{m}_i\) the class is not separable from the selected features. Frequently used nonlinear extensions of LDA are the Quadratic Discriminant Analysis (QDA) (Tharwat 2016; Ghosh et al. 2021), Flexible Discriminant Analysis (FDA) (Hastie et al. 1994), and Regularized Discriminant Analysis (RDA) (Friedman 1989).

Proper Orthogonal Decompositions (POD) are frequently motivated in PCA and are often used in turbulence and in reducing dynamical systems. It is a technique also similar to classical modal decomposition. The idea is to decompose the time-dependent solution as

$$\begin{aligned} \textbf{u}(\textbf{x},t)=\sum _{p=1}^P a_p(t) \mathbf {\varphi }_p(\textbf{x}) \end{aligned}$$
(1.33)

and compute the Proper Orthogonal Modes (POMs) \(\mathbf {\varphi }_p(\textbf{x})\) that maximize the energy representation (L2-norm). In essence, we are looking for the set of “discrete functions” \(\mathbf {\varphi }_p(\textbf{x})\) that best represent \(\textbf{u}(\textbf{x},t)\) with the lowest number of terms P. Since these are computed as discretized functions, several snapshots \(\textbf{u}(\textbf{x},t_i), i=1, \ldots , n\) are grabbed in the discretized domain, i.e.

$$\begin{aligned} \textbf{U}=\begin{bmatrix}\textbf{u}(\textbf{x},t_1)&\textbf{u}(\textbf{x},t_2)&\ldots & \textbf{u}(\textbf{x},t_n)\end{bmatrix}= \begin{bmatrix} u_{11} &{} \ldots &{} u_{1n}\\ \vdots &{} \ddots &{} \vdots \\ u_{m1} &{}\ldots &{} u_{mn} \end{bmatrix} \end{aligned}$$
(1.34)

Then, the POD vectors are the eigenvectors of the sample covariance matrix. If the snapshots are corrected to have zero mean value, the covariance matrix is

$$\begin{aligned} \textbf{S}=\frac{1}{n} \textbf{U}\textbf{U}^T \end{aligned}$$
(1.35)

The POMs may also be computed using the SVD of \(\textbf{U}\) (the left singular vectors are the eigenvectors of \(\textbf{U}\textbf{U}^T\)) or auto-associative NNs (Autoencoder Neural Networks that replicate the input in the output but using a hidden layer of smaller dimension). To overcome the curse of dimensionality when using too many features (e.g., for parametric analyses), the POD idea is generalized in Proper Generalized Decomposition (PGD), by assuming approximations of the form

$$\begin{aligned} \textbf{u}(x_1,x_2,\ldots ,x_d )=\sum _{i=1}^N \mathbf {\phi }_i^1 (x_1 ) \circ \mathbf {\phi }_i^2 (x_2 )\circ \ldots \circ \mathbf {\phi }_i^d (x_d ) \end{aligned}$$
(1.36)

where \(\mathbf {\phi }_i^j (x_j)\) are the unknown vector functions (usually also discretized and computed iteratively, for example using a greedy algorithm), and “\(\circ \)” stands for the Hadamard or entry-wise product of vectors. Note that, in general we cannot use the separability \(\Phi (x,y)\ne \phi (x)\psi (y)\) but PGDs look for the best \(\phi _i(x)\psi _i(y)\) choices for the given problem such that we can say \(\Phi (x,y)\simeq \sum _i \phi _i (x)\psi _i(y)\) in a sufficiently small number of addends (hence with a reduced complexity). The power of the idea is that for a large number n of features, determining functions of the type \(\Phi (x_1,x_2,\ldots , x_n)\) is virtually impossible, but determining products and additions of scalar functions is feasible.

The UMAP and t-SNE schemes are based on the concept of a generalized metric or distance between samples. A symmetric and normalized (between 0 and 1) metric is defined as

$$\begin{aligned} d_{ij}(\textbf{x}_i,\textbf{x}_j )=d_i^j (\textbf{x}_i,\textbf{x}_j )+d_j^i (\textbf{x}_i,\textbf{x}_j )-d_i^j(\textbf{x}_i,\textbf{x}_j) d_j^i(\textbf{x}_i,\textbf{x}_j) \end{aligned}$$
(1.37)

where the unidirectional distance function is defined as

$$\begin{aligned} d_i^j(\textbf{x}_i,\textbf{x}_j )=\exp \left( -\frac{\rho _{ij} -\rho _i^1}{\rho _i^k}\right) \end{aligned}$$
(1.38)

where \(\rho _{ij} =|\textbf{x}_i-\textbf{x}_j|\) and \(\rho _i^k=|\textbf{x}_i-\textbf{x}_k |\), with k referring to the k-nearest neighbor (\(\rho _i^1\) refers to the nearest neighbor to i). Here k is an important hyperparameter. Note that \(d_i^j=1\) if ij are nearest neighbors, and \(d_i^j\rightarrow 0\) for far away neighbors. We are looking for a new set of lower dimensional features \(\textbf{z}\) to replace \(\textbf{x}\). The same generalized distance \(d_{ij}(\textbf{z}_i,\textbf{z}_j)\) may be applied to the new features. To this end, through optimization techniques, like the steepest descent, we minimize the fuzzy set dissimilarity cross-entropy (or entropy difference) like the Kullback–Leibler (KL) divergence (Hershey and Olsen 2007; Van Erven and Harremos 2014), which measures the difference between the probability distributions \(d_{ij}(\textbf{x}_i,\textbf{x}_j )\) and \(d_{ij}(\textbf{z}_i,\textbf{z}_j )\), and their complementary values \([1-d_{ij}(\textbf{x}_i,\textbf{x}_j )]\) and \([1-d_{ij}(\textbf{z}_i,\textbf{z}_j )]\) (recall that \(d\in (0,1]\), so it is seen as a probability distribution)

$$\begin{aligned} KL(d(\textbf{x}),d(\textbf{z}))=&\sum _{i,j=1}^n \left\{ d_{ij}(\textbf{x}_i,\textbf{x}_j ) \ln \left[ \frac{d_{ij}(\textbf{x}_i,\textbf{x}_j )}{d_{ij}(\textbf{z}_i,\textbf{z}_j ) }\right] \right. \nonumber \\ &+\left. \left[ 1-d_{ij}(\textbf{x}_i,\textbf{x}_j)\right] \ln \left[ \frac{1-d_{ij}(\textbf{x}_i,\textbf{x}_j )}{1-d_{ij}(\textbf{z}_i,\textbf{z}_j ) }\right] \right\} \end{aligned}$$
(1.39)

Note that the KL scheme is not symmetric with respect to the distributions. If distances in both spaces are equal for all the samples, KL \(=0\). In general, a lower dimensional space gives KL \(\ne 0\), but with the dimension of \(\textbf{z}\) fixed, the features (or combinations of features) that give a minimum KL considering all n samples represent the optimal selection.

Autoencoders are a type of neural network, discussed below, and can be interpreted as a nonlinear generalization of PCA. Indeed, an autoencoder with linear activation functions is equivalent to a SVD.

1.2.2.6 Genetic Algorithms

Genetic Algorithms (Mitchell 1998) in ML (or more generally evolutionary algorithms) are in essence very similar to those employed in optimization (Grefenstette 1993; De Jong 1988). They are metaheuristic algorithms which include the steps in natural evolution: (1) Initial population, (2) a fitness function, (3) a (nature-like) selection according to fitness, (4) crossover (the gene combination), (5) mutation (random alteration). After running many generations, convergence is expected to the superspecies. Feature selection and database reduction is a typical application (Vafaie and De Jong 1992). The variety of implementations is large and the implementations depend on the specific problem addressed (e.g., polymer design, Kim et al. 2021, and materials modeling, Paszkowicz 2009), but the essence and ingredients are similar.

Fig. 1.10
figure 10

Rosenblatt’s perceptron and Adaline (Adaptive Linear Neuron) model

1.2.2.7 Rosenblatt’s Perceptron, Adaline (Adaptive Linear Neuron) Model, and Feed Forward Neural Networks (FFNN)

Currently, the majority of ML algorithms employed in practice are some type or variation of Neural Networks. Deep Learning (DL) refers to NNs with many layers. While the NN theory was proposed some decades ago, efficient implementations facilitating the solution of real-world problems have been established only in the late 1980s and early 1990s. NNs are based on the ideas from McCulloch and Pitts (1943) describing a simple model for the work of neurons, and on Rosenblatt’s perceptron (Rosenblatt 1958); see Fig. 1.10. The Adaline model (Widrow and Hoff 1962) (Fig. 1.10) introduces the activation function to drive the learning process from the different samples, instead of the dichotomic outputs from the samples. This activation function is today one of the keystones of NNs. The logistic sigmoid function is frequently used. There are other alternatives such as the ReLU (Rectified Linear Unit; the Macaulay ramp function) or the hyperbolic tangent

$$\begin{aligned} \tanh (z)=\frac{\exp (z)-\exp (-z)}{\exp (z)+\exp (-z)} \text {, with } z=\textbf{w}^T \textbf{x}+ w_0 \end{aligned}$$
(1.40)

NNs are made from many such artificial neurons, typically arranged in several layers, with each layer \(l=1,\ldots ,L\) containing many neurons. The output from the network is defined as a composition of functions

$$\begin{aligned} \textbf{y}=\textbf{f}^L(\textbf{f}^{L-1} (\textbf{f}^{L-2}(\ldots \textbf{f}^2 (\textbf{f}^1 (\textbf{z}^1 ))))) \end{aligned}$$
(1.41)

where the \(\textbf{f}^l\) are the neuron functions of the layer (often also denoted by \(\mathbf {\sigma }^l\) in the sigmoid case), typically arranged by groups in the form \(\textbf{f}^l(\textbf{W}^l \textbf{x}^l+\textbf{b}_l )\), where \(\textbf{W}^l\) is the matrix of weights, \(\textbf{z}^l:=\textbf{W}^l \textbf{x}^l+\textbf{b}_l\), \(\textbf{x}^l=\textbf{y}^{l-1}=\textbf{f}^{l-1}(\textbf{z}^{l-1})\) are the neuron inputs and output of the previous layer (the features for the first function; \(\textbf{y}^0\equiv \textbf{x}\)), and \(\textbf{b}_l\) is the layer bias vector, which is often incorporated as a weight on a unit bias by writing \(\textbf{z}^l=\textbf{W}^l \textbf{x}^l\), so \(\textbf{x}^l\) has also the index 0, and \(x_0^l=1\); see Fig. 1.11. The output may be also a vector \(\textbf{y}\equiv \textbf{y}^L\). The purpose of the learning process is to learn the optimum values of \(\textbf{W}^l\), \(\textbf{b}_l\). The power of the NNs is that a simple architecture, with simple functions, may be capable of reproducing more complex functions. Indeed, Rosenblatt’s scheme discussed below may give any linear or nonlinear function. Of course, complex problems will require many neurons, layers, and data, but the overall structure of the method is almost unchanged.

Fig. 1.11
figure 11

Neural Network with \(L-1\) hidden layers and one (the L) output layer. Notation for weights is \(W_{oi}^l\), where i is the input cell (zero refers to the bias unit), o is the output cell (the order is often reversed in the literature), and \(l=1,\ldots ,L\) are the layers

The Feed Forward Neural Network (FFNN) with many layers, as shown in Fig. 1.11, is trained by optimization algorithms (typically modifications of the steepest descent) using the backpropagation algorithm, which consists in computing the sensitivities using the chain rule from the output layer to the input layer, so for each layer, the information of the derivatives of the subsequent layers are known. For example, in Fig. 1.11, assume that the error is computed as \(E=\tfrac{1}{2} \left( \textbf{y}-\textbf{y}^{\text {exp}}\right) ^T \left( \textbf{y}-\textbf{y}^{\text {exp}}\right) \) (logistic errors are more common, but we consider herein this simpler case). Then, if \(\alpha \) is the learning rate (a hyperparameter), the increment between epochs of the parameters is

$$\begin{aligned} \Delta {W}_{oi}^l=-\alpha (\textbf{y}-\textbf{y}^{\text {exp}})^T \frac{\partial \textbf{y}}{\partial W_{oi}^l} \end{aligned}$$
(1.42)

where \(\partial \textbf{y}/ \partial W_{oi}^l\) is computed through the chain rule. Figure 1.12 shows a simple example with two layers and two neurons per layer; superindices refer to layer and subindices to neuron. For example, following the green path, we can compute

$$\begin{aligned} \frac{\partial {y}_2}{\partial W_{21}^2}=\left[ \frac{\partial {y}_2}{\partial {z}_2^2}\right] \left\{ \frac{\partial {z}_2^2}{\partial W_{21}^2}\right\} \end{aligned}$$
(1.43)

where \(\partial {y}_2 / \partial {z}_2^2\) is the derivative of the selected activation function evaluated at the iterative value \({z}_2^2\) and \(\partial {z}_2^2 /\partial W_{21}^2={x}_1^2\) is also the known iterative value. As an example of a deeper layer, consider the red line in Fig. 1.12

$$\begin{aligned} \frac{\partial {y}_1}{\partial W_{21}^1 }= \left[ \frac{\partial {y}_1}{\partial {z}_1^2 }\right] \left[ \frac{ {z}_1^2}{\partial {x}_2^2} \frac{\partial {x}_2^2}{\partial \ {z}_2^1}\right] \left\{ \frac{\partial {z}_2^1}{\partial W_{21}^2}\right\} \end{aligned}$$
(1.44)

where we note that the first square bracket corresponds to the last layer, the second to the previous one, and so on, until the term in curly brackets addressing the specific network variable. The procedures had issues with exploding or vanishing gradients (especially with sigmoid and hyperbolic tangent activations), but several improvements in algorithms (gradient clipping, regularization, skipping of connections, etc.), have resulted in efficient algorithms for many hidden layers. The complex improvement techniques, with an important number of “tweaks” to make them work in practical problems, is one of the reasons why “canned” libraries are employed and recommended (Fig. 1.13).

Fig. 1.12
figure 12

Computation of the gradient through backpropagation. \(z_o^l\) is defined as \(z_o^l=W_{oi}^l x_i^l\) (which includes the bias) and \(f_o^l(z_o^l)\) is the activation function

Fig. 1.13
figure 13

Neural networks are capable of generating functions to fit data regardless of the dimension of the space and the nonlinearity of the problem. In this example, three neurons of the simplest Rosenblatt’s perceptron consisting of a step function are used to generate a local linear function. This function is obtained by simply changing the weights of the bias and adding the results of the three neurons with equal weights. Other more complex functions may be obtained with different weights. Furthermore, the firing step function may be changed by generally better choices as the ReLU or the sigmoid functions

1.2.2.8 Bayesian Neural Networks (BNN)

A Bayesian Neural Network (BNN) is a NN that uses probability and the Bayes theorem relating conditional probabilities

$$\begin{aligned} p(\textbf{z}|\textbf{x})=\frac{p(\textbf{x}|\textbf{z})p(\textbf{z})}{p(\textbf{x})} \end{aligned}$$
(1.45)

where \(p(\textbf{x}|\textbf{z})=p(\textbf{x}\cap \textbf{z})/p(\textbf{z})\). A typical example is to consider a probabilistic distribution of the weights (so we take \(\textbf{z}=\textbf{w}\)) for a given model, or a probabilistic distribution of the output (so we take \(\textbf{z}=\textbf{y}\)) not conditioned to a specific model. These choices can be applied in uncertainty quantifications (Olivier et al. 2021), with metal fatigue a typical application case (Fernández et al. 2022a; Bezazi et al. 2007). Given the complexity in passing analytical distributions through the NN, sampling is often performed through Monte Carlo approaches. The purpose is to learn the mean and standard deviations of the distributions of the weights, assuming they follow a normal distribution \(w_i\approx \mathcal {N}({\mu }_i,{\sigma }_i^2)\). For the case of predicting an output y, considering one weight, the training objective is to maximize the probability of the training data for the best prediction, or minimize the likelihood of a bad prediction as

$$\begin{aligned} \mathbf {\mu }^*,\mathbf {\Sigma }^*=\mathop {\mathrm {arg\,min}}\limits _{\mathbf {\mu },\mathbf {\Sigma }}\sum _{\forall \textbf{x}_i,y_{i}} \mathcal {L}(f(\textbf{x}_i;\mathcal {N}(\mathbf {\mu },\mathbf {\Sigma }),y_{i})) + \text {KL}[p(\mathcal {N}(\mathbf {\mu },\mathbf {\Sigma })),p(\mathcal {N}(0,1))] \end{aligned}$$
(1.46)

where \(\text {KL}(p_1,p_2)\) is the Kullback–Leibler divergence regularization for the probabilities \(p_1\) and \(p_2\) explained before, \(\mathcal {L}\) is the loss function and \(f(\textbf{x}_i;\mathcal {N}(\mathbf {\mu },\mathbf {\Sigma }))\) is the function prediction for y from data \(\textbf{x}_i\), assuming a distribution \(\mathcal {N}(\mathbf {\mu },\mathbf {\Sigma })\). With the learned optimal parameters \(\mathbf {\mu }^*,\mathbf {\Sigma }^*\), the prediction for new data \(\textbf{x}\) is

$$\begin{aligned} y=\frac{1}{K} \sum _{k=1}^K f(\textbf{x};\mathcal {N}_k(\mathbf {\mu }^*,\mathbf {\Sigma }^*)) \end{aligned}$$
(1.47)

where the \(\mathcal {N}_k(\mathbf {\mu }^*,\mathbf {\Sigma }^*)\) are the numerical evaluations of the normal distributions for the obtained parameters.

1.2.2.9 Convolutional Neural Networks (CNNs)

Although a Convolutional Neural Network (CNN) is a type of FFNN, they were formulated with the purpose of classifying images. CNNs combine one or several convolution layers combined with pooling layers (for feature extraction from images) and with normal final FFNN layers for classification (Fig. 1.14). Pooling is also named subsampling since performing averaging or extracting the maximum of a patch are the typical operations. In the convolutional layers, input data has usually several dimensions, and they are filtered with a moving patch array (also named kernel, with a specific stride length and edge padding; see Fig. 1.15) to reduce the dimension and/or to extract the main characteristics of, or information from, the image (like looking at a lower resolution version or comparing patterns with a reference). Each padding using a patch over the same record is called a channel, and successive or chained paddings are called layers, Fig. 1.15. The same padding, with lower dimension, may be applied over different sample dimensions (a volume). In essence, the idea is similar to the convolution of functions in signal processing to extract information from the signal. Indeed this is also an application of CNN. The structure of CNNs have obvious and interesting applications in multiscale modeling in materials science, and in constitutive modeling (Yang et al. 2020; Frankel et al. 2022), and thus also in determining material properties (Xie and Grossman 2018; Zheng et al. 2020), behavior prediction (Yang et al. 2020), and obviously in extracting microstructure information from images (Jha et al. 2018).

Fig. 1.14
figure 14

Typical structure of a CNN, including one convolution layer, one pooling layer, a flattened layer of features, and a FFNN

Fig. 1.15
figure 15

Convolutional network layer with depth 1, stride length 2 (the filter patch moves 2 positions at once) and edge padding 1 (the boundary is filled with one row and column of zeroes). Pooling is similar, but usually selects the maximum or average of a moving pad to avoid correlation of features with location

Fig. 1.16
figure 16

Recurrent Neural Network. a Folded representation, b unfolded representation considering three events, and c classification according to the input–output instances considered

1.2.2.10 Recurrent Neural Networks (RNN)

RNNs are used for sequences of events, so they are extensively used in language processing (e.g., in “Siri” or translators from Google), and they are effective in unveiling and predicting sequences of events (e.g., manufacturing) or when history is important (path-dependent events as in plasticity Mozaffar et al. 2019; Li et al. 2019; du Bos et al. 2020). In Fig. 1.16, a simple RNN is shown with \(\,^{t}\textbf{h}\) representing the history variables, such that the equations of the RNN are

$$\begin{aligned} \,^{t+1}\textbf{h} & = \textbf{f}_h^l(\textbf{W}_h \,^t\textbf{h}+ \textbf{W}_x \,^t\textbf{x}+\textbf{b}) \end{aligned}$$
(1.48)
$$\begin{aligned} \,^t\textbf{o} & = \textbf{f}_{o}^l(\textbf{W}_h \,^t\textbf{h}+ \textbf{W}_x \,^t\textbf{x}+\textbf{b}) \end{aligned}$$
(1.49)
$$\begin{aligned} \,^t\textbf{y} & = \textbf{f}_{h}^l(\textbf{W}_h \,^t\textbf{h}+\textbf{b}) \end{aligned}$$
(1.50)

The unfolding of a RNN allows for a better understanding of the process; see Fig. 1.16. Following our previous seismic example, they can be used to study the prediction of new earthquakes from previous history; see, for example, Panakkat and Adeli (2009), Wang et al. (2017). A RNN is similar in nature to a FFNN, and is frequently mixed with FF layers, but recycling some output at a given time or event for the next time(s) or event(s). RNNs may be classified according to the number of related input–output instances as one-to-one, one-to-many (one input instance to many output instances), many-to-one (e.g., classifying a voice or determining the location of an earthquake), and many-to-many (translation into a foreign language); see Fig. 1.16. A frequent ingredient in RNN are “gates” (e.g., in Long Short-Term Memory (LSTM), see Fig. 1.17) to decide which data is introduced, output, or forgotten.

Fig. 1.17
figure 17

A LSTM RNN, including long and short memory, and forget, input, and output gates. \(\sigma \) is the sigmoid function, colored boxes are typical NN layers, tanh is the hyperbolic tangent, \(\otimes \) and \(\oplus \) and tanh are componentwise operations

1.2.2.11 Generative Adversarial Networks (GAN)

A Generative Adversarial Network (GAN) (Goodfellow et al. 2020) is a type of ML based on game theory (sum-zero game where one agent’s benefit is the other agent’s loss) with the purpose to learn the probability distribution of the set of training samples (i.e. to solve the generative modeling problem). Although different algorithms have been presented within the GAN paradigm, most are based on NN agents, consisting of a generative NN and a discriminative NN. These NNs have opposite targets. The generative NN tries to fool the discriminative NN, whereas the discriminative NN tries to distinguish original (true) data from generated data presented by the generative NN. With successive events, both NNs learn—the generative NN learns how to fool the other NN, and the discriminative NN how not to be fooled. The type of NN depends on the problem at hand. For example when distinguishing images, a CNN is typically used. In this case, for example, in the falsification of photographs (deepfake Yadav and Salmani 2019), several images of a person are presented and the discriminator has to distinguish if they are actual pictures or manufactured photos. This technology is used to generate fake videos, and to detect them (Duarte et al. 2019; Yu et al. 2022) and is used in CAE tasks like the reconstruction of turbulent velocity fields (by comparing images) (Deng et al. 2019). GANs are also used in the generation of compliant designs, for example in the aeronautical industry (Shu et al. 2020), and also to solve differential equations (Yang et al. 2020; Randle et al. 2020). A recent overview of GANs may be found in Aggarwal et al. (2021).

1.2.2.12 Ensemble Learning

While NNs may bring accurate predictions through extensive training, obtaining such predictions may not be computationally efficient. Ensemble learning consists of employing many low-accuracy but efficient methods to obtain a better prediction through a sort of averaging (or voting). Following our seismic vulnerability example, it would be like asking several experts to give a fast opinion (for example just showing them a photograph) about the vulnerability of a structure or a site, instead of asking one of them to perform a detailed study of the structure (Giacinto et al. 1997; Tang et al. 2022). The methods used may be, for example, shallow NNs and decision tress.

1.3 Constraining to, and Incorporating Physics in, Data-Driven Methods

ML usually gives no insight into the physics of the problem. The classical procedures are considered “black boxes”, with inherent positive (McCoy et al. 2022) and negative (Gabel et al. 2014) attributes. While these black boxes are useful in applications to solve classical fuzzy problems where they have been extensively applied in economy, image or speech recognition, pattern recognition, etc. they have inherently several drawbacks regarding use in mechanical engineering and applied sciences. The first drawback is the large amount of data they require to yield relevant predictions. The second one is the lack of fulfillment of basic physics principles (e.g., the laws of thermodynamics). The third one is the lack of guarantees in the optimality or uniqueness of the prediction, or even guarantees in the reasonableness of the predicted response. The fourth one is the computational cost, if including training, when compared using classical methods. Although once trained, the use may be much faster than many classical methods. Probably, the most important drawback is the lack of physical insight into the problem, because human learning is complex and needs a detailed understanding of the problem to seek creative solutions to unsolved problems. Indeed, in contrast to “unexplainable” AI, now also eXplainable Artificial Intelligence (XAI) is being advocated (Arrieta et al. 2020).

ML may be a good avenue to obtain engineering solutions, but to yield valuable (and reliable), scientific answers, physics principles need to be incorporated in the overall procedure. To this end, the predictions and learning of the previously overviewed methods, or other more elaborated ones, should be restricted to solution subsets that do fulfill all the basic principles. That is, conservation of energy, of linear momentum, etc. should be fulfilled. When doing so, we use data-driven physics-based machine learning (or modeling) (Ströfer et al. 2018), or “gray-box” modeling (Liu et al. 2021; Asgari et al. 2021; Regazzoni et al. 2020; Rogers et al. 2017). The simplest and probably most used method to impose such principles (an imposition called “whitening” or “bleaching” Yáñez-Márquez 2020) is the use of penalties and Lagrange multipliers in the cost function (Dener et al. 2020; Borkowski et al. 2022; Rao et al. 2021; Soize and Ghanem 2020), but there are many options and procedures to incorporate physics either in the data or in the learning (Karpatne et al. 2017). The resulting methods and disciplines which mix data science and physical equations are often referred to as Physics Based Data Science (PBDS), Physics-Informed Data Science (PIDS), Physics-Informed Machine Learning (PIML) (Karniadakis et al. 2021; Kashinath et al. 2021), Physics Guided Machine Learning (PGML) (Pawar et al. 2021; Rai and Sahu 2020), or Data-Based Physics-Informed Engineering (DBPIE).

In a nutshell, data-based physically informed ML allows for the use of data science methods without most of the shortcomings of physics-uninformed methods. Namely, we do not need much data (Karniadakis et al. 2021), solutions are often meaningful, the results are more interpretable, the methods much more efficient, and the number of meaningless spurious solutions is substantially smaller. The methods are no longer a sophisticated interpolation but can give predictions outside the domain given by the training data. In essence, we incorporate the knowledge acquired in the last centuries.

In PBDS, meaningful internal variables play a key role. In classical engineering modeling, as in constitutive modeling, variables are either external (position, velocity, and temperature) or internal (plastic or viscous deformations, damage, and deformation history). The external variables are observable (common to all methods), whereas the internal variables, being non-observable, are usually based on assumptions to describe some internal state. Here, a usual difference with ML methods is that a physical meaning is typically assigned to internal variables in classical methods, but for example when using NNs, internal variables (e.g., those in hidden layers) have typically no physical interpretation. However, the sought solution of the problem relates external variables both through physical principles or laws and through state equations. To link both physical principles and state equations, an inherent physical meaning is therefore best given (or sought) for the internal ML variables (Carleo et al. 2019; Vassallo et al. 2021). Physical principles are theoretical, of general validity, and unquestioned for the problem at hand (e.g., mass, energy, momentum conservation, and Clausius-Duhem inequality), whereas state equations are a result of assumptions and observations at the considered scales, leading to equations tied to some conditions, assumptions, and simplifications of sometimes questionable generality and of more phenomenological nature.

In essence, the possible ML solutions obtained from state equations must be restricted to those that fulfill the basic physical principles, constituting the physically viable solution manifold, and that is often facilitated by the proper selection of the structure of the ML method and the involved internal variables. These physical constraints may be incorporated in ML procedures in different ways, depending on the analysis and the ML method used, as we briefly discuss below (see also an example in Ayensa Jiménez 2022).

1.3.1 Incorporating Physics in, and Learning Physics From, the Dataset

An objective may be to discover a hidden physical structure in data or physical relations in data (Chinesta et al. 2020). One purpose may be to reduce the dimension of the problem by discovering relations in data that lead to the reduction of complexity (Alizadeh et al. 2020; Aletti et al. 2015). This is similar to calculating dynamical modes of displacements (Bathe and Wilson 1973; Bathe 2006) or to discover the invariants when relating strain components in hyperelasticity (Weiss et al. 1996; Bonet and Wood 1997). Another objective may be to generate surrogate models (Bird et al. 2021; Straus and Skogestad 2017; Jansson et al. 2003; Liu et al. 2021) to discover which variables have little relevance to the physical phenomenon, or quantifying uncertainty in data (Chan and Elsheikh 2018; Trinchero et al. 2018; Abdar et al. 2021; Zhu et al. 2019). Learning physics from data is in essence a data mining approach (Bock et al. 2019; Kamath 2001; Fischer et al. 2006). Of course, this approach is always followed in classical analysis when establishing analytical models, for example when neglecting time effects for quasi-stationary problems, or when reducing the dimension of 3D problems to plane stress or plane strain conditions. However, ML seeks an unbiased automatic approach to the solution of a problem.

1.3.2 Incorporating Physics in the Design of a ML Method

A natural possibility to incorporate physics in the design of the ML method is to impose some equations, in some general form, onto the method, and the purpose is to learn some of the freedom allowed by the equations (Tartakovsky et al. 2018). That is the case when learning material parameters (typical in Materials Science informatics, Agrawal and Choudhary 2016; Vivanco-Benavides et al. 2022; Stoll and Benner 2021), selecting specific functions from possibilities (e.g., selecting hardening models or hyperelastic models from a library of functions, Flaschel et al. 2021, 2022), or learning corrections of models (e.g., deviations of the “model” from reality).

Physics in the design of the ML procedure may also be incorporated by imposing some specific meaning to the hidden variables (introducing physically meaningful internal variables as the backstress in plasticity) or the structure (as the specific existence and dependencies of variables in the yield function) (Ling et al. 2016). Doing so, the resulting learned relations may be better interpreted and will be in compliance with our previous knowledge (Abueidda et al. 2021; Miyazawa et al. 2019; Zhan and Li 2021).

A large amount of ML methods in CAE are devoted to learning constitutive (or state) equations (Leygue et al. 2018), with known conservation principles and kinematic relations (equilibrium and compatibility), as well as the boundary conditions (González et al. 2019b; He et al. 2021). In essence, we can think of a “physical manifold” and a “constitutive manifold”, and we seek the intersection of both for some given actions or boundary and initial conditions (Ibañez et al. 2018; He et al. 2021; Ibañez et al. 2017; Nguyen and Keip 2018; Leygue et al. 2018). Autoencoders are a good tool to reduce complexity and filter noise (He et al. 2021). Other methods are devoted to inferring the boundary conditions or material constitutive inhomogeneities (e.g., local damage) assuming that the general form of the constitutive relations is known (this is a ML approach to the classical inverse problem of damage/defect detection).

Fig. 1.18
figure 18

Data-based constitutive modeling. Left: purely data-driven technique, where no constitutive manifold is directly employed. Instead, the closest known data point is located and is used to compute the solution. Center, right: constitutive (e.g., stress–strain) data points are used to compute a constitutive manifold (which may include uncertainty quantification), which is then employed to compute the solution in a classical manner

Regarding the determination of the constitutive equations, the procedure may be purely data-driven (without the explicit representation of a constitutive manifold or constitutive relations, i.e. “model-free” Kirchdoerfer and Ortiz 2016, 2017; Eggersmann et al. 2021a; Conti et al. 2020; Eggersmann et al. 2021b) or manifold-based, in which case a constitutive manifold is established as a data-based constitutive equation. In the model-free family, we assume that a large amount of data is known, so a material data “point” is always close to the physical manifold (see Fig. 1.18 left). Then, while these techniques may be considered within the ML family, they are more data-driven deterministic techniques (raw data is employed directly, no constitutive equation is “learned”). In the manifold-based family (Fig. 1.18, center and right), the manifold may be explicit (e.g., spline-based, Sussman and Bathe 2009; Crespo and Montáns 2019; Latorre and Montáns 2017; Crespo et al. 2017; Coelho et al. 2017) or implicit (discrete or local, e.g., Lopez et al. 2018; Ibañez et al. 2020; Meng et al. 2018; Ibañez et al. 2017). This is a family of methods for which the objective is to learn the state equations from known (experimental or analytical) data points, probably subject to some physics requirements (as integrability). Within this approach, once the manifold is established, the computation of the prediction follows a scheme very similar to the use of classical methods (Crespo et al. 2017).

Remarkably, in some Manifold Learning approaches, physical requirements (which may include, or not, physical internal variables, Amores et al. 2020) may result in a substantial reduction of the experimental data needed (Latorre et al. 2017; Amores et al. 2021) and of the overall computational effort, resulting also in an increased interpretability of the solution. An important class of problems where ML in general, and Manifold Learning approaches in particular, are often applied with important success, is the generation of surrogate models for multiscale problems (Peng et al. 2021; Yan et al. 2020; White et al. 2019; El Said and Hallett 2018; Alber et al. 2019; Brunton and Kutz 2019). The solutions of nonlinear multiscale problems, in particular those which use Finite Element based computational homogenization (FE squared) (FE2) techniques, are still very expensive, because at each integration point, a FE problem representing the Representative Volume Element (RVE) must be considered (Fish et al. 2021; Arbabi et al. 2020; Fuhg et al. 2021). Then, surrogate models which represent the equivalent behavior at the continuum level are extremely useful (Fig. 1.19). These surrogate models may be obtained using different techniques. The use of Neural Networks is one option (Wang and Sun 2018; Wang et al. 2020). Then the dataset for the training is obtained from repeated off-line simulations with different loading and boundary conditions at different deformation levels and with different loading histories (Logarzo et al. 2021). Another option is to use surrogate models based on the equivalence of physical quantities as stored and dissipated energies (Crespo et al. 2020; Miñano and Montáns 2018). Reduced Order Methods are also important, especially in nonlinear path-dependent procedures to determine the main internal variables or simplest representation giving sufficient accuracy (Singh et al. 2017; Rocha et al. 2020). An important aspect in surrogate modeling is the possibility of inversion of the map (Haghighat et al. 2021; Raissi et al. 2019; Haghighat and Juanes 2021), which is crucial when prediction is not the main purpose of the machine learning procedure but the main objective is to learn about the material or its spatial distribution. The use of autoencoders can be effective if decompression is fundamental in the process (Kim et al. 2021; Bastek et al. 2022; Xu et al. 2022; Jung et al. 2020).

Fig. 1.19
figure 19

Surrogate modeling representing the micromechanics. Examples are numerical manifolds or Neural Networks

1.3.3 Data Assimilation and Correction Methods

The use of ML models, as when using any model (including classical analytical models; see, for example, Bathe 2006), may result in a significant error in the prediction of the actual physical response. This error may be produced either by insufficient data (or insufficient quality of the data because of noise or lack of completeness), or by inaccuracy of the model (e.g., due to too few layers in a NN or erroneous or oversimplifying assumptions) (Haik et al. 2021). Then problems are how to incorporate new data (labeled or unlabeled) into the model (Buizza et al. 2022), how to enrich the model to improve the predictions (Singh et al. 2017), and how to augment physical models with machine-learned bias (Volpiani et al. 2021) (hybrid models). These problems are typically encountered in dynamics (Muthali et al. 2021), and the solutions are often similar to those employed in control theory (Rubio et al. 2021), as the use of Kalman methods (Zhang et al. 2020). Machine learning techniques may be used for self-learning complex physical phenomena as the sloshing of fluids (Moya et al. 2020). In essence, the proposal here is to assume that there is a model-predicted response \(\textbf{y}^{\text {model}}\) and a true (say “experimental”) response \(\textbf{y}^{\text {exp}}\) (Moya et al. 2022). The difference is the error to be corrected, namely \(\textbf{y}^{\text {corr}} =\textbf{y}^{\text {exp}} -\textbf{y}^{\text {model}}\). This error is corrected in further predictions by assuming that there is an indeterminacy either in the input data (statistical error) or in the model (some unknown variables that are not being considered). Note that the statistical error case is conceptually similar to the quantification of uncertainty. In case the model needs corrections, some formalism may be employed to introduce physics corrections to learned models. For example, correcting dissipative behavior in assumed (hyper)elastic behavior (or vice versa). In case there are some indeterminacies in the model, we can assume that the model is of the form

$$\begin{aligned} \textbf{y}=f(\textbf{x};\textbf{w},\mathbf {\omega }) \end{aligned}$$
(1.51)

where the \(\textbf{w}\) are the parameters determined previously (e.g., during the usual model learning process and now fixed) and \(\mathbf {\omega }\) are the parameters correcting the model by minimizing the error. This model correction process using new data is called data assimilation. In Dynamic Data-Driven Application Systems (DDDAS), the concepts of Digital Twins and Hybrid Twins are employed. A Digital Twin (Glaessgen and Stargel 2012) is basically a virtual (sometimes comprehensive) model which is used as a replication of a system in real life. For example, a Formula-1 simulator, Mayani et al. 2018, or a spacecraft simulator, Ye et al. 2020; Wang 2020 may be considered a Digital Twin (Luo et al. 2020). A Digital Twin serves as a platform to try new solutions when it is difficult or expensive to try them in the actual physical system. Digital Twins are increasingly used in industry in many fields (Bhatti et al. 2021; Garg and Panigrahi 2021; Burov and Burova 2020). This virtual platform may contain classical analytical models, data-driven models, or a combination of both (which is currently very usual in complex systems). The concept of Hybrid Twin (Chinesta et al. 2020) (or self-learning digital twin, Moya et al. 2020) is a step forward, which mixes the virtual/digital twin model with model order reductions and parametrized solutions. The purpose is to have a twin in real time, which may be used to predict the behavior of a system in advance and correct the system (Moya et al. 2022) or take any other measure; that is, in essence to control a complex physical system. The dynamic equation of the Hybrid Twin is

$$\begin{aligned} \dot{\textbf{X}}(t;\mathbf {\mu })=\textbf{A}(\textbf{X},t;\mathbf {\mu })+\textbf{B}(\textbf{X},t)+\textbf{C}(t)+\textbf{R}(t) \end{aligned}$$
(1.52)

where the \(\mathbf {\mu }\) are the model parameters, \(\textbf{A}(\textbf{X},t;\mathbf {\mu })\) is the (possibly analytical) model contribution given those parameters (a linear model would be \(\textbf{A}(\mathbf {\mu })\textbf{X}\)) (Sancarlos et al. 2021), \(\textbf{B}(\textbf{X},t)\) is a data-based correction to the model (a continuous update from measurements), \(\textbf{C}(t)\) are the external actions, and \(\textbf{R}(t)\) is the (unbiased and unpredictable) noise. We use the word “hybrid” (Champaney et al. 2022) because analytical and data-based approaches are employed. Hybrid Twins have been applied in various fields, for example in simulating acoustic resonators (Martín et al. 2020).

1.3.4 ML Methods Designed to Learn Physics

A different objective from incorporating physics in the ML method is to use a ML method to learn physics. One example would be to learn constitutive equations without prior (or with minimal) assumptions—a case that is similar to those discussed above but, for example, without neglecting a priori the influence of some terms or without assuming the nature of the constitutive equation (for example, not assuming elasticity, plasticity, or other). Another example is to learn new physical or fundamental evolution equations in nature. A successful (and quite simple) case is the Sparse Identification of Physical Systems, in particular the Sparse Identification of Nonlinear Dynamics (SINDy) (Brunton et al. 2016; Rudy et al. 2017). In this approach, the nonlinear problem

$$\begin{aligned} \dot{\textbf{X}}=\textbf{A}(\textbf{X}) \end{aligned}$$
(1.53)

is re-written as

$$\begin{aligned} \dot{\textbf{X}}= \mathbf {\Theta }(\textbf{X})\mathbf {\Xi } \end{aligned}$$
(1.54)

where \(\mathbf {\Xi }\) is a sparse matrix of dynamical coefficients and \(\mathbf {\Theta }(\textbf{X})\) contains a library of functions evaluated at \(\textbf{X}\). In the Lorenz System shown in Fig. 1.20 (Brunton et al. 2016), \(\mathbf {\Theta }(\textbf{X})\) involves a set of nonlinear polynomial combinations of the components of \(\textbf{X}\). The purpose here is to obtain the possibly simplest yet accurate description (the parsimonious model) in terms of the expansion functions, and this is performed by the technique of sparse regression, which promotes sparsity in underdetermined least squares regression by replacing the norm-2 Tikhonov regularization by a norm-1 penalization (Tibshirani 1996), although in Brunton et al. (2016) the authors used a slightly different technique. The optimal penalty may be obtained by minimizing a cross-validation error (i.e. the solution which is accurate but avoids overfitting). The method has been applied to a variety of physics problems to determine their differential equations (Rudy et al. 2017). Similar approaches are Physics-Informed Spline Learning (PiSL) (Sun et al. 2021), which represents an improvement for data representation allowing for explicit derivatives and uses alternating direction optimization with adaptive Sequential Threshold Ridge regression (STRidge) (Rudy et al. 2017) for promoting sparsity, and also more classical genetic and symbolic regression procedures (Searson 2009; Schmidt and Lipson 2009, 2010). An overview of these techniques and others may be found in Brunton and Kutz (2022); see also Zhang and Liu (2021) for a progressive approach for considering uncertainties.

Fig. 1.20
figure 20

Reproduced from Brunton et al. (2016)

Sparse Identification of Nonlinear Dynamics. Case of Lorenz System.

These approaches, as the SINDy type, can trivially address the correction given by an imperfect modeling (i.e. the Hybrid Twin). It simply suffices to consider a correction in Eq. (1.53)

$$\begin{aligned} \dot{\textbf{X}}-\textbf{A}(\textbf{X})=\textbf{B}(\textbf{X}) \end{aligned}$$
(1.55)

where \(\textbf{B}(\textbf{X})\) is the measured discrepancy to be corrected between the results obtained from the inexact model and the experimental results. As performed in mathematics and physics, the key for simplification and possible linearization of a complex problem consists of finding a proper (possibly reduced) space of (possibly transformed) input variables to re-write the problem. As mentioned, NNs, in particular autoencoders, can be used to find the space, to which, thereafter, a SINDy approach may be applied to create a Digital or Hybrid Twin (Champion et al. 2019). These mixed NN approaches have also been employed in multiscale physics transferring learning through scales by increasingly deep and wide NNs (Liu et al. 2020), also employing CNNs (Liu et al. 2022). Of course, Dynamic Mode Decomposition (DMD) (Schmid 2010; Tu 2013; Schmid 2011; Jovanović et al. 2014; Demo et al. 2018), a procedure to determine coupled spatio-temporal modes for nonlinear problems based on Koopman (composite operator) theory (Williams et al. 2015), is also used for incorporating data into physical systems, or determining the physical system equations themselves. The idea is to obtain two sets (“snapshots”) of spatial measurements separated by a given \(\Delta t\), namely \(\,^t\textbf{X}\) and \(\,^{t+\Delta t}\textbf{X}\). Then, the eigenvectors of \(\textbf{A}=\,^{t+\Delta t}\textbf{X}\,^{t}\textbf{X}^{\sim 1}\), where \(\,^t\textbf{X}^{\sim 1}\) is the pseudoinverse, are the best regressors to the linear model, that is, the minimum-squares best fit of the nonlinear model, compatible with the snapshots. In practice, the \(\textbf{A}\) matrix is usually not computed because working with the SVD of \(\textbf{X}\) is more efficient (Proctor et al. 2016).

Other techniques to discover physical relations (or nonlinear differential equations), as well as simultaneously obtain physical parameters and fields, are physics-informed NN (PINN) (Raissi and Karniadakis 2018; Raissi et al. 2019; Pang et al. 2019; Yang et al. 2021). For example, using neural networks, the viscosity, the density, and the pressure, with the velocity field in time may be obtained assuming the Navier–Stokes equations as background and employing a NN as the learning engine to match snapshots. Moreover, these methods may be combined with time integrators for obtaining the nonlinear parameters of any differential equation, including higher derivatives, just from discretized experimental snapshots (Meng et al. 2020; Zhang et al. 2020). Other applications include inverse problems in discretized conservative settings (Jagtap et al. 2020).

1.3.4.1 Deep Operator Networks

While it is very well known that the so-called universal approximation theorem guarantees that a neural network can approximate any continuous function, it is also possible to approximate continuous operators by means of neural networks (Chen and Chen 1995). Based on this fact, Lu and coworkers have proposed the Deep Operator Networks (DeepONets) (Lu et al. 2021).

A DeepONet typically consists of two different networks working together: one to encode the input function at a number of measurement locations (the so-called branch net) and a second one (the trunk net) to encode the locations for the output functions. Assume that we look forward to characterize an operator \(F:X\rightarrow Y\), with XY two topological spaces. For any function \(x\in X\), this operator produces \(G=F(x)\), the output function. For any point y in the domain of F(x), G(y) is a real number. A DeepONet thus learns from pairs (xy) to produce the operator. However, for an efficient training, the input function x is sampled at discrete spatial locations.

In some examples, DeepONets showed very small generalization errors and even exponential error convergence with respect to the training dataset size. This is however not yet fully understood. DeepONets have been applied, for example, to predict crack paths in brittle materials (Goswami et al. 2022), instabilities in boundary layers (Di Leoni et al. 2021), and the response of dynamical systems subjected to stochastic loadings (Garg et al. 2022).

Recently, DeepONets have been generalized by parameterizing the integral kernel in Fourier space, giving rise to the so-called Fourier Neural Operators (Li et al. 2020). These networks have also gained a high popularity, and have been applied to weather forecasting, for instance (Pathak et al. 2022).

1.3.4.2 Neural Networks Preserving the Physical Structure of the Problem

Within the realm of PIML approaches, a new family of methods has recently been proposed. The distinctive characteristic is that these new techniques see the supervised learning process as a dynamical system as

$$\begin{aligned} \dot{\textbf{z}} = \textbf{f}(\textbf{z},t),\text { with }\textbf{z}(0)=\textbf{z}_0 \end{aligned}$$
(1.56)

with \(\textbf{z}\) being the set of variables governing the problem. The supervised learning problem will thus be to establish \(\textbf{f}\) in such a way as to reach an accurate description of the evolution of the variables. By formulating the problem in this way, the analyst can use the knowledge already available, and established over centuries, on dynamical systems. For instance, adopting a Hamiltonian perspective on the dynamics and enforcing \(\textbf{f}\) to be of the form

$$\begin{aligned} \dot{\textbf{z}}=\textbf{L}\nabla {H} \end{aligned}$$
(1.57)

where \(\textbf{L}\) is the classical (skew-symmetric) symplectic matrix, which ensures that the learnt dynamics will conserve energy, because it is derived from the Hamiltonian H. Many recent references have exploited this approach, either in Hamiltonian or Lagrangian frameworks (Greydanus et al. 2019; Mattheakis et al. 2022; Cranmer et al. 2020). If the system of interest is dissipative—which is, by far, most frequently the case—a second potential must be added to the formulation as

$$\begin{aligned} \dot{\textbf{z}}=\textbf{L}\nabla {H} + \textbf{M}\nabla {S} \end{aligned}$$
(1.58)

where S represents the so-called Mathieu potential. To ensure the fulfillment of the first and second principles of thermodynamics, an additional restriction (the so-called degeneracy conditions) must be imposed, i.e.

$$\begin{aligned} \textbf{L}\nabla {S} = \textbf{M}\nabla {H} = \textbf{0} \end{aligned}$$
(1.59)

These equations essentially state that entropy has nothing to do with energy conservation and, in turn, energy potentials have nothing to do with dissipation. The resulting NN formulations produce predictions that comply with the laws of thermodynamics (Hernández et al. 2021, 2022).

1.4 Applications of Machine Learning in Computer Aided Engineering

In this section we describe some applications of machine learning in CAE. The main purpose is to briefly focus on a variety of topics and ML approaches employed in several fields, but not to give a comprehensive review. Hence, given the vast literature already available, developed in the last few years, many important works have likely been omitted. However, even though the field of applications is very broad, the main ideas fundamental to the techniques are given in the previous sections.

1.4.1 Constitutive Modeling and Multiscale Applications

The main field of application of machine learning techniques in CAE is ML constitutive modeling, both at the continuum scale and for easing multiscale computations. As previously mentioned, applicable procedures are model-free approaches, data-driven manifold learning, data-driven model selection and calibration, and surrogate modeling. Another interesting application of ML, in particular NNs, is to improve results from coarse FE models without resorting to expensive fine computations, e.g., “zooming” (Yamaguchi and Okuda 2021). There are several reviews of applications of ML in constitutive modeling (specially using NNs), in continuum mechanics (Bock et al. 2019), for soils (Zhang et al. 2021), composites (Zhang and Friedrich 2003; Liu et al. 2021; El Kadi 2006), and material science (Hkdh 1999). An earlier review of NN applications in computational mechanics in general can also be found in Yagawa and Okuda (1996). Below we briefly review some applications.

1.4.1.1 Linear and Nonlinear Elasticity

One of the simplest modeling problems and, hence, one of the most explored ones is the case of elasticity. The linear elastic problem, addressed from a model-free data-driven method is analyzed in Kirchdoerfer and Ortiz (2016), Conti et al. (2018), and even earlier in Wang et al. (2011) for cloths in the animation and design industries. Data-driven nonlinear elasticity is also analyzed in several works (Conti et al. 2020; Stainier et al. 2019; Nguyen and Keip 2018), and applied to soft tissues (González et al. 2020) and foams (Frankel et al. 2022).

In particular, data-driven specific solvers are needed if model-free methods are employed, and some effort is directed to developing such solvers and data structuring methods for the task (Eggersmann et al. 2021a, b; Platzer et al. 2021). Kernel regression is also employed (Kanno 2018).

Another common methodology is the use of data-driven constitutive manifolds (Ibañez et al. 2017), where identification and reduction of the constitutive manifolds allow for a much more efficient approach. NNs are as well used in finite deformation elasticity (Nguyen-Thanh et al. 2020; Wang et al. 2022).

Remarkably, nonlinear elasticity is one of the cases where physics-informed methods are important, because true elasticity means integrable path-independent constitutive behavior, i.e. hyperelasticity. Classical ML methods are not integrable (hence not truly elastic). To fulfill such requirement, specific methods are needed (González et al. 2019b; Chinesta et al. 2020; Hernandez et al. 2021). One of the possibilities is to posit the state variables and a reduced expression of the hyperelastic stored energy (which may be termed as “interpretable” ML models Flaschel et al. 2021). Then, this energy may be modeled, for example, by splines or B-splines. This approach, based on the Valanis–Landel assumption, was pioneered by Sussman and Bathe for isotropic polymers (Sussman and Bathe 2009) and extended later for anisotropic materials (Latorre and Montáns 2013) like soft biological tissues (fascia, Latorre et al. 2017, skin Romero et al. 2017, heart Latorre and Montáns 2017, muscle Latorre et al. 2018, Moreno et al. 2020), compressible materials (Crespo et al. 2017), auxetic foams (Crespo and Montans 2018; Crespo et al. 2020), and composites (Amores et al. 2021). Polynomials in terms of invariants are also employed, with the coefficients determined by sparse regression (Flaschel et al. 2021). Another approach is to select models from a database, and possibly correct them (González et al. 2019a; Erchiqui and Kandil 2006), or select specific function models for the hyperelastic stored energy using machine learning methods (e.g., NNs) (Flaschel et al. 2021; Vlassis et al. 2020; Nguyen-Thanh et al. 2020). In particular, polyconvexity (to guarantee stability and global minimizers for the elastic boundary-value problem) may also be imposed in NN models (Klein et al. 2022). Anisotropy in hyperelasticity may be learned from data with NNs (Fuhg et al. 2022a).

In material datasets, noise and outliers may be a relevant issue, both regarding accuracy and their promotion of overfitting. Clustering has been employed in model-free methods to assign a different relevance depending on the distance to the solution and using an estimation based on maximum entropy (Kirchdoerfer and Ortiz 2017). For spline-based constitutive modeling, experimental data reduction using stability-based penalizations allows for the use of noisy datasets and outliers avoiding overfitting (Latorre and Montáns 2020).

1.4.1.2 Plasticity, Viscoelasticity, and Damage

ML modeling of nonconservative effects is still in quite an incipient state because path-dependency requires the modeling of latent internal variables and the knowledge of the previous deformation path (González et al. 2021). However, some early works using NNs are available (Panagiotopoulos and Waszczyszyn 1999). The amount of needed data is much larger because the possible deformation paths are infinite, but there are already a relevant number of works dealing with inelasticity. In the case of damage, spline-based What-You-Prescribe is What-You-Get (WYPiWYG) large-strain modeling is available both for isotropic (Miñano and Montáns 2015) and anisotropic materials (Miñano and Montáns 2018). Crack growth in the aircraft industry has also been determined with RNNs (Nascimento and Viana 2020). Of course, ML has been for a long time applied to model fatigue (Lee et al. 2005).

Plasticity is probably the most studied case of the nonconservative behaviors (Waszczyszyn and Ziemiański 2001). For the case of data-driven (model-free) “extended experimental constitutive manifolds” including internal variables, the LArge Time INcrement (LATIN) method (solving by separating the constitutive and compatibility/equilibrium sets and looking for the intersection) has been successfully used (Ladevèze et al. 2019); see also Ibañez et al. (2018).

Data-driven model-free techniques in plasticity and viscoelasticity have been developed using more general history variables (like the history of stresses or strains as typically pursued for hereditary models) (Eggersmann et al. 2019; Ciftci and Hackl 2022). FFNNs with PODs have been employed to fit several plasticity stress–strain behaviors. NNs are also used to replace the stress integration approaches in FE analysis of elastoplastic models (Jang et al. 2021). In general, RNNs (Mozaffar et al. 2019; Borkowski et al. 2022) and CNNs (Abueidda et al. 2021) are a good resort for predicting plastic paths, and sophisticated LSTM and Gated Recurrent Unit (GRU) schemes have been reported to give excellent predictions even for complex paths (Wang et al. 2020).

In materials science, ML is employed to predict the cyclic stress–strain behavior depending on the microstructure of the material obtained from electron backscatter diffraction (EBSD) analysis. The shape of the yield function can also be determined by employing sparse regression from a strain map and the cell load in a non-homogeneous test (like considering a plate with holes) (Flaschel et al. 2022). A mixture of analytical formulas and FFNN machine learning has been employed to replace the temperature- and rate-dependent term of the Johnson–Cook model (Li et al. 2019). In plasticity, physics-based modeling is incorporated by assuming the existence of a stored energy, a plasticity yield function, and a plastic flow rule. These may be obtained by NNs learned from numerical experiments on polycrystal databases, resulting in a more robust ML approach than using the classical black-box ML scheme (Vlassis and Sun 2021). Support Vector Regression (SVR), Gaussian Process Regression (GPR), and NNs have been used to determine data-driven yield functions with the convexity constraints required by the theory (Fuhg et al. 2022b). Automatic hyperparameter (self-)learning has been addressed for NN modeling of elastoplasticity in Fuchs et al. (2021).

1.4.1.3 Fracture

Fracture phenomena may also be modeled using NNs (Theocaris and Panagiotopoulos 1993; Seibi and Al-Alawi 1997) and data-driven model-free techniques (Carrara et al. 2020). Data-driven model extraction from experimental data and knowledge transfer (Goswami et al. 2020) have been applied to obtain predictions in 3D models from 2D cases (Liu et al. 2021). Data-driven approaches are used to enhance fracture paths in simulations of random composites and in model reduction to avoid high fidelity phase-field computations (Guilleminot and Dolbow 2020). SVMs and variants have been used for predicting fracture properties, e.g., Yuvaraj et al. (2013), Kulkrni et al. (2011), and so have been other methods like BNN, Genetic Algorithm (GA), and hybrid systems; see, for example, Nasiri et al. (2017), Hoshyar et al. (2020).

1.4.1.4 Multiscale and Composites Modeling

The modeling of complex materials is one of the fields where machine learning may bring about significant advances in CAE (Peng et al. 2021), in particular when nonlinear behavior is modeled (Jackson et al. 2019). This is particularly the case when the macroscopic behavior or the physical properties depend in a complex manner on a specific microstructure (Fish et al. 2021) or on physics equations and phenomena only seen at a micro- or smaller scale, as atomistic (Caccin et al. 2015; Kontolati et al. 2021; Wood et al. 2019), molecular (Xiao et al. 2020), or cellular (Verkhivker et al. 2020).

ML allows for the simpler implementation of first-principles in multiscale simulations (Hong et al. 2021), describing physical macroscopic properties, like also in chaotic dynamical systems for which the highly nonlinear behavior depends on complex interactions at smaller scales (e.g., weather and climate predictions) (Chattopadhyay et al. 2020). Generating surrogate models to reproduce the observed macroscopic effects due to complex phenomena at the microscale (Wirtz et al. 2015) is often only possible through ML and Model Order Reduction (MOR) (Wang et al. 2020; Yvonnet and He 2007). Even in the simplest cases, ML may substantially speed up the expensive computational costs of classical nonlinear FE2 homogenization techniques (Feng et al. 2022; Wu et al. 2020), allowing for real-time simulations (Rocha et al. 2021). The nonlinear multiscale case is complex because an infinite number of simulations would be needed for a complete general database. However, a reduced dataset may be used to develop a numerical constitutive manifold with sufficient accuracy, e.g., using Numerically EXplicit Potentials (NEXP) (Yvonnet et al. 2013). Material designs are often obtained from inverse analyses facilitated by parametric ML surrogate models (Jackson et al. 2019; Haghighat et al. 2021). In particular, ML may be employed to determine the phase distributions in heterogeneous materials (Valdés-Alonzo et al. 2022).

The modeling of classical fiber-based and complex composite heterogeneous materials often requires multiscale approaches (Pathan et al. 2019; Hadden et al. 2015; Kanouté et al. 2009) because modeling of interactions at the continuum level requires inaccurate assumptions. CNNs are ideal for dealing with the relation of an unstructured RVE with continuum equivalent properties. In particular, ML may be used for dealing with stochastic distributions of constituents (Liu et al. 2022). Modeling of complex properties such as composite phase changes for thermal management in Li-ion batteries may be performed with CNNs (Kolodziejczyk et al. 2021). Indeed, CNNs can also be used for performing an inverse analysis (Sorini et al. 2021). In general, many complex properties and effects observed macroscopically, but through effects mainly attributed to the microscale, are often addressed with different ML techniques, including CNNs, e.g., Field et al. (2021), Nayak et al. (2022), and Koumoulos et al. (2019).

1.4.1.5 Metamaterials Modeling

Metamaterials are architected materials with inner custom-made structure. With the current development of 3D printing, metamaterial modeling and design is becoming an important field (Kadic et al. 2019; Bertoldi et al. 2017; Zadpoor 2016; Barchiesi et al. 2019) because a material with unique salient properties may be designed ad libitum allowing for a wide range of applications (Surjadi et al. 2019). Their design has evolved from the classical optimization-based approach (Sigmund 2009). ML methods for the design of metamaterials are often used with two objectives. The first objective is to generate simple surrogate models to accelerate simulations avoiding FE modeling to the very fine scale describing the structure, especially when nonlinearities are important. The second objective is to perform analyses using a metamaterial topology parametrization which allows for an effective metamaterial design from macroscopic desired properties. Examples of ML approaches for metamaterials pursuing these two objectives can be found in, e.g., Wu et al. (2020), Fernández et al. (2022b), Zheng et al. (2020), and Wilt et al. (2020).

1.4.2 Fluid Mechanics Applications

Fluid phenomena and related modeling approaches are very rich, spanning from the breakup of liquid droplets under different conditions (Krzeczkowski 1980; Roisman et al. 2018; Liu et al. 2018) to smoke from fires in tunnels (Gannouni and Maad 2016; Wu et al. 2021), emissions from engines (Khurana et al. 2021; Baklacioglu et al. 2019), flow and wake effects in wind turbines (Clifton et al. 2013; Ti et al. 2020), and free surface flow dynamics (Becker and Teschner 2007; Scardovelli and Zaleski 1999). The difficulty in obtaining accurate and efficient solutions, especially when effects at multiple scales are important, has fostered the introduction of ML techniques. We briefly review some representative works.

1.4.2.1 Turbulence Flow Modeling

The modeling of turbulence is an important aspect in the solution of the Navier–Stokes equations of fluid flows. Here ML techniques can be of value.

The ML procedures in turbulence often build on the Reynolds Averaging decomposition, \(\textbf{u}(\textbf{x},t)=\bar{\textbf{u}}(\textbf{x})+\tilde{\textbf{u}}(\textbf{x},t)\) which splits the flow \(\textbf{u}(\textbf{x},t)\) into an average \(\bar{\textbf{u}}(\textbf{x})\) time-independent component and a fluctuating component \(\tilde{\textbf{u}}(\textbf{x},t)\) with zero average, obtaining the incompressibility conditions \(\nabla \cdot \bar{\textbf{u}}=0\) and \(\nabla \cdot \tilde{\textbf{u}}=0\). Then, the Navier–Stokes equations are written in terms of the Reynolds stresses \(\rho \overline{\tilde{\textbf{u}}\otimes \tilde{\textbf{u}}}\)

$$\begin{aligned} \nabla \bar{\textbf{u}}\cdot \bar{\textbf{u}}=\frac{1}{\rho }\nabla (-p\textbf{I}+2\mu \nabla ^s\bar{\textbf{u}}-\rho \overline{\tilde{\textbf{u}}\otimes \tilde{\textbf{u}}}) \end{aligned}$$
(1.60)

for which a turbulence closure model is assumed, e.g., eddy viscosity model or the more involved \(k-\varepsilon \) (Gerolymos and Vallet 1996) or Spalart–Allmaras models (Spalart and Allmaras 1992). In Eq. (1.60), \(\nabla ^s\bar{\textbf{u}}\) is the average deviatoric strain-rate tensor. The framework in Eq. (1.60) gives the two commonly used models: the Reynolds-Averaged Navier–Stokes (RANS) model, best for steady flows (Speziale 1998; Kalitzin et al. 2005), and the Large Eddy Simulations (LES) model, using a subgrid-scale model, thus much more expensive computationally, but best used to predict flow separation and fine turbulence details. RANS closure models have been explored using ML. For example, the work reported (Zhao et al. 2020) trains a turbulence model for wake mixing using a CFD-driven Gene Expression Programming (an evolutionary algorithm). Physics-informed ML may also be used for augmenting turbulence models, in particular to overcome the difficulties of ill-conditioning of the RANS equations with typical Reynolds stress closures, focusing on improving mean flow predictions (Wu et al. 2018). Results of using ML to improve accuracy of closure models are, for example, given in Wackers et al. (2020), Wang et al. (2017). One of the important problems in modeling turbulence and accelerating full field simulations is to upscale the finer details, e.g., vorticity from the small to the larger scale, using a lower resolution (grid) analysis. These upscaling procedures may be performed by inserting NN corrections which learn the scale evolution relations, greatly accelerating the computations by allowing lower resolution (Kochkov et al. 2021).

1.4.2.2 Shock Dynamics

More accurate and faster shock-capturing by NN has been pursued in Stevens and Colonius (2020), where ML has been applied to improve finite volume methods to address discontinuous solutions of PDEs. In particular, Weighted Essentially Non-Oscillatory Neural Network (WENO-NN) approaches establish the smoothness of the solution to avoid spurious oscillations, still capturing accurately the shock, where the ML procedure facilitates the computation of the optimal nonlinear coefficients of each cell average.

1.4.2.3 Reduced Models for Accelerating Simulations

An important application of ML in fluid dynamics and aerodynamics is the development of reduced order models. In essence, these models capture the main dominant coarse flow structures, with fine structures included and provide a faster, simpler model for analysis, i.e. a surrogate model of similar idea as those used in multiscale analysis. As mentioned previously, there are many techniques used for this task, such as DMD (Schmid et al. 2011; Hemati et al. 2014) or more general POD (Berkooz et al. 1993; Aubry 1991; Rowley 2005), PGD (Dumon et al. 2011; Chinesta et al. 2011), PCA (Audouze et al. 2009), and SVD (Lorente et al. 2008; Braconnier et al. 2011). Autoencoders employing different NN types (Kramer 1991; Murata et al. 2020; Xu and Duraisamy 2020; Maulik et al. 2021) and other nonlinear extensions of the previous techniques are a widely used approach for dealing with nonlinear cases typical in fluid dynamics (Gonzalez and Balajewicz 2018). These techniques also frequently include physics information to guarantee consistency (Erichson et al. 2019).

1.4.3 Structural Mechanics Applications

ML has been used for some time already in structural mechanics, with probably the most applications in Structural Health Monitoring (SHM) (Farrar and Worden 2012). ML is applied for the primal identification of structural systems (SSI) (Sirca and Adeli 2012; Amezquita-Sancheza et al. 2020), in particular of complex or historical structures, to assess their general and seismic vulnerability (Ruggieri et al. 2021; Xie et al. 2020) and facilitate ulterior health monitoring (Mishra 2021). Feature extraction and model reduction is fundamental in these approaches (Rosafalco et al. 2021). Other areas where ML is employed is in the control of structures (e.g., active Tuned Mass Dampers, Yucel et al. 2019; Colherinhas et al. 2019; Etedali and Mollayi 2018) under wind, seismic or crowd actions, or in structural design (Herrada et al. 2017; Sun et al. 2021; Hong et al. 2020; Yuan et al. 2020). We also comment in this section on the development of novel ML approaches based on ideas used in structural and finite element analyses.

1.4.3.1 Structural System Identification and Health Monitoring

Structural System Identification (SSI) is a key in analyzing the vulnerability of historical structures in seismic zones (e.g., Italy and Spain) (Domaneschi et al. 2021). It is also a problem in the assessment of modern structures, since modeling assumptions may not have been sufficiently accurate (Torky and Ohno 2021). Many classical approaches based on optimization methods are frequently ill-conditioned or they present many possible solutions, some of which should have been discarded automatically. Hence, ML is an excellent approach to address SSI, and different algorithms have been employed. For example, SVM (Gui et al. 2017), and in particular Weighted Least Squares Support Vector Machines (LS-SVM) have been employed to determine the structural parameters and then identify degradation due to damage through dynamic response (Tang et al. 2006; Zhang et al. 2007). K-Means and KNNs are also frequently used in SHM. For example, in Sarmadi and Karamodin (2020) anomaly detection is performed using the (squared) “distance” \((\textbf{x}-\bar{\textbf{x}}) \textbf{S}_k^{-1} (\textbf{x}-\bar{\textbf{x}})\) to detect the k-NN in a multivariate one-class k-NN approach. The authors applied the approach to wood and steel bridges and compared the results obtained with other ML techniques to reach the smallest misclassification rate. Bridge structures have also been focused on using Genetic Algorithms in an unsupervised approach to detect damage (Silva et al. 2016). Health monitoring of bridges is the focus in rather early research (e.g., the simple case analyzed in Liu and Sun 1997 through NNs). The traffic load (Lee et al. 2002) and ambient vibrations (Avci et al. 2021) are often actions that require the study of the evolution of the mechanical properties. The application of NNs is typical in detecting changes in the properties and the possible explanations on the origin of those changes (Ko and Ni 2005). Basically, all types of bridges have been studied using ML techniques, namely steel girder bridges (Nick et al. 2021), reinforced concrete T-bridges (Hasançebi and Dumlupınar 2013), cable stayed (Arangio and Bontempi 2015) and long suspension bridges (Ni et al. 2020), truss bridges (Mehrjoo et al. 2008), and arch bridges (Jayasundara et al. 2020). Different types of NN are used (e.g., Bayesian, Arangio and Beck 2012; Li et al. 2020; Ni et al. 2001, Convolutional, Nguyen et al. 2020; Quqa et al. 2022, Recurrent, Miao et al. 2023; Miao and Yokota 2022), and the use of other techniques is also frequent as SVM; see, for example, (Alamdari et al. 2017; Yu et al. 2021).

Apart from bridges and multi-story buildings (González and Zapico 2008; Wang et al. 2020), there are many other types of structures for which SSI and SHM are performed employing ML. Important structures are dams, where a deterioration and failure may cause massive destruction, hence visual inspection and monitoring of displacement cycles are typical actions in SHM of dams. The observations feed ML algorithms to assess the health of the structure. The estimation of the structural response from collected data is studied for example in Li et al. (2021b), where a CNN is used to extract features and a bidirectional gated RNN is employed to perform transfer learning from long-term dependencies. Similar works addressing SHM of dams are given (Yuan et al. 2022; Sevieri and De Falco 2020). A review may be found in Salazar et al. (2017).

Of course, different outputs may be pursued and the appropriate ML technique is related to both available data and desired output. For example, NNs have been used in Kao and Loh (2013), Ranković et al. (2012), Chen et al. (2018), and He et al. (2022) to monitor radial and lateral displacements in arch dams. Several ML techniques such as Random Forest (RF), Boosted Regression Trees (BRT), NN, SVM, and MARS are compared in Salazar et al. (2015) in the prediction of dam displacements and of dam leakage. The researchers found that BRT outperforms the most common data-driven technique employed when considering this problem, namely the Hydrostatic-Seasonal-Time method (HST), which accounts for the irreversible evolution of the dam response due to the reversible hydrostatic and thermal loads; see also Salazar et al. (2016). Gravity dams are a different type of structure from arch dams. Their reliability under flooding, earthquakes, and aging has also been addressed using ML methods in Hariri-Ardebili and Pourkamali-Anaraki (2018), where kNN, SVM, and NB have been used in the binary classification of structural results, and a failure surface is computed as a function of the dimensions of the dam. Related to dam infrastructure planning, flooding susceptibility predictions due to rainfall using NB and Naïve Bayes Trees (NBT) are compared in Khosravi et al. (2019) with three classical methods (see review of Multicriteria Decision Making (MCDM) in de Brito and Evers 2016) in the field. For tunnel designs and monitoring, we have that the soil is also an integral part of the structure and is difficult to characterize. The understanding of its behavior often depends on qualitative observations; it is therefore another field where machine learning techniques will have an important impact in the future (Jafari 2020).

Important types of structures considered in SHM are also aerogenerators or Wind Turbines (WT); see review in Ciang et al. (2008). Here, two main components are typically analyzed: the blades and the gearbox (Wang et al. 2016). SVM is a frequent ML technique used and acoustic noise is a source of relevant data for blade monitoring (Regan et al. 2016). Deep NNs are also frequently employed when multiple sources of data are available, in particular CNNs are used to deal with images from drones (Shihavuddin et al. 2019; Guo et al. 2021). Images are valuable not only in the detection of overall damage (e.g., determining a damage index value), but also in determining the location of the damage. This gives an alternative to the placement of networks of strain sensors (Laflamme et al. 2016). Other WT functional problems, such as dirt and mud detection in blades to improve maintenance, can be determined employing different ML methods; e.g., in Jiménez et al. (2020) k-Nearest Neighbors (k-NN), SVM, LDA, PCA, DT, and an ensemble subspace discriminant method are employed. ther factors like the presence of ice in cold climates are also important. In Jiménez et al. (2019), a ML approach is applied to pattern recognition on guided ultrasonic waves to detect and classify ice thickness. In this work, different ML techniques are employed for feature extraction (data reduction into meaningful features), both linear (autoregressive ML models and PCA) and nonlinear (nonlinear-AR eXogenous and nonlinear PCA), and then feature selection is performed to avoid overfitting. A wide range of supervised classifiers of different families (DT, LDA, QDA, several types of SVM, kNN, and ensembles) were employed and compared, both in terms of accuracy and efficiency.

Applications of ML can be found also in data preparation, including imputation techniques to fill missing sensor data (Li et al. 2021a, b). Systems, and damage and structural responses are assessed employing different variables. Typical variables are the displacements (building drift), which allow for the determination of material and structural geometric properties, for example in reinforced concrete (RC) columns. This can be achieved through locally weighted LS-SVM (Luo and Paal 2019). Bearing capacities and failure modes of structural components (columns, beams, shear walls) can also be predicted using ML techniques, in particular when the classical methods are complex and lack accuracy. For example, in Mangalathu et al. (2020) several ML methods such as Naïve Bayes, kNN, decision trees, and random forests combined with several weighted Boost techniques (similar to ensemble learning under the assumption that many weak learners make a strong learner) such as AdaBoost (Adaptative Boost, meaning that new weak learners adapt from misclassifications of previous ones) are compared to predict the failure modes (flexural, diagonal tension or compression, sliding shear) of RC shear walls in seismic events.

Identification of smart structures with nonlinearities, like buildings with magnetorheological dampers, has been performed through a combination of NN, PCA, and fuzzy logic (Mohammadzadeh et al. 2015).

In SHM, the integration of data from different types or families of sensors (data fusion) is an important topic. Data fusion (Hall and Llinas 2001) brings not only challenges in SHM but also the possibility of more accurate, integral prediction of the health of the structure (Wu and Jahanshahi 2020). For example, in Vitola et al. (2017) a data fusion system based on kNN classification was used in SHM. SHM is most frequently performed through the analysis of the dynamic response of the structure and comparing vibrational modes using the Modal Assurance Criterion (MAC) (Ho et al. 2021). However, in the more challenging SSI, many other additional features are employed as typology, age, and images. In SHM, damage detection is also pursued through the analysis of images. Visual inspection is a long used method for crack detection in concrete or steel structures, or to determine unusual displacements and deformations of the overall structure from global structural images. Automatic processing and damage detection from images obtained from stationary cameras or an Unmanned Aereal Vehicle (UAV) (Sankarasrinivasan et al. 2015; Reagan et al. 2018) is currently being performed using ML techniques. A recent review of these image-based techniques can be found in Dong and Catbas (2021). Another recent review of ML applications of SHM of civil structures can be found in Flah et al. (2021).

One of the lessons learnt considering the available results is that to improve predictions and robustness, some progress is needed in physics-based ML approaches for SHM. For instance, an improvement may be using concrete damage models with environmental data, typology, images, etc. to detect damage which may have little impact in sensors (Kralovec and Schagerl 2020), but which may result in significant losses. This issue is also of special relevancy in the aircraft industry (Ahmed et al. 2021).

1.4.3.2 Structural Design and Topology Optimization

The design of components and structures is based on creativity and experience (Málaga-Chuquitaype 2022), so it is also an optimal field for the use of ML procedures, e.g., Adeli and Yeh (1990). ML in the general design of industrial components is briefly addressed below.

Given the creative nature of structural design, evolutionary algorithms are good choices. For example in Freischlad and Schnellenbach-Held (2005), linguistic modeling is applied to conceptual design, investigating evolutionary design and optimization of high-rise concrete buildings for lateral load bearing. The process of the design of a structure using ML from concept to actual structural detailing is discussed in Chang and Cheng (2020). Different structural systems are conceptually designed with the aid of ML techniques, including shear walls to sustain earthquakes (e.g., using GAN, in Lu et al. 2022; Zhao et al. 2022), shell structures (Tam et al. 2020; Zheng et al. 2020), and even the architectural volume (Chang et al. 2021). A study and proposal of different ML techniques in architectural design can be found in Tamke et al. (2018).

Of course, one of the main disciplines in structural design is Topology Optimization (TO), and ML approaches (a combination coined “learning topology” in Moroni and Pascali 2021) can be used to develop more robust schemes (Chi et al. 2021; Muñoz et al. 2022) through tuning numerical parameters (Lynch et al. 2019). For example in Muñoz et al. (2022), manifold learning approaches such as local linear embedding (LLE) techniques are employed to extract geometrical modes defined by the material distribution given by the TO algorithm, facilitating the creation of new geometries. TO of nonlinear structures is also performed using deep NN (Abueidda et al. 2020). In order to obtain optimum thermal structures, GANs have been used to develop non-iterative structural to Li et al. (2019). Using ML to develop a non-iterative TO approach has also been addressed in Yu et al. (2019). A recent review of ML techniques in TO can be found in Mukherjee et al. (2021).

1.4.4 Machine Learning Approaches Motivated in Structural Mechanics and by Finite Element Concepts

While ML has contributed to CAE and structural design, new ML approaches have also been developed based on concepts that are traditional in structural analysis and finite element solutions. For example, one of the ideas is the concept of substructuring, employed in static condensation, Guyan reduction, and Craig–Bampton schemes (Bathe 2006). In Jokar and Semperlotti (2021) a Finite Element Network Analysis (FENA) is proposed. The method substitutes the classical finite elements by a library of “elements” consisting of a Bidirectional Recurrent Neural Network (BRNN). The BRNN of the elements are trained individually and the training can be computationally costly. Then these trained BRNN are concatenated, and the composite system needs no further training. The solution is fast, not considering the training, since in contrast to FE solutions, no system of equations is solved. The method has only been applied to the analysis of an elastic bar, so the generalization of the idea to the solution of more complex problems is still an open research task.

The partition of unity used in finite element and meshless methods has been employed to develop a Finite Element Machine (FEMa) for fast supervised learning (Pereira et al. 2020). The idea is that each training sample is the center of a Shepard function, and the training set is treated as a probabilistic manifold. The advantage is that, as in the case of spline-based approaches, the technique has no parameters. Compared to several methods, the BPNN, Naïve Bayes, SVM (using both RBF and sigmoids), RF, DT, etc. the FEMa method was competitive in the eighteen benchmark datasets typically employed in the literature when analyzing supervised methods.

Another interesting approach is the substitution of some procedures of finite element methods with machine learning approaches. Candidates are material and general element libraries, creating surrogate material models (discussed above) or surrogate elements, or patches of elements. This approach follows the substructuring or multiscale computational homogenization (FE2) idea, but in this case using ML procedures instead of a RVE finite element mesh. In Capuano and Rimoli (2019), several possibilities are addressed and applied to nonlinear truss structures and a (nonlinear) hyperelastic perforated plane strain structure. A similar approach is used in Yan et al. (2022) for composite shells employing physics-based NNs. In Jung et al. (2020), finite element matrices passing the patch test are generated from data using a neural network accounting for some physical constraints, as vanishing strain energy under rigid body motions.

1.4.5 Multiphysics Problems

Despite the already mentioned advance in scientific machine learning in several fields, much less has been achieved considering multiphysics problems. This is undoubtedly due to the youth of the discipline, but there are a number of efforts that deserve mentioning. For instance, in Alexiadis (2019) a system is developed with the aim of replicating human physiology. In Alizadeh et al. (2021), a similar approach is developed for nanofluid flow, while (Ren et al. 2020) studies hydrogen production.

In the field of multiphysics problems, there exists a particularly appealing approach to machine learning, namely that of port-Hamiltonian formalisms (Van Der Schaft et al. 2014). Port-Hamiltonian systems are essentially open systems that obey a Hamiltonian description of their physics (and thus, are conservative, or reversible). Their interaction with the environment is made through a forcing term. If we call \(\textbf{z}\) the set of variables governing the problem (\(\textbf{z}=(\textbf{p},\textbf{q})\), e.g., position and momentum, for a canonical Hamiltonian system), its evolution in time will be given by

$$\begin{aligned} \dot{\textbf{z}}=\textbf{J}\nabla {H}+\textbf{F} \end{aligned}$$
(1.61)

where \(\textbf{J}\) is the classical (skew-symmetric) symplectic matrix, H is the Hamiltonian (total energy of the system), and \(\textbf{F}\) is the forcing term, which links the port-Hamiltonian system to other subsystems. This paves the way for an efficient coupling of different systems, possibly governed by different physics. Enforcing this port-Hamiltonian structure during the learning process, as an inductive bias, ensures the fulfillment of the conservation of energy in the total system, while allowing for a proper introduction of dissipative terms in the formulation. This is indeed the approach followed in Desai et al. (2021); see also Massaroli et al. (2019), Eidnes et al. (2022), Mattheakis et al. (2022), Sprangers et al. (2014), and Morandin et al. (2022). A recent review on the progress of these techniques can be found in Cherifi (2020).

1.4.6 Machine Learning in Manufacturing and Design

ML techniques have been applied to classical manufacturing since their early conception, and are now important in Additive Manufacturing (AM). Furthermore, ML is currently being applied to the complete product chain, from conceptual design to the manufacturing process. Below, we review ML applications in classical and additive manufacturing, and in automated design.

1.4.6.1 Classical Manufacturing

In manufacturing, plasticity plays a fundamental role. Machine learning approaches to solve inelastic problems have already been addressed in Sect. 1.4.1.2 above. Research activities prior to 1965 are compiled in an interesting review by Monostori et al. (1996). More recently, another review compiled the works in different research endeavors within the field of manufacturing (Pham and Afify 2005).

Of course, ML is a natural ally of the Industry 4.0 paradigm (the fourth industrial revolution), in which sensors are ubiquitous and data streams provide the systems with valuable information. This synergistic alliance is explored in Raj et al. (2021). In Sharp et al. (2018), valuable research is reported in which Natural Language Processing (NLP) was applied to documentation from 2005 to 2017 in the field of smart manufacturing. The survey analyzes aspects ranging from decision support (prior to the moment, a piece was manufactured), plant and operations health management (for the manufacturing process itself), data management, as a consequence of the vast amount of information produced by Internet of Things (IoT) devices installed in modern plants, or lifecycle management, for instance. The survey concludes that ML-based techniques are present in the literature (at the moment of publication, 2018) for product life cycle management. While many of these ML techniques are inherently designed to perform prognosis (i.e., to predict several aspects related to manufacturing), in Ademujimi et al. (2017) a review is given of literature that employs ML to perform diagnosis of manufacturing processes.

1.4.6.2 Additive Manufacturing

Due to its inherent technological complexity and our still limited comprehension of many of the physical processes taking place, additive manufacturing (AM) has been an active field of research in machine learning. The interested reader can consult different reviews of the state of the art (Razvi et al. 2019; Meng et al. 2020; Jin et al. 2020; Wang et al. 2020). One of the fields where ML will be very important, and that is tied to topology optimization, is 3D printing. AM, in particular 3D printing, represents a revolution in component design and manufacturing because it allows for infinite possibilities and largely reduced manufacturing difficulties. Moreover, these technologies are reaching resolutions at the microscale, so a component may be designed and manufactured with differently designed structures at the mesoscale (establishing metamaterials), obtaining unprecedented material properties at the continuum scale thus widening the design space (Barchiesi et al. 2019; Zadpoor 2016).

There are many different AM procedures, like Fused Deposition Modeling (FDM), Selective Laser Melting (SLM), Direct Energy Deposition (DED), Electron Beam Melting (EBM), Binder Jetting, etc. While additive manufacturing offers huge possibilities, it also results into associated new challenges in multiple aspects, from the detection of porosity (important in the characterization of the printed material) to the recognition of defects (melting, microstructural, and geometrical), to the characterization of the complex anisotropic behavior, which depends on multiple parameters of the manufacturing process (e.g., laser power in Selected Laser Melting, direction of printing, powder and printing conditions). Both the design using AM and the error correction or compensation (Omairi and Ismail 2021) are typical objectives in the application of ML to AM. Different ML techniques are employed, with SVM one of the most used schemes. For example, SVM is employed for identifying defective parts from images in FDM (Delli and Chang 2018), for detecting geometrical defects in SLM-made components (zur Jacobsmühlen et al. 2015; Gobert et al. 2018), for building process maps relating variables to desired properties (e.g., as low porosity) (Aoyagi et al. 2019), and for predicting surface roughness in terms of process features (Wu et al. 2018). NNs are often used for optimizing the AM process by predicting properties as a function of printing variables. For example, NNs have been used for predicting and optimizing melt pool geometry in DED (Caiazzo and Caggiano 2020), to build process maps and optimize efficiency and surface roughness in SLM (Zhang et al. 2017), to minimize support wasted material (optimize supports in a piece) in FDM (Jiang et al. 2019), to predict and optimize resulting mechanical properties of the printed material (Lewandowski and Seifi 2016) like strength (e.g., using CNN from thermal histories in Xie et al. 2021 or FFNN in Bayraktar et al. 2017), bending stiffness in AM composites (Nawafleh and AL-Oqla 2022), and stress–strain curves of binary composites using a combination of CNN and PCA (Yang et al. 2020).

NNs have also been used to create surrogate models with the purpose of mimicking the acoustic properties of AM replicas of a Stradivarius violin (Tian et al. 2021). Reviews of techniques and different applications of machine learning in additive manufacturing may be found in Wang et al. (2020), DebRoy et al. (2021), Meng et al. (2020), Qin et al. (2022), Xames et al. (2023), and Hashemi et al. (2022). The review in Guo et al. (2022) addresses in some detail physics-based proposals.

1.4.6.3 Automated CAD and Generative Design

A fundamental step in the design of an industrial component or an architected structure is the conceptual development of the novel component or structure (the most creative part), and more often, the customization of a component from a given family to meet the specific requirements of the component in the system to which it will be added. The novel product is in essence a variation or evolution of previous concepts (first case) or previous components (second case). ML may help in both cases. The challenge of understanding the “rules” of creativity to foster it has paved the way for interesting contributions of ML in this field (Ganin et al. 2021).

In the first case, ML helps in the generative design of a novel component or structure by creating variations supported by attributes, based in essence on the combination and evolution of previous conceptual designs (Gero 1996; Khan and Awan 2018); see the review of ML contributions in Duffy (1997); see also Tzonis and White (2012) especially for conceptual design. An example would be to create a new design of a car. Some conditions are given by the segment to which it will belong, but some other possibilities are open and can be generated from possible variations that may please or attract consumers. For example, Generative Adversarial Networks (GAN) (Goodfellow et al. 2020) are used to explore aerodynamic shapes (Chen et al. 2019). ML is also used for the association of concepts and combinatorial creativity with aims to the reuse of creativity to create new concepts and designs (Chen 2020). Further, ML is employed in the evaluation of design concepts from many candidates based on human preferences expressed in previous concepts (Camburn et al. 2020). There are also ML works that aid in the development of detailed and consistent CAD drawings from hand sketches (Seff et al. 2021), i.e. interpreting and detailing a CAD drawing from a hand sketch.

Considering the second case, the customization of designs is natural to ML approaches. The idea here is to perform automatic variations of previous conceptual designs, or of designs obtained from mathematical optimization. A good example of this approach using deep NN is given in Yoo et al. (2021), to propose designs of a wheel, in which variations that comply with mechanical requirements (strength, eigenfrequencies, etc. evaluated through surrogate models as a function of geometric parameters) given with shapes are obtained by variations and simplifications using autoencoders. Based on this work, an interesting discussion between aesthetics and performance (aspects to include in ML models) is given in Shin et al. (2021). The combination of topology optimization and generative design can be found in many endeavors (Oh et al. 2019; Barbieri and Muzzupappa 2022).

Moreover, in the design process, there are many aspects that can be automated. A typical aspect is the search for components with similar layout such that detailed drawings, solid models (Chu and Hsu 2006), and manufacturing processes (Li et al. 2016) of new designs may be inferred from previous similar designs (Zehtaban et al. 2016). Indeed, many works focus on procedures to reuse parts of CAD schemes for electronic circuits (Boning et al. 2019) or to develop microfluidic devices (Lore et al. 2015; Tsur 2020).

1.5 Conclusions

With the current access to large amounts of data and the ubiquitous presence of real-time sensors in our life, as those present in cell phones, and also with the increased computational power, Machine Learning (ML) has resulted in a change of paradigm on how many problems are addressed. When using ML, the approach to many engineering problems is no longer a matter of understanding the governing equations, not even a matter of fully understanding the problem being addressed, but of having sufficient data so relations between features and desired outputs can be established; and not even in a deterministic way, but in an implicit probabilistic way.

ML has been succeeding for more than a decade in solving complex problems as face recognition or stocks evolution, for which there was no successful deterministic method, and not even a sound understanding of the actual significance of the main variables affecting the result. Computer Aided Engineering (CAE), with the Finite Element Method standing out, had also an extraordinary success in accurately solving complex engineering problems, but a detailed understanding of the governing equations and their discretization is needed. This success delayed the introduction of ML techniques in classical CAE dominated fields, but during the last years increasing emphasis has been placed on ML methods. In particular, ML is used to solve some of the issues still remaining when addressing the problem through classical techniques. Examples of these issues are the still limited generality of classical CAE methods (although the success of the FEM is due to its good generalization possibilities), the search for practical solutions when there is not a complete, full understanding of the problem, and computational efficiency in high-dimensional problems like multiscale and nonlinear inverse problems. While we are still seeing the start of a new era, already a large variety of problems in CAE has been addressed using different ML techniques.

Lessons have also been learned in the last few years. One important lesson is that in engineering solutions, robustness and reliability of the solution are important (Bathe 2006), and data may not be sufficient to guarantee that robustness. Then, ML methods that incorporate physical laws and use the vast analytical knowledge acquired in the last centuries may result not only in more robust methods but also in more efficient schemes. In this chapter, we briefly reviewed ML techniques in CAE and some representative applications. We focused on conveying some of the excitement that is now developing in the research and use of ML techniques by short descriptions of methods and many references to applications of those techniques.