Function Trees: Transparent Machine Learning (2024)

Jerome H. FriedmanDepartment of Statistics, Stanford University,Stanford, CA 94305, USA. (jhf@stanford.edu)

Abstract

The output of a machine learning algorithm can usually be represented by oneor more multivariate functions of its input variables. Knowing the globalproperties of such functions can help in understanding the system thatproduced the data as well as interpreting and explaining corresponding modelpredictions. A method is presented for representing a general multivariatefunction as a tree of simpler functions. This tree exposes the global internalstructure of the function by uncovering and describing the combined jointinfluences of subsets of its input variables. Given the inputs andcorresponding function values, a function tree is constructed that can be usedto rapidly identify and compute all of the function’s main and interactioneffects up to high order. Interaction effects involving up to four variablesare graphically visualized.

1 Introduction

A fundamental exercise in machine learning is the approximation of a functionof several to many variables given values of the function, often contaminatedwith noise, at observed joint values of the input variables. The result canthen be used to estimate unknown function values given corresponding inputs.The goal is to accurately estimate the underlying (non noisy) outcome valuessince the noise is by definition unpredictable. To the extent that this issuccessful the estimated function may, in addition, be used to try tounderstand underlying phenomena giving rise to the data. Even when predictionaccuracy is the dominate concern, being able to comprehend the way in whichthe input variables are jointly combining to produce predictions may lead toimportant sanity checks on the validity of the function estimate. Besidesaccuracy, the success of this latter exercise requires that the structure ofthe function estimate be represented in a comprehensible form.

It is well known, and often commented, that the most accurate functionapproximation methods to date tend not to provide comprehensible results. Thefunction estimate is encoded in an obtuse form that obscures potentiallyrecognizable relationships among the inputs that give rise to various functionoutput values. This is especially the case when the function is not inherentlylow dimensional. That is, medium to large subsets of the input variables acttogether to influence the function $F(\mathbf{x})$ in the sense that theircombined contribution cannot be represented by a combination of smallersubsets of those variables (interaction effects). This has led to incompleteinterpretations based on low dimensional approximations such as additivemodeling with no interactions, or models restricted to at most two–variable interactions.

2 Interaction effects

For input variables $\mathbf{x}\,=$ $(x_{1},x_{2},\cdot\cdot\cdot,x_{p})$ afunction $F(\mathbf{x})$ is said to exhibit an interaction between two of them $x_{j}$ and $x_{k}$ if the difference in the value of $F(\mathbf{x})$ asresult of changing the value of one of them $x_{j}$ depends on the value ofthe other $x_{k}$ . This means that in order to understand the effect of thesetwo variables on $F(\mathbf{x})$ they must be considered together and cannotbe studied separately. An interaction effect between variables $x_{j}$ and $x_{k}$ implies that the second derivative of $F(\mathbf{x})$ jointly withrespect to $x_{j}$ and $x_{k}$ is not zero for at least some values of $\mathbf{x}$ . That is,

E_{\mathbf{x}}\left[\frac{\partial^{2}F(\mathbf{x})}{\partial x_{j}\partial x_%{k}}\right]^{2}>0\text{,}

(1)

with an analogous expression involving finite differences for categoricalvariables. If there is no interaction between these variables, the function $F(\mathbf{x})$ can be expressed as a sum of two functions, one that does notdepend on $x_{j}$ and the other that is independent of $x_{k}$

F(\mathbf{x})=f_{\backslash j}(\mathbf{x}_{\backslash j})+f_{\backslash k}(%\mathbf{x}_{\backslash k})\text{.}

(2)

Here $\mathbf{x}_{\backslash j}$ and $\mathbf{x}_{\backslash k}$ respectivelyrepresent all variables except $x_{j}$ and $x_{k}$ . If a given variable $x_{j}$ interacts with no other variable, then the function can be expressedas

F(\mathbf{x})=f_{j}(x_{j})+f_{\backslash j}(\mathbf{x}_{\backslash j})

(3)

where the first term on the right is a function only of $x_{j}$ and the secondis independent of $x_{j}$ . In this case $F(\mathbf{x})$ is said to be“additive”in variable $x_{j}$ and theunivariate function $f_{j}(x_{j})$ can be examined to study the effect of $x_{j}$ on the function $F(\mathbf{x})$ independently from the effects of theother variables.

Higher order interactions are analogously defined. A function $F(\mathbf{x})$ is said to have an $n$ –variable interaction effect involving variables $\{x_{j}\}_{1}^{n}$ if

E_{\mathbf{x}}\left[\frac{\partial^{n}F(\mathbf{x})}{\partial x_{1}\partial x_%{2}\cdot\cdot\cdot\;\partial x_{n}}\right]^{2}>0\text{,}

(4)

again with an analogous expression involving finite differences forcategorical variables. The existence of such an $n$ –variable interactionimplies that the effect of the corresponding variables $\{x_{j}\}_{1}^{n}$ onthe function $F(\mathbf{x})$ cannot be decomposed into a sum of functions eachinvolving subsets of those variables. If there is no such interaction, thecontribution of variables $\{x_{j}\}_{1}^{n}$ to the variation of $F(\mathbf{x})$ can be decomposed into a sum of functions each not dependingupon one of these respective variables

F(\mathbf{x})=\sum_{j=1}^{n}f_{\backslash j}(\mathbf{x}_{\backslash j})\text{.}

(5)

If none of the variables in a subset $s\mathbf{=}\{x_{j}\}_{1}^{n}$ interactwith any of the variables in its complement set $\backslash s\mathbf{=}\{x_{j}\}_{n+1}^{p}$ , then the function is additive in the variable subset $s$

F(\mathbf{x})=f_{s}(\mathbf{x}_{s})+f_{\mathbf{\backslash s}}(\mathbf{x}_{%\backslash s})

(6)

and one can study the effect of $\{x_{j}\}_{1}^{n}$ on the function $F(\mathbf{x})$ separately from that of the other variables.

The notion of interaction effects can be useful for understanding a function $F(\mathbf{x})$ if it is dominated by those of low order involving variablesubsets $s\subset\{x_{j}\}_{1}^{p}$ with small cardinality ( $|\,s\,|\,<<p$ ).In this case a target function of many variables can be understood in terms ofa collection of lower dimensional functions each depending on a relativelysmall different subset of the variables. It is generally easier to comprehendthe lower dimensional structure associated with fewer variables. Also, acommon regularization method is to more heavily penalize or forbid solutionsinvolving higher order interactions. Although often useful, this can result inlower accuracy and/or misleading interpretations when $F(\mathbf{x})$ involvessubstantial interaction effects of higher order than that assumed. As seenbelow the existence of such higher order interactions can misleadinterpretation by causing the functional form of those of lower order todepend on values of other unknown variables.

$\begin{array}[c]{cc}{\parbox[b]{221.99844pt}{\begin{center}\includegraphics[height=227.49898pt,width=221.99844pt]{Rplot10.png}\\Figure 1: Function tree.\end{center}}}&{\parbox[b]{172.55873pt}{\begin{center}\includegraphics[height=213.6881pt,width=172.55873pt]{Rplot09.png}\\Figure 2: Node functions.\end{center}}}\end{array}$

3 Function trees

A function tree is a way of representing a multivariate function so as touncover and estimate its interaction effects of various orders. The inputfunction $F(\mathbf{x})$ is defined by data $\{\mathbf{x}_{i},y_{i}\}_{1}^{N}$ with $\mathbf{x}$ representing multivariate evaluation points and $y$ thecorresponding function values perhaps contaminated with noise. The output ofthe procedure is a set of univariate functions, each one depending on a singleselected input variable. These functions are arranged as the non root nodes ofa tree. A constant function is associated with the root. The tree provides aprescription for combining these univariate functions to form the multivariatefunction approximation $\hat{F}(\mathbf{x})$ .

Figure 1 shows one such function tree derived from simple simulated data forillustration. Figure 2 shows the nine non constant functions corresponding toeach of the respective non root nodes of the tree. In addition to itsassociated univariate function (Fig. 2), each tree node $k$ represents a basisfunction $B_{k}(\mathbf{x})$ . The sum of these basis functions forms the finalmultivariate approximation

\hat{F}_{K}(\mathbf{x})=B_{0}+\sum_{k=1}^{K}B_{k}(\mathbf{x})\text{.}

(7)

Each node’s basis function is the product of its associated univariatefunction and those of the nodes on the path from it to the root

B_{k}(\mathbf{x})={\displaystyle\prod\limits_{l\in p(k)}}f_{l}(x_{j(l)})\text{.}

(8)

Here $p(k)$ represents those nodes $l$ on the path from node $k$ to the rootand $j(l)$ labels the input variable associated with each such node $l$ . Thedarkness of each node’s shading in Fig. 1 corresponds to the estimatedrelative influence of its basis function on the final function estimate asmeasured by its standard deviation over the training data.

The interaction level of each basis function (8)is the number of univariate functions of different variables along thepath from its corresponding node to the root of the tree. Products ofunivariate functions involving $n$ different variables satisfy (4).Note that different univariate functions of the same variable mayappear multiple times along the same path. While potentially improving themodel fit they do not increase interaction level. With this approachinteractions among a specified subset of the variables $s$ are modeled by sumsof products of univariate functions of those variables

J_{s}(\{x_{j}\}_{j\in s})=\sum_{k=1}^{K_{s}}{\displaystyle\prod\limits_{j\in s%}}f_{jk}(x_{j})\text{.}

(9)

The function tree model shown in Figs. 1 & 2 was obtained by applying theprocedure described below to a simulated data example. There are $10000$ observations with outcome variables generated as $y=F(\mathbf{x})+\varepsilon$ with $\mathbf{x}\sim N^{8}(0,0.5)$ and

F(\mathbf{x})=4\,\sin(\pi x_{1})\,\cos(\pi x_{2})+7\,\,x_{3}^{2}+15\,\,(x_{4}+%0.4)\,(x_{5}-0.6)\,(x_{6}+0.2)+5\,\sin(\pi\,(x_{7}+0.1)\,x_{8})\text{.}

(10)

The noise is generated as $\varepsilon\sim N(0,var(F)/4)$ producinga $2/1$ signal/noise ratio. This target function has an additive dependence onthe third variable $x_{3}$ , separate two–variable interactions between $(x_{1},x_{2})$ and $(x_{7},x_{8})$ respectively, and a trilinearthree–variable interaction among $(x_{4},x_{5},x_{6})$ .

The first node of the tree is seen to incorporate the pure additive effect of $x_{3}$ . Nodes 2 - 5 model the $(x_{4},x_{5},x_{6})$ three–variableinteraction. The two–variable interaction between $x_{1}$ and $x_{2}$ ismodeled by nodes 6 and 7, and that of $x_{7}$ and $x_{8}$ is modeled by nodes8 and 9. The nine univariate functions (Fig. 2) combined as specified by thetree (Fig. 1) produce a multivariate function that explains 97% of thevariance of the target (10).

3.1 Tree construction

Function trees are constructed in a standard forward stepwise best firstmanner. Initially the tree consists of a single root node representing aconstant function $B_{0}$ . At the $K$ th step there are $K$ basis functions(7) (8) in the current model $\hat{F}_{K}(\mathbf{x})$ . The( $K+1)$ st is taken to be

B_{K+1}(\mathbf{x})=B_{k^{\ast}}(\mathbf{x})\,f^{\ast}(x_{j^{\ast}})

(11)

where $j^{\ast}$ , $k^{\ast}$ , and $f^{\ast}\,$ are the solution to

(j^{\ast},k^{\ast},\,f^{\ast})=\arg\min_{(j,k,\,\,f)}\,\hat{E}\left(y-\hat{F}_%{K}(\mathbf{x})-B_{k}(\mathbf{x})\,f(x_{j})\right)^{2}\text{.}

(12)

Here $B_{k}(\mathbf{x})$ is one of the basis functions in the current model, $f(x_{j})$ is a function of one of the predictor variables $x_{j}$ and theempirical expected value is over the data distribution. The model is thenupdated

\hat{F}_{K+1}(\mathbf{x})=\hat{F}_{K}(\mathbf{x})+B_{k^{\ast}}(\mathbf{x})\,f^%{\ast}(x_{j^{\ast}})

(13)

for the next iteration. In terms of tree construction the update consists ofadding a daughter node to current node $k^{\ast}$ with associated function $\,f^{\ast}(x_{j^{\ast}})$ . Iterations continue until the goodness–of–fitstops improving as measured on an independent test data sample.

3.2 Backfitting

As with all tree based methods, smaller trees (less nodes) tend to be moreeasily understood. Here tree size can be reduced and accuracy improved byincreased function optimization (backfitting) at each step (Friedman andStuetzle 1981). This is implemented here by updating the functions associatedwith all current tree nodes in presence of the newly added one, and thosepreviously updated. In particular, after adding the new ( $K+1$ )st node, allnode functions are re-estimated $f_{k}(x_{j(k)})\leftarrow f_{k}^{\ast}(x_{j(k)})$ one at a time in order starting from the first

f_{k}^{\ast}(x_{j(k)})=\arg\min_{\tilde{f}}\,\hat{E}_{\mathbf{x}}\left[y-\hat{%F}_{K+1}(\mathbf{x})-B_{k}(\mathbf{x})\left(\tilde{f}(x_{j(k)})/\,f_{k}(x_{j(k%)})-1\right)\right]^{2}\text{,}

(14)

\hat{F}_{K+1}(\mathbf{x})\leftarrow\hat{F}_{K+1}(\mathbf{x})+B_{k}(\mathbf{x})%\left(f^{\ast}(x_{j(k)})/f_{k}(x_{j(k)})-1\right)\text{,}k=1,2,\cdot\cdot\cdot,(K+1)\text{.}

(15)

Here $k$ labels a node of the tree and $f_{k}$ its associated function ofvariable $x_{j(k)}$ . In terms of tree construction, each such step involvesreplacing each ( $k$ th) node’s current function $f_{k}(x_{j(k)})$ with $f_{k}^{\ast}(x_{j(k)})$ (14). No changes are made to tree topology(Fig. 1) or the input variables associated with each node. Only the functions(Fig. 2) are updated. Backfitting passes can be repeated until the modelbecomes stable. This usually happens after one or two passes. Note that in thepresence of backfitting function tree models are not nested. All of the basisfunctions of the $K+1$ node model are different from those of its $K$ node predecessor.

3.3 Univariate function estimation

The central operation of the above procedure is repeated estimation ofunivariate functions of the individual predictor variables. In both (12)and (14) the solutions take the form of weighted conditionalexpectations

\hat{f}(x_{j})=\hat{E}_{w^{2}}\left[\frac{r}{w}\,|\,x_{j}\right]\text{,}

(16)

where $w$ represents one of the current basis functions $w=B_{k}(\mathbf{x})$ and $r=y-\hat{F}(\mathbf{x})$ are the current residuals. The function treeprocedure as described above is agnostic concerning the methods used for thisestimation. However, actual performance in terms of execution speed andprediction accuracy can be highly dependent on such choices.

A major consideration in the choice of an estimation method for a particularvariable $x_{j}$ is the nature of that variable in terms of the values itrealizes, and connections between those values. A categorical variable(factor) realizes discrete values with no order relation. The only method forevaluating (16) for such variables is to take the weighted ( $w^{2}$ )mean of $r/w$ at each discrete $x$ value. This is also a viable strategy ifthe $x_{j}$ values are orderable but realize only a small set of distinct values.

For numeric variables that realize many distinct orderable values one can takeadvantage of the presumed smoothness of the solution function to improveaccuracy by borrowing strength from similarly valued observations. There aremany such methods for “smoothing”data (seeIrizarry 2019). For the examples presented below near neighbor local averagingand local linear fitting were employed. Near neighbor local averagingestimates are equivariant under monotone transformations of the $x$ -values,depending only their relative ranks. This can provide higher accuracy in thepresence of irregular or clumped $x$ -values, immunity to $x$ outliers,resistance to $y$ outliers and more cautious (constant) extrapolation at theedges. For more evenly distributed data with many distinct $x$ -values, in theabsence of $x$ and $y$ outliers, local linear fitting can yield smoother moreaccurate results.

4 Interpretation

The primary goal of the function tree representation is to enhanceinterpretation by exposing a function’s separate interaction effects ofvarious orders. These can then be studied using graphical and other methods togain insight into the nature of the relationship between the predictorvariables $\mathbf{x}=(x_{1},x_{2},\cdot\cdot\cdot,x_{p})$ and the targetfunction $F(\mathbf{x})$ .

4.1 Partial dependence functions

One way to estimate the contribution of subsets of predictor variables to afunction $F(\mathbf{x})$ is through partial dependence functions (Friedman2001). Let $\mathbf{z}$ represent a subset of the predictor variables $\mathbf{z}\subset\mathbf{x}$ and $\mathbf{\tilde{z}}$ the complement subset $\mathbf{z\,}\cup\,\mathbf{\tilde{z}\,}\mathbf{=x\,}$ . Then the partialdependence of a function $F(\mathbf{x})\,\$ on $\mathbf{z}$ isdefined as

PD(\mathbf{z})=E_{\mathbf{\tilde{z}}}\left[F(\mathbf{x})\right]\text{.}

(17)

Conditioned on joint values of the variables in $\mathbf{z}$ , the the value ofthe function $F(\mathbf{x})$ is averaged over the joint values of thecomplement variables $\mathbf{\tilde{z}}$ . Since the joint locations ofpartial dependence functions are generally not identifiable, they are eachcentered to have zero mean value over the data distribution of $\mathbf{x}$ .

If for a function $F(\mathbf{x})$ the variables in the specified subset $\mathbf{z}$ do not participate in interactions with variables in $\mathbf{\tilde{z}}$ then the additive dependence (6) of $F(\mathbf{x})$ on $\mathbf{z}$ is well defined and given by $PD(\mathbf{z})$ . If this is not the case, then the dependence of the target function $F(\mathbf{x})$ on the subset $\mathbf{z}$ is not well defined in the sensethat its functional form changes with changing values of the variables in $\mathbf{\tilde{z}}$ . In this case one can define a “nominal”dependence by averaging over some distribution $p(\mathbf{x})$ of the predictor variables $\mathbf{x}$ . This confounds theproperties of the actual target function $F(\mathbf{x})$ with those of thedata distribution $p(\mathbf{x})$ on the resulting estimate of the dependenceon $\mathbf{z}$ . Two popular choices for $p(\mathbf{x})$ are the training dataand the product of its marginals (independence).Partial dependence functionschoose a compromise in which the variables in $\mathbf{z}$ are taken to beindependent of those in $\mathbf{\tilde{z}}$ . Note that the result betweenthis and using the training data distribution is different only if there aresubstantial correlations or associations between the interacting variables in $\mathbf{z}$ and $\mathbf{\tilde{z}}$ .

Partial dependence functions (17) can be estimated from the data in astraightforward way by evaluating

\widehat{PD}(\mathbf{z})=\frac{1}{N}\sum_{i=1}^{N}F(\mathbf{z,\tilde{z}}_{i})

(18)

over a representative set of joint values of $\mathbf{z}$ . This requires $N\cdot N_{\mathbf{z}}$ target function evaluations where $N_{\mathbf{z}}$ isthe number of evaluation points and $N$ is the sample size used for averaging.

With the function tree representation partial dependence functions can becomputed much more rapidly. For the purpose of computing a partial dependencefunction on a variable subset $\mathbf{z}$ the function tree model can beexpressed as

\hat{F}(\mathbf{x})=A+\sum_{k=1}^{K}f_{k}(\mathbf{z)\cdot\,}g_{k}\mathbf{(%\tilde{z})}

(19)

where $A$ represents all basis functions not involving any variables in $\mathbf{z}$ , $\,f_{k}(\mathbf{z)}$ are functions involving variables only in $\mathbf{z}$ , and $\mathbf{\tilde{z}}$ represents the compliment variables.With this representation the partial dependence of $\hat{F}(\mathbf{x})$ onthe variables $\mathbf{z}$ is simply

\widehat{PD}(\mathbf{z})=\sum_{k=1}^{K}\bar{g}_{k}\cdot f_{k}(\mathbf{z)}

(20)

where $\bar{g}_{k}=\hat{E}_{\mathbf{x}}[g_{k}\mathbf{(\tilde{z})]}$ is themean of $g_{k}\mathbf{(\tilde{z})}$ over the data. This (20) is aparticular linear combination of the functions $\{f_{k}(\mathbf{z)\}}_{1}^{K}$ . The number of target function $\hat{F}(\mathbf{x})$ evaluations required tocompute (20) is proportional to

C(N,N_{\mathbf{z}})=N_{\mathbf{z}}\,+\alpha(\mathbf{z,\tilde{z}})\cdot N

(21)

where $\alpha(\mathbf{z,\tilde{z}})$ is the fraction of node basis functions(8) involving variables in both $\mathbf{z}$ and $\mathbf{\tilde{z}}$ .

4.2 Partial association functions

Partial dependence functions focus on the properties of the function $F(\mathbf{x})$ by averaging over a distribution in which the $\mathbf{z}$ variables are independent of those in $\mathbf{\tilde{z}}$ . This concentrateson the target function by removing the effects of associations between thosevariable subsets. As a result the calculation can sometimes emphasize $\mathbf{x}$ -values for which the data density $p(\mathbf{x})$ is smallleading to potential inaccuracy. This only affects a partial dependence to theextent that there are variables in $\mathbf{\tilde{z}}$ with which itsvariables $\mathbf{z}$ both interact and are correlated.

The function tree representation provides a method for detecting and measuringthe strength of such occurrences. “Partialassociation”functions are defined as

\widehat{PA}(\mathbf{z})=\sum_{k=1}^{K}f_{k}(\mathbf{z)}\cdot h_{k}(\mathbf{z})

(22)

where

h_{k}(\mathbf{z})=E_{\mathbf{x}}\left[g_{k}\mathbf{(\tilde{z})\,|\,\,}f_{k}(%\mathbf{z)}\right]

(23)

with the functions $f_{k}(\mathbf{z)}$ and $g_{k}\mathbf{(\tilde{z})}$ definedin (19). As with partial dependence functions (20) this can beviewed as a linear combination of the functions $\{f_{k}(\mathbf{z)\}}_{1}^{K}$ but with varying coefficients $\{h_{k}(\mathbf{z)\}}_{1}^{K}$ thataccount for the associations between the variables in $\mathbf{z}$ and thecomplement variables $\mathbf{\tilde{z}}$ . The coefficient functions $h_{k}(\mathbf{z})$ can be estimated by any univariate smoother. Regressionsplines with knots at the $20$ percentiles are used in the examples below.Note that from (19) partial association functions (22)(23) reduce to partial dependence functions (20) when variablesin $\mathbf{z}$ do not participate in interactions with variables in $\mathbf{\tilde{z}\,\ }$ ( $\,g_{k}\mathbf{(\tilde{z}})=1$ ), or they areindependent of those in $\mathbf{\tilde{z}\,\ }$ with which they do interact( $\,h_{k}\mathbf{(z})=\bar{g}$ ). Otherwise they can produce different resultscaused by the variation in the respective coefficient functions $h_{k}(\mathbf{z)}$ induced by the associations between the correspondinginteracting variables in $\mathbf{z}$ and $\mathbf{\tilde{z}}$ .

Partial association functions can be used to access the influence thatassociations among the predictor variables have on the corresponding partialdependences. In particular, one can substitute them (22) (23)for partial dependence functions (20) in any analysis. Similar resultsindicate that such influences are small.

4.3 Interaction detection

Partial dependence functions can be used to detect interaction effects betweenvariables. For example, if a target function $F(\mathbf{x})$ contains nointeraction between variables $x_{j}$ and $x_{k}$ its partial dependence onthose variables is

PD(x_{j},x_{k})=PD(x_{j})+PD(x_{k})

(24)

(Friedman and Popescu 2008). If there is such an interaction, $PD(x_{j})+PD(x_{k})$ represents the additive component and

I(x_{j},x_{k})=PD(x_{j},x_{k})-PD(x_{j})-PD(x_{k})

(25)

represents the corresponding pure interaction component of the effect of $x_{j}$ and $x_{k}$ on $F(\mathbf{x})$ as reflected by its partial dependences.

More generally, the pure interaction effect involving variables $s=\{x_{j}\}_{1}^{n}$ is defined to be

I(s)=PD(s)-\sum_{u\subset s}I(u)

(26)

where $I(u)$ is recursively defined as

I(u)=PD(u)-\sum_{v\subset u}I(v)\text{.}

(27)

This pure interaction component $I(\{x_{j}\}_{1}^{n})$ is its correspondingpartial dependence with all lower order interactions among all its variablesubsets removed. If there is no $n$ –variable interaction effect involving $\{x_{j}\}_{1}^{n}$ its pure interaction component $I(\{x_{j}\}_{1}^{n})=0$ .Note that this result depends only on the properties of the function $F(\mathbf{x})$ (5) and not on the data distribution $p(\mathbf{x})$ .The strength of such an interaction effect can be taken to be the standarddeviation of its pure interaction component over the training data, or otherspecified data distribution, divided by the standard deviation of thecorresponding target function

S(\{j\}_{1}^{n})=\sqrt{\left.var_{\mathbf{x}}[I(\{x_{j}\}_{1}^{n})]\right/var_%{\mathbf{x}}(F(\mathbf{x}))}\text{.}

(28)

If there exists a higher order interaction effect jointly involving variablesin subset $\{x_{j}\}_{1}^{m}$ along with other variables $\{x_{j}\}_{m+1}^{n}$ , $S(\{j\}_{1}^{n})>0$ (28), then the form of the interaction functionof the subset $I(\{x_{j}\}_{1}^{m}\,|\,\{x_{j}\}_{m+1}^{n})$ depends on thejoint values of the complement variables $\{x_{j}\}_{m+1}^{n}$ . The resultingunconditioned interaction effect $I(\{x_{j}\}_{1}^{m})$ is then an averageover the joint distribution $p(\{x_{j}\}_{m+1}^{n})$ of the complementvariables. If there are no such higher order interactions involving $\{x_{j}\}_{1}^{m}$ jointly with other variables, its pure interaction effect $I(\{x_{j}\}_{1}^{m})$ does not depend on the data distribution $p(\mathbf{x})$ and reflects only properties of the target function $F(\mathbf{x})$ .

The measure $S(s)$ (28) indicates the strength of an $n$ –variableinteraction effect involving the corresponding variable subset $s=$ $\{x_{j}\}_{1}^{n}$ over the distribution $p(\mathbf{x})$ . One can search forsubstantial interaction effects by evaluating (26–28) over allvariable subsets $s$ up to some maximum size $|\,s\,|\,\leq M$ . This approachcan in principle be applied to a target function $F(\mathbf{x})$ representedin any form. All that is required to compute partial dependence functions isthe value of the function at various specified values of $\mathbf{x}$ .

Function Trees: Transparent Machine Learning (5)

Figure 3 shows the strengths (28) of all interaction effectsuncovered in the synthetic data example of Section 3 using this approach. Thepresence of more lower order interaction effects shown here than explicitlyrepresented by the function tree (Fig. 1) is due to lower level marginals ofhigher order basis functions (9) being assigned to their properinteraction level.

5 Computation

For a total of $p$ variables the number of distinct subsets involving $n<p$ variables is $\binom{p}{n}$ . For each such subset the number of partialdependence functions required to extract its pure interaction effect(26) (27) grows exponentially in its size $n.$ Thus for largervalues of $p$ and $n$ the required computation using the standard method(18) to evaluate all of the necessary partial dependence functions isgenerally too severe to allow a search for interaction effects by completeenumeration over all variable subsets. However, for functions represented by afunction tree, one can employ (20) to dramatically reduce thecomputation (21) required to evaluate each partial dependencefunction. This in turn allows the search for interactions to be performed overmany more and larger variable subsets representing higher order interaction effects.

For small to moderate number of variables $p\lesssim 20$ complete enumerationusing (20) is generally feasible for $n\lesssim 4$ . However, for largerproblems the rapid growth with respect to both $p$ and $n$ soon places therequired computation out of reach. Parallel computation and smart strategiesfor reusing previously computed partial dependence functions can somewhat easethis burden. However, the properties of partial dependence functions allow asimple input variable screening approach that can often considerably reducethe size of the search.

If for a (centered) function $F(\mathbf{x})$ a variable $x_{j}$ participatesin no interactions with any other variables then it can be expressed as

F(\mathbf{x})=PD(x_{j})+PD(\mathbf{x}_{\backslash j})

(29)

where $PD(x_{j})$ is its partial dependence on $x_{j}$ and $PD(\mathbf{x}_{\backslash j})$ is its partial dependence on all other variables (Friedmanand Popescu 2008). One can define an overall interaction strength for eachpredictor variable as

H_{j}=\sqrt{\hat{E}_{\mathbf{x}}(F(\mathbf{x})-PD(x_{j})-PD(\mathbf{x}_{%\backslash j}))^{2}}\text{.}

(30)

Variables $x_{j}$ with small values of $H_{j}$ can be removed from the searchfor interaction effects. This often excludes many variables therebysubstantially reducing computation.

Computation can be further reduced by taking advantage of the function treerepresentation. The function tree strength of the contribution of inputvariable $x_{j}$ to interaction level $k$ is taken to be

R_{jk}=\sum_{m=1}^{M}I(j\in m)\,\,I(int(B_{m})=k)\,\sqrt{var\,_{\mathbf{x}}[B_%{m}(\mathbf{x})]}\text{.}

(31)

Here $I(\cdot)$ is an indicator of the truth of its argument, $m$ labels anode of the $M$ node function tree, $B_{m}(\mathbf{x})$ is its correspondingbasis function, $int(B_{m})$ its interaction order and the variance is overthe distribution $p(\mathbf{x})$ . This can be used as a filter to excludevariables $x_{j}$ with small values of $R_{jk}$ (31) from the searchfor $k$ –variable interaction effects. Since node basis functions $B_{m}(\mathbf{x})$ are not pure interaction effect functions (26)(27) the search for $n$ –variable interaction effects shouldincludethe union of all relevant variables with substantial contributions(31) at that interaction level or higher.

In many applications the number of variables involved in higher orderinteractions is less than that involved in main or lower order interactioneffects. Since computation increases very rapidly with the number of variablesexamined, further variable reduction at the interaction level (31) cansubstantially reduce computation even for minor variable reductions.

For the simulation example in Section 6.2 (below) exhaustive search overall $30$ variables at each of four interaction levels requires computing $2027970$ partial dependence functions with $2.03\times 10^{12}$ correspondingtarget function evaluations using (18) with $N=N_{\mathbf{z}}=1000$ .This would take approximately $36$ days on a Dell XPS 13 laptop. Approximatingthe search by only including the ten variables estimated to be the mostimportant reduces the necessary number of partial dependence functions to $15540$ requiring $1.55$ $\times$ $10^{10}$ target function evaluations. Thiswould take approximately seven hours.

Partial dependence screening using (30) identifies six interactingvariables reducing the number of partial dependence functions computed to $1114$ with $1.11\times 10^{9}$ target function evaluations using (18).This reduces the computing time to approximately $30$ minutes. Building thefunction tree for this example required $14.5$ seconds. Correspondinginteraction level screening using (31) eliminated all four–variableinteractions. Six variables remained for both the two–variable andthree–variable interaction searches and ten variables for main effects. Thisrequires a total of $384$ partial dependence functions. Employing (20)with $N=N_{z}=1000$ involved $4.14\times 10^{5}$ target function evaluationsrequiring $0.5$ seconds.

5.1 Interaction investigation

Once discovered, interactions as well as main effects can be visualized usingpartial dependence plots. Figure 4 shows a heat map representation ofthe interaction effect between variables $x_{7}$ and $x_{8}$ , $I(x_{7},x_{8})$ (26) (27), for the function tree model $\hat{F}(\mathbf{x})$ (Figs. 1 & 2). Colors black, blue, purple, violet, red, orange, and yellow,seen in the lower left legend, represent a continuum of increasing functionvalues from lowest ( $-5.06$ ) to highest ( $5.49$ ).

Function Trees: Transparent Machine Learning (6)

One sees that joint high or joint low values of these two variables producerelatively large increases in function value (red, yellow), whereas oppositevalues, (low, high) or (high, low), give rise to large decreases (blue,black). Since Fig. 3 shows no higher order interactions involvingthese variables, this function of $x_{7}$ and $x_{8}$ represents thecorresponding interaction effect associated with the target function estimate $\hat{F}(\mathbf{x)}$ , and does not reflect the nature of the datadistribution $p(\mathbf{x})$ . One can verify this behavior by examining thetrue target function (10).

Variables $x_{4}$ and $x_{5}$ are seen in Fig. 3 to have a relativelyweak interaction. However, they are also seen to participate in a substantialthree–variable interaction with variable $x_{6}$ so that their two–variableinteraction function $I(x_{4},x_{5})$ (26) (27) may notprovide a complete description of their joint effect on the target estimate $\hat{F}(\mathbf{x})$ .

Function Trees: Transparent Machine Learning (7)

Figure 5 shows a plot of the average (unconditional) pure interactionfunction $I(x_{4},x_{5})$ (upper left), and corresponding interactionfunctions $I(x_{4},x_{5}\,|\,x_{6})$ conditioned at three values of $x_{6}\in\{-2,0,2\}$ , all plotted on the same vertical scale ( $-63,63$ ). Here ablack color represents the most negative values, yellow the most positive andviolet the smallest absolute values. The upper left frame shows that thevariation of unconditional interaction function $I(x_{4},x_{5})$ is quitesmall ( $\simeq$ violet) as indicated in Fig. 3. The lower left frameshows the same for $I(x_{4},x_{5}\,|\,x_{6}=0)$ . The two right frames howevershow very strong interaction effects for $I(x_{4},x_{5}\,|\,x_{6}=-2)$ and $I(x_{4},x_{5}\,|\,x_{6}=2)$ . There is almost no interaction between $x_{4}$ and $x_{5}$ when $x_{6}=0$ . For $x_{6}=\pm\,2$ , there are strong butopposite interaction effects. This leads to an overall weaktwo–variable interaction between $x_{4}$ and $x_{5}$ when averaged over thedistribution of $x_{6}$ (upper left). Seeing only this weak two–variableinteraction effect between ( $x_{4},x_{5}$ ) without knowledge of thecorresponding three–variable $(x_{4,}x_{5},x_{6})$ interaction would lead tothe impression that these two variables influence the target $F(\mathbf{x})$ in an additive unrelated manner. As seen in the right two frames of Fig.5 this is clearly not the case. Thus, in the presence of higher orderinteractions, lower order representations can be misleading.

6 Illustrations

The function tree methodology described above was applied to a number ofpopular public data sets to investigate the nature of the relationship betweenthe joint values of their predictor variables and outcome. In many cases thecorresponding function tree uncovered simple relationships involving additiveeffects or at most a few two–variable interactions. Others were seen toinvolve more complex structure. Examples of both are illustrated here.

The goal of the function tree approach is interpretational. As such its form(7) (8) is somewhat more restrictive than other more flexiblemethods focused purely on prediction accuracy such as XGBoost (Chen andGuestrin 2016) or Random Forests (Breiman 2001). However, if on any particularproblem its accuracy is substantially less than these other methods thequality of the corresponding interpretation may be questionable. For each ofthe examples presented in this section root-mean-squared prediction error

RMSE=\sqrt{\left.\sum_{i=1}^{N}(y_{i}-\hat{F}(\mathbf{x}_{i}))^{2}\right/\sum_%{i=1}^{N}(y_{i}-\bar{y})^{2}}

(32)

is reported for default versions of the three methods: no tuning for RandomForests and only model size tuning for function trees and XGBoost. In allexamples presented here the estimated interaction effects are based on partialdependence functions (Section 4.1). Corresponding summaries usingpartial association functions (Section 4.2) produced nearly identicalresults in all cases.

6.1 Capital bikeshare data

This data set taken from the UCI Machine Learning Repository consists of 17379hourly observations of counts of bicycle rentals in the Capital bikesharesystem in Washington, D. C. between 2011 and 2012. The outcome variable $y$ isthe rental count. The sixteen predictor variables are both categorical andnumeric consisting of corresponding time, weather and seasonal information.

Function Trees: Transparent Machine Learning (8)

Function Trees: Transparent Machine Learning (9)

Function Trees: Transparent Machine Learning (10)

Figure 6 displays the tree for the function tree model and Fig.7 shows the corresponding node functions. Its interaction effectsummary is shown in Fig. 8. Figure 8 indicatesexistence of substantial two–variable and three–variable interactions. Animportant question is the extent to which these effects are properties of thetarget function and not just properties of a particular training sample. A wayto address this is through a bootstrap analysis (Efron 1979). Figure9 shows box plots of the distributions over fifty bootstrapreplications of test set root-mean squared-error (32) of function treemodels built under three constraints on maximum interaction order:unconstrained (left), no three-variable interactions allowed (center), and nointeractions allowed (right). As seen, existence of both two–variable andthree–variable interactions are significant.

Function Trees: Transparent Machine Learning (11)

Function Trees: Transparent Machine Learning (12)

The largest estimated two–variable interaction effect is between variableshour-of-day (hr) and the binary indicator for working/non-working days(wrk). Figure 10 shows the partial dependence of DC bikerentals on time-of-day conditioned on working days wrk $=1$ (top) andnon working days wrk $=0$ (bottom). They are seen to be quite differentindicating a strong interaction effect between these two variables. Inparticular, the sharp reduction in rentals beginning at 8am and continuinguntil 5pm on working days is seen not to exist on non-working days.

Function Trees: Transparent Machine Learning (13)

The upper left frame of Fig. 11 shows the partial dependence of bikerentals jointly on adjusted temperature (atemp) and time-of-day(hr), along with its pure interaction component $I($ hr,atemp $)$ (top right). Figure 8 indicates a three–variableinteraction between variables hour (hr), working day indicator(wrk) and adjusted temperature (atemp). Thus, the nature ofthe interaction effect between hr and atemp may depend on thevalue of wrk. The bottom two frames of Fig. 11 show thisinteraction effect separately conditioned on the value of working day $I($ hr,atemp $\,|\,$ wrk $=0)$ and $I($ hr,atemp $\,|\,$ wrk $=1)$ . One sees substantial difference in the nature of thisinteraction for the different values of wrk reflecting the presence ofthe three–variable interaction.

Function Trees: Transparent Machine Learning (14)

Figure 8 indicates that the strongest three–variable interactioneffect involves the working day indicator (wrk), day of theweek (wekd) and hour of the day (hr). This is illustrated inFig. 12. The interaction function $I($ wekd,wrk $|$ hr) is displayed for hr $\in\,\{$ midnight,6 am,noon,6 pm $\}$ . Note that there are noobservations for which Saturday (wekd $=7$ ) and Sunday (wekd $=1$ ) are labeled as working days (wrk $=2$ ). The absence of athree–variable interaction would imply that $I($ wekd,wrk $|$ hr $)$ is the same for all values of the variable hr.As seen in Fig. 12 this is not the case especially for non workingweekdays at noon (lower left).

The $RSME$ (32) of function tree, XGBoost and Random Forest models onthis example was $0.34$ , $0.37$ and $0.38$ , respectively.

6.2 Simulated data

This example was presented in Hu et. al. (2023). They generated datausing the target function

	$\displaystyle g(\mathbf{x)}$	$\displaystyle\mathbf{=}\sum_{j=1}^{5}x_{j}+0.5\sum_{j=6}^{8}\,x_{j}^{2}+\sum_{%j=9}^{10}x_{j}\,\,I(x_{j}>0)+x_{1}\,x_{2}+x_{1}\,x_{3}+x_{2}\,x_{3}$		(33)
		$\displaystyle+0.5\,x_{1}\,x_{2}\,\,x_{3}+x_{4}\,\,x_{5}+x_{4}\,x_{6}+x_{5}\,x_%{6}+0.5\,I(x_{4}>0)\,\,x_{5}\,x_{6}\text{.}$

This is a function of ten variables involving numerous interactions involvingup to three variables. The response was simulated as

y=g(\mathbf{x})+\varepsilon\text{, \ }\varepsilon\sim N(0,0.5^{2})\text{.}

(34)

There were $30$ predictor variables. The first $20$ were simulated from amultivariate Gaussian distribution with mean $0$ , variance $1$ and equalcorrelation $0.5$ between all pairs. Ten additional variables were includedthat were independent of the first $20$ (irrelevant variables). They were alsosimulated from a multivariate Gaussian distribution with mean $0$ , variance $1$ and equal correlation $0.5$ between all pairs. All predictors weretruncated to be within the interval $[-2.5,2.5]$ . Here the sample size wastaken to be $N=20000$ .

Function Trees: Transparent Machine Learning (15)

Figure 13 shows the largest estimated main and interaction effectstrengths up through four–variables. One sees that all of the main andinteraction effects represented in (33) are shown with substantialstrengths. Those not represented in (33), including all four–variableinteractions, either do not appear or do so with very low estimated strength.

The $RSME$ (32) of function tree, XGBoost and Random Forest models onthis example was $0.062$ , $0.147$ and $0.174$ respectively.

6.3 Pumadyn data

The other examples were chosen to represent data with somewhat complex targetfunction structure. Here we illustrate on a data set where the function treeuncovers very simple structure. The data (Corke 1996) is produced from arealistic simulation of the dynamics of a Puma 560 robot arm. The task is topredict the angular acceleration of one of the robot arm’s links. There are8192 observations involving 32 input variables that include angular positions,velocities and torques of the robot arm.

Function Trees: Transparent Machine Learning (16)

Function Trees: Transparent Machine Learning (17)

Figure 14 shows the resulting entire function tree model. The upperleft frame shows the tree and the other frames show the node functions. Figure15 shows the (only) interaction effect between tau4 andtheta5. The upper left frame shows the partial dependence function $PD(theta5,tau4)$ ; upper right frame shows its additive component and lowerleft frame its pure interaction effect $I(theta5,tau4)$ . The lower right framerepresents $I(theta5,tau4)$ as a perspective mesh plot. Of the 32 inputvariables, the function tree model selected only two of them to explain 97%of the variance of the target.

The $RSME$ (32) of function tree, XGBoost and Random Forest models onthis example was $0.18$ , $0.21$ and $0.27$ , respectively.

6.4 SGEMM GPU kernel performance

This data set from the UCI Machine Learning Repository measures the runningtime required for a matrix-matrix product using a parameterizable SGEMM GPUkernel. There are 241600 observations with 14 predictor variables:

SGEMM GPU kernel performance predictor variables

$\begin{array}[c]{lll}\text{1-2}&\text{\emph{mwg,\thinspace nwg}}&\text{per-%matrix 2Dtiling atworkgroup level}\\\text{3}&\text{\emph{kwg}}&\text{inner dimension of 2Dtiling at workgrouplevel}\\\text{4-5}&\text{\emph{mdimc,\thinspace ndimc}}&\text{local workgroupsize}\\\text{6-7}&\text{\emph{mdima,\thinspace ndimb}}&\text{local memoryshape}\\\text{8}&\text{\emph{kwi}}&\text{kernel loop unrolling factor}\\\text{9-10}&\text{\emph{vwm,\thinspace vwn}}&\text{per-matrix vectorwidths for loading and storing}\\\text{11-12}&\text{\emph{strm,\thinspace strn}}&\text{enable stride foraccessing off-chip memory within a single thread}\\\text{13-14}&\text{\emph{sa,\thinspace sb}}&\text{per-matrix manualcaching of the 2D workgroup tile.}\end{array}$

As suggested by the authors, the outcome was taken to be the logarithm of therunning time. All predictor variables realize at most four distinct values andwere therefore treated as being categorical. Figure 16 shows theestimated interaction effect profile for this data. Among the larger effectsare four main, seven two–variable, four three–variable and a largefour–variable interaction effect.

Function Trees: Transparent Machine Learning (18)

Function Trees: Transparent Machine Learning (19)

Figure 17 shows box plots of the distributions over ten bootstrapreplications of test set root-mean squared-error (32) of function treemodels built under three interaction constraints: unconstrained (left),four–variable interaction among variables (mwg,nwg,mdimc,ndimc)prohibited (center), and all four–variable interactions prohibited (right).Note that not allowing the four–variable interaction among(mwg,nwg,mdimc,ndimc) does not preclude any of these variables fromparticipating in lower order interactions, or four–variable interactions withother variables. One sees that incorporating the (mwg,nwg,mdimc,ndimc)four–variable interaction is very important to model accuracy as are otherfour–variable interactions to a lesser extent.

Function Trees: Transparent Machine Learning (20)

Since the (mwg,nwg,mdimc,ndimc) interaction effect is inherently fourdimensional it cannot be completely represented by a two dimensional display.Unlike three–variable interactions it cannot be represented by a series oftwo dimensional plots conditioned on the value of a third variable. However itis possible to gain some insight about this four–variable interaction effectthrough graphical methods.

Figure 18 depicts the difference between interaction effects computedfrom two models

\hat{D}(\mathbf{x})=\hat{F}(\mathbf{x})-\hat{G}(\mathbf{x})\text{.}

(35)

The first $\hat{F}(\mathbf{x})$ is the full model with all interactionsallowed (Fig. 17 left). The second model $\hat{G}(\mathbf{x})$ wasbuilt under the constraint that no four–variable interaction effect amongvariables (mwg,nwg,mdimc,ndimc) was allowed (Fig. 17 center).Thus, any structure detected in $\hat{D}(\mathbf{x})$ is due to presence ofthis four–variable interaction.

Figure 18 shows nine bivariate interaction effects differencesbetween these two models (35)

\emph{I}_{\hat{D}}(\text{\emph{mwg,nwg\thinspace$|$\thinspace mdimc}}\,\in\,\{%\emph{8,16,32}\}\,\,\times\,\,\text{\emph{ndimc\thinspace}}\in\,\{\emph{8,16,3%2}\})

where mwg,nwg $\in\{\emph{16,32,64,128}\}$ . For the top tworows mdimc $\,\in\,\{8,16\}$ , the model differences are seen to be smallin absolute value ( $\simeq$ violet) for all joint values of the othervariables, indicating that the four–variable interaction has little to noeffect for those collective values. For the last row (mdimc $=32$ ) one sees that the absolute differences are larger (red,yellow, blue), especially for ndimc $=32$ .This four–variable interaction effect is seen to be mainly the result of avery large increase in (log) running time for the GPU matrix product whensimultaneously mdimc and ndimc realize their largest valueswhile at the same time mwg and nwg are at their smallest values.

The $RSME$ (32) of function tree, XGBoost and Random Forest models onthis example was $0.12$ , $0.09$ and $0.16$ , respectively.

6.5 Global surrogate

Function trees can be used as interpretable surrogate models to investigatethe predictions of black box models. One simply uses the black box predictionsalong with the corresponding predictor variables as input. To the extent thatthe function tree fit is accurate, all of its interpretational tools can thenbe applied to study the nature of the black box model. This is illustratedusing the simulated data of Hu et. al. (2023), shown in Section6.2, where the true underlying structure (33) is known.

Function Trees: Transparent Machine Learning (21)

Function Trees: Transparent Machine Learning (22)

First XGBoost is applied in regression mode to model the data (34)producing an estimate $\hat{g}(\mathbf{x})$ . The resulting test dataroot–mean–squared error

RMSE(\hat{g})=\left\{\hat{E}_{\mathbf{x}}[(g(\mathbf{x})-\hat{g}(\mathbf{x}))^%{2}]/Var(g(\mathbf{x}))\right\}^{\frac{1}{2}}

(36)

was $RMSE(\hat{g})=0.14$ . The output of the XGBoost model $\hat{g}(\mathbf{x})$ is then fit with a function tree. The $RMSE$ for this fit was $0.13$ indicating that the function tree accurately reflects the XGBoostmodel. Figure 19 displays the resulting largest main and interactioneffects detected by the function tree in the XGBoost model. This can becompared to the function tree based on the original data shown in Fig.13, as well as to the structure of the true target function(33). One sees that the XGBoost regression model here closely capturesthe true target function structure as reflected in its function tree representation.

Next a Random Forest is applied to the same data. The resulting test dataroot–mean–squared error (36) was $RMSE(\hat{g})=0.17$ . The output ofthis random forest model was fit by a function tree producing a corresponding $RMSE$ of $0.053$ . Figure 20 shows the largest main and interactioneffects detected in the Random Forest model. While capturing the nature of thetarget function reasonably well, the random forest appears to somewhat underestimate the $(x_{4},x_{5},x_{6})$ interaction strength and identify severalspurious three–variable interactions.

Function Trees: Transparent Machine Learning (23)

Finally here we apply the surrogate approach in a classification setting. Theoutcome variable is binary, $y\in\{0,1\}$ with $\Pr(y=1\,|\,\mathbf{x})=1/(1+\exp(-g(\mathbf{x}))$ . The log–odds function $g(\mathbf{x})$ is givenby (33). Logistic XGBoost was applied to this $(\mathbf{x},y)$ dataproducing a log–odds estimate $\hat{g}(\mathbf{x})$ . The resultingroot-mean-squared-error $RMSE(\hat{g})=0.39$ (36) is in this case muchlarger owing to the loss of information associated with providing the learningalgorithm with only the sign of the outcome rather than its actual numericalvalue. This logistic XGBoost model was then fit with a function tree withresulting $RMSE$ of $0.19$ . Although larger than in the regression case thisstill represents a fairly close fit ( $R^{2}=0.96$ ). The result is shown inFig. 21. As would be expected this lower accuracy logistic XGBoostmodel less perfectly captures the true target function (33). Ituncovers all of the true main and interaction effects. It also indicates anumber of spurious main and two–variable interactions at somewhat lowerstrength. The strength of the interaction among variables $(x_{1},x_{2},x_{3})$ is under estimated and there are several spurious three–variableinteractions indicated at comparable strength. In spite of its lower accuracythe logistic XGBoost model is still seen in Fig. 21 to reflect muchof intrinsic structure of the target log–odds $g(\mathbf{x})$ (33).

7 Previous work

The origins of the function tree approach lie with the additive modelingprocedure of Hastie and Tibshirani (1990). They used nonparametric univariatesmoothers and backfitting to produce models with no interactions. Functiontrees can be viewed as generalizing that method to discover and includeunspecified interaction effects.

The closest predecessor to function trees is MARS (Friedman 1991). It models ageneral multivariate function as a sum of products of simple basic univariatefunctions of the predictor variables and can in principle model interactioneffects to high order. The basic univariate function used with MARS is a dReLU(double ReLU)

d(x)=a\cdot(t-x)_{+}+b\cdot(x-t)_{+}

(37)

characterized by two slope parameters $(a,b)$ and knot location $t$ . The MARSforward stepwise modeling strategy is similar to that used with the functiontree approach.

The principal difference between MARS and function trees involves their basicbuilding blocks. MARS employs the simple dReLU (37) at every step.Function trees employ more general functions of the predictor variables $f(x_{j})$ estimated by user selected smoothers. This allows customizing ofthe estimation method to the nature of each individual predictor variabledistribution. It also produces more parsimonious models (smaller trees) sincemany dReLUs (37) can be required to approximate even simple whole curves.

A second difference with MARS is that function trees allow different functionsof the same variable to appear multiple times in the products of asingle basis function (8), as seen in Fig. 6. This providesmore flexibility by incorporating stepwise multiplicative as well as additivemodeling. Besides increased accuracy this further reduces model size. Inaddition to being more interpretable, smaller models allow faster computationof their partial dependence functions.

The use of partial dependence functions to uncover interaction effects wasproposed by Friedman and Popescu (2008). The innovation here is their rapididentification and computation afforded by the function tree representation.This allows a comprehensive search for all interaction effects up to highorders. Other techniques that have been proposed for interaction detection aremostly based on the functional ANOVA decomposition (see Roosen 1995, Hooker2004 & 2007, Kim et. al. 2009, Lengerich et. al. 2020, Huet. al. 2023, and Walters et. al. 2023). So far, computationalconsiderations have limited these approaches to two–variable interactions.Lou et. al. (2013) use a bivariate partitioning method to screen fortwo–variable interactions. Tang, et. al. (2016) combine the functionalANOVA technique with the polynomial dimensional decomposition to reducecomputation with independent variables.

8 Discussion

While sometimes competitive in accuracy with other more flexible methods suchas XGBoost and Random Forests, the focus of the function tree approach is oninterpretation. The goal is to provide a representation of the target functionthat exposes its interaction effects and provides a framework for their rapidcalculation, especially those involving more than two variables. Almost allresearch into interaction detection to date has been limited to that involvingjust two variables. In fact, in many settings the unqualified term“interaction effect”is meant to refer totwo variables only.

Focusing only on two–variable interactions is natural because it reduces thesize of the search and once interactions are detected their functional formsare easily examined by traditional graphical methods such as heat maps,contour plots or perspective mesh plots, etc. Higher order interactionsinvolve more variables and their higher dimensional structures are not aseasily represented. As illustrated in Figs. 5, 11 and12, three–variable interactions can be visualized by viewing aseries of bivariate interaction plots conditioned on selected values ofanother variable. As seen in Fig. 18 interactions involving morevariables can be investigated by contrasting models with and without thoseinteractions included.

In addition to their direct interpretational value, knowledge of the existenceand strengths of higher order interactions can be important as they placeinterpretational limits on the nature of those of lower order involving thesame variables. As noted above, the functional form of an interaction effectinvolving variables $\{x_{j}\}_{1}^{n}$ depends on the value of anothervariable, say $x_{k}$ , if there exists a higher order interaction involvingboth $\{x_{j}\}_{1}^{n}$ and $x_{k}$ . If there are no such substantial higherorder interactions, the functional form of the $\{x_{j}\}_{1}^{n}$ interactionis well defined and represents the isolated joint contribution of thosevariables to the the target function. If such higher order interactions doexist then the form of the $\{x_{j}\}_{1}^{n}$ interaction is not well definedas it depends on the value of $x_{k}$ and its corresponding functional formbecomes an average over the joint distribution of all such variables. As seenfor example in Fig. 5, this can lead to highly misleadinginterpretations. Tan et. al. (2023) discuss problems interpreting lowerorder effects in the presence of higher order interactions.

In applications involving training data with very large absolute correlationsamong subsets of predictor variables, main and interaction effects of variouslevels involving those variables tend not to be separable. This can causesubstantial spurious interaction effects to be reported. These can be detectedby comparing the lack-of-fit (32) for models constructed with andwithout the questionable effects included, as illustrated in Figs.9 and 17.

A popular interpretational tool used to investigate predictive models is ameasure of the impact or importance of the respective individual inputvariables on model predictions. There are a wide variety of definitions forvariable importance each providing different information. See Molnar (2023)for a comprehensive survey. In the presence of interactions the contributionof a given predictor variable depends on the values of the other variableswith which it interacts. Interaction effect summaries (e.g. Figs. 8,13 and 16) are thus more comprehensive than correspondingvariable importance summaries. Variable importances can be derived frominteraction based functional decompositions. For example, Gevaert and Saeys(2022) and Walters et. al. (2023) use them to derive Shapley (1953) values.

In the examples presented here partial dependence functions were used to bothdetect and examine interaction effects. Computational considerations largelydictate their use for the former. However, once uncovered, identified main andinteraction effects can be examined by any appropriate method. For example,accumulated local effects (ALE) functions (Apley and Zhu 2020) can be employedfor this purpose. Also, partial dependence based search results can be used toguide methods for constructing functional ANOVA decompositions.

References

[1]Apley, D. and Zhu, J. (2020). Visualizing the effects of predictorvariables in black box supervised learning models.J. Roy. Statist.Soc. Series. B 4, 1059–1086.
[2]Breiman, L. (2001). Random Forests. Machine Learning,45, 5–32.
[3]Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boostingsystem. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and DataMining, 785–794.
[4]Corke, P. I. (1996). A robotics toolbox for MATLAB. IEEE Robotics and Automation Magazine, 3, 24–32.
[5]Efron, B. (1979). Bootstrap methods: another look at the jackknife.Ann. Statist. 7, 1–26.
[6]Friedman, J. and Stuetzle, W. (1981). Projection pursuitregression. J. Amer. Statist. Assoc. 76, 817–823.
[7]Friedman, J. (1991). Multivariate adaptive regression splines.Ann. Statist. 19, 1–67.
[8]Friedman, J. (2001). Greedy function approximation: a gradientboosting machine. Ann. Statist. 29, 1189–1232.
[9]Friedman, J., and Popescu, B. (2008). Predictive learning via ruleensembles.Ann. of Appl. Statist. 2, 916–954.
[10]Gevaert, A. and Saeys, Y. (2022). PDD-SHAP: Fast approximations forShapley values using functional decomposition, arXiv:2208.12595v1 [cs.LG]
[11]Hastie, T. and Tibshirani, R. (1990). Generalized AdditiveModels. CRC Press.
[12]Hooker, G. (2004). Discovering additive structure in black boxfunctions. Proc. of the 10th ACM SIGKDD Int. Conf. on KnowledgeDiscovery and Data Mining.
[13]Hooker, G. (2007). Generalized functional ANOVA diagnostics forhigh-dimensional functions of dependent variables.J. Comp. and Graph.Statist. 16, 709–732.
[14]Hu, L., Nair, V. J., Sudjianto, A., Zhang, A. and Chen, J. (2023).Interpretable machine learning based on functional ANOVA framework: algorithmsand comparisons. arXiv:2305.15670 [stat.ML].
[15]Irizarry, R. A. (2019). Introduction to Data Science.Chapman & Hall.
[16]Kim, Y., Kim, J., Lee, S. and Kwon, S. (2009). Boosting on thefunctional ANOVA decomposition. Statistics and Its Interface2, 361–368.
[17]Lengerich, B., Tan, S., Chang, C., Hooker, G., Caruana, R. (2020).Purifying interaction effects with the functional ANOVA: an efficientalgorithm for recovering identifiable additive models. Proc. 23rd Int.Conf. Artificial Intel. and Statist. (AISTATS).
[18]Lou, Y., Caruana, R., Gehrke, J., Hooker, G. (2013). Accurateintelligible models and pairwise interactions. Proc. of the 13th ACMSIGKDD Int. Conf. on Knowledge Discovery and Data Mining.
[19]Molar, C. (2023). Interpretable Machine Learning. Lean Publishing.
[20]Roosen, C. (1995). Visualization and exploration ofhigh-dimensional functions using the functional ANOVA decomposition. PhDthesis, Statistics, Stanford University.
[21]Shapley, L. (1953). A value for n-person games. Contributionsto the Theory of Games 2(28), 307–317.
[22]Tan, S., Hooker, G., Koch, P., Gordo, A. and Caruana, R. (2023).Considerations when learning additive explanations for black-box models.Machine Learning 112, 3333–3359.
[23]Tang, K., Congedo, P. and Abgrall, R. (2016). Adaptive surrogatemodeling by ANOVA and sparse polynomial dimensional decomposition for globalsensitivity analysis in fluid simulation. J. Comput. Physics314, 557–589.
[24]Walters, B., Ortega-Martorell, S., Olier, I., Lisboa, P.J.G.(2023). How to open a black box classifier for tabular data. Algorithms16, 181.