Function Trees: Transparent Machine Learning (2024)

Jerome H. FriedmanDepartment of Statistics, Stanford University,Stanford, CA 94305, USA. (jhf@stanford.edu)

Abstract

The output of a machine learning algorithm can usually be represented by oneor more multivariate functions of its input variables. Knowing the globalproperties of such functions can help in understanding the system thatproduced the data as well as interpreting and explaining corresponding modelpredictions. A method is presented for representing a general multivariatefunction as a tree of simpler functions. This tree exposes the global internalstructure of the function by uncovering and describing the combined jointinfluences of subsets of its input variables. Given the inputs andcorresponding function values, a function tree is constructed that can be usedto rapidly identify and compute all of the function’s main and interactioneffects up to high order. Interaction effects involving up to four variablesare graphically visualized.

1 Introduction

A fundamental exercise in machine learning is the approximation of a functionof several to many variables given values of the function, often contaminatedwith noise, at observed joint values of the input variables. The result canthen be used to estimate unknown function values given corresponding inputs.The goal is to accurately estimate the underlying (non noisy) outcome valuessince the noise is by definition unpredictable. To the extent that this issuccessful the estimated function may, in addition, be used to try tounderstand underlying phenomena giving rise to the data. Even when predictionaccuracy is the dominate concern, being able to comprehend the way in whichthe input variables are jointly combining to produce predictions may lead toimportant sanity checks on the validity of the function estimate. Besidesaccuracy, the success of this latter exercise requires that the structure ofthe function estimate be represented in a comprehensible form.

It is well known, and often commented, that the most accurate functionapproximation methods to date tend not to provide comprehensible results. Thefunction estimate is encoded in an obtuse form that obscures potentiallyrecognizable relationships among the inputs that give rise to various functionoutput values. This is especially the case when the function is not inherentlylow dimensional. That is, medium to large subsets of the input variables acttogether to influence the function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) in the sense that theircombined contribution cannot be represented by a combination of smallersubsets of those variables (interaction effects). This has led to incompleteinterpretations based on low dimensional approximations such as additivemodeling with no interactions, or models restricted to at most two–variable interactions.

2 Interaction effects

For input variables 𝐱=𝐱absent\mathbf{x}\,=bold_x = (x1,x2,β‹―,xp)subscriptπ‘₯1subscriptπ‘₯2β‹―subscriptπ‘₯𝑝(x_{1},x_{2},\cdot\cdot\cdot,x_{p})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , β‹― , italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) afunction F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) is said to exhibit an interaction between two of themxjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT if the difference in the value of F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) asresult of changing the value of one of them xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT depends on the value ofthe other xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This means that in order to understand the effect of thesetwo variables on F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) they must be considered together and cannotbe studied separately. An interaction effect between variables xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT andxksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT implies that the second derivative of F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) jointly withrespect to xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is not zero for at least some values of𝐱𝐱\mathbf{x}bold_x. That is,

E𝐱⁒[βˆ‚2F⁒(𝐱)βˆ‚xjβ’βˆ‚xk]2>0⁒,subscript𝐸𝐱superscriptdelimited-[]superscript2𝐹𝐱subscriptπ‘₯𝑗subscriptπ‘₯π‘˜20,E_{\mathbf{x}}\left[\frac{\partial^{2}F(\mathbf{x})}{\partial x_{j}\partial x_%{k}}\right]^{2}>0\text{,}italic_E start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT [ divide start_ARG βˆ‚ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F ( bold_x ) end_ARG start_ARG βˆ‚ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT βˆ‚ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0 ,(1)

with an analogous expression involving finite differences for categoricalvariables. If there is no interaction between these variables, the functionF⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) can be expressed as a sum of two functions, one that does notdepend on xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the other that is independent of xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

F⁒(𝐱)=f\j⁒(𝐱\j)+f\k⁒(𝐱\k)⁒.𝐹𝐱subscript𝑓\absent𝑗subscript𝐱\absent𝑗subscript𝑓\absentπ‘˜subscript𝐱\absentπ‘˜.F(\mathbf{x})=f_{\backslash j}(\mathbf{x}_{\backslash j})+f_{\backslash k}(%\mathbf{x}_{\backslash k})\text{.}italic_F ( bold_x ) = italic_f start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT \ italic_k end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT \ italic_k end_POSTSUBSCRIPT ) .(2)

Here 𝐱\jsubscript𝐱\absent𝑗\mathbf{x}_{\backslash j}bold_x start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT and 𝐱\ksubscript𝐱\absentπ‘˜\mathbf{x}_{\backslash k}bold_x start_POSTSUBSCRIPT \ italic_k end_POSTSUBSCRIPT respectivelyrepresent all variables except xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. If a given variablexjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT interacts with no other variable, then the function can be expressedas

F⁒(𝐱)=fj⁒(xj)+f\j⁒(𝐱\j)𝐹𝐱subscript𝑓𝑗subscriptπ‘₯𝑗subscript𝑓\absent𝑗subscript𝐱\absent𝑗F(\mathbf{x})=f_{j}(x_{j})+f_{\backslash j}(\mathbf{x}_{\backslash j})italic_F ( bold_x ) = italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT )(3)

where the first term on the right is a function only of xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the secondis independent of xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In this case F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) is said to beβ€œadditive”in variable xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and theunivariate function fj⁒(xj)subscript𝑓𝑗subscriptπ‘₯𝑗f_{j}(x_{j})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) can be examined to study the effect ofxjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on the function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) independently from the effects of theother variables.

Higher order interactions are analogously defined. A function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x )is said to have an n𝑛nitalic_n–variable interaction effect involving variables{xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT if

E𝐱⁒[βˆ‚nF⁒(𝐱)βˆ‚x1β’βˆ‚x2β’β‹―β’βˆ‚xn]2>0⁒,subscript𝐸𝐱superscriptdelimited-[]superscript𝑛𝐹𝐱subscriptπ‘₯1subscriptπ‘₯2β‹―subscriptπ‘₯𝑛20,E_{\mathbf{x}}\left[\frac{\partial^{n}F(\mathbf{x})}{\partial x_{1}\partial x_%{2}\cdot\cdot\cdot\;\partial x_{n}}\right]^{2}>0\text{,}italic_E start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT [ divide start_ARG βˆ‚ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_F ( bold_x ) end_ARG start_ARG βˆ‚ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT βˆ‚ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT β‹― βˆ‚ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0 ,(4)

again with an analogous expression involving finite differences forcategorical variables. The existence of such an n𝑛nitalic_n–variable interactionimplies that the effect of the corresponding variables {xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT onthe function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) cannot be decomposed into a sum of functions eachinvolving subsets of those variables. If there is no such interaction, thecontribution of variables {xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to the variation ofF⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) can be decomposed into a sum of functions each not dependingupon one of these respective variables

F⁒(𝐱)=βˆ‘j=1nf\j⁒(𝐱\j)⁒.𝐹𝐱superscriptsubscript𝑗1𝑛subscript𝑓\absent𝑗subscript𝐱\absent𝑗.F(\mathbf{x})=\sum_{j=1}^{n}f_{\backslash j}(\mathbf{x}_{\backslash j})\text{.}italic_F ( bold_x ) = βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT ) .(5)

If none of the variables in a subset s={xj}1n𝑠superscriptsubscriptsubscriptπ‘₯𝑗1𝑛s\mathbf{=}\{x_{j}\}_{1}^{n}italic_s = { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT interactwith any of the variables in its complement set \s={xj}n+1p\backslash s\mathbf{=}\{x_{j}\}_{n+1}^{p}\ italic_s = { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, then the function is additive in the variable subset s𝑠sitalic_s

F⁒(𝐱)=fs⁒(𝐱s)+f\𝐬⁒(𝐱\s)𝐹𝐱subscript𝑓𝑠subscript𝐱𝑠subscript𝑓\absent𝐬subscript𝐱\absent𝑠F(\mathbf{x})=f_{s}(\mathbf{x}_{s})+f_{\mathbf{\backslash s}}(\mathbf{x}_{%\backslash s})italic_F ( bold_x ) = italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT \ bold_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT \ italic_s end_POSTSUBSCRIPT )(6)

and one can study the effect of {xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT on the functionF⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) separately from that of the other variables.

The notion of interaction effects can be useful for understanding a functionF⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) if it is dominated by those of low order involving variablesubsets sβŠ‚{xj}1p𝑠superscriptsubscriptsubscriptπ‘₯𝑗1𝑝s\subset\{x_{j}\}_{1}^{p}italic_s βŠ‚ { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT with small cardinality (|s|<<pmuch-less-than𝑠𝑝|\,s\,|\,<<p| italic_s | < < italic_p).In this case a target function of many variables can be understood in terms ofa collection of lower dimensional functions each depending on a relativelysmall different subset of the variables. It is generally easier to comprehendthe lower dimensional structure associated with fewer variables. Also, acommon regularization method is to more heavily penalize or forbid solutionsinvolving higher order interactions. Although often useful, this can result inlower accuracy and/or misleading interpretations when F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) involvessubstantial interaction effects of higher order than that assumed. As seenbelow the existence of such higher order interactions can misleadinterpretation by causing the functional form of those of lower order todepend on values of other unknown variables.

Function Trees: Transparent Machine Learning (1)Figure 1: Function tree.Function Trees: Transparent Machine Learning (2)Figure 2: Node functions.Function Trees: Transparent Machine Learning (3)Figure 1: Function tree.Function Trees: Transparent Machine Learning (4)Figure 2: Node functions.\begin{array}[c]{cc}{\parbox[b]{221.99844pt}{\begin{center}\includegraphics[height=227.49898pt,width=221.99844pt]{Rplot10.png}\\Figure 1: Function tree.\end{center}}}&{\parbox[b]{172.55873pt}{\begin{center}\includegraphics[height=213.6881pt,width=172.55873pt]{Rplot09.png}\\Figure 2: Node functions.\end{center}}}\end{array}start_ARRAY start_ROW start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY

3 Function trees

A function tree is a way of representing a multivariate function so as touncover and estimate its interaction effects of various orders. The inputfunction F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) is defined by data {𝐱i,yi}1Nsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖1𝑁\{\mathbf{x}_{i},y_{i}\}_{1}^{N}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPTwith 𝐱𝐱\mathbf{x}bold_x representing multivariate evaluation points and y𝑦yitalic_y thecorresponding function values perhaps contaminated with noise. The output ofthe procedure is a set of univariate functions, each one depending on a singleselected input variable. These functions are arranged as the non root nodes ofa tree. A constant function is associated with the root. The tree provides aprescription for combining these univariate functions to form the multivariatefunction approximation F^⁒(𝐱)^𝐹𝐱\hat{F}(\mathbf{x})over^ start_ARG italic_F end_ARG ( bold_x ).

Figure 1 shows one such function tree derived from simple simulated data forillustration. Figure 2 shows the nine non constant functions corresponding toeach of the respective non root nodes of the tree. In addition to itsassociated univariate function (Fig. 2), each tree node kπ‘˜kitalic_k represents a basisfunction Bk⁒(𝐱)subscriptπ΅π‘˜π±B_{k}(\mathbf{x})italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ). The sum of these basis functions forms the finalmultivariate approximation

F^K⁒(𝐱)=B0+βˆ‘k=1KBk⁒(𝐱)⁒.subscript^𝐹𝐾𝐱subscript𝐡0superscriptsubscriptπ‘˜1𝐾subscriptπ΅π‘˜π±.\hat{F}_{K}(\mathbf{x})=B_{0}+\sum_{k=1}^{K}B_{k}(\mathbf{x})\text{.}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_x ) = italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) .(7)

Each node’s basis function is the product of its associated univariatefunction and those of the nodes on the path from it to the root

Bk⁒(𝐱)=∏l∈p⁒(k)fl⁒(xj⁒(l))⁒.subscriptπ΅π‘˜π±subscriptproductπ‘™π‘π‘˜subscript𝑓𝑙subscriptπ‘₯𝑗𝑙.B_{k}(\mathbf{x})={\displaystyle\prod\limits_{l\in p(k)}}f_{l}(x_{j(l)})\text{.}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) = ∏ start_POSTSUBSCRIPT italic_l ∈ italic_p ( italic_k ) end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_l ) end_POSTSUBSCRIPT ) .(8)

Here p⁒(k)π‘π‘˜p(k)italic_p ( italic_k ) represents those nodes l𝑙litalic_l on the path from node kπ‘˜kitalic_k to the rootand j⁒(l)𝑗𝑙j(l)italic_j ( italic_l ) labels the input variable associated with each such node l𝑙litalic_l. Thedarkness of each node’s shading in Fig. 1 corresponds to the estimatedrelative influence of its basis function on the final function estimate asmeasured by its standard deviation over the training data.

The interaction level of each basis function (8)is the number of univariate functions of different variables along thepath from its corresponding node to the root of the tree. Products ofunivariate functions involving n𝑛nitalic_n different variables satisfy (4).Note that different univariate functions of the same variable mayappear multiple times along the same path. While potentially improving themodel fit they do not increase interaction level. With this approachinteractions among a specified subset of the variables s𝑠sitalic_s are modeled by sumsof products of univariate functions of those variables

Js⁒({xj}j∈s)=βˆ‘k=1Ks∏j∈sfj⁒k⁒(xj)⁒.subscript𝐽𝑠subscriptsubscriptπ‘₯𝑗𝑗𝑠superscriptsubscriptπ‘˜1subscript𝐾𝑠subscriptproduct𝑗𝑠subscriptπ‘“π‘—π‘˜subscriptπ‘₯𝑗.J_{s}(\{x_{j}\}_{j\in s})=\sum_{k=1}^{K_{s}}{\displaystyle\prod\limits_{j\in s%}}f_{jk}(x_{j})\text{.}italic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j ∈ italic_s end_POSTSUBSCRIPT ) = βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j ∈ italic_s end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .(9)

The function tree model shown in Figs. 1 & 2 was obtained by applying theprocedure described below to a simulated data example. There are 10000100001000010000observations with outcome variables generated as y=F⁒(𝐱)+Ξ΅π‘¦πΉπ±πœ€y=F(\mathbf{x})+\varepsilonitalic_y = italic_F ( bold_x ) + italic_Ξ΅with 𝐱∼N8⁒(0,0.5)similar-to𝐱superscript𝑁800.5\mathbf{x}\sim N^{8}(0,0.5)bold_x ∼ italic_N start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT ( 0 , 0.5 ) and

F⁒(𝐱)=4⁒sin⁑(π⁒x1)⁒cos⁑(π⁒x2)+7⁒x32+15⁒(x4+0.4)⁒(x5βˆ’0.6)⁒(x6+0.2)+5⁒sin⁑(π⁒(x7+0.1)⁒x8)⁒.𝐹𝐱4πœ‹subscriptπ‘₯1πœ‹subscriptπ‘₯27superscriptsubscriptπ‘₯3215subscriptπ‘₯40.4subscriptπ‘₯50.6subscriptπ‘₯60.25πœ‹subscriptπ‘₯70.1subscriptπ‘₯8.F(\mathbf{x})=4\,\sin(\pi x_{1})\,\cos(\pi x_{2})+7\,\,x_{3}^{2}+15\,\,(x_{4}+%0.4)\,(x_{5}-0.6)\,(x_{6}+0.2)+5\,\sin(\pi\,(x_{7}+0.1)\,x_{8})\text{.}italic_F ( bold_x ) = 4 roman_sin ( italic_Ο€ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_cos ( italic_Ο€ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + 7 italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 15 ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + 0.4 ) ( italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT - 0.6 ) ( italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT + 0.2 ) + 5 roman_sin ( italic_Ο€ ( italic_x start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT + 0.1 ) italic_x start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT ) .(10)

The noise is generated as Ρ∼N⁒(0,v⁒a⁒r⁒(F)/4)similar-toπœ€π‘0π‘£π‘Žπ‘ŸπΉ4\varepsilon\sim N(0,var(F)/4)italic_Ξ΅ ∼ italic_N ( 0 , italic_v italic_a italic_r ( italic_F ) / 4 ) producinga 2/1212/12 / 1 signal/noise ratio. This target function has an additive dependence onthe third variable x3subscriptπ‘₯3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, separate two–variable interactions between(x1,x2)subscriptπ‘₯1subscriptπ‘₯2(x_{1},x_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and (x7,x8)subscriptπ‘₯7subscriptπ‘₯8(x_{7},x_{8})( italic_x start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT ) respectively, and a trilinearthree–variable interaction among (x4,x5,x6)subscriptπ‘₯4subscriptπ‘₯5subscriptπ‘₯6(x_{4},x_{5},x_{6})( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ).

The first node of the tree is seen to incorporate the pure additive effect ofx3subscriptπ‘₯3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Nodes 2 - 5 model the (x4,x5,x6)subscriptπ‘₯4subscriptπ‘₯5subscriptπ‘₯6(x_{4},x_{5},x_{6})( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) three–variableinteraction. The two–variable interaction between x1subscriptπ‘₯1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscriptπ‘₯2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ismodeled by nodes 6 and 7, and that of x7subscriptπ‘₯7x_{7}italic_x start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT and x8subscriptπ‘₯8x_{8}italic_x start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT is modeled by nodes8 and 9. The nine univariate functions (Fig. 2) combined as specified by thetree (Fig. 1) produce a multivariate function that explains 97% of thevariance of the target (10).

3.1 Tree construction

Function trees are constructed in a standard forward stepwise best firstmanner. Initially the tree consists of a single root node representing aconstant function B0subscript𝐡0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. At the K𝐾Kitalic_Kth step there are K𝐾Kitalic_K basis functions(7) (8) in the current model F^K⁒(𝐱)subscript^𝐹𝐾𝐱\hat{F}_{K}(\mathbf{x})over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_x ). The(K+1)K+1)italic_K + 1 )st is taken to be

BK+1⁒(𝐱)=Bkβˆ—β’(𝐱)⁒fβˆ—β’(xjβˆ—)subscript𝐡𝐾1𝐱subscript𝐡superscriptπ‘˜βˆ—π±superscriptπ‘“βˆ—subscriptπ‘₯superscriptπ‘—βˆ—B_{K+1}(\mathbf{x})=B_{k^{\ast}}(\mathbf{x})\,f^{\ast}(x_{j^{\ast}})italic_B start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ( bold_x ) = italic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x ) italic_f start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )(11)

where jβˆ—superscriptπ‘—βˆ—j^{\ast}italic_j start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT, kβˆ—superscriptπ‘˜βˆ—k^{\ast}italic_k start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT, and fβˆ—superscriptπ‘“βˆ—f^{\ast}\,italic_f start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPTare the solution to

(jβˆ—,kβˆ—,fβˆ—)=arg⁑min(j,k,f)⁑E^⁒(yβˆ’F^K⁒(𝐱)βˆ’Bk⁒(𝐱)⁒f⁒(xj))2⁒.superscriptπ‘—βˆ—superscriptπ‘˜βˆ—superscriptπ‘“βˆ—subscriptπ‘—π‘˜π‘“^𝐸superscript𝑦subscript^𝐹𝐾𝐱subscriptπ΅π‘˜π±π‘“subscriptπ‘₯𝑗2.(j^{\ast},k^{\ast},\,f^{\ast})=\arg\min_{(j,k,\,\,f)}\,\hat{E}\left(y-\hat{F}_%{K}(\mathbf{x})-B_{k}(\mathbf{x})\,f(x_{j})\right)^{2}\text{.}( italic_j start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ) = roman_arg roman_min start_POSTSUBSCRIPT ( italic_j , italic_k , italic_f ) end_POSTSUBSCRIPT over^ start_ARG italic_E end_ARG ( italic_y - over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_x ) - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(12)

Here Bk⁒(𝐱)subscriptπ΅π‘˜π±B_{k}(\mathbf{x})italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) is one of the basis functions in the current model,f⁒(xj)𝑓subscriptπ‘₯𝑗f(x_{j})italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a function of one of the predictor variables xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and theempirical expected value is over the data distribution. The model is thenupdated

F^K+1⁒(𝐱)=F^K⁒(𝐱)+Bkβˆ—β’(𝐱)⁒fβˆ—β’(xjβˆ—)subscript^𝐹𝐾1𝐱subscript^𝐹𝐾𝐱subscript𝐡superscriptπ‘˜βˆ—π±superscriptπ‘“βˆ—subscriptπ‘₯superscriptπ‘—βˆ—\hat{F}_{K+1}(\mathbf{x})=\hat{F}_{K}(\mathbf{x})+B_{k^{\ast}}(\mathbf{x})\,f^%{\ast}(x_{j^{\ast}})over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ( bold_x ) = over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_x ) + italic_B start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x ) italic_f start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )(13)

for the next iteration. In terms of tree construction the update consists ofadding a daughter node to current node kβˆ—superscriptπ‘˜βˆ—k^{\ast}italic_k start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT with associated functionfβˆ—β’(xjβˆ—)superscriptπ‘“βˆ—subscriptπ‘₯superscriptπ‘—βˆ—\,f^{\ast}(x_{j^{\ast}})italic_f start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). Iterations continue until the goodness–of–fitstops improving as measured on an independent test data sample.

3.2 Backfitting

As with all tree based methods, smaller trees (less nodes) tend to be moreeasily understood. Here tree size can be reduced and accuracy improved byincreased function optimization (backfitting) at each step (Friedman andStuetzle 1981). This is implemented here by updating the functions associatedwith all current tree nodes in presence of the newly added one, and thosepreviously updated. In particular, after adding the new (K+1𝐾1K+1italic_K + 1)st node, allnode functions are re-estimated fk⁒(xj⁒(k))←fkβˆ—β’(xj⁒(k))←subscriptπ‘“π‘˜subscriptπ‘₯π‘—π‘˜superscriptsubscriptπ‘“π‘˜βˆ—subscriptπ‘₯π‘—π‘˜f_{k}(x_{j(k)})\leftarrow f_{k}^{\ast}(x_{j(k)})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) ← italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) one at a time in order starting from the first

fkβˆ—β’(xj⁒(k))=arg⁑minf~⁑E^𝐱⁒[yβˆ’F^K+1⁒(𝐱)βˆ’Bk⁒(𝐱)⁒(f~⁒(xj⁒(k))/fk⁒(xj⁒(k))βˆ’1)]2⁒,superscriptsubscriptπ‘“π‘˜βˆ—subscriptπ‘₯π‘—π‘˜subscript~𝑓subscript^𝐸𝐱superscriptdelimited-[]𝑦subscript^𝐹𝐾1𝐱subscriptπ΅π‘˜π±~𝑓subscriptπ‘₯π‘—π‘˜subscriptπ‘“π‘˜subscriptπ‘₯π‘—π‘˜12,f_{k}^{\ast}(x_{j(k)})=\arg\min_{\tilde{f}}\,\hat{E}_{\mathbf{x}}\left[y-\hat{%F}_{K+1}(\mathbf{x})-B_{k}(\mathbf{x})\left(\tilde{f}(x_{j(k)})/\,f_{k}(x_{j(k%)})-1\right)\right]^{2}\text{,}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) = roman_arg roman_min start_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG end_POSTSUBSCRIPT over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT [ italic_y - over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ( bold_x ) - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) ( over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) / italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) - 1 ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(14)
F^K+1⁒(𝐱)←F^K+1⁒(𝐱)+Bk⁒(𝐱)⁒(fβˆ—β’(xj⁒(k))/fk⁒(xj⁒(k))βˆ’1)⁒,⁒k=1,2,β‹―,(K+1)⁒.formulae-sequence←subscript^𝐹𝐾1𝐱subscript^𝐹𝐾1𝐱subscriptπ΅π‘˜π±superscriptπ‘“βˆ—subscriptπ‘₯π‘—π‘˜subscriptπ‘“π‘˜subscriptπ‘₯π‘—π‘˜1,π‘˜12⋯𝐾1.\hat{F}_{K+1}(\mathbf{x})\leftarrow\hat{F}_{K+1}(\mathbf{x})+B_{k}(\mathbf{x})%\left(f^{\ast}(x_{j(k)})/f_{k}(x_{j(k)})-1\right)\text{,}k=1,2,\cdot\cdot\cdot,(K+1)\text{.}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ( bold_x ) ← over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT ( bold_x ) + italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) ( italic_f start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) / italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) - 1 ) , italic_k = 1 , 2 , β‹― , ( italic_K + 1 ) .(15)

Here kπ‘˜kitalic_k labels a node of the tree and fksubscriptπ‘“π‘˜f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT its associated function ofvariable xj⁒(k)subscriptπ‘₯π‘—π‘˜x_{j(k)}italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT. In terms of tree construction, each such step involvesreplacing each (kπ‘˜kitalic_kth) node’s current function fk⁒(xj⁒(k))subscriptπ‘“π‘˜subscriptπ‘₯π‘—π‘˜f_{k}(x_{j(k)})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) withfkβˆ—β’(xj⁒(k))superscriptsubscriptπ‘“π‘˜βˆ—subscriptπ‘₯π‘—π‘˜f_{k}^{\ast}(x_{j(k)})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j ( italic_k ) end_POSTSUBSCRIPT ) (14). No changes are made to tree topology(Fig. 1) or the input variables associated with each node. Only the functions(Fig. 2) are updated. Backfitting passes can be repeated until the modelbecomes stable. This usually happens after one or two passes. Note that in thepresence of backfitting function tree models are not nested. All of the basisfunctions of the K+1𝐾1K+1italic_K + 1 node model are different from those of its K𝐾Kitalic_K node predecessor.

3.3 Univariate function estimation

The central operation of the above procedure is repeated estimation ofunivariate functions of the individual predictor variables. In both (12)and (14) the solutions take the form of weighted conditionalexpectations

f^⁒(xj)=E^w2⁒[rw|xj]⁒,^𝑓subscriptπ‘₯𝑗subscript^𝐸superscript𝑀2delimited-[]conditionalπ‘Ÿπ‘€subscriptπ‘₯𝑗,\hat{f}(x_{j})=\hat{E}_{w^{2}}\left[\frac{r}{w}\,|\,x_{j}\right]\text{,}over^ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG italic_r end_ARG start_ARG italic_w end_ARG | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ,(16)

where w𝑀witalic_w represents one of the current basis functions w=Bk⁒(𝐱)𝑀subscriptπ΅π‘˜π±w=B_{k}(\mathbf{x})italic_w = italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x )and r=yβˆ’F^⁒(𝐱)π‘Ÿπ‘¦^𝐹𝐱r=y-\hat{F}(\mathbf{x})italic_r = italic_y - over^ start_ARG italic_F end_ARG ( bold_x ) are the current residuals. The function treeprocedure as described above is agnostic concerning the methods used for thisestimation. However, actual performance in terms of execution speed andprediction accuracy can be highly dependent on such choices.

A major consideration in the choice of an estimation method for a particularvariable xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the nature of that variable in terms of the values itrealizes, and connections between those values. A categorical variable(factor) realizes discrete values with no order relation. The only method forevaluating (16) for such variables is to take the weighted (w2superscript𝑀2w^{2}italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT)mean of r/wπ‘Ÿπ‘€r/witalic_r / italic_w at each discrete xπ‘₯xitalic_x value. This is also a viable strategy ifthe xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT values are orderable but realize only a small set of distinct values.

For numeric variables that realize many distinct orderable values one can takeadvantage of the presumed smoothness of the solution function to improveaccuracy by borrowing strength from similarly valued observations. There aremany such methods for β€œsmoothing”data (seeIrizarry 2019). For the examples presented below near neighbor local averagingand local linear fitting were employed. Near neighbor local averagingestimates are equivariant under monotone transformations of the xπ‘₯xitalic_x-values,depending only their relative ranks. This can provide higher accuracy in thepresence of irregular or clumped xπ‘₯xitalic_x-values, immunity to xπ‘₯xitalic_x outliers,resistance to y𝑦yitalic_y outliers and more cautious (constant) extrapolation at theedges. For more evenly distributed data with many distinct xπ‘₯xitalic_x-values, in theabsence of xπ‘₯xitalic_x and y𝑦yitalic_y outliers, local linear fitting can yield smoother moreaccurate results.

4 Interpretation

The primary goal of the function tree representation is to enhanceinterpretation by exposing a function’s separate interaction effects ofvarious orders. These can then be studied using graphical and other methods togain insight into the nature of the relationship between the predictorvariables 𝐱=(x1,x2,β‹―,xp)𝐱subscriptπ‘₯1subscriptπ‘₯2β‹―subscriptπ‘₯𝑝\mathbf{x}=(x_{1},x_{2},\cdot\cdot\cdot,x_{p})bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , β‹― , italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) and the targetfunction F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ).

4.1 Partial dependence functions

One way to estimate the contribution of subsets of predictor variables to afunction F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x )is through partial dependence functions (Friedman2001). Let 𝐳𝐳\mathbf{z}bold_z represent a subset of the predictor variablesπ³βŠ‚π±π³π±\mathbf{z}\subset\mathbf{x}bold_z βŠ‚ bold_x and 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG the complement subset𝐳βˆͺ𝐳~=𝐱𝐳~𝐳𝐱\mathbf{z\,}\cup\,\mathbf{\tilde{z}\,}\mathbf{=x\,}bold_z βˆͺ over~ start_ARG bold_z end_ARG = bold_x. Then the partialdependence of a function  F⁒(𝐱)𝐹𝐱F(\mathbf{x})\,\ italic_F ( bold_x )on 𝐳𝐳\mathbf{z}bold_z isdefined as

P⁒D⁒(𝐳)=E𝐳~⁒[F⁒(𝐱)]⁒.𝑃𝐷𝐳subscript𝐸~𝐳delimited-[]𝐹𝐱.PD(\mathbf{z})=E_{\mathbf{\tilde{z}}}\left[F(\mathbf{x})\right]\text{.}italic_P italic_D ( bold_z ) = italic_E start_POSTSUBSCRIPT over~ start_ARG bold_z end_ARG end_POSTSUBSCRIPT [ italic_F ( bold_x ) ] .(17)

Conditioned on joint values of the variables in 𝐳𝐳\mathbf{z}bold_z, the the value ofthe function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) is averaged over the joint values of thecomplement variables 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG. Since the joint locations ofpartial dependence functions are generally not identifiable, they are eachcentered to have zero mean value over the data distribution of 𝐱𝐱\mathbf{x}bold_x.

If for a function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) the variables in the specified subset𝐳𝐳\mathbf{z}bold_z do not participate in interactions with variables in𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG then the additive dependence (6) ofF⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) on 𝐳𝐳\mathbf{z}bold_z is well defined and given by P⁒D⁒(𝐳)𝑃𝐷𝐳PD(\mathbf{z})italic_P italic_D ( bold_z ). If this is not the case, then the dependence of the target functionF⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) on the subset 𝐳𝐳\mathbf{z}bold_z is not well defined in the sensethat its functional form changes with changing values of the variables in𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG. In this case one can define a β€œnominal”dependence by averaging over some distributionp⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) of the predictor variables 𝐱𝐱\mathbf{x}bold_x. This confounds theproperties of the actual target function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) with those of thedata distribution p⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) on the resulting estimate of the dependenceon 𝐳𝐳\mathbf{z}bold_z. Two popular choices for p⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) are the training dataand the product of its marginals (independence).Partial dependence functionschoose a compromise in which the variables in 𝐳𝐳\mathbf{z}bold_z are taken to beindependent of those in 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG. Note that the result betweenthis and using the training data distribution is different only if there aresubstantial correlations or associations between the interacting variables in𝐳𝐳\mathbf{z}bold_z and 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG.

Partial dependence functions (17) can be estimated from the data in astraightforward way by evaluating

P⁒D^⁒(𝐳)=1Nβ’βˆ‘i=1NF⁒(𝐳,𝐳~i)^𝑃𝐷𝐳1𝑁superscriptsubscript𝑖1𝑁𝐹𝐳subscript~𝐳𝑖\widehat{PD}(\mathbf{z})=\frac{1}{N}\sum_{i=1}^{N}F(\mathbf{z,\tilde{z}}_{i})over^ start_ARG italic_P italic_D end_ARG ( bold_z ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_F ( bold_z , over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(18)

over a representative set of joint values of 𝐳𝐳\mathbf{z}bold_z. This requiresNβ‹…N𝐳⋅𝑁subscript𝑁𝐳N\cdot N_{\mathbf{z}}italic_N β‹… italic_N start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT target function evaluations where N𝐳subscript𝑁𝐳N_{\mathbf{z}}italic_N start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT isthe number of evaluation points and N𝑁Nitalic_N is the sample size used for averaging.

With the function tree representation partial dependence functions can becomputed much more rapidly. For the purpose of computing a partial dependencefunction on a variable subset 𝐳𝐳\mathbf{z}bold_z the function tree model can beexpressed as

F^⁒(𝐱)=A+βˆ‘k=1Kfk⁒(𝐳)β‹…gk⁒(𝐳~)^𝐹𝐱𝐴superscriptsubscriptπ‘˜1𝐾⋅subscriptπ‘“π‘˜π³subscriptπ‘”π‘˜~𝐳\hat{F}(\mathbf{x})=A+\sum_{k=1}^{K}f_{k}(\mathbf{z)\cdot\,}g_{k}\mathbf{(%\tilde{z})}over^ start_ARG italic_F end_ARG ( bold_x ) = italic_A + βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) β‹… italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG )(19)

where A𝐴Aitalic_A represents all basis functions not involving any variables in𝐳𝐳\mathbf{z}bold_z, fk⁒(𝐳)subscriptπ‘“π‘˜π³\,f_{k}(\mathbf{z)}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) are functions involving variables only in𝐳𝐳\mathbf{z}bold_z, and 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG represents the compliment variables.With this representation the partial dependence of F^⁒(𝐱)^𝐹𝐱\hat{F}(\mathbf{x})over^ start_ARG italic_F end_ARG ( bold_x ) onthe variables 𝐳𝐳\mathbf{z}bold_z is simply

P⁒D^⁒(𝐳)=βˆ‘k=1KgΒ―kβ‹…fk⁒(𝐳)^𝑃𝐷𝐳superscriptsubscriptπ‘˜1𝐾⋅subscriptΒ―π‘”π‘˜subscriptπ‘“π‘˜π³\widehat{PD}(\mathbf{z})=\sum_{k=1}^{K}\bar{g}_{k}\cdot f_{k}(\mathbf{z)}over^ start_ARG italic_P italic_D end_ARG ( bold_z ) = βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT overΒ― start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT β‹… italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z )(20)

where gΒ―k=E^𝐱⁒[gk⁒(𝐳~)]subscriptΒ―π‘”π‘˜subscript^𝐸𝐱delimited-[]subscriptπ‘”π‘˜~𝐳\bar{g}_{k}=\hat{E}_{\mathbf{x}}[g_{k}\mathbf{(\tilde{z})]}overΒ― start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT [ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG ) ] is themean of gk⁒(𝐳~)subscriptπ‘”π‘˜~𝐳g_{k}\mathbf{(\tilde{z})}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG ) over the data. This (20) is aparticular linear combination of the functions {fk⁒(𝐳)}1Ksuperscriptsubscriptsubscriptπ‘“π‘˜π³1𝐾\{f_{k}(\mathbf{z)\}}_{1}^{K}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. The number of target function F^⁒(𝐱)^𝐹𝐱\hat{F}(\mathbf{x})over^ start_ARG italic_F end_ARG ( bold_x ) evaluations required tocompute (20) is proportional to

C⁒(N,N𝐳)=N𝐳+α⁒(𝐳,𝐳~)β‹…N𝐢𝑁subscript𝑁𝐳subscript𝑁𝐳⋅𝛼𝐳~𝐳𝑁C(N,N_{\mathbf{z}})=N_{\mathbf{z}}\,+\alpha(\mathbf{z,\tilde{z}})\cdot Nitalic_C ( italic_N , italic_N start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ) = italic_N start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT + italic_Ξ± ( bold_z , over~ start_ARG bold_z end_ARG ) β‹… italic_N(21)

where α⁒(𝐳,𝐳~)𝛼𝐳~𝐳\alpha(\mathbf{z,\tilde{z}})italic_Ξ± ( bold_z , over~ start_ARG bold_z end_ARG ) is the fraction of node basis functions(8) involving variables in both 𝐳𝐳\mathbf{z}bold_z and 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG.

4.2 Partial association functions

Partial dependence functions focus on the properties of the functionF⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) by averaging over a distribution in which the 𝐳𝐳\mathbf{z}bold_zvariables are independent of those in 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG. This concentrateson the target function by removing the effects of associations between thosevariable subsets. As a result the calculation can sometimes emphasize𝐱𝐱\mathbf{x}bold_x-values for which the data density p⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) is smallleading to potential inaccuracy. This only affects a partial dependence to theextent that there are variables in 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG with which itsvariables 𝐳𝐳\mathbf{z}bold_z both interact and are correlated.

The function tree representation provides a method for detecting and measuringthe strength of such occurrences. β€œPartialassociation”functions are defined as

P⁒A^⁒(𝐳)=βˆ‘k=1Kfk⁒(𝐳)β‹…hk⁒(𝐳)^𝑃𝐴𝐳superscriptsubscriptπ‘˜1𝐾⋅subscriptπ‘“π‘˜π³subscriptβ„Žπ‘˜π³\widehat{PA}(\mathbf{z})=\sum_{k=1}^{K}f_{k}(\mathbf{z)}\cdot h_{k}(\mathbf{z})over^ start_ARG italic_P italic_A end_ARG ( bold_z ) = βˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) β‹… italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z )(22)

where

hk⁒(𝐳)=E𝐱⁒[gk⁒(𝐳~)|fk⁒(𝐳)]subscriptβ„Žπ‘˜π³subscript𝐸𝐱delimited-[]conditionalsubscriptπ‘”π‘˜~𝐳subscriptπ‘“π‘˜π³h_{k}(\mathbf{z})=E_{\mathbf{x}}\left[g_{k}\mathbf{(\tilde{z})\,|\,\,}f_{k}(%\mathbf{z)}\right]italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) = italic_E start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT [ italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG ) | italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) ](23)

with the functions fk⁒(𝐳)subscriptπ‘“π‘˜π³f_{k}(\mathbf{z)}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) and gk⁒(𝐳~)subscriptπ‘”π‘˜~𝐳g_{k}\mathbf{(\tilde{z})}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG ) definedin (19). As with partial dependence functions (20) this can beviewed as a linear combination of the functions {fk⁒(𝐳)}1Ksuperscriptsubscriptsubscriptπ‘“π‘˜π³1𝐾\{f_{k}(\mathbf{z)\}}_{1}^{K}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT but with varying coefficients {hk⁒(𝐳)}1Ksuperscriptsubscriptsubscriptβ„Žπ‘˜π³1𝐾\{h_{k}(\mathbf{z)\}}_{1}^{K}{ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT thataccount for the associations between the variables in 𝐳𝐳\mathbf{z}bold_z and thecomplement variables 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG. The coefficient functionshk⁒(𝐳)subscriptβ„Žπ‘˜π³h_{k}(\mathbf{z})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) can be estimated by any univariate smoother. Regressionsplines with knots at the 20202020 percentiles are used in the examples below.Note that from (19) partial association functions (22)(23) reduce to partial dependence functions (20) when variablesin 𝐳𝐳\mathbf{z}bold_z do not participate in interactions with variables in𝐳~~𝐳\mathbf{\tilde{z}\,\ }over~ start_ARG bold_z end_ARG(gk⁒(𝐳~)=1subscriptπ‘”π‘˜~𝐳1\,g_{k}\mathbf{(\tilde{z}})=1italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG ) = 1), or they areindependent of those in 𝐳~~𝐳\mathbf{\tilde{z}\,\ }over~ start_ARG bold_z end_ARGwith which they do interact(hk⁒(𝐳)=gΒ―subscriptβ„Žπ‘˜π³Β―π‘”\,h_{k}\mathbf{(z})=\bar{g}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) = overΒ― start_ARG italic_g end_ARG). Otherwise they can produce different resultscaused by the variation in the respective coefficient functions hk⁒(𝐳)subscriptβ„Žπ‘˜π³h_{k}(\mathbf{z)}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_z ) induced by the associations between the correspondinginteracting variables in 𝐳𝐳\mathbf{z}bold_z and 𝐳~~𝐳\mathbf{\tilde{z}}over~ start_ARG bold_z end_ARG.

Partial association functions can be used to access the influence thatassociations among the predictor variables have on the corresponding partialdependences. In particular, one can substitute them (22) (23)for partial dependence functions (20) in any analysis. Similar resultsindicate that such influences are small.

4.3 Interaction detection

Partial dependence functions can be used to detect interaction effects betweenvariables. For example, if a target function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) contains nointeraction between variables xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT its partial dependence onthose variables is

P⁒D⁒(xj,xk)=P⁒D⁒(xj)+P⁒D⁒(xk)𝑃𝐷subscriptπ‘₯𝑗subscriptπ‘₯π‘˜π‘ƒπ·subscriptπ‘₯𝑗𝑃𝐷subscriptπ‘₯π‘˜PD(x_{j},x_{k})=PD(x_{j})+PD(x_{k})italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )(24)

(Friedman and Popescu 2008). If there is such an interaction, P⁒D⁒(xj)+P⁒D⁒(xk)𝑃𝐷subscriptπ‘₯𝑗𝑃𝐷subscriptπ‘₯π‘˜PD(x_{j})+PD(x_{k})italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) represents the additive component and

I⁒(xj,xk)=P⁒D⁒(xj,xk)βˆ’P⁒D⁒(xj)βˆ’P⁒D⁒(xk)𝐼subscriptπ‘₯𝑗subscriptπ‘₯π‘˜π‘ƒπ·subscriptπ‘₯𝑗subscriptπ‘₯π‘˜π‘ƒπ·subscriptπ‘₯𝑗𝑃𝐷subscriptπ‘₯π‘˜I(x_{j},x_{k})=PD(x_{j},x_{k})-PD(x_{j})-PD(x_{k})italic_I ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )(25)

represents the corresponding pure interaction component of the effect ofxjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) as reflected by its partial dependences.

More generally, the pure interaction effect involving variables s={xj}1n𝑠superscriptsubscriptsubscriptπ‘₯𝑗1𝑛s=\{x_{j}\}_{1}^{n}italic_s = { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is defined to be

I⁒(s)=P⁒D⁒(s)βˆ’βˆ‘uβŠ‚sI⁒(u)𝐼𝑠𝑃𝐷𝑠subscript𝑒𝑠𝐼𝑒I(s)=PD(s)-\sum_{u\subset s}I(u)italic_I ( italic_s ) = italic_P italic_D ( italic_s ) - βˆ‘ start_POSTSUBSCRIPT italic_u βŠ‚ italic_s end_POSTSUBSCRIPT italic_I ( italic_u )(26)

where I⁒(u)𝐼𝑒I(u)italic_I ( italic_u ) is recursively defined as

I⁒(u)=P⁒D⁒(u)βˆ’βˆ‘vβŠ‚uI⁒(v)⁒.𝐼𝑒𝑃𝐷𝑒subscript𝑣𝑒𝐼𝑣.I(u)=PD(u)-\sum_{v\subset u}I(v)\text{.}italic_I ( italic_u ) = italic_P italic_D ( italic_u ) - βˆ‘ start_POSTSUBSCRIPT italic_v βŠ‚ italic_u end_POSTSUBSCRIPT italic_I ( italic_v ) .(27)

This pure interaction component I⁒({xj}1n)𝐼superscriptsubscriptsubscriptπ‘₯𝑗1𝑛I(\{x_{j}\}_{1}^{n})italic_I ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )is its correspondingpartial dependence with all lower order interactions among all its variablesubsets removed. If there is no n𝑛nitalic_n–variable interaction effect involving{xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT its pure interaction component I⁒({xj}1n)=0𝐼superscriptsubscriptsubscriptπ‘₯𝑗1𝑛0I(\{x_{j}\}_{1}^{n})=0italic_I ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = 0.Note that this result depends only on the properties of the functionF⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) (5) and not on the data distribution p⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ).The strength of such an interaction effect can be taken to be the standarddeviation of its pure interaction component over the training data, or otherspecified data distribution, divided by the standard deviation of thecorresponding target function

S⁒({j}1n)=v⁒a⁒r𝐱⁒[I⁒({xj}1n)]/v⁒a⁒r𝐱⁒(F⁒(𝐱))⁒.𝑆superscriptsubscript𝑗1𝑛/π‘£π‘Žsubscriptπ‘Ÿπ±delimited-[]𝐼superscriptsubscriptsubscriptπ‘₯𝑗1π‘›π‘£π‘Žsubscriptπ‘Ÿπ±πΉπ±.S(\{j\}_{1}^{n})=\sqrt{\left.var_{\mathbf{x}}[I(\{x_{j}\}_{1}^{n})]\right/var_%{\mathbf{x}}(F(\mathbf{x}))}\text{.}italic_S ( { italic_j } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = square-root start_ARG italic_v italic_a italic_r start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT [ italic_I ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ] / italic_v italic_a italic_r start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_F ( bold_x ) ) end_ARG .(28)

If there exists a higher order interaction effect jointly involving variablesin subset {xj}1msuperscriptsubscriptsubscriptπ‘₯𝑗1π‘š\{x_{j}\}_{1}^{m}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT along with other variables {xj}m+1nsuperscriptsubscriptsubscriptπ‘₯π‘—π‘š1𝑛\{x_{j}\}_{m+1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, S⁒({j}1n)>0𝑆superscriptsubscript𝑗1𝑛0S(\{j\}_{1}^{n})>0italic_S ( { italic_j } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) > 0 (28), then the form of the interaction functionof the subset I⁒({xj}1m|{xj}m+1n)𝐼conditionalsuperscriptsubscriptsubscriptπ‘₯𝑗1π‘šsuperscriptsubscriptsubscriptπ‘₯π‘—π‘š1𝑛I(\{x_{j}\}_{1}^{m}\,|\,\{x_{j}\}_{m+1}^{n})italic_I ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) depends on thejoint values of the complement variables {xj}m+1nsuperscriptsubscriptsubscriptπ‘₯π‘—π‘š1𝑛\{x_{j}\}_{m+1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The resultingunconditioned interaction effect I⁒({xj}1m)𝐼superscriptsubscriptsubscriptπ‘₯𝑗1π‘šI(\{x_{j}\}_{1}^{m})italic_I ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) is then an averageover the joint distribution p⁒({xj}m+1n)𝑝superscriptsubscriptsubscriptπ‘₯π‘—π‘š1𝑛p(\{x_{j}\}_{m+1}^{n})italic_p ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) of the complementvariables. If there are no such higher order interactions involving{xj}1msuperscriptsubscriptsubscriptπ‘₯𝑗1π‘š\{x_{j}\}_{1}^{m}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT jointly with other variables, its pure interaction effectI⁒({xj}1m)𝐼superscriptsubscriptsubscriptπ‘₯𝑗1π‘šI(\{x_{j}\}_{1}^{m})italic_I ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) does not depend on the data distribution p⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) and reflects only properties of the target function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ).

The measure S⁒(s)𝑆𝑠S(s)italic_S ( italic_s ) (28) indicates the strength of an n𝑛nitalic_n–variableinteraction effect involving the corresponding variable subset s=𝑠absents=italic_s ={xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over the distribution p⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ). One can search forsubstantial interaction effects by evaluating (26–28) over allvariable subsets s𝑠sitalic_s up to some maximum size |s|≀M𝑠𝑀|\,s\,|\,\leq M| italic_s | ≀ italic_M. This approachcan in principle be applied to a target function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) representedin any form. All that is required to compute partial dependence functions isthe value of the function at various specified values of 𝐱𝐱\mathbf{x}bold_x.

Function Trees: Transparent Machine Learning (5)

Figure 3 shows the strengths (28) of all interaction effectsuncovered in the synthetic data example of Section 3 using this approach. Thepresence of more lower order interaction effects shown here than explicitlyrepresented by the function tree (Fig. 1) is due to lower level marginals ofhigher order basis functions (9) being assigned to their properinteraction level.

5 Computation

For a total of p𝑝pitalic_p variables the number of distinct subsets involving n<p𝑛𝑝n<pitalic_n < italic_pvariables is (pn)binomial𝑝𝑛\binom{p}{n}( FRACOP start_ARG italic_p end_ARG start_ARG italic_n end_ARG ). For each such subset the number of partialdependence functions required to extract its pure interaction effect(26) (27) grows exponentially in its size n.𝑛n.italic_n . Thus for largervalues of p𝑝pitalic_p and n𝑛nitalic_n the required computation using the standard method(18) to evaluate all of the necessary partial dependence functions isgenerally too severe to allow a search for interaction effects by completeenumeration over all variable subsets. However, for functions represented by afunction tree, one can employ (20) to dramatically reduce thecomputation (21) required to evaluate each partial dependencefunction. This in turn allows the search for interactions to be performed overmany more and larger variable subsets representing higher order interaction effects.

For small to moderate number of variables p≲20less-than-or-similar-to𝑝20p\lesssim 20italic_p ≲ 20 complete enumerationusing (20) is generally feasible for n≲4less-than-or-similar-to𝑛4n\lesssim 4italic_n ≲ 4. However, for largerproblems the rapid growth with respect to both p𝑝pitalic_p and n𝑛nitalic_n soon places therequired computation out of reach. Parallel computation and smart strategiesfor reusing previously computed partial dependence functions can somewhat easethis burden. However, the properties of partial dependence functions allow asimple input variable screening approach that can often considerably reducethe size of the search.

If for a (centered) function F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) a variable xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT participatesin no interactions with any other variables then it can be expressed as

F⁒(𝐱)=P⁒D⁒(xj)+P⁒D⁒(𝐱\j)𝐹𝐱𝑃𝐷subscriptπ‘₯𝑗𝑃𝐷subscript𝐱\absent𝑗F(\mathbf{x})=PD(x_{j})+PD(\mathbf{x}_{\backslash j})italic_F ( bold_x ) = italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_P italic_D ( bold_x start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT )(29)

where P⁒D⁒(xj)𝑃𝐷subscriptπ‘₯𝑗PD(x_{j})italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is its partial dependence on xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and P⁒D⁒(𝐱\j)𝑃𝐷subscript𝐱\absent𝑗PD(\mathbf{x}_{\backslash j})italic_P italic_D ( bold_x start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT ) is its partial dependence on all other variables (Friedmanand Popescu 2008). One can define an overall interaction strength for eachpredictor variable as

Hj=E^𝐱⁒(F⁒(𝐱)βˆ’P⁒D⁒(xj)βˆ’P⁒D⁒(𝐱\j))2⁒.subscript𝐻𝑗subscript^𝐸𝐱superscript𝐹𝐱𝑃𝐷subscriptπ‘₯𝑗𝑃𝐷subscript𝐱\absent𝑗2.H_{j}=\sqrt{\hat{E}_{\mathbf{x}}(F(\mathbf{x})-PD(x_{j})-PD(\mathbf{x}_{%\backslash j}))^{2}}\text{.}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = square-root start_ARG over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_F ( bold_x ) - italic_P italic_D ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_P italic_D ( bold_x start_POSTSUBSCRIPT \ italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .(30)

Variables xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with small values of Hjsubscript𝐻𝑗H_{j}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can be removed from the searchfor interaction effects. This often excludes many variables therebysubstantially reducing computation.

Computation can be further reduced by taking advantage of the function treerepresentation. The function tree strength of the contribution of inputvariable xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to interaction level kπ‘˜kitalic_k is taken to be

Rj⁒k=βˆ‘m=1MI⁒(j∈m)⁒I⁒(i⁒n⁒t⁒(Bm)=k)⁒v⁒a⁒r𝐱⁒[Bm⁒(𝐱)]⁒.subscriptπ‘…π‘—π‘˜superscriptsubscriptπ‘š1π‘€πΌπ‘—π‘šπΌπ‘–π‘›π‘‘subscriptπ΅π‘šπ‘˜π‘£π‘Žsubscriptπ‘Ÿπ±delimited-[]subscriptπ΅π‘šπ±.R_{jk}=\sum_{m=1}^{M}I(j\in m)\,\,I(int(B_{m})=k)\,\sqrt{var\,_{\mathbf{x}}[B_%{m}(\mathbf{x})]}\text{.}italic_R start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = βˆ‘ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_I ( italic_j ∈ italic_m ) italic_I ( italic_i italic_n italic_t ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = italic_k ) square-root start_ARG italic_v italic_a italic_r start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT [ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_x ) ] end_ARG .(31)

Here I⁒(β‹…)𝐼⋅I(\cdot)italic_I ( β‹… ) is an indicator of the truth of its argument, mπ‘šmitalic_m labels anode of the M𝑀Mitalic_M node function tree, Bm⁒(𝐱)subscriptπ΅π‘šπ±B_{m}(\mathbf{x})italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_x ) is its correspondingbasis function, i⁒n⁒t⁒(Bm)𝑖𝑛𝑑subscriptπ΅π‘šint(B_{m})italic_i italic_n italic_t ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) its interaction order and the variance is overthe distribution p⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ). This can be used as a filter to excludevariables xjsubscriptπ‘₯𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with small values of Rj⁒ksubscriptπ‘…π‘—π‘˜R_{jk}italic_R start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT (31) from the searchfor kπ‘˜kitalic_k–variable interaction effects. Since node basis functionsBm⁒(𝐱)subscriptπ΅π‘šπ±B_{m}(\mathbf{x})italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_x ) are not pure interaction effect functions (26)(27) the search for n𝑛nitalic_n–variable interaction effects shouldincludethe union of all relevant variables with substantial contributions(31) at that interaction level or higher.

In many applications the number of variables involved in higher orderinteractions is less than that involved in main or lower order interactioneffects. Since computation increases very rapidly with the number of variablesexamined, further variable reduction at the interaction level (31) cansubstantially reduce computation even for minor variable reductions.

For the simulation example in Section 6.2 (below) exhaustive search overall 30303030 variables at each of four interaction levels requires computing2027970202797020279702027970 partial dependence functions with 2.03Γ—10122.03superscript10122.03\times 10^{12}2.03 Γ— 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT correspondingtarget function evaluations using (18) with N=N𝐳=1000𝑁subscript𝑁𝐳1000N=N_{\mathbf{z}}=1000italic_N = italic_N start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT = 1000.This would take approximately 36363636 days on a Dell XPS 13 laptop. Approximatingthe search by only including the ten variables estimated to be the mostimportant reduces the necessary number of partial dependence functions to15540155401554015540 requiring 1.551.551.551.55 Γ—\timesΓ— 1010superscript101010^{10}10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT target function evaluations. Thiswould take approximately seven hours.

Partial dependence screening using (30) identifies six interactingvariables reducing the number of partial dependence functions computed to1114111411141114 with 1.11Γ—1091.11superscript1091.11\times 10^{9}1.11 Γ— 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT target function evaluations using (18).This reduces the computing time to approximately 30303030 minutes. Building thefunction tree for this example required 14.514.514.514.5 seconds. Correspondinginteraction level screening using (31) eliminated all four–variableinteractions. Six variables remained for both the two–variable andthree–variable interaction searches and ten variables for main effects. Thisrequires a total of 384384384384 partial dependence functions. Employing (20)with N=Nz=1000𝑁subscript𝑁𝑧1000N=N_{z}=1000italic_N = italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = 1000 involved 4.14Γ—1054.14superscript1054.14\times 10^{5}4.14 Γ— 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT target function evaluationsrequiring 0.50.50.50.5 seconds.

5.1 Interaction investigation

Once discovered, interactions as well as main effects can be visualized usingpartial dependence plots. Figure 4 shows a heat map representation ofthe interaction effect between variables x7subscriptπ‘₯7x_{7}italic_x start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT and x8subscriptπ‘₯8x_{8}italic_x start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT, I⁒(x7,x8)𝐼subscriptπ‘₯7subscriptπ‘₯8I(x_{7},x_{8})italic_I ( italic_x start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT )(26) (27), for the function tree model F^⁒(𝐱)^𝐹𝐱\hat{F}(\mathbf{x})over^ start_ARG italic_F end_ARG ( bold_x )(Figs. 1 & 2). Colors black, blue, purple, violet, red, orange, and yellow,seen in the lower left legend, represent a continuum of increasing functionvalues from lowest (βˆ’5.065.06-5.06- 5.06) to highest (5.495.495.495.49).

Function Trees: Transparent Machine Learning (6)

One sees that joint high or joint low values of these two variables producerelatively large increases in function value (red, yellow), whereas oppositevalues, (low, high) or (high, low), give rise to large decreases (blue,black). Since Fig. 3 shows no higher order interactions involvingthese variables, this function of x7subscriptπ‘₯7x_{7}italic_x start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT and x8subscriptπ‘₯8x_{8}italic_x start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT represents thecorresponding interaction effect associated with the target function estimateF^⁒(𝐱)^𝐹𝐱\hat{F}(\mathbf{x)}over^ start_ARG italic_F end_ARG ( bold_x ), and does not reflect the nature of the datadistribution p⁒(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ). One can verify this behavior by examining thetrue target function (10).

Variables x4subscriptπ‘₯4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and x5subscriptπ‘₯5x_{5}italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are seen in Fig. 3 to have a relativelyweak interaction. However, they are also seen to participate in a substantialthree–variable interaction with variable x6subscriptπ‘₯6x_{6}italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT so that their two–variableinteraction function I⁒(x4,x5)𝐼subscriptπ‘₯4subscriptπ‘₯5I(x_{4},x_{5})italic_I ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) (26) (27) may notprovide a complete description of their joint effect on the target estimateF^⁒(𝐱)^𝐹𝐱\hat{F}(\mathbf{x})over^ start_ARG italic_F end_ARG ( bold_x ).

Function Trees: Transparent Machine Learning (7)

Figure 5 shows a plot of the average (unconditional) pure interactionfunction I⁒(x4,x5)𝐼subscriptπ‘₯4subscriptπ‘₯5I(x_{4},x_{5})italic_I ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) (upper left), and corresponding interactionfunctions I⁒(x4,x5|x6)𝐼subscriptπ‘₯4conditionalsubscriptπ‘₯5subscriptπ‘₯6I(x_{4},x_{5}\,|\,x_{6})italic_I ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) conditioned at three values of x6∈{βˆ’2,0,2}subscriptπ‘₯6202x_{6}\in\{-2,0,2\}italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ∈ { - 2 , 0 , 2 }, all plotted on the same vertical scale (βˆ’63,636363-63,63- 63 , 63). Here ablack color represents the most negative values, yellow the most positive andviolet the smallest absolute values. The upper left frame shows that thevariation of unconditional interaction function I⁒(x4,x5)𝐼subscriptπ‘₯4subscriptπ‘₯5I(x_{4},x_{5})italic_I ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) is quitesmall (≃similar-to-or-equals\simeq≃ violet) as indicated in Fig. 3. The lower left frameshows the same for I⁒(x4,x5|x6=0)𝐼subscriptπ‘₯4conditionalsubscriptπ‘₯5subscriptπ‘₯60I(x_{4},x_{5}\,|\,x_{6}=0)italic_I ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = 0 ). The two right frames howevershow very strong interaction effects for I⁒(x4,x5|x6=βˆ’2)𝐼subscriptπ‘₯4conditionalsubscriptπ‘₯5subscriptπ‘₯62I(x_{4},x_{5}\,|\,x_{6}=-2)italic_I ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = - 2 ) andI⁒(x4,x5|x6=2)𝐼subscriptπ‘₯4conditionalsubscriptπ‘₯5subscriptπ‘₯62I(x_{4},x_{5}\,|\,x_{6}=2)italic_I ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = 2 ). There is almost no interaction between x4subscriptπ‘₯4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPTand x5subscriptπ‘₯5x_{5}italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT when x6=0subscriptπ‘₯60x_{6}=0italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = 0. For x6=± 2subscriptπ‘₯6plus-or-minus2x_{6}=\pm\,2italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = Β± 2, there are strong butopposite interaction effects. This leads to an overall weaktwo–variable interaction between x4subscriptπ‘₯4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and x5subscriptπ‘₯5x_{5}italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT when averaged over thedistribution of x6subscriptπ‘₯6x_{6}italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT (upper left). Seeing only this weak two–variableinteraction effect between (x4,x5subscriptπ‘₯4subscriptπ‘₯5x_{4},x_{5}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT) without knowledge of thecorresponding three–variable (x4,⁒x5,x6)subscriptπ‘₯4subscriptπ‘₯5subscriptπ‘₯6(x_{4,}x_{5},x_{6})( italic_x start_POSTSUBSCRIPT 4 , end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) interaction would lead tothe impression that these two variables influence the target F⁒(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x )in an additive unrelated manner. As seen in the right two frames of Fig.5 this is clearly not the case. Thus, in the presence of higher orderinteractions, lower order representations can be misleading.

6 Illustrations

The function tree methodology described above was applied to a number ofpopular public data sets to investigate the nature of the relationship betweenthe joint values of their predictor variables and outcome. In many cases thecorresponding function tree uncovered simple relationships involving additiveeffects or at most a few two–variable interactions. Others were seen toinvolve more complex structure. Examples of both are illustrated here.

The goal of the function tree approach is interpretational. As such its form(7) (8) is somewhat more restrictive than other more flexiblemethods focused purely on prediction accuracy such as XGBoost (Chen andGuestrin 2016) or Random Forests (Breiman 2001). However, if on any particularproblem its accuracy is substantially less than these other methods thequality of the corresponding interpretation may be questionable. For each ofthe examples presented in this section root-mean-squared prediction error

R⁒M⁒S⁒E=βˆ‘i=1N(yiβˆ’F^⁒(𝐱i))2/βˆ‘i=1N(yiβˆ’yΒ―)2𝑅𝑀𝑆𝐸superscriptsubscript𝑖1𝑁/superscriptsubscript𝑦𝑖^𝐹subscript𝐱𝑖2superscriptsubscript𝑖1𝑁superscriptsubscript𝑦𝑖¯𝑦2RMSE=\sqrt{\left.\sum_{i=1}^{N}(y_{i}-\hat{F}(\mathbf{x}_{i}))^{2}\right/\sum_%{i=1}^{N}(y_{i}-\bar{y})^{2}}italic_R italic_M italic_S italic_E = square-root start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_F end_ARG ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - overΒ― start_ARG italic_y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG(32)

is reported for default versions of the three methods: no tuning for RandomForests and only model size tuning for function trees and XGBoost. In allexamples presented here the estimated interaction effects are based on partialdependence functions (Section 4.1). Corresponding summaries usingpartial association functions (Section 4.2) produced nearly identicalresults in all cases.

6.1 Capital bikeshare data

This data set taken from the UCI Machine Learning Repository consists of 17379hourly observations of counts of bicycle rentals in the Capital bikesharesystem in Washington, D. C. between 2011 and 2012. The outcome variable y𝑦yitalic_y isthe rental count. The sixteen predictor variables are both categorical andnumeric consisting of corresponding time, weather and seasonal information.

Function Trees: Transparent Machine Learning (8)
Function Trees: Transparent Machine Learning (9)
Function Trees: Transparent Machine Learning (10)

Figure 6 displays the tree for the function tree model and Fig.7 shows the corresponding node functions. Its interaction effectsummary is shown in Fig.  8. Figure 8 indicatesexistence of substantial two–variable and three–variable interactions. Animportant question is the extent to which these effects are properties of thetarget function and not just properties of a particular training sample. A wayto address this is through a bootstrap analysis (Efron 1979). Figure9 shows box plots of the distributions over fifty bootstrapreplications of test set root-mean squared-error (32) of function treemodels built under three constraints on maximum interaction order:unconstrained (left), no three-variable interactions allowed (center), and nointeractions allowed (right). As seen, existence of both two–variable andthree–variable interactions are significant.

Function Trees: Transparent Machine Learning (11)
Function Trees: Transparent Machine Learning (12)

The largest estimated two–variable interaction effect is between variableshour-of-day (hr) and the binary indicator for working/non-working days(wrk). Figure 10 shows the partial dependence of DC bikerentals on time-of-day conditioned on working days wrk =1absent1=1= 1 (top) andnon working days wrk =0absent0=0= 0 (bottom). They are seen to be quite differentindicating a strong interaction effect between these two variables. Inparticular, the sharp reduction in rentals beginning at 8am and continuinguntil 5pm on working days is seen not to exist on non-working days.

Function Trees: Transparent Machine Learning (13)

The upper left frame of Fig. 11 shows the partial dependence of bikerentals jointly on adjusted temperature (atemp) and time-of-day(hr), along with its pure interaction component I(I(italic_I (hr,atemp)))) (top right). Figure 8 indicates a three–variableinteraction between variables hour (hr), working day indicator(wrk) and adjusted temperature (atemp). Thus, the nature ofthe interaction effect between hr and atemp may depend on thevalue of wrk. The bottom two frames of Fig. 11 show thisinteraction effect separately conditioned on the value of working dayI(I(italic_I (hr,atemp||\,|\,|wrk=0)=0)= 0 ) and I(I(italic_I (hr,atemp||\,|\,|wrk=1)=1)= 1 ). One sees substantial difference in the nature of thisinteraction for the different values of wrk reflecting the presence ofthe three–variable interaction.

Function Trees: Transparent Machine Learning (14)

Figure 8 indicates that the strongest three–variable interactioneffect  involves the working day indicator (wrk), day of theweek (wekd) and hour of the day (hr). This is illustrated inFig. 12. The interaction function I(I(italic_I (wekd,wrk |||| hr) is displayed for hrβ€‰βˆˆ{\in\,\{∈ {midnight,6 am,noon,6 pm}}\}}. Note that there are noobservations for which Saturday (wekd =7absent7=7= 7) and Sunday (wekd=1absent1=1= 1) are labeled as working days (wrk=2absent2=2= 2). The absence of athree–variable interaction would imply that I(I(italic_I (wekd,wrk |||| hr)))) is the same for all values of the variable hr.As seen in Fig. 12 this is not the case especially for non workingweekdays at noon (lower left).

The R⁒S⁒M⁒E𝑅𝑆𝑀𝐸RSMEitalic_R italic_S italic_M italic_E (32) of function tree, XGBoost and Random Forest models onthis example was 0.340.340.340.34, 0.370.370.370.37 and 0.380.380.380.38, respectively.

6.2 Simulated data

This example was presented in Hu et. al. (2023). They generated datausing the target function

g⁒(𝐱)𝑔𝐱\displaystyle g(\mathbf{x)}italic_g ( bold_x )=βˆ‘j=15xj+0.5β’βˆ‘j=68xj2+βˆ‘j=910xj⁒I⁒(xj>0)+x1⁒x2+x1⁒x3+x2⁒x3absentsuperscriptsubscript𝑗15subscriptπ‘₯𝑗0.5superscriptsubscript𝑗68superscriptsubscriptπ‘₯𝑗2superscriptsubscript𝑗910subscriptπ‘₯𝑗𝐼subscriptπ‘₯𝑗0subscriptπ‘₯1subscriptπ‘₯2subscriptπ‘₯1subscriptπ‘₯3subscriptπ‘₯2subscriptπ‘₯3\displaystyle\mathbf{=}\sum_{j=1}^{5}x_{j}+0.5\sum_{j=6}^{8}\,x_{j}^{2}+\sum_{%j=9}^{10}x_{j}\,\,I(x_{j}>0)+x_{1}\,x_{2}+x_{1}\,x_{3}+x_{2}\,x_{3}= βˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 0.5 βˆ‘ start_POSTSUBSCRIPT italic_j = 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + βˆ‘ start_POSTSUBSCRIPT italic_j = 9 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_I ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 ) + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT(33)
+0.5⁒x1⁒x2⁒x3+x4⁒x5+x4⁒x6+x5⁒x6+0.5⁒I⁒(x4>0)⁒x5⁒x6⁒.0.5subscriptπ‘₯1subscriptπ‘₯2subscriptπ‘₯3subscriptπ‘₯4subscriptπ‘₯5subscriptπ‘₯4subscriptπ‘₯6subscriptπ‘₯5subscriptπ‘₯60.5𝐼subscriptπ‘₯40subscriptπ‘₯5subscriptπ‘₯6.\displaystyle+0.5\,x_{1}\,x_{2}\,\,x_{3}+x_{4}\,\,x_{5}+x_{4}\,x_{6}+x_{5}\,x_%{6}+0.5\,I(x_{4}>0)\,\,x_{5}\,x_{6}\text{.}+ 0.5 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT + 0.5 italic_I ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT > 0 ) italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT .

This is a function of ten variables involving numerous interactions involvingup to three variables. The response was simulated as

y=g⁒(𝐱)+Ρ⁒,⁒Ρ∼N⁒(0,0.52)⁒.π‘¦π‘”π±πœ€,πœ€similar-to𝑁0superscript0.52.y=g(\mathbf{x})+\varepsilon\text{, \ }\varepsilon\sim N(0,0.5^{2})\text{.}italic_y = italic_g ( bold_x ) + italic_Ξ΅ , italic_Ξ΅ ∼ italic_N ( 0 , 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .(34)

There were 30303030 predictor variables. The first 20202020 were simulated from amultivariate Gaussian distribution with mean 00, variance 1111 and equalcorrelation 0.50.50.50.5 between all pairs. Ten additional variables were includedthat were independent of the first 20202020 (irrelevant variables). They were alsosimulated from a multivariate Gaussian distribution with mean 00, variance1111 and equal correlation 0.50.50.50.5 between all pairs. All predictors weretruncated to be within the interval [βˆ’2.5,2.5]2.52.5[-2.5,2.5][ - 2.5 , 2.5 ]. Here the sample size wastaken to be N=20000𝑁20000N=20000italic_N = 20000.

Function Trees: Transparent Machine Learning (15)

Figure 13 shows the largest estimated main and interaction effectstrengths up through four–variables. One sees that all of the main andinteraction effects represented in (33) are shown with substantialstrengths. Those not represented in (33), including all four–variableinteractions, either do not appear or do so with very low estimated strength.

The R⁒S⁒M⁒E𝑅𝑆𝑀𝐸RSMEitalic_R italic_S italic_M italic_E (32) of function tree, XGBoost and Random Forest models onthis example was 0.0620.0620.0620.062, 0.1470.1470.1470.147 and 0.1740.1740.1740.174 respectively.

6.3 Pumadyn data

The other examples were chosen to represent data with somewhat complex targetfunction structure. Here we illustrate on a data set where the function treeuncovers very simple structure. The data (Corke 1996) is produced from arealistic simulation of the dynamics of a Puma 560 robot arm. The task is topredict the angular acceleration of one of the robot arm’s links. There are8192 observations involving 32 input variables that include angular positions,velocities and torques of the robot arm.

Function Trees: Transparent Machine Learning (16)
Function Trees: Transparent Machine Learning (17)

Figure 14 shows the resulting entire function tree model. The upperleft frame shows the tree and the other frames show the node functions. Figure15 shows the (only) interaction effect between tau4 andtheta5. The upper left frame shows the partial dependence functionP⁒D⁒(t⁒h⁒e⁒t⁒a⁒5,t⁒a⁒u⁒4)π‘ƒπ·π‘‘β„Žπ‘’π‘‘π‘Ž5π‘‘π‘Žπ‘’4PD(theta5,tau4)italic_P italic_D ( italic_t italic_h italic_e italic_t italic_a 5 , italic_t italic_a italic_u 4 ); upper right frame shows its additive component and lowerleft frame its pure interaction effect I⁒(t⁒h⁒e⁒t⁒a⁒5,t⁒a⁒u⁒4)πΌπ‘‘β„Žπ‘’π‘‘π‘Ž5π‘‘π‘Žπ‘’4I(theta5,tau4)italic_I ( italic_t italic_h italic_e italic_t italic_a 5 , italic_t italic_a italic_u 4 ). The lower right framerepresents I⁒(t⁒h⁒e⁒t⁒a⁒5,t⁒a⁒u⁒4)πΌπ‘‘β„Žπ‘’π‘‘π‘Ž5π‘‘π‘Žπ‘’4I(theta5,tau4)italic_I ( italic_t italic_h italic_e italic_t italic_a 5 , italic_t italic_a italic_u 4 ) as a perspective mesh plot. Of the 32 inputvariables, the function tree model selected only two of them to explain 97%of the variance of the target.

The R⁒S⁒M⁒E𝑅𝑆𝑀𝐸RSMEitalic_R italic_S italic_M italic_E (32) of function tree, XGBoost and Random Forest models onthis example was 0.180.180.180.18, 0.210.210.210.21 and 0.270.270.270.27, respectively.

6.4 SGEMM GPU kernel performance

This data set from the UCI Machine Learning Repository measures the runningtime required for a matrix-matrix product using a parameterizable SGEMM GPUkernel. There are 241600 observations with 14 predictor variables:

SGEMM GPU kernel performance predictor variables

1-2mwg, nwgper-matrix 2Dtiling atworkgroup level3kwginner dimension of 2Dtiling at workgrouplevel4-5mdimc, ndimclocal workgroupsize6-7mdima, ndimblocal memoryshape8kwikernel loop unrolling factor9-10vwm, vwnper-matrix vectorwidths for loading and storing11-12strm, strnenable stride foraccessing off-chip memory within a single thread13-14sa, sbper-matrix manualcaching of the 2D workgroup tile.1-2mwg, nwgper-matrix 2Dtiling atworkgroup level3kwginner dimension of 2Dtiling at workgrouplevel4-5mdimc, ndimclocal workgroupsize6-7mdima, ndimblocal memoryshape8kwikernel loop unrolling factor9-10vwm, vwnper-matrix vectorwidths for loading and storing11-12strm, strnenable stride foraccessing off-chip memory within a single thread13-14sa, sbper-matrix manualcaching of the 2D workgroup tile.\begin{array}[c]{lll}\text{1-2}&\text{\emph{mwg,\thinspace nwg}}&\text{per-%matrix 2Dtiling atworkgroup level}\\\text{3}&\text{\emph{kwg}}&\text{inner dimension of 2Dtiling at workgrouplevel}\\\text{4-5}&\text{\emph{mdimc,\thinspace ndimc}}&\text{local workgroupsize}\\\text{6-7}&\text{\emph{mdima,\thinspace ndimb}}&\text{local memoryshape}\\\text{8}&\text{\emph{kwi}}&\text{kernel loop unrolling factor}\\\text{9-10}&\text{\emph{vwm,\thinspace vwn}}&\text{per-matrix vectorwidths for loading and storing}\\\text{11-12}&\text{\emph{strm,\thinspace strn}}&\text{enable stride foraccessing off-chip memory within a single thread}\\\text{13-14}&\text{\emph{sa,\thinspace sb}}&\text{per-matrix manualcaching of the 2D workgroup tile.}\end{array}start_ARRAY start_ROW start_CELL 1-2 end_CELL start_CELL mwg, nwg end_CELL start_CELL per-matrix 2Dtiling at workgroup level end_CELL end_ROW start_ROW start_CELL 3 end_CELL start_CELL kwg end_CELL start_CELL inner dimension of 2Dtiling at workgroup level end_CELL end_ROW start_ROW start_CELL 4-5 end_CELL start_CELL mdimc, ndimc end_CELL start_CELL local workgroup size end_CELL end_ROW start_ROW start_CELL 6-7 end_CELL start_CELL mdima, ndimb end_CELL start_CELL local memory shape end_CELL end_ROW start_ROW start_CELL 8 end_CELL start_CELL kwi end_CELL start_CELL kernel loop unrolling factor end_CELL end_ROW start_ROW start_CELL 9-10 end_CELL start_CELL vwm, vwn end_CELL start_CELL per-matrix vector widths for loading and storing end_CELL end_ROW start_ROW start_CELL 11-12 end_CELL start_CELL strm, strn end_CELL start_CELL enable stride for accessing off-chip memory within a single thread end_CELL end_ROW start_ROW start_CELL 13-14 end_CELL start_CELL sa, sb end_CELL start_CELL per-matrix manual caching of the 2D workgroup tile. end_CELL end_ROW end_ARRAY

As suggested by the authors, the outcome was taken to be the logarithm of therunning time. All predictor variables realize at most four distinct values andwere therefore treated as being categorical. Figure 16 shows theestimated interaction effect profile for this data. Among the larger effectsare four main, seven two–variable, four three–variable and a largefour–variable interaction effect.

Function Trees: Transparent Machine Learning (18)
Function Trees: Transparent Machine Learning (19)

Figure 17 shows box plots of the distributions over ten bootstrapreplications of test set root-mean squared-error (32) of function treemodels built under three interaction constraints: unconstrained (left),four–variable interaction among variables (mwg,nwg,mdimc,ndimc)prohibited (center), and all four–variable interactions prohibited (right).Note that not allowing the four–variable interaction among(mwg,nwg,mdimc,ndimc) does not preclude any of these variables fromparticipating in lower order interactions, or four–variable interactions withother variables. One sees that incorporating the (mwg,nwg,mdimc,ndimc)four–variable interaction is very important to model accuracy as are otherfour–variable interactions to a lesser extent.

Function Trees: Transparent Machine Learning (20)

Since the (mwg,nwg,mdimc,ndimc) interaction effect is inherently fourdimensional it cannot be completely represented by a two dimensional display.Unlike three–variable interactions it cannot be represented by a series oftwo dimensional plots conditioned on the value of a third variable. However itis possible to gain some insight about this four–variable interaction effectthrough graphical methods.

Figure 18 depicts the difference between interaction effects computedfrom two models

D^⁒(𝐱)=F^⁒(𝐱)βˆ’G^⁒(𝐱)⁒.^𝐷𝐱^𝐹𝐱^𝐺𝐱.\hat{D}(\mathbf{x})=\hat{F}(\mathbf{x})-\hat{G}(\mathbf{x})\text{.}over^ start_ARG italic_D end_ARG ( bold_x ) = over^ start_ARG italic_F end_ARG ( bold_x ) - over^ start_ARG italic_G end_ARG ( bold_x ) .(35)

The first F^⁒(𝐱)^𝐹𝐱\hat{F}(\mathbf{x})over^ start_ARG italic_F end_ARG ( bold_x ) is the full model with all interactionsallowed (Fig. 17 left). The second model G^⁒(𝐱)^𝐺𝐱\hat{G}(\mathbf{x})over^ start_ARG italic_G end_ARG ( bold_x ) wasbuilt under the constraint that no four–variable interaction effect amongvariables (mwg,nwg,mdimc,ndimc) was allowed (Fig. 17 center).Thus, any structure detected in D^⁒(𝐱)^𝐷𝐱\hat{D}(\mathbf{x})over^ start_ARG italic_D end_ARG ( bold_x ) is due to presence ofthis four–variable interaction.

Figure 18 shows nine bivariate interaction effects differencesbetween these two models (35)

ID^⁒(mwg,nwg|mdimc∈{8,16,32}Γ—ndimcβ€‰βˆˆ{8,16,32})subscriptI^𝐷mwg,nwg|mdimc8,16,32ndimc 8,16,32\emph{I}_{\hat{D}}(\text{\emph{mwg,nwg\thinspace$|$\thinspace mdimc}}\,\in\,\{%\emph{8,16,32}\}\,\,\times\,\,\text{\emph{ndimc\thinspace}}\in\,\{\emph{8,16,3%2}\})I start_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT ( italic_mwg,nwg italic_| italic_mdimc ∈ { 8,16,32 } Γ— ndimc ∈ { 8,16,32 } )

where mwg,nwg ∈{16,32,64,128}absent16,32,64,128\in\{\emph{16,32,64,128}\}∈ { 16,32,64,128 }. For the top tworows mdimc∈{8,16}absent816\,\in\,\{8,16\}∈ { 8 , 16 }, the model differences are seen to be smallin absolute value (≃similar-to-or-equals\simeq≃ violet) for all joint values of the othervariables, indicating that the four–variable interaction has little to noeffect for those collective values. For the last row (mdimc  =32absent32=32= 32) one sees that the absolute differences are larger (red,yellow, blue), especially for ndimc  =32absent32=32= 32.This four–variable interaction effect is seen to be mainly the result of avery large increase in (log) running time for the GPU matrix product whensimultaneously mdimc and ndimc realize their largest valueswhile at the same time mwg and nwg are at their smallest values.

The R⁒S⁒M⁒E𝑅𝑆𝑀𝐸RSMEitalic_R italic_S italic_M italic_E (32) of function tree, XGBoost and Random Forest models onthis example was 0.120.120.120.12, 0.090.090.090.09 and 0.160.160.160.16, respectively.

6.5 Global surrogate

Function trees can be used as interpretable surrogate models to investigatethe predictions of black box models. One simply uses the black box predictionsalong with the corresponding predictor variables as input. To the extent thatthe function tree fit is accurate, all of its interpretational tools can thenbe applied to study the nature of the black box model. This is illustratedusing the simulated data of Hu et. al. (2023), shown in Section6.2, where the true underlying structure (33) is known.

Function Trees: Transparent Machine Learning (21)
Function Trees: Transparent Machine Learning (22)

First XGBoost is applied in regression mode to model the data (34)producing an estimate g^⁒(𝐱)^𝑔𝐱\hat{g}(\mathbf{x})over^ start_ARG italic_g end_ARG ( bold_x ). The resulting test dataroot–mean–squared error

R⁒M⁒S⁒E⁒(g^)={E^𝐱⁒[(g⁒(𝐱)βˆ’g^⁒(𝐱))2]/V⁒a⁒r⁒(g⁒(𝐱))}12𝑅𝑀𝑆𝐸^𝑔superscriptsubscript^𝐸𝐱delimited-[]superscript𝑔𝐱^𝑔𝐱2π‘‰π‘Žπ‘Ÿπ‘”π±12RMSE(\hat{g})=\left\{\hat{E}_{\mathbf{x}}[(g(\mathbf{x})-\hat{g}(\mathbf{x}))^%{2}]/Var(g(\mathbf{x}))\right\}^{\frac{1}{2}}italic_R italic_M italic_S italic_E ( over^ start_ARG italic_g end_ARG ) = { over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT [ ( italic_g ( bold_x ) - over^ start_ARG italic_g end_ARG ( bold_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / italic_V italic_a italic_r ( italic_g ( bold_x ) ) } start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT(36)

was R⁒M⁒S⁒E⁒(g^)=0.14𝑅𝑀𝑆𝐸^𝑔0.14RMSE(\hat{g})=0.14italic_R italic_M italic_S italic_E ( over^ start_ARG italic_g end_ARG ) = 0.14. The output of the XGBoost model g^⁒(𝐱)^𝑔𝐱\hat{g}(\mathbf{x})over^ start_ARG italic_g end_ARG ( bold_x ) is then fit with a function tree. The R⁒M⁒S⁒E𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E for this fit was0.130.130.130.13 indicating that the function tree accurately reflects the XGBoostmodel. Figure 19 displays the resulting largest main and interactioneffects detected by the function tree in the XGBoost model. This can becompared to the function tree based on the original data shown in Fig.13, as well as to the structure of the true target function(33). One sees that the XGBoost regression model here closely capturesthe true target function structure as reflected in its function tree representation.

Next a Random Forest is applied to the same data. The resulting test dataroot–mean–squared error (36) was R⁒M⁒S⁒E⁒(g^)=0.17𝑅𝑀𝑆𝐸^𝑔0.17RMSE(\hat{g})=0.17italic_R italic_M italic_S italic_E ( over^ start_ARG italic_g end_ARG ) = 0.17. The output ofthis random forest model was fit by a function tree producing a correspondingR⁒M⁒S⁒E𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E of 0.0530.0530.0530.053. Figure 20 shows the largest main and interactioneffects detected in the Random Forest model. While capturing the nature of thetarget function reasonably well, the random forest appears to somewhat underestimate the (x4,x5,x6)subscriptπ‘₯4subscriptπ‘₯5subscriptπ‘₯6(x_{4},x_{5},x_{6})( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) interaction strength and identify severalspurious three–variable interactions.

Function Trees: Transparent Machine Learning (23)

Finally here we apply the surrogate approach in a classification setting. Theoutcome variable is binary, y∈{0,1}𝑦01y\in\{0,1\}italic_y ∈ { 0 , 1 } with Pr(y=1|𝐱)=1/(1+exp(βˆ’g(𝐱))\Pr(y=1\,|\,\mathbf{x})=1/(1+\exp(-g(\mathbf{x}))roman_Pr ( italic_y = 1 | bold_x ) = 1 / ( 1 + roman_exp ( - italic_g ( bold_x ) ). The log–odds function g⁒(𝐱)𝑔𝐱g(\mathbf{x})italic_g ( bold_x ) is givenby (33). Logistic XGBoost was applied to this (𝐱,y)𝐱𝑦(\mathbf{x},y)( bold_x , italic_y ) dataproducing a log–odds estimate g^⁒(𝐱)^𝑔𝐱\hat{g}(\mathbf{x})over^ start_ARG italic_g end_ARG ( bold_x ). The resultingroot-mean-squared-error R⁒M⁒S⁒E⁒(g^)=0.39𝑅𝑀𝑆𝐸^𝑔0.39RMSE(\hat{g})=0.39italic_R italic_M italic_S italic_E ( over^ start_ARG italic_g end_ARG ) = 0.39 (36) is in this case muchlarger owing to the loss of information associated with providing the learningalgorithm with only the sign of the outcome rather than its actual numericalvalue. This logistic XGBoost model was then fit with a function tree withresulting R⁒M⁒S⁒E𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E of 0.190.190.190.19. Although larger than in the regression case thisstill represents a fairly close fit (R2=0.96superscript𝑅20.96R^{2}=0.96italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.96). The result is shown inFig. 21. As would be expected this lower accuracy logistic XGBoostmodel less perfectly captures the true target function (33). Ituncovers all of the true main and interaction effects. It also indicates anumber of spurious main and two–variable interactions at somewhat lowerstrength. The strength of the interaction among variables (x1,x2,x3)subscriptπ‘₯1subscriptπ‘₯2subscriptπ‘₯3(x_{1},x_{2},x_{3})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) is under estimated and there are several spurious three–variableinteractions indicated at comparable strength. In spite of its lower accuracythe logistic XGBoost model is still seen in Fig. 21 to reflect muchof intrinsic structure of the target log–odds g⁒(𝐱)𝑔𝐱g(\mathbf{x})italic_g ( bold_x ) (33).

7 Previous work

The origins of the function tree approach lie with the additive modelingprocedure of Hastie and Tibshirani (1990). They used nonparametric univariatesmoothers and backfitting to produce models with no interactions. Functiontrees can be viewed as generalizing that method to discover and includeunspecified interaction effects.

The closest predecessor to function trees is MARS (Friedman 1991). It models ageneral multivariate function as a sum of products of simple basic univariatefunctions of the predictor variables and can in principle model interactioneffects to high order. The basic univariate function used with MARS is a dReLU(double ReLU)

d⁒(x)=aβ‹…(tβˆ’x)++bβ‹…(xβˆ’t)+𝑑π‘₯β‹…π‘Žsubscript𝑑π‘₯⋅𝑏subscriptπ‘₯𝑑d(x)=a\cdot(t-x)_{+}+b\cdot(x-t)_{+}italic_d ( italic_x ) = italic_a β‹… ( italic_t - italic_x ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + italic_b β‹… ( italic_x - italic_t ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT(37)

characterized by two slope parameters (a,b)π‘Žπ‘(a,b)( italic_a , italic_b ) and knot location t𝑑titalic_t. The MARSforward stepwise modeling strategy is similar to that used with the functiontree approach.

The principal difference between MARS and function trees involves their basicbuilding blocks. MARS employs the simple dReLU (37) at every step.Function trees employ more general functions of the predictor variablesf⁒(xj)𝑓subscriptπ‘₯𝑗f(x_{j})italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )estimated by user selected smoothers. This allows customizing ofthe estimation method to the nature of each individual predictor variabledistribution. It also produces more parsimonious models (smaller trees) sincemany dReLUs (37) can be required to approximate even simple whole curves.

A second difference with MARS is that function trees allow different functionsof the same variable to appear multiple times in the products of asingle basis function (8), as seen in Fig. 6. This providesmore flexibility by incorporating stepwise multiplicative as well as additivemodeling. Besides increased accuracy this further reduces model size. Inaddition to being more interpretable, smaller models allow faster computationof their partial dependence functions.

The use of partial dependence functions to uncover interaction effects wasproposed by Friedman and Popescu (2008). The innovation here is their rapididentification and computation afforded by the function tree representation.This allows a comprehensive search for all interaction effects up to highorders. Other techniques that have been proposed for interaction detection aremostly based on the functional ANOVA decomposition (see Roosen 1995, Hooker2004 & 2007, Kim et. al. 2009, Lengerich et. al. 2020, Huet. al. 2023, and Walters et. al. 2023). So far, computationalconsiderations have limited these approaches to two–variable interactions.Lou et. al. (2013) use a bivariate partitioning method to screen fortwo–variable interactions. Tang, et. al. (2016) combine the functionalANOVA technique with the polynomial dimensional decomposition to reducecomputation with independent variables.

8 Discussion

While sometimes competitive in accuracy with other more flexible methods suchas XGBoost and Random Forests, the focus of the function tree approach is oninterpretation. The goal is to provide a representation of the target functionthat exposes its interaction effects and provides a framework for their rapidcalculation, especially those involving more than two variables. Almost allresearch into interaction detection to date has been limited to that involvingjust two variables. In fact, in many settings the unqualified termβ€œinteraction effect”is meant to refer totwo variables only.

Focusing only on two–variable interactions is natural because it reduces thesize of the search and once interactions are detected their functional formsare easily examined by traditional graphical methods such as heat maps,contour plots or perspective mesh plots, etc. Higher order interactionsinvolve more variables and their higher dimensional structures are not aseasily represented. As illustrated in Figs. 5, 11 and12, three–variable interactions can be visualized by viewing aseries of bivariate interaction plots conditioned on selected values ofanother variable. As seen in Fig. 18 interactions involving morevariables can be investigated by contrasting models with and without thoseinteractions included.

In addition to their direct interpretational value, knowledge of the existenceand strengths of higher order interactions can be important as they placeinterpretational limits on the nature of those of lower order involving thesame variables. As noted above, the functional form of an interaction effectinvolving variables {xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT depends on the value of anothervariable, say xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, if there exists a higher order interaction involvingboth {xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. If there are no such substantial higherorder interactions, the functional form of the {xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT interactionis well defined and represents the isolated joint contribution of thosevariables to the the target function. If such higher order interactions doexist then the form of the {xj}1nsuperscriptsubscriptsubscriptπ‘₯𝑗1𝑛\{x_{j}\}_{1}^{n}{ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT interaction is not well definedas it depends on the value of xksubscriptπ‘₯π‘˜x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and its corresponding functional formbecomes an average over the joint distribution of all such variables. As seenfor example in Fig. 5, this can lead to highly misleadinginterpretations. Tan et. al. (2023) discuss problems interpreting lowerorder effects in the presence of higher order interactions.

In applications involving training data with very large absolute correlationsamong subsets of predictor variables, main and interaction effects of variouslevels involving those variables tend not to be separable. This can causesubstantial spurious interaction effects to be reported. These can be detectedby comparing the lack-of-fit (32) for models constructed with andwithout the questionable effects included, as illustrated in Figs.9 and 17.

A popular interpretational tool used to investigate predictive models is ameasure of the impact or importance of the respective individual inputvariables on model predictions. There are a wide variety of definitions forvariable importance each providing different information. See Molnar (2023)for a comprehensive survey. In the presence of interactions the contributionof a given predictor variable depends on the values of the other variableswith which it interacts. Interaction effect summaries (e.g. Figs. 8,13 and 16) are thus more comprehensive than correspondingvariable importance summaries. Variable importances can be derived frominteraction based functional decompositions. For example, Gevaert and Saeys(2022) and Walters et. al. (2023) use them to derive Shapley (1953) values.

In the examples presented here partial dependence functions were used to bothdetect and examine interaction effects. Computational considerations largelydictate their use for the former. However, once uncovered, identified main andinteraction effects can be examined by any appropriate method. For example,accumulated local effects (ALE) functions (Apley and Zhu 2020) can be employedfor this purpose. Also, partial dependence based search results can be used toguide methods for constructing functional ANOVA decompositions.

References

  • [1]Apley, D. and Zhu, J. (2020). Visualizing the effects of predictorvariables in black box supervised learning models.J. Roy. Statist.Soc. Series. B 4, 1059–1086.
  • [2]Breiman, L. (2001). Random Forests. Machine Learning,45, 5–32.
  • [3]Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boostingsystem. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and DataMining, 785–794.
  • [4]Corke, P. I. (1996). A robotics toolbox for MATLAB. IEEE Robotics and Automation Magazine, 3, 24–32.
  • [5]Efron, B. (1979). Bootstrap methods: another look at the jackknife.Ann. Statist. 7, 1–26.
  • [6]Friedman, J. and Stuetzle, W. (1981). Projection pursuitregression. J. Amer. Statist. Assoc. 76, 817–823.
  • [7]Friedman, J. (1991). Multivariate adaptive regression splines.Ann. Statist. 19, 1–67.
  • [8]Friedman, J. (2001). Greedy function approximation: a gradientboosting machine. Ann. Statist. 29, 1189–1232.
  • [9]Friedman, J., and Popescu, B. (2008). Predictive learning via ruleensembles.Ann. of Appl. Statist. 2, 916–954.
  • [10]Gevaert, A. and Saeys, Y. (2022). PDD-SHAP: Fast approximations forShapley values using functional decomposition, arXiv:2208.12595v1 [cs.LG]
  • [11]Hastie, T. and Tibshirani, R. (1990). Generalized AdditiveModels. CRC Press.
  • [12]Hooker, G. (2004). Discovering additive structure in black boxfunctions. Proc. of the 10th ACM SIGKDD Int. Conf. on KnowledgeDiscovery and Data Mining.
  • [13]Hooker, G. (2007). Generalized functional ANOVA diagnostics forhigh-dimensional functions of dependent variables.J. Comp. and Graph.Statist. 16, 709–732.
  • [14]Hu, L., Nair, V. J., Sudjianto, A., Zhang, A. and Chen, J. (2023).Interpretable machine learning based on functional ANOVA framework: algorithmsand comparisons. arXiv:2305.15670 [stat.ML].
  • [15]Irizarry, R. A. (2019). Introduction to Data Science.Chapman & Hall.
  • [16]Kim, Y., Kim, J., Lee, S. and Kwon, S. (2009). Boosting on thefunctional ANOVA decomposition. Statistics and Its Interface2, 361–368.
  • [17]Lengerich, B., Tan, S., Chang, C., Hooker, G., Caruana, R. (2020).Purifying interaction effects with the functional ANOVA: an efficientalgorithm for recovering identifiable additive models. Proc. 23rd Int.Conf. Artificial Intel. and Statist. (AISTATS).
  • [18]Lou, Y., Caruana, R., Gehrke, J., Hooker, G. (2013). Accurateintelligible models and pairwise interactions. Proc. of the 13th ACMSIGKDD Int. Conf. on Knowledge Discovery and Data Mining.
  • [19]Molar, C. (2023). Interpretable Machine Learning. Lean Publishing.
  • [20]Roosen, C. (1995). Visualization and exploration ofhigh-dimensional functions using the functional ANOVA decomposition. PhDthesis, Statistics, Stanford University.
  • [21]Shapley, L. (1953). A value for n-person games. Contributionsto the Theory of Games 2(28), 307–317.
  • [22]Tan, S., Hooker, G., Koch, P., Gordo, A. and Caruana, R. (2023).Considerations when learning additive explanations for black-box models.Machine Learning 112, 3333–3359.
  • [23]Tang, K., Congedo, P. and Abgrall, R. (2016). Adaptive surrogatemodeling by ANOVA and sparse polynomial dimensional decomposition for globalsensitivity analysis in fluid simulation. J. Comput. Physics314, 557–589.
  • [24]Walters, B., Ortega-Martorell, S., Olier, I., Lisboa, P.J.G.(2023). How to open a black box classifier for tabular data. Algorithms16, 181.
Function Trees: Transparent Machine Learning (2024)
Top Articles
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 5977

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.