Jerome H. FriedmanDepartment of Statistics, Stanford University,Stanford, CA 94305, USA. (jhf@stanford.edu)
Abstract
The output of a machine learning algorithm can usually be represented by oneor more multivariate functions of its input variables. Knowing the globalproperties of such functions can help in understanding the system thatproduced the data as well as interpreting and explaining corresponding modelpredictions. A method is presented for representing a general multivariatefunction as a tree of simpler functions. This tree exposes the global internalstructure of the function by uncovering and describing the combined jointinfluences of subsets of its input variables. Given the inputs andcorresponding function values, a function tree is constructed that can be usedto rapidly identify and compute all of the functionβs main and interactioneffects up to high order. Interaction effects involving up to four variablesare graphically visualized.
1 Introduction
A fundamental exercise in machine learning is the approximation of a functionof several to many variables given values of the function, often contaminatedwith noise, at observed joint values of the input variables. The result canthen be used to estimate unknown function values given corresponding inputs.The goal is to accurately estimate the underlying (non noisy) outcome valuessince the noise is by definition unpredictable. To the extent that this issuccessful the estimated function may, in addition, be used to try tounderstand underlying phenomena giving rise to the data. Even when predictionaccuracy is the dominate concern, being able to comprehend the way in whichthe input variables are jointly combining to produce predictions may lead toimportant sanity checks on the validity of the function estimate. Besidesaccuracy, the success of this latter exercise requires that the structure ofthe function estimate be represented in a comprehensible form.
It is well known, and often commented, that the most accurate functionapproximation methods to date tend not to provide comprehensible results. Thefunction estimate is encoded in an obtuse form that obscures potentiallyrecognizable relationships among the inputs that give rise to various functionoutput values. This is especially the case when the function is not inherentlylow dimensional. That is, medium to large subsets of the input variables acttogether to influence the function in the sense that theircombined contribution cannot be represented by a combination of smallersubsets of those variables (interaction effects). This has led to incompleteinterpretations based on low dimensional approximations such as additivemodeling with no interactions, or models restricted to at most twoβvariable interactions.
2 Interaction effects
For input variables afunction is said to exhibit an interaction between two of them and if the difference in the value of asresult of changing the value of one of them depends on the value ofthe other . This means that in order to understand the effect of thesetwo variables on they must be considered together and cannotbe studied separately. An interaction effect between variables and implies that the second derivative of jointly withrespect to and is not zero for at least some values of. That is,
(1)
with an analogous expression involving finite differences for categoricalvariables. If there is no interaction between these variables, the function can be expressed as a sum of two functions, one that does notdepend on and the other that is independent of
(2)
Here and respectivelyrepresent all variables except and . If a given variable interacts with no other variable, then the function can be expressedas
(3)
where the first term on the right is a function only of and the secondis independent of . In this case is said to beβadditiveβin variable and theunivariate function can be examined to study the effect of on the function independently from the effects of theother variables.
Higher order interactions are analogously defined. A function is said to have an βvariable interaction effect involving variables if
(4)
again with an analogous expression involving finite differences forcategorical variables. The existence of such an βvariable interactionimplies that the effect of the corresponding variables onthe function cannot be decomposed into a sum of functions eachinvolving subsets of those variables. If there is no such interaction, thecontribution of variables to the variation of can be decomposed into a sum of functions each not dependingupon one of these respective variables
(5)
If none of the variables in a subset interactwith any of the variables in its complement set , then the function is additive in the variable subset
(6)
and one can study the effect of on the function separately from that of the other variables.
The notion of interaction effects can be useful for understanding a function if it is dominated by those of low order involving variablesubsets with small cardinality ().In this case a target function of many variables can be understood in terms ofa collection of lower dimensional functions each depending on a relativelysmall different subset of the variables. It is generally easier to comprehendthe lower dimensional structure associated with fewer variables. Also, acommon regularization method is to more heavily penalize or forbid solutionsinvolving higher order interactions. Although often useful, this can result inlower accuracy and/or misleading interpretations when involvessubstantial interaction effects of higher order than that assumed. As seenbelow the existence of such higher order interactions can misleadinterpretation by causing the functional form of those of lower order todepend on values of other unknown variables.
3 Function trees
A function tree is a way of representing a multivariate function so as touncover and estimate its interaction effects of various orders. The inputfunction is defined by data with representing multivariate evaluation points and thecorresponding function values perhaps contaminated with noise. The output ofthe procedure is a set of univariate functions, each one depending on a singleselected input variable. These functions are arranged as the non root nodes ofa tree. A constant function is associated with the root. The tree provides aprescription for combining these univariate functions to form the multivariatefunction approximation .
Figure 1 shows one such function tree derived from simple simulated data forillustration. Figure 2 shows the nine non constant functions corresponding toeach of the respective non root nodes of the tree. In addition to itsassociated univariate function (Fig. 2), each tree node represents a basisfunction . The sum of these basis functions forms the finalmultivariate approximation
(7)
Each nodeβs basis function is the product of its associated univariatefunction and those of the nodes on the path from it to the root
(8)
Here represents those nodes on the path from node to the rootand labels the input variable associated with each such node . Thedarkness of each nodeβs shading in Fig. 1 corresponds to the estimatedrelative influence of its basis function on the final function estimate asmeasured by its standard deviation over the training data.
The interaction level of each basis function (8)is the number of univariate functions of different variables along thepath from its corresponding node to the root of the tree. Products ofunivariate functions involving different variables satisfy (4).Note that different univariate functions of the same variable mayappear multiple times along the same path. While potentially improving themodel fit they do not increase interaction level. With this approachinteractions among a specified subset of the variables are modeled by sumsof products of univariate functions of those variables
(9)
The function tree model shown in Figs. 1 & 2 was obtained by applying theprocedure described below to a simulated data example. There are observations with outcome variables generated as with and
(10)
The noise is generated as producinga signal/noise ratio. This target function has an additive dependence onthe third variable , separate twoβvariable interactions between and respectively, and a trilinearthreeβvariable interaction among .
The first node of the tree is seen to incorporate the pure additive effect of. Nodes 2 - 5 model the threeβvariableinteraction. The twoβvariable interaction between and ismodeled by nodes 6 and 7, and that of and is modeled by nodes8 and 9. The nine univariate functions (Fig. 2) combined as specified by thetree (Fig. 1) produce a multivariate function that explains 97% of thevariance of the target (10).
3.1 Tree construction
Function trees are constructed in a standard forward stepwise best firstmanner. Initially the tree consists of a single root node representing aconstant function . At the th step there are basis functions(7) (8) in the current model . The(st is taken to be
(11)
where , , and are the solution to
(12)
Here is one of the basis functions in the current model, is a function of one of the predictor variables and theempirical expected value is over the data distribution. The model is thenupdated
(13)
for the next iteration. In terms of tree construction the update consists ofadding a daughter node to current node with associated function. Iterations continue until the goodnessβofβfitstops improving as measured on an independent test data sample.
3.2 Backfitting
As with all tree based methods, smaller trees (less nodes) tend to be moreeasily understood. Here tree size can be reduced and accuracy improved byincreased function optimization (backfitting) at each step (Friedman andStuetzle 1981). This is implemented here by updating the functions associatedwith all current tree nodes in presence of the newly added one, and thosepreviously updated. In particular, after adding the new ()st node, allnode functions are re-estimated one at a time in order starting from the first
(14)
(15)
Here labels a node of the tree and its associated function ofvariable . In terms of tree construction, each such step involvesreplacing each (th) nodeβs current function with (14). No changes are made to tree topology(Fig. 1) or the input variables associated with each node. Only the functions(Fig. 2) are updated. Backfitting passes can be repeated until the modelbecomes stable. This usually happens after one or two passes. Note that in thepresence of backfitting function tree models are not nested. All of the basisfunctions of the node model are different from those of its node predecessor.
3.3 Univariate function estimation
The central operation of the above procedure is repeated estimation ofunivariate functions of the individual predictor variables. In both (12)and (14) the solutions take the form of weighted conditionalexpectations
(16)
where represents one of the current basis functions and are the current residuals. The function treeprocedure as described above is agnostic concerning the methods used for thisestimation. However, actual performance in terms of execution speed andprediction accuracy can be highly dependent on such choices.
A major consideration in the choice of an estimation method for a particularvariable is the nature of that variable in terms of the values itrealizes, and connections between those values. A categorical variable(factor) realizes discrete values with no order relation. The only method forevaluating (16) for such variables is to take the weighted ()mean of at each discrete value. This is also a viable strategy ifthe values are orderable but realize only a small set of distinct values.
For numeric variables that realize many distinct orderable values one can takeadvantage of the presumed smoothness of the solution function to improveaccuracy by borrowing strength from similarly valued observations. There aremany such methods for βsmoothingβdata (seeIrizarry 2019). For the examples presented below near neighbor local averagingand local linear fitting were employed. Near neighbor local averagingestimates are equivariant under monotone transformations of the -values,depending only their relative ranks. This can provide higher accuracy in thepresence of irregular or clumped -values, immunity to outliers,resistance to outliers and more cautious (constant) extrapolation at theedges. For more evenly distributed data with many distinct -values, in theabsence of and outliers, local linear fitting can yield smoother moreaccurate results.
4 Interpretation
The primary goal of the function tree representation is to enhanceinterpretation by exposing a functionβs separate interaction effects ofvarious orders. These can then be studied using graphical and other methods togain insight into the nature of the relationship between the predictorvariables and the targetfunction .
4.1 Partial dependence functions
One way to estimate the contribution of subsets of predictor variables to afunction is through partial dependence functions (Friedman2001). Let represent a subset of the predictor variables and the complement subset. Then the partialdependence of a function βon isdefined as
(17)
Conditioned on joint values of the variables in , the the value ofthe function is averaged over the joint values of thecomplement variables . Since the joint locations ofpartial dependence functions are generally not identifiable, they are eachcentered to have zero mean value over the data distribution of .
If for a function the variables in the specified subset do not participate in interactions with variables in then the additive dependence (6) of on is well defined and given by . If this is not the case, then the dependence of the target function on the subset is not well defined in the sensethat its functional form changes with changing values of the variables in. In this case one can define a βnominalβdependence by averaging over some distribution of the predictor variables . This confounds theproperties of the actual target function with those of thedata distribution on the resulting estimate of the dependenceon . Two popular choices for are the training dataand the product of its marginals (independence).Partial dependence functionschoose a compromise in which the variables in are taken to beindependent of those in . Note that the result betweenthis and using the training data distribution is different only if there aresubstantial correlations or associations between the interacting variables in and .
Partial dependence functions (17) can be estimated from the data in astraightforward way by evaluating
(18)
over a representative set of joint values of . This requires target function evaluations where isthe number of evaluation points and is the sample size used for averaging.
With the function tree representation partial dependence functions can becomputed much more rapidly. For the purpose of computing a partial dependencefunction on a variable subset the function tree model can beexpressed as
(19)
where represents all basis functions not involving any variables in, are functions involving variables only in, and represents the compliment variables.With this representation the partial dependence of onthe variables is simply
(20)
where is themean of over the data. This (20) is aparticular linear combination of the functions . The number of target function evaluations required tocompute (20) is proportional to
(21)
where is the fraction of node basis functions(8) involving variables in both and .
4.2 Partial association functions
Partial dependence functions focus on the properties of the function by averaging over a distribution in which the variables are independent of those in . This concentrateson the target function by removing the effects of associations between thosevariable subsets. As a result the calculation can sometimes emphasize-values for which the data density is smallleading to potential inaccuracy. This only affects a partial dependence to theextent that there are variables in with which itsvariables both interact and are correlated.
The function tree representation provides a method for detecting and measuringthe strength of such occurrences. βPartialassociationβfunctions are defined as
(22)
where
(23)
with the functions and definedin (19). As with partial dependence functions (20) this can beviewed as a linear combination of the functions but with varying coefficients thataccount for the associations between the variables in and thecomplement variables . The coefficient functions can be estimated by any univariate smoother. Regressionsplines with knots at the percentiles are used in the examples below.Note that from (19) partial association functions (22)(23) reduce to partial dependence functions (20) when variablesin do not participate in interactions with variables in(), or they areindependent of those in with which they do interact(). Otherwise they can produce different resultscaused by the variation in the respective coefficient functions induced by the associations between the correspondinginteracting variables in and .
Partial association functions can be used to access the influence thatassociations among the predictor variables have on the corresponding partialdependences. In particular, one can substitute them (22) (23)for partial dependence functions (20) in any analysis. Similar resultsindicate that such influences are small.
4.3 Interaction detection
Partial dependence functions can be used to detect interaction effects betweenvariables. For example, if a target function contains nointeraction between variables and its partial dependence onthose variables is
(24)
(Friedman and Popescu 2008). If there is such an interaction, represents the additive component and
(25)
represents the corresponding pure interaction component of the effect of and on as reflected by its partial dependences.
More generally, the pure interaction effect involving variables is defined to be
(26)
where is recursively defined as
(27)
This pure interaction component is its correspondingpartial dependence with all lower order interactions among all its variablesubsets removed. If there is no βvariable interaction effect involving its pure interaction component .Note that this result depends only on the properties of the function (5) and not on the data distribution .The strength of such an interaction effect can be taken to be the standarddeviation of its pure interaction component over the training data, or otherspecified data distribution, divided by the standard deviation of thecorresponding target function
(28)
If there exists a higher order interaction effect jointly involving variablesin subset along with other variables , (28), then the form of the interaction functionof the subset depends on thejoint values of the complement variables . The resultingunconditioned interaction effect is then an averageover the joint distribution of the complementvariables. If there are no such higher order interactions involving jointly with other variables, its pure interaction effect does not depend on the data distribution and reflects only properties of the target function .
The measure (28) indicates the strength of an βvariableinteraction effect involving the corresponding variable subset over the distribution . One can search forsubstantial interaction effects by evaluating (26β28) over allvariable subsets up to some maximum size . This approachcan in principle be applied to a target function representedin any form. All that is required to compute partial dependence functions isthe value of the function at various specified values of .
Figure 3 shows the strengths (28) of all interaction effectsuncovered in the synthetic data example of Section 3 using this approach. Thepresence of more lower order interaction effects shown here than explicitlyrepresented by the function tree (Fig. 1) is due to lower level marginals ofhigher order basis functions (9) being assigned to their properinteraction level.
5 Computation
For a total of variables the number of distinct subsets involving variables is . For each such subset the number of partialdependence functions required to extract its pure interaction effect(26) (27) grows exponentially in its size Thus for largervalues of and the required computation using the standard method(18) to evaluate all of the necessary partial dependence functions isgenerally too severe to allow a search for interaction effects by completeenumeration over all variable subsets. However, for functions represented by afunction tree, one can employ (20) to dramatically reduce thecomputation (21) required to evaluate each partial dependencefunction. This in turn allows the search for interactions to be performed overmany more and larger variable subsets representing higher order interaction effects.
For small to moderate number of variables complete enumerationusing (20) is generally feasible for . However, for largerproblems the rapid growth with respect to both and soon places therequired computation out of reach. Parallel computation and smart strategiesfor reusing previously computed partial dependence functions can somewhat easethis burden. However, the properties of partial dependence functions allow asimple input variable screening approach that can often considerably reducethe size of the search.
If for a (centered) function a variable participatesin no interactions with any other variables then it can be expressed as
(29)
where is its partial dependence on and is its partial dependence on all other variables (Friedmanand Popescu 2008). One can define an overall interaction strength for eachpredictor variable as
(30)
Variables with small values of can be removed from the searchfor interaction effects. This often excludes many variables therebysubstantially reducing computation.
Computation can be further reduced by taking advantage of the function treerepresentation. The function tree strength of the contribution of inputvariable to interaction level is taken to be
(31)
Here is an indicator of the truth of its argument, labels anode of the node function tree, is its correspondingbasis function, its interaction order and the variance is overthe distribution . This can be used as a filter to excludevariables with small values of (31) from the searchfor βvariable interaction effects. Since node basis functions are not pure interaction effect functions (26)(27) the search for βvariable interaction effects shouldincludethe union of all relevant variables with substantial contributions(31) at that interaction level or higher.
In many applications the number of variables involved in higher orderinteractions is less than that involved in main or lower order interactioneffects. Since computation increases very rapidly with the number of variablesexamined, further variable reduction at the interaction level (31) cansubstantially reduce computation even for minor variable reductions.
For the simulation example in Section 6.2 (below) exhaustive search overall variables at each of four interaction levels requires computing partial dependence functions with correspondingtarget function evaluations using (18) with .This would take approximately days on a Dell XPS 13 laptop. Approximatingthe search by only including the ten variables estimated to be the mostimportant reduces the necessary number of partial dependence functions to requiring target function evaluations. Thiswould take approximately seven hours.
Partial dependence screening using (30) identifies six interactingvariables reducing the number of partial dependence functions computed to with target function evaluations using (18).This reduces the computing time to approximately minutes. Building thefunction tree for this example required seconds. Correspondinginteraction level screening using (31) eliminated all fourβvariableinteractions. Six variables remained for both the twoβvariable andthreeβvariable interaction searches and ten variables for main effects. Thisrequires a total of partial dependence functions. Employing (20)with involved target function evaluationsrequiring seconds.
5.1 Interaction investigation
Once discovered, interactions as well as main effects can be visualized usingpartial dependence plots. Figure 4 shows a heat map representation ofthe interaction effect between variables and , (26) (27), for the function tree model (Figs. 1 & 2). Colors black, blue, purple, violet, red, orange, and yellow,seen in the lower left legend, represent a continuum of increasing functionvalues from lowest () to highest ().
One sees that joint high or joint low values of these two variables producerelatively large increases in function value (red, yellow), whereas oppositevalues, (low, high) or (high, low), give rise to large decreases (blue,black). Since Fig. 3 shows no higher order interactions involvingthese variables, this function of and represents thecorresponding interaction effect associated with the target function estimate, and does not reflect the nature of the datadistribution . One can verify this behavior by examining thetrue target function (10).
Variables and are seen in Fig. 3 to have a relativelyweak interaction. However, they are also seen to participate in a substantialthreeβvariable interaction with variable so that their twoβvariableinteraction function (26) (27) may notprovide a complete description of their joint effect on the target estimate.
Figure 5 shows a plot of the average (unconditional) pure interactionfunction (upper left), and corresponding interactionfunctions conditioned at three values of , all plotted on the same vertical scale (). Here ablack color represents the most negative values, yellow the most positive andviolet the smallest absolute values. The upper left frame shows that thevariation of unconditional interaction function is quitesmall ( violet) as indicated in Fig. 3. The lower left frameshows the same for . The two right frames howevershow very strong interaction effects for and. There is almost no interaction between and when . For , there are strong butopposite interaction effects. This leads to an overall weaktwoβvariable interaction between and when averaged over thedistribution of (upper left). Seeing only this weak twoβvariableinteraction effect between () without knowledge of thecorresponding threeβvariable interaction would lead tothe impression that these two variables influence the target in an additive unrelated manner. As seen in the right two frames of Fig.5 this is clearly not the case. Thus, in the presence of higher orderinteractions, lower order representations can be misleading.
6 Illustrations
The function tree methodology described above was applied to a number ofpopular public data sets to investigate the nature of the relationship betweenthe joint values of their predictor variables and outcome. In many cases thecorresponding function tree uncovered simple relationships involving additiveeffects or at most a few twoβvariable interactions. Others were seen toinvolve more complex structure. Examples of both are illustrated here.
The goal of the function tree approach is interpretational. As such its form(7) (8) is somewhat more restrictive than other more flexiblemethods focused purely on prediction accuracy such as XGBoost (Chen andGuestrin 2016) or Random Forests (Breiman 2001). However, if on any particularproblem its accuracy is substantially less than these other methods thequality of the corresponding interpretation may be questionable. For each ofthe examples presented in this section root-mean-squared prediction error
(32)
is reported for default versions of the three methods: no tuning for RandomForests and only model size tuning for function trees and XGBoost. In allexamples presented here the estimated interaction effects are based on partialdependence functions (Section 4.1). Corresponding summaries usingpartial association functions (Section 4.2) produced nearly identicalresults in all cases.
6.1 Capital bikeshare data
This data set taken from the UCI Machine Learning Repository consists of 17379hourly observations of counts of bicycle rentals in the Capital bikesharesystem in Washington, D. C. between 2011 and 2012. The outcome variable isthe rental count. The sixteen predictor variables are both categorical andnumeric consisting of corresponding time, weather and seasonal information.
Figure 6 displays the tree for the function tree model and Fig.7 shows the corresponding node functions. Its interaction effectsummary is shown in Fig. β8. Figure 8 indicatesexistence of substantial twoβvariable and threeβvariable interactions. Animportant question is the extent to which these effects are properties of thetarget function and not just properties of a particular training sample. A wayto address this is through a bootstrap analysis (Efron 1979). Figure9 shows box plots of the distributions over fifty bootstrapreplications of test set root-mean squared-error (32) of function treemodels built under three constraints on maximum interaction order:unconstrained (left), no three-variable interactions allowed (center), and nointeractions allowed (right). As seen, existence of both twoβvariable andthreeβvariable interactions are significant.
The largest estimated twoβvariable interaction effect is between variableshour-of-day (hr) and the binary indicator for working/non-working days(wrk). Figure 10 shows the partial dependence of DC bikerentals on time-of-day conditioned on working days wrk (top) andnon working days wrk (bottom). They are seen to be quite differentindicating a strong interaction effect between these two variables. Inparticular, the sharp reduction in rentals beginning at 8am and continuinguntil 5pm on working days is seen not to exist on non-working days.
The upper left frame of Fig. 11 shows the partial dependence of bikerentals jointly on adjusted temperature (atemp) and time-of-day(hr), along with its pure interaction component hr,atemp (top right). Figure 8 indicates a threeβvariableinteraction between variables hour (hr), working day indicator(wrk) and adjusted temperature (atemp). Thus, the nature ofthe interaction effect between hr and atemp may depend on thevalue of wrk. The bottom two frames of Fig. 11 show thisinteraction effect separately conditioned on the value of working dayhr,atempwrk and hr,atempwrk. One sees substantial difference in the nature of thisinteraction for the different values of wrk reflecting the presence ofthe threeβvariable interaction.
Figure 8 indicates that the strongest threeβvariable interactioneffect βinvolves the working day indicator (wrk), day of theweek (wekd) and hour of the day (hr). This is illustrated inFig. 12. The interaction function wekd,wrkββhr) is displayed for hrβmidnight,6βam,noon,6βpm. Note that there are noobservations for which Saturday (wekd ) and Sunday (wekd) are labeled as working days (wrk). The absence of athreeβvariable interaction would imply that wekd,wrkββhr is the same for all values of the variable hr.As seen in Fig. 12 this is not the case especially for non workingweekdays at noon (lower left).
The (32) of function tree, XGBoost and Random Forest models onthis example was , and , respectively.
6.2 Simulated data
This example was presented in Hu et. al. (2023). They generated datausing the target function
(33)
This is a function of ten variables involving numerous interactions involvingup to three variables. The response was simulated as
(34)
There were predictor variables. The first were simulated from amultivariate Gaussian distribution with mean , variance and equalcorrelation between all pairs. Ten additional variables were includedthat were independent of the first (irrelevant variables). They were alsosimulated from a multivariate Gaussian distribution with mean , variance and equal correlation between all pairs. All predictors weretruncated to be within the interval . Here the sample size wastaken to be .
Figure 13 shows the largest estimated main and interaction effectstrengths up through fourβvariables. One sees that all of the main andinteraction effects represented in (33) are shown with substantialstrengths. Those not represented in (33), including all fourβvariableinteractions, either do not appear or do so with very low estimated strength.
The (32) of function tree, XGBoost and Random Forest models onthis example was , and respectively.
6.3 Pumadyn data
The other examples were chosen to represent data with somewhat complex targetfunction structure. Here we illustrate on a data set where the function treeuncovers very simple structure. The data (Corke 1996) is produced from arealistic simulation of the dynamics of a Puma 560 robot arm. The task is topredict the angular acceleration of one of the robot armβs links. There are8192 observations involving 32 input variables that include angular positions,velocities and torques of the robot arm.
Figure 14 shows the resulting entire function tree model. The upperleft frame shows the tree and the other frames show the node functions. Figure15 shows the (only) interaction effect between tau4 andtheta5. The upper left frame shows the partial dependence function; upper right frame shows its additive component and lowerleft frame its pure interaction effect . The lower right framerepresents as a perspective mesh plot. Of the 32 inputvariables, the function tree model selected only two of them to explain 97%of the variance of the target.
The (32) of function tree, XGBoost and Random Forest models onthis example was , and , respectively.
6.4 SGEMM GPU kernel performance
This data set from the UCI Machine Learning Repository measures the runningtime required for a matrix-matrix product using a parameterizable SGEMM GPUkernel. There are 241600 observations with 14 predictor variables:
SGEMM GPU kernel performance predictor variables
As suggested by the authors, the outcome was taken to be the logarithm of therunning time. All predictor variables realize at most four distinct values andwere therefore treated as being categorical. Figure 16 shows theestimated interaction effect profile for this data. Among the larger effectsare four main, seven twoβvariable, four threeβvariable and a largefourβvariable interaction effect.
Figure 17 shows box plots of the distributions over ten bootstrapreplications of test set root-mean squared-error (32) of function treemodels built under three interaction constraints: unconstrained (left),fourβvariable interaction among variables (mwg,nwg,mdimc,ndimc)prohibited (center), and all fourβvariable interactions prohibited (right).Note that not allowing the fourβvariable interaction among(mwg,nwg,mdimc,ndimc) does not preclude any of these variables fromparticipating in lower order interactions, or fourβvariable interactions withother variables. One sees that incorporating the (mwg,nwg,mdimc,ndimc)fourβvariable interaction is very important to model accuracy as are otherfourβvariable interactions to a lesser extent.
Since the (mwg,nwg,mdimc,ndimc) interaction effect is inherently fourdimensional it cannot be completely represented by a two dimensional display.Unlike threeβvariable interactions it cannot be represented by a series oftwo dimensional plots conditioned on the value of a third variable. However itis possible to gain some insight about this fourβvariable interaction effectthrough graphical methods.
Figure 18 depicts the difference between interaction effects computedfrom two models
(35)
The first is the full model with all interactionsallowed (Fig. 17 left). The second model wasbuilt under the constraint that no fourβvariable interaction effect amongvariables (mwg,nwg,mdimc,ndimc) was allowed (Fig. 17 center).Thus, any structure detected in is due to presence ofthis fourβvariable interaction.
Figure 18 shows nine bivariate interaction effects differencesbetween these two models (35)
where mwg,nwg . For the top tworows mdimc, the model differences are seen to be smallin absolute value ( violet) for all joint values of the othervariables, indicating that the fourβvariable interaction has little to noeffect for those collective values. For the last row (mdimcββ) one sees that the absolute differences are larger (red,yellow, blue), especially for ndimcββ.This fourβvariable interaction effect is seen to be mainly the result of avery large increase in (log) running time for the GPU matrix product whensimultaneously mdimc and ndimc realize their largest valueswhile at the same time mwg and nwg are at their smallest values.
The (32) of function tree, XGBoost and Random Forest models onthis example was , and , respectively.
6.5 Global surrogate
Function trees can be used as interpretable surrogate models to investigatethe predictions of black box models. One simply uses the black box predictionsalong with the corresponding predictor variables as input. To the extent thatthe function tree fit is accurate, all of its interpretational tools can thenbe applied to study the nature of the black box model. This is illustratedusing the simulated data of Hu et. al. (2023), shown in Section6.2, where the true underlying structure (33) is known.
First XGBoost is applied in regression mode to model the data (34)producing an estimate . The resulting test datarootβmeanβsquared error
(36)
was . The output of the XGBoost model is then fit with a function tree. The for this fit was indicating that the function tree accurately reflects the XGBoostmodel. Figure 19 displays the resulting largest main and interactioneffects detected by the function tree in the XGBoost model. This can becompared to the function tree based on the original data shown in Fig.13, as well as to the structure of the true target function(33). One sees that the XGBoost regression model here closely capturesthe true target function structure as reflected in its function tree representation.
Next a Random Forest is applied to the same data. The resulting test datarootβmeanβsquared error (36) was . The output ofthis random forest model was fit by a function tree producing a corresponding of . Figure 20 shows the largest main and interactioneffects detected in the Random Forest model. While capturing the nature of thetarget function reasonably well, the random forest appears to somewhat underestimate the interaction strength and identify severalspurious threeβvariable interactions.
Finally here we apply the surrogate approach in a classification setting. Theoutcome variable is binary, with . The logβodds function is givenby (33). Logistic XGBoost was applied to this dataproducing a logβodds estimate . The resultingroot-mean-squared-error (36) is in this case muchlarger owing to the loss of information associated with providing the learningalgorithm with only the sign of the outcome rather than its actual numericalvalue. This logistic XGBoost model was then fit with a function tree withresulting of . Although larger than in the regression case thisstill represents a fairly close fit (). The result is shown inFig. 21. As would be expected this lower accuracy logistic XGBoostmodel less perfectly captures the true target function (33). Ituncovers all of the true main and interaction effects. It also indicates anumber of spurious main and twoβvariable interactions at somewhat lowerstrength. The strength of the interaction among variables is under estimated and there are several spurious threeβvariableinteractions indicated at comparable strength. In spite of its lower accuracythe logistic XGBoost model is still seen in Fig. 21 to reflect muchof intrinsic structure of the target logβodds (33).
7 Previous work
The origins of the function tree approach lie with the additive modelingprocedure of Hastie and Tibshirani (1990). They used nonparametric univariatesmoothers and backfitting to produce models with no interactions. Functiontrees can be viewed as generalizing that method to discover and includeunspecified interaction effects.
The closest predecessor to function trees is MARS (Friedman 1991). It models ageneral multivariate function as a sum of products of simple basic univariatefunctions of the predictor variables and can in principle model interactioneffects to high order. The basic univariate function used with MARS is a dReLU(double ReLU)
(37)
characterized by two slope parameters and knot location . The MARSforward stepwise modeling strategy is similar to that used with the functiontree approach.
The principal difference between MARS and function trees involves their basicbuilding blocks. MARS employs the simple dReLU (37) at every step.Function trees employ more general functions of the predictor variablesestimated by user selected smoothers. This allows customizing ofthe estimation method to the nature of each individual predictor variabledistribution. It also produces more parsimonious models (smaller trees) sincemany dReLUs (37) can be required to approximate even simple whole curves.
A second difference with MARS is that function trees allow different functionsof the same variable to appear multiple times in the products of asingle basis function (8), as seen in Fig. 6. This providesmore flexibility by incorporating stepwise multiplicative as well as additivemodeling. Besides increased accuracy this further reduces model size. Inaddition to being more interpretable, smaller models allow faster computationof their partial dependence functions.
The use of partial dependence functions to uncover interaction effects wasproposed by Friedman and Popescu (2008). The innovation here is their rapididentification and computation afforded by the function tree representation.This allows a comprehensive search for all interaction effects up to highorders. Other techniques that have been proposed for interaction detection aremostly based on the functional ANOVA decomposition (see Roosen 1995, Hooker2004 & 2007, Kim et. al. 2009, Lengerich et. al. 2020, Huet. al. 2023, and Walters et. al. 2023). So far, computationalconsiderations have limited these approaches to twoβvariable interactions.Lou et. al. (2013) use a bivariate partitioning method to screen fortwoβvariable interactions. Tang, et. al. (2016) combine the functionalANOVA technique with the polynomial dimensional decomposition to reducecomputation with independent variables.
8 Discussion
While sometimes competitive in accuracy with other more flexible methods suchas XGBoost and Random Forests, the focus of the function tree approach is oninterpretation. The goal is to provide a representation of the target functionthat exposes its interaction effects and provides a framework for their rapidcalculation, especially those involving more than two variables. Almost allresearch into interaction detection to date has been limited to that involvingjust two variables. In fact, in many settings the unqualified termβinteraction effectβis meant to refer totwo variables only.
Focusing only on twoβvariable interactions is natural because it reduces thesize of the search and once interactions are detected their functional formsare easily examined by traditional graphical methods such as heat maps,contour plots or perspective mesh plots, etc. Higher order interactionsinvolve more variables and their higher dimensional structures are not aseasily represented. As illustrated in Figs. 5, 11 and12, threeβvariable interactions can be visualized by viewing aseries of bivariate interaction plots conditioned on selected values ofanother variable. As seen in Fig. 18 interactions involving morevariables can be investigated by contrasting models with and without thoseinteractions included.
In addition to their direct interpretational value, knowledge of the existenceand strengths of higher order interactions can be important as they placeinterpretational limits on the nature of those of lower order involving thesame variables. As noted above, the functional form of an interaction effectinvolving variables depends on the value of anothervariable, say , if there exists a higher order interaction involvingboth and . If there are no such substantial higherorder interactions, the functional form of the interactionis well defined and represents the isolated joint contribution of thosevariables to the the target function. If such higher order interactions doexist then the form of the interaction is not well definedas it depends on the value of and its corresponding functional formbecomes an average over the joint distribution of all such variables. As seenfor example in Fig. 5, this can lead to highly misleadinginterpretations. Tan et. al. (2023) discuss problems interpreting lowerorder effects in the presence of higher order interactions.
In applications involving training data with very large absolute correlationsamong subsets of predictor variables, main and interaction effects of variouslevels involving those variables tend not to be separable. This can causesubstantial spurious interaction effects to be reported. These can be detectedby comparing the lack-of-fit (32) for models constructed with andwithout the questionable effects included, as illustrated in Figs.9 and 17.
A popular interpretational tool used to investigate predictive models is ameasure of the impact or importance of the respective individual inputvariables on model predictions. There are a wide variety of definitions forvariable importance each providing different information. See Molnar (2023)for a comprehensive survey. In the presence of interactions the contributionof a given predictor variable depends on the values of the other variableswith which it interacts. Interaction effect summaries (e.g. Figs. 8,13 and 16) are thus more comprehensive than correspondingvariable importance summaries. Variable importances can be derived frominteraction based functional decompositions. For example, Gevaert and Saeys(2022) and Walters et. al. (2023) use them to derive Shapley (1953) values.
In the examples presented here partial dependence functions were used to bothdetect and examine interaction effects. Computational considerations largelydictate their use for the former. However, once uncovered, identified main andinteraction effects can be examined by any appropriate method. For example,accumulated local effects (ALE) functions (Apley and Zhu 2020) can be employedfor this purpose. Also, partial dependence based search results can be used toguide methods for constructing functional ANOVA decompositions.
References
[1]Apley, D. and Zhu, J. (2020). Visualizing the effects of predictorvariables in black box supervised learning models.J. Roy. Statist.Soc. Series. B 4, 1059β1086.
[2]Breiman, L. (2001). Random Forests. Machine Learning,45, 5β32.
[3]Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boostingsystem. Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and DataMining, 785β794.
[4]Corke, P. I. (1996). A robotics toolbox for MATLAB. IEEE Robotics and Automation Magazine, 3, 24β32.
[5]Efron, B. (1979). Bootstrap methods: another look at the jackknife.Ann. Statist.7, 1β26.
[6]Friedman, J. and Stuetzle, W. (1981). Projection pursuitregression. J. Amer. Statist. Assoc. 76,817β823.
[7]Friedman, J. (1991). Multivariate adaptive regression splines.Ann. Statist.19, 1β67.
[8]Friedman, J. (2001). Greedy function approximation: a gradientboosting machine. Ann. Statist.29, 1189β1232.
[9]Friedman, J., and Popescu, B. (2008). Predictive learning via ruleensembles.Ann. of Appl. Statist. 2, 916β954.
[10]Gevaert, A. and Saeys, Y. (2022). PDD-SHAP: Fast approximations forShapley values using functional decomposition, arXiv:2208.12595v1 [cs.LG]
[11]Hastie, T. and Tibshirani, R. (1990). Generalized AdditiveModels. CRC Press.
[12]Hooker, G. (2004). Discovering additive structure in black boxfunctions. Proc. of the 10th ACM SIGKDD Int. Conf. on KnowledgeDiscovery and Data Mining.
[13]Hooker, G. (2007). Generalized functional ANOVA diagnostics forhigh-dimensional functions of dependent variables.J. Comp. and Graph.Statist.16, 709β732.
[14]Hu, L., Nair, V. J., Sudjianto, A., Zhang, A. and Chen, J. (2023).Interpretable machine learning based on functional ANOVA framework: algorithmsand comparisons. arXiv:2305.15670 [stat.ML].
[15]Irizarry, R. A. (2019). Introduction to Data Science.Chapman & Hall.
[16]Kim, Y., Kim, J., Lee, S. and Kwon, S. (2009). Boosting on thefunctional ANOVA decomposition. Statistics and Its Interface2, 361β368.
[17]Lengerich, B., Tan, S., Chang, C., Hooker, G., Caruana, R. (2020).Purifying interaction effects with the functional ANOVA: an efficientalgorithm for recovering identifiable additive models. Proc. 23rd Int.Conf. Artificial Intel. and Statist. (AISTATS).
[18]Lou, Y., Caruana, R., Gehrke, J., Hooker, G. (2013). Accurateintelligible models and pairwise interactions. Proc. of the 13th ACMSIGKDD Int. Conf. on Knowledge Discovery and Data Mining.
[19]Molar, C. (2023). Interpretable Machine Learning. Lean Publishing.
[20]Roosen, C. (1995). Visualization and exploration ofhigh-dimensional functions using the functional ANOVA decomposition. PhDthesis, Statistics, Stanford University.
[21]Shapley, L. (1953). A value for n-person games. Contributionsto the Theory of Games2(28), 307β317.
[22]Tan, S., Hooker, G., Koch, P., Gordo, A. and Caruana, R. (2023).Considerations when learning additive explanations for black-box models.Machine Learning112, 3333β3359.
[23]Tang, K., Congedo, P. and Abgrall, R. (2016). Adaptive surrogatemodeling by ANOVA and sparse polynomial dimensional decomposition for globalsensitivity analysis in fluid simulation. J. Comput. Physics314, 557β589.
[24]Walters, B., Ortega-Martorell, S., Olier, I., Lisboa, P.J.G.(2023). How to open a black box classifier for tabular data. Algorithms16, 181.
Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557
Phone: +59115435987187
Job: Education Supervisor
Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening
Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.
We notice you're using an ad blocker
Without advertising income, we can't keep making this site awesome for you.