We propose an extension of an entropy-based heuristic for constructing a decision tree from a large database with many numeric attributes. When it comes to handling numeric attributes, conventional methods are inefficient if any numeric attributes are strongly correlated. Our approach offers one solution to this problem. For each pair of numeric attributes with strong correlation, we compute a two-dimensional association rule with respect to these attributes and the objective attribute of the decision tree. In particular, we consider a family ℛ of grid-regions in the plane associated with the pair of attributes. For R ∈ ℛ, the data can be split into two classes: data inside R and data outside R. We compute the region Ropt ∈ ℛ that minimizes the entropy of the splitting, and add the splitting associated with Ropt (for each pair of strongly correlated attributes) to the set of candidate tests in an entropy-based heuristic. We give efficient algorithms for cases in which ℛ is (1) x-monotone connected regions, (2) based-monotone regions, (3) rectangles, and (4) rectilinear convex regions. The algorithm has been implemented as a subsystem of SONAR (System for Optimized Numeric Association Rules) developed by the authors. We have confirmed that we can compute the optimal region efficiently. And diverse experiments show that our approach can create compact trees whose accuracy is comparable with or better than that of conventional trees. More importantly, we can grasp non-linear correlation among numeric attributes which could not be found without our region splitting.
ASJC Scopus subject areas