Weka - Attribute Selection Measure: Information Gain (ID3)

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains. - Wikipedia

Data that will be test is :


We do attribute selection by using Information Gain as Attribute Evaluator. Here are the result : 
From the above result, we can see that Outlook is the best split, while Day does not give any contribute to output, so we can remove Day attribute to test with Id3.

For those who don't have Id3 in their weka, you can download in from package manager named "simpleEducationalLearningSchemes".

After you have downloaded, you can got to Classifier - > trees - > Id3  to test the data. Before you start, you can click on More Options then click Choose - >  PlainText


 After that you can start the process of Id3. Here are the results.

Id3 can't visualize tree, but we can draw the tree based on the result given above.

Here are trees. From the Information Gain, we get the best split is Outlook attribute, so Id3 use Outlook as the root node, then followed by temperature and windy. 

That's all. Thanks!

Weka - Information Gain and Gain Ratio Using Soybean Database

Notes: The large soybean database (soybean-large-data.arff) and it's corresponding test database (soybean-large-test.arff) combined into a single file (soybean-large.arff).

1. Title: Large Soybean Database

2. Sources:
      (a) R.S. Michalski and R.L. Chilausky "Learning by Being Told and
          Learning from Examples: An Experimental Comparison of the Two
      Methods of Knowledge Acquisition in the Context of Developing
      an Expert System for Soybean Disease Diagnosis", International
      Journal of Policy Analysis and Information Systems, Vol. 4,
      No. 2, 1980.
      (b) Donor: Ming Tan & Jeff Schlimmer (Jeff.Schlimmercs.cmu.edu)
      (c) Date: 11 July 1988

3. Past Usage:
     1. See above.
     2. Tan, M., & Eshelman, L. (1988). Using weighted networks to represent
        classification knowledge in noisy domains.  Proceedings of the Fifth
        International Conference on Machine Learning (pp. 121-134). Ann Arbor,
         Michigan: Morgan Kaufmann.
         -- IWN recorded a 97.1 classification accuracy
            -- 290 training and 340 test instances
      3. Fisher,D.H. & Schlimmer,J.C. (1988). Concept Simplification and
         Predictive Accuracy. Proceedings of the Fifth
         International Conference on Machine Learning (pp. 22-28). Ann Arbor,
         Michigan: Morgan Kaufmann.
         -- Notes why this database is highly predictable

4. Relevant Information Paragraph:
     There are 19 classes, only the first 15 of which have been used in prior
     work.  The folklore seems to be that the last four classes are
     unjustified by the data since they have so few examples.
     There are 35 categorical attributes, some nominal and some ordered.  The
     value ``dna'' means does not apply.  The values for attributes are
     encoded numerically, with the first value encoded as ``0,'' the second as
     ``1,'' and so forth.  An unknown values is encoded as ``?''.

5. Number of Instances: 683

6. Number of Attributes: 35 (all have been nominalized)

7. Attribute Information:
    -- 19 Classes
     diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot,
     phytophthora-rot, brown-stem-rot, powdery-mildew,
     downy-mildew, brown-spot, bacterial-blight,
     bacterial-pustule, purple-seed-stain, anthracnose,
     phyllosticta-leaf-spot, alternarialeaf-spot,
     frog-eye-leaf-spot, diaporthe-pod-&-stem-blight,
     cyst-nematode, 2-4-d-injury, herbicide-injury.   

    1. date:        april,may,june,july,august,september,october,?.
    2. plant-stand:    normal,lt-normal,?.
    3. precip:        lt-norm,norm,gt-norm,?.
    4. temp:        lt-norm,norm,gt-norm,?.
    5. hail:        yes,no,?.
    6. crop-hist:    diff-lst-year,same-lst-yr,same-lst-two-yrs,
                        same-lst-sev-yrs,?.
    7. area-damaged:    scattered,low-areas,upper-areas,whole-field,?.
    8. severity:    minor,pot-severe,severe,?.
    9. seed-tmt:    none,fungicide,other,?.
   10. germination:    '90-100','80-89','lt-80',?.
   11. plant-growth:    norm,abnorm,?.
   12. leaves:        norm,abnorm.
   13. leafspots-halo:    absent,yellow-halos,no-yellow-halos,?.
   14. leafspots-marg:    w-s-marg,no-w-s-marg,dna,?.
   15. leafspot-size:    lt-1/8,gt-1/8,dna,?.
   16. leaf-shread:    absent,present,?.
   17. leaf-malf:    absent,present,?.
   18. leaf-mild:    absent,upper-surf,lower-surf,?.
   19. stem:        norm,abnorm,?.
   20. lodging:        yes,no,?.
   21. stem-cankers:    absent,below-soil,above-soil,above-sec-nde,?.
   22. canker-lesion:    dna,brown,dk-brown-blk,tan,?.
   23. fruiting-bodies:    absent,present,?.
   24. external decay:    absent,firm-and-dry,watery,?.
   25. mycelium:    absent,present,?.
   26. int-discolor:    none,brown,black,?.
   27. sclerotia:    absent,present,?.
   28. fruit-pods:    norm,diseased,few-present,dna,?.
   29. fruit spots:    absent,colored,brown-w/blk-specks,distort,dna,?.
   30. seed:        norm,abnorm,?.
   31. mold-growth:    absent,present,?.
   32. seed-discolor:    absent,present,?.
   33. seed-size:    norm,lt-norm,?.
   34. shriveling:    absent,present,?.
   35. roots:        norm,rotted,galls-cysts,?.

-------------------------------------------------------------------------------------------------
For this task, we want to search 5 and 10 best attributes using Information Gain and Gain Ratio, and do some analysis for both result.

1) Open weka (me using weka 3.7), click on Explorer



2) Load the data from datasets in weka directory
C:\Program Files\Weka-3-7\data\soybean.arff


3)  Click on Select Attributes tab



4) Click on Choose [Attribute Evaluator - > InfoGainAttributeEval], [Search Method -> Ranker]


5) Back to our task, where we need to find 5 and 10 best attributes, so first I will show how to get the best 5 attributes, then we just need to repeat step 5 to do for 10 best attributes.

Click on Ranker, type 5 in numToSelect. Then click okay.


6) For Attribute Selection Mode, by default it will tick Use for training set, so don't change anything.
Click Start. Then it will display the result at the right hand side.


From the result, we get 5 best attributes based on Information Gain and Ranker search method.

Ranked attributes:
 1.1517   22 canker-lesion
 1.0129   15 leafspot-size
 0.9852   29 fruit-spots
 0.8684   13 leafspots-halo
 0.8535   21 stem-cankers
Repeat the above steps into

Information Gain - 10 best attributes
Gain Ratio - 5 best attributes
Gain Ratio - 10 best attributes

The above are the results

Information Gain - 10 best attributes

Gain Ratio - 5 best attributes

Gain Ratio - 10 best attributes

Newbies analysis :

The analysis is just want to show there are differences of ranked attributes between Information Gain and Gain Ratio.

For 5 best attributes :

Information Gain
Gain Ratio
Ranked attributes:
 1.1517   22 canker-lesion
 1.0129   15 leafspot-size
 0.9852   29 fruit-spots
 0.8684   13 leafspots-halo
 0.8535   21 stem-cankers
Ranked attributes:
 0.944   27 sclerotia
 0.944   26 int-discolor
 0.833   18 leaf-mild
 0.773   15 leafspot-size
 0.753   35 roots

The best split for Information Gain is canker-lesion, while for Gain Ratio is sclerotia. And there are 1 common attribute which is leafspots-size.

For 10 best attributes :

Information Gain
Gain Ratio
Ranked attributes:
 1.1517   22 canker-lesion
 1.0129   15 leafspot-size
 0.9852   29 fruit-spots
 0.8684   13 leafspots-halo
 0.8535   21 stem-cankers
 0.8504   14 leafspots-marg
 0.8437   28 fruit-pods
 0.6918   19 stem
 0.6715    1 date
 0.6265   11 plant-growth
Ranked attributes:
 0.944   27 sclerotia
 0.944   26 int-discolor
 0.833   18 leaf-mild
 0.773   15 leafspot-size
 0.753   35 roots
 0.743   14 leafspots-marg
 0.702   13 leafspots-halo
 0.702   12 leaves
 0.698   19 stem
 0.678   11 plant-growth

This is just the same like above, just adding 5 attributes. For this one, there are 5 common attributes which are leafspot-size, leafspots-halo, leafspots-marg, stem and plant-growth.

That's all. Thanks!