Notes: The large soybean database (soybean-large-data.arff) and it's corresponding test database (soybean-large-test.arff) combined into a single file (soybean-large.arff).
1. Title: Large Soybean Database
2. Sources:
(a) R.S. Michalski and R.L. Chilausky "Learning by Being Told and
Learning from Examples: An Experimental Comparison of the Two
Methods of Knowledge Acquisition in the Context of Developing
an Expert System for Soybean Disease Diagnosis", International
Journal of Policy Analysis and Information Systems, Vol. 4,
No. 2, 1980.
(b) Donor: Ming Tan & Jeff Schlimmer (Jeff.Schlimmercs.cmu.edu)
(c) Date: 11 July 1988
3. Past Usage:
1. See above.
2. Tan, M., & Eshelman, L. (1988). Using weighted networks to represent
classification knowledge in noisy domains. Proceedings of the Fifth
International Conference on Machine Learning (pp. 121-134). Ann Arbor,
Michigan: Morgan Kaufmann.
-- IWN recorded a 97.1 classification accuracy
-- 290 training and 340 test instances
3. Fisher,D.H. & Schlimmer,J.C. (1988). Concept Simplification and
Predictive Accuracy. Proceedings of the Fifth
International Conference on Machine Learning (pp. 22-28). Ann Arbor,
Michigan: Morgan Kaufmann.
-- Notes why this database is highly predictable
4. Relevant Information Paragraph:
There are 19 classes, only the first 15 of which have been used in prior
work. The folklore seems to be that the last four classes are
unjustified by the data since they have so few examples.
There are 35 categorical attributes, some nominal and some ordered. The
value ``dna'' means does not apply. The values for attributes are
encoded numerically, with the first value encoded as ``0,'' the second as
``1,'' and so forth. An unknown values is encoded as ``?''.
5. Number of Instances: 683
6. Number of Attributes: 35 (all have been nominalized)
7. Attribute Information:
-- 19 Classes
diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot,
phytophthora-rot, brown-stem-rot, powdery-mildew,
downy-mildew, brown-spot, bacterial-blight,
bacterial-pustule, purple-seed-stain, anthracnose,
phyllosticta-leaf-spot, alternarialeaf-spot,
frog-eye-leaf-spot, diaporthe-pod-&-stem-blight,
cyst-nematode, 2-4-d-injury, herbicide-injury.
1. date: april,may,june,july,august,september,october,?.
2. plant-stand: normal,lt-normal,?.
3. precip: lt-norm,norm,gt-norm,?.
4. temp: lt-norm,norm,gt-norm,?.
5. hail: yes,no,?.
6. crop-hist: diff-lst-year,same-lst-yr,same-lst-two-yrs,
same-lst-sev-yrs,?.
7. area-damaged: scattered,low-areas,upper-areas,whole-field,?.
8. severity: minor,pot-severe,severe,?.
9. seed-tmt: none,fungicide,other,?.
10. germination: '90-100','80-89','lt-80',?.
11. plant-growth: norm,abnorm,?.
12. leaves: norm,abnorm.
13. leafspots-halo: absent,yellow-halos,no-yellow-halos,?.
14. leafspots-marg: w-s-marg,no-w-s-marg,dna,?.
15. leafspot-size: lt-1/8,gt-1/8,dna,?.
16. leaf-shread: absent,present,?.
17. leaf-malf: absent,present,?.
18. leaf-mild: absent,upper-surf,lower-surf,?.
19. stem: norm,abnorm,?.
20. lodging: yes,no,?.
21. stem-cankers: absent,below-soil,above-soil,above-sec-nde,?.
22. canker-lesion: dna,brown,dk-brown-blk,tan,?.
23. fruiting-bodies: absent,present,?.
24. external decay: absent,firm-and-dry,watery,?.
25. mycelium: absent,present,?.
26. int-discolor: none,brown,black,?.
27. sclerotia: absent,present,?.
28. fruit-pods: norm,diseased,few-present,dna,?.
29. fruit spots: absent,colored,brown-w/blk-specks,distort,dna,?.
30. seed: norm,abnorm,?.
31. mold-growth: absent,present,?.
32. seed-discolor: absent,present,?.
33. seed-size: norm,lt-norm,?.
34. shriveling: absent,present,?.
35. roots: norm,rotted,galls-cysts,?.
-------------------------------------------------------------------------------------------------
For this task, we want to search 5 and 10 best attributes using Information Gain and Gain Ratio, and do some analysis for both result.
1) Open weka (me using weka 3.7), click on Explorer
2) Load the data from datasets in weka directory
C:\Program Files\Weka-3-7\data\soybean.arff
3) Click on Select Attributes tab
4) Click on Choose [Attribute Evaluator - > InfoGainAttributeEval], [Search Method -> Ranker]
5) Back to our task, where we need to find 5 and 10 best attributes, so first I will show how to get the best 5 attributes, then we just need to repeat step 5 to do for 10 best attributes.
Click on Ranker, type 5 in numToSelect. Then click okay.
6) For Attribute Selection Mode, by default it will tick Use for training set, so don't change anything.
Click Start. Then it will display the result at the right hand side.
From the result, we get 5 best attributes based on Information Gain and Ranker search method.
Ranked attributes:
1.1517 22 canker-lesion
1.0129 15 leafspot-size
0.9852 29 fruit-spots
0.8684 13 leafspots-halo
0.8535 21 stem-cankers
Repeat the above steps into
Information Gain - 10 best attributes
Gain Ratio - 5 best attributes
Gain Ratio - 10 best attributes
The above are the results
|
Information Gain - 10 best attributes |
|
Gain Ratio - 5 best attributes |
|
Gain Ratio - 10 best attributes |
Newbies analysis :
The analysis is just want to show there are differences of ranked attributes between Information Gain and Gain Ratio.
For 5 best attributes :
Information Gain
|
Gain Ratio
|
Ranked attributes:
1.1517 22 canker-lesion
1.0129 15 leafspot-size
0.9852 29 fruit-spots
0.8684 13 leafspots-halo
0.8535 21 stem-cankers
|
Ranked attributes:
0.944 27 sclerotia
0.944 26 int-discolor
0.833 18 leaf-mild
0.773 15 leafspot-size
0.753 35 roots
|
The best split for Information Gain is canker-lesion, while for Gain Ratio is sclerotia. And there are 1 common attribute which is leafspots-size.
For 10 best attributes :
Information Gain
|
Gain Ratio
|
Ranked attributes:
1.1517 22 canker-lesion
1.0129 15 leafspot-size
0.9852 29 fruit-spots
0.8684 13 leafspots-halo
0.8535 21 stem-cankers
0.8504 14 leafspots-marg
0.8437 28 fruit-pods
0.6918 19 stem
0.6715 1 date
0.6265 11 plant-growth
|
Ranked attributes:
0.944 27 sclerotia
0.944 26 int-discolor
0.833 18 leaf-mild
0.773 15 leafspot-size
0.753 35 roots
0.743 14 leafspots-marg
0.702 13 leafspots-halo
0.702 12 leaves
0.698 19 stem
0.678 11 plant-growth
|
This is just the same like above, just adding 5 attributes. For this one, there are 5 common attributes which are leafspot-size, leafspots-halo, leafspots-marg, stem and plant-growth.
That's all. Thanks!