Lucky Programmer !

Long time no update in this blog. I guess maybe I have to start updating in this blog as I can refer it in the future. So right now, I have been given a task related to Feature Store and I promise you this is one of the process that we need the most! You can view Uber’s Michaelangelo architectures is depicted as follows:

Credit: https://medium.com/intuitionmachine/google-and-ubers-best-practices-for-deep-learning-58488a8899b6

It is a one step process after Data Preparation and before Modeling (it is predictive modeling not the other modeling ;)). From my understanding, Feature Store is focusing on storing the features that you use in your model. For instance, you build a model for a recommendation system, the features used in that model will be stored in the Feature Store.

What's the point of storing this features?

- To keep track what are the features you use?
- Which model use this features?
- Which project use this features?
- Which features always produce higher accuracy?
- Help your teammates share their features too!

Right now, I am still building the Feature Store myself using Python and store it into a database.

You can read here to know more how it is used in big tech company (Uber).

Thank you for reading my post. :) Why not you spare some time watch this music video. Hmm feeling nostalgic right?

Weka - Attribute Selection Measure: Information Gain (ID3)

Data Mining Id3 Information Gain Weka

2 Comments

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan^[1] used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains. - Wikipedia

Data that will be test is :

We do attribute selection by using Information Gain as Attribute Evaluator. Here are the result :

From the above result, we can see that Outlook is the best split, while Day does not give any contribute to output, so we can remove Day attribute to test with Id3.

For those who don't have Id3 in their weka, you can download in from package manager named "simpleEducationalLearningSchemes".

After you have downloaded, you can got to Classifier - > trees - > Id3 to test the data. Before you start, you can click on More Options then click Choose - > PlainText

After that you can start the process of Id3. Here are the results.

Id3 can't visualize tree, but we can draw the tree based on the result given above.

Here are trees. From the Information Gain, we get the best split is Outlook attribute, so Id3 use Outlook as the root node, then followed by temperature and windy.

That's all. Thanks!

Weka - Information Gain and Gain Ratio Using Soybean Database

Data Mining Gain Ratio Information Gain Weka

Add Comment

Notes: The large soybean database (soybean-large-data.arff) and it's corresponding test database (soybean-large-test.arff) combined into a single file (soybean-large.arff).

1. Title: Large Soybean Database

2. Sources:
      (a) R.S. Michalski and R.L. Chilausky "Learning by Being Told and
          Learning from Examples: An Experimental Comparison of the Two
     Methods of Knowledge Acquisition in the Context of Developing
     an Expert System for Soybean Disease Diagnosis", International
     Journal of Policy Analysis and Information Systems, Vol. 4,
     No. 2, 1980.
      (b) Donor: Ming Tan & Jeff Schlimmer (Jeff.Schlimmercs.cmu.edu)
      (c) Date: 11 July 1988

3. Past Usage:
     1. See above.
     2. Tan, M., & Eshelman, L. (1988). Using weighted networks to represent
        classification knowledge in noisy domains. Proceedings of the Fifth
        International Conference on Machine Learning (pp. 121-134). Ann Arbor,
         Michigan: Morgan Kaufmann.
         -- IWN recorded a 97.1 classification accuracy
            -- 290 training and 340 test instances
      3. Fisher,D.H. & Schlimmer,J.C. (1988). Concept Simplification and
         Predictive Accuracy. Proceedings of the Fifth
         International Conference on Machine Learning (pp. 22-28). Ann Arbor,
         Michigan: Morgan Kaufmann.
         -- Notes why this database is highly predictable

4. Relevant Information Paragraph:
     There are 19 classes, only the first 15 of which have been used in prior
     work. The folklore seems to be that the last four classes are
     unjustified by the data since they have so few examples.
     There are 35 categorical attributes, some nominal and some ordered. The
     value ``dna'' means does not apply. The values for attributes are
     encoded numerically, with the first value encoded as ``0,'' the second as
     ``1,'' and so forth. An unknown values is encoded as ``?''.

5. Number of Instances: 683

6. Number of Attributes: 35 (all have been nominalized)

7. Attribute Information:
    -- 19 Classes
     diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot,
     phytophthora-rot, brown-stem-rot, powdery-mildew,
     downy-mildew, brown-spot, bacterial-blight,
     bacterial-pustule, purple-seed-stain, anthracnose,
     phyllosticta-leaf-spot, alternarialeaf-spot,
     frog-eye-leaf-spot, diaporthe-pod-&-stem-blight,
     cyst-nematode, 2-4-d-injury, herbicide-injury.

    1. date:        april,may,june,july,august,september,october,?.
    2. plant-stand:    normal,lt-normal,?.
    3. precip:        lt-norm,norm,gt-norm,?.
    4. temp:        lt-norm,norm,gt-norm,?.
    5. hail:        yes,no,?.
    6. crop-hist:    diff-lst-year,same-lst-yr,same-lst-two-yrs,
                        same-lst-sev-yrs,?.
    7. area-damaged:    scattered,low-areas,upper-areas,whole-field,?.
    8. severity:    minor,pot-severe,severe,?.
    9. seed-tmt:    none,fungicide,other,?.
   10. germination:    '90-100','80-89','lt-80',?.
   11. plant-growth:    norm,abnorm,?.
   12. leaves:        norm,abnorm.
   13. leafspots-halo:    absent,yellow-halos,no-yellow-halos,?.
   14. leafspots-marg:    w-s-marg,no-w-s-marg,dna,?.
   15. leafspot-size:    lt-1/8,gt-1/8,dna,?.
   16. leaf-shread:    absent,present,?.
   17. leaf-malf:    absent,present,?.
   18. leaf-mild:    absent,upper-surf,lower-surf,?.
   19. stem:        norm,abnorm,?.
   20. lodging:        yes,no,?.
   21. stem-cankers:    absent,below-soil,above-soil,above-sec-nde,?.
   22. canker-lesion:    dna,brown,dk-brown-blk,tan,?.
   23. fruiting-bodies:    absent,present,?.
   24. external decay:    absent,firm-and-dry,watery,?.
   25. mycelium:    absent,present,?.
   26. int-discolor:    none,brown,black,?.
   27. sclerotia:    absent,present,?.
   28. fruit-pods:    norm,diseased,few-present,dna,?.
   29. fruit spots:    absent,colored,brown-w/blk-specks,distort,dna,?.
   30. seed:        norm,abnorm,?.
   31. mold-growth:    absent,present,?.
   32. seed-discolor:    absent,present,?.
   33. seed-size:    norm,lt-norm,?.
   34. shriveling:    absent,present,?.
   35. roots:        norm,rotted,galls-cysts,?.

-------------------------------------------------------------------------------------------------
For this task, we want to search 5 and 10 best attributes using Information Gain and Gain Ratio, and do some analysis for both result.

1) Open weka (me using weka 3.7), click on Explorer

2) Load the data from datasets in weka directory

C:\Program Files\Weka-3-7\data\soybean.arff

3) Click on Select Attributes tab

4) Click on Choose [Attribute Evaluator - > InfoGainAttributeEval], [Search Method -> Ranker]

5) Back to our task, where we need to find 5 and 10 best attributes, so first I will show how to get the best 5 attributes, then we just need to repeat step 5 to do for 10 best attributes.

Click on Ranker, type 5 in numToSelect. Then click okay.

6) For Attribute Selection Mode, by default it will tick Use for training set, so don't change anything.
Click Start. Then it will display the result at the right hand side.

From the result, we get 5 best attributes based on Information Gain and Ranker search method.

Ranked attributes:
1.1517   22 canker-lesion
1.0129   15 leafspot-size
0.9852   29 fruit-spots
0.8684   13 leafspots-halo
0.8535   21 stem-cankers

Repeat the above steps into

Information Gain - 10 best attributes
Gain Ratio - 5 best attributes
Gain Ratio - 10 best attributes

The above are the results

Information Gain - 10 best attributes

Gain Ratio - 5 best attributes

Gain Ratio - 10 best attributes

Newbies analysis :

The analysis is just want to show there are differences of ranked attributes between Information Gain and Gain Ratio.

For 5 best attributes :

Information Gain	Gain Ratio
Ranked attributes: 1.1517 22 canker-lesion 1.0129 15 leafspot-size 0.9852 29 fruit-spots 0.8684 13 leafspots-halo 0.8535 21 stem-cankers	Ranked attributes: 0.944 27 sclerotia 0.944 26 int-discolor 0.833 18 leaf-mild 0.773 15 leafspot-size 0.753 35 roots

The best split for Information Gain is canker-lesion, while for Gain Ratio is sclerotia. And there are 1 common attribute which is leafspots-size.

For 10 best attributes :

Information Gain	Gain Ratio
Ranked attributes: 1.1517 22 canker-lesion 1.0129 15 leafspot-size 0.9852 29 fruit-spots 0.8684 13 leafspots-halo 0.8535 21 stem-cankers 0.8504 14 leafspots-marg 0.8437 28 fruit-pods 0.6918 19 stem 0.6715 1 date 0.6265 11 plant-growth	Ranked attributes: 0.944 27 sclerotia 0.944 26 int-discolor 0.833 18 leaf-mild 0.773 15 leafspot-size 0.753 35 roots 0.743 14 leafspots-marg 0.702 13 leafspots-halo 0.702 12 leaves 0.698 19 stem 0.678 11 plant-growth

This is just the same like above, just adding 5 attributes. For this one, there are 5 common attributes which are leafspot-size, leafspots-halo, leafspots-marg, stem and plant-growth.

That's all. Thanks!

Different ordered data, different result?

Multilayer Perceptron Weka

Add Comment

The answer is yes!

I will show you my result when I was testing with my altered state of consciousness data. Try with weka and using Multilayer Perceptron with same parameter.

Here are the result, same data but with different ordered :

Weka - Attribute Selector Classifier

Weka

Add Comment

In weka, they have three technique to perform selected attribute which are :

native approach, using the attribute selection classes directly
using a meta-classifier
the filter approach

For this time, I will be using meta-classifier. Basically meta-classifier will use Attribute Selector Classifier, after it reduce the attribute, then the attribute reduced will be use in other method.

For example :-

You have a data set, the column in data set are :

name
age
smoking
heart rate
no. tel

After using Attribute Selector Classifier to the data, it will reduce the attribute to :

age
smoking
heart rate

So this attribute will be use in other method such as Multilayer Perceptron, Naive Bayes or any method. That's it.

Practical Session :

Open your weka, and load any data. Or you can try download data from here.

After that go to classify tab.