Feature Store - Do we really need this?

Long time no update in this blog. I guess maybe I have to start updating in this blog as I can refer it in the future. So right now, I have been given a task related to Feature Store and I promise you this is one of the process that we need the most! You can view Uber’s Michaelangelo architectures is depicted as follows:

Credit: https://medium.com/intuitionmachine/google-and-ubers-best-practices-for-deep-learning-58488a8899b6

It is a one step process after Data Preparation and before Modeling (it is predictive modeling not the other modeling ;)). From my understanding, Feature Store is focusing on storing the features that you use in your model. For instance, you build a model for a recommendation system, the features used in that model will be stored in the Feature Store.

What's the point of storing this features?

- To keep track what are the features you use?
- Which model use this features?
- Which project use this features?
- Which features always produce higher accuracy?
- Help your teammates share their features too!

Right now, I am still building the Feature Store myself using Python and store it into a database.

You can read here to know more how it is used in big tech company (Uber).

 Thank you for reading my post. :) Why not you spare some time watch this music video. Hmm feeling nostalgic right?


Weka - Attribute Selection Measure: Information Gain (ID3)

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains. - Wikipedia

Data that will be test is :


We do attribute selection by using Information Gain as Attribute Evaluator. Here are the result : 
From the above result, we can see that Outlook is the best split, while Day does not give any contribute to output, so we can remove Day attribute to test with Id3.

For those who don't have Id3 in their weka, you can download in from package manager named "simpleEducationalLearningSchemes".

After you have downloaded, you can got to Classifier - > trees - > Id3  to test the data. Before you start, you can click on More Options then click Choose - >  PlainText


 After that you can start the process of Id3. Here are the results.

Id3 can't visualize tree, but we can draw the tree based on the result given above.

Here are trees. From the Information Gain, we get the best split is Outlook attribute, so Id3 use Outlook as the root node, then followed by temperature and windy. 

That's all. Thanks!

Weka - Information Gain and Gain Ratio Using Soybean Database

Notes: The large soybean database (soybean-large-data.arff) and it's corresponding test database (soybean-large-test.arff) combined into a single file (soybean-large.arff).

1. Title: Large Soybean Database

2. Sources:
      (a) R.S. Michalski and R.L. Chilausky "Learning by Being Told and
          Learning from Examples: An Experimental Comparison of the Two
      Methods of Knowledge Acquisition in the Context of Developing
      an Expert System for Soybean Disease Diagnosis", International
      Journal of Policy Analysis and Information Systems, Vol. 4,
      No. 2, 1980.
      (b) Donor: Ming Tan & Jeff Schlimmer (Jeff.Schlimmercs.cmu.edu)
      (c) Date: 11 July 1988

3. Past Usage:
     1. See above.
     2. Tan, M., & Eshelman, L. (1988). Using weighted networks to represent
        classification knowledge in noisy domains.  Proceedings of the Fifth
        International Conference on Machine Learning (pp. 121-134). Ann Arbor,
         Michigan: Morgan Kaufmann.
         -- IWN recorded a 97.1 classification accuracy
            -- 290 training and 340 test instances
      3. Fisher,D.H. & Schlimmer,J.C. (1988). Concept Simplification and
         Predictive Accuracy. Proceedings of the Fifth
         International Conference on Machine Learning (pp. 22-28). Ann Arbor,
         Michigan: Morgan Kaufmann.
         -- Notes why this database is highly predictable

4. Relevant Information Paragraph:
     There are 19 classes, only the first 15 of which have been used in prior
     work.  The folklore seems to be that the last four classes are
     unjustified by the data since they have so few examples.
     There are 35 categorical attributes, some nominal and some ordered.  The
     value ``dna'' means does not apply.  The values for attributes are
     encoded numerically, with the first value encoded as ``0,'' the second as
     ``1,'' and so forth.  An unknown values is encoded as ``?''.

5. Number of Instances: 683

6. Number of Attributes: 35 (all have been nominalized)

7. Attribute Information:
    -- 19 Classes
     diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot,
     phytophthora-rot, brown-stem-rot, powdery-mildew,
     downy-mildew, brown-spot, bacterial-blight,
     bacterial-pustule, purple-seed-stain, anthracnose,
     phyllosticta-leaf-spot, alternarialeaf-spot,
     frog-eye-leaf-spot, diaporthe-pod-&-stem-blight,
     cyst-nematode, 2-4-d-injury, herbicide-injury.   

    1. date:        april,may,june,july,august,september,october,?.
    2. plant-stand:    normal,lt-normal,?.
    3. precip:        lt-norm,norm,gt-norm,?.
    4. temp:        lt-norm,norm,gt-norm,?.
    5. hail:        yes,no,?.
    6. crop-hist:    diff-lst-year,same-lst-yr,same-lst-two-yrs,
                        same-lst-sev-yrs,?.
    7. area-damaged:    scattered,low-areas,upper-areas,whole-field,?.
    8. severity:    minor,pot-severe,severe,?.
    9. seed-tmt:    none,fungicide,other,?.
   10. germination:    '90-100','80-89','lt-80',?.
   11. plant-growth:    norm,abnorm,?.
   12. leaves:        norm,abnorm.
   13. leafspots-halo:    absent,yellow-halos,no-yellow-halos,?.
   14. leafspots-marg:    w-s-marg,no-w-s-marg,dna,?.
   15. leafspot-size:    lt-1/8,gt-1/8,dna,?.
   16. leaf-shread:    absent,present,?.
   17. leaf-malf:    absent,present,?.
   18. leaf-mild:    absent,upper-surf,lower-surf,?.
   19. stem:        norm,abnorm,?.
   20. lodging:        yes,no,?.
   21. stem-cankers:    absent,below-soil,above-soil,above-sec-nde,?.
   22. canker-lesion:    dna,brown,dk-brown-blk,tan,?.
   23. fruiting-bodies:    absent,present,?.
   24. external decay:    absent,firm-and-dry,watery,?.
   25. mycelium:    absent,present,?.
   26. int-discolor:    none,brown,black,?.
   27. sclerotia:    absent,present,?.
   28. fruit-pods:    norm,diseased,few-present,dna,?.
   29. fruit spots:    absent,colored,brown-w/blk-specks,distort,dna,?.
   30. seed:        norm,abnorm,?.
   31. mold-growth:    absent,present,?.
   32. seed-discolor:    absent,present,?.
   33. seed-size:    norm,lt-norm,?.
   34. shriveling:    absent,present,?.
   35. roots:        norm,rotted,galls-cysts,?.

-------------------------------------------------------------------------------------------------
For this task, we want to search 5 and 10 best attributes using Information Gain and Gain Ratio, and do some analysis for both result.

1) Open weka (me using weka 3.7), click on Explorer



2) Load the data from datasets in weka directory
C:\Program Files\Weka-3-7\data\soybean.arff


3)  Click on Select Attributes tab



4) Click on Choose [Attribute Evaluator - > InfoGainAttributeEval], [Search Method -> Ranker]


5) Back to our task, where we need to find 5 and 10 best attributes, so first I will show how to get the best 5 attributes, then we just need to repeat step 5 to do for 10 best attributes.

Click on Ranker, type 5 in numToSelect. Then click okay.


6) For Attribute Selection Mode, by default it will tick Use for training set, so don't change anything.
Click Start. Then it will display the result at the right hand side.


From the result, we get 5 best attributes based on Information Gain and Ranker search method.

Ranked attributes:
 1.1517   22 canker-lesion
 1.0129   15 leafspot-size
 0.9852   29 fruit-spots
 0.8684   13 leafspots-halo
 0.8535   21 stem-cankers
Repeat the above steps into

Information Gain - 10 best attributes
Gain Ratio - 5 best attributes
Gain Ratio - 10 best attributes

The above are the results

Information Gain - 10 best attributes

Gain Ratio - 5 best attributes

Gain Ratio - 10 best attributes

Newbies analysis :

The analysis is just want to show there are differences of ranked attributes between Information Gain and Gain Ratio.

For 5 best attributes :

Information Gain
Gain Ratio
Ranked attributes:
 1.1517   22 canker-lesion
 1.0129   15 leafspot-size
 0.9852   29 fruit-spots
 0.8684   13 leafspots-halo
 0.8535   21 stem-cankers
Ranked attributes:
 0.944   27 sclerotia
 0.944   26 int-discolor
 0.833   18 leaf-mild
 0.773   15 leafspot-size
 0.753   35 roots

The best split for Information Gain is canker-lesion, while for Gain Ratio is sclerotia. And there are 1 common attribute which is leafspots-size.

For 10 best attributes :

Information Gain
Gain Ratio
Ranked attributes:
 1.1517   22 canker-lesion
 1.0129   15 leafspot-size
 0.9852   29 fruit-spots
 0.8684   13 leafspots-halo
 0.8535   21 stem-cankers
 0.8504   14 leafspots-marg
 0.8437   28 fruit-pods
 0.6918   19 stem
 0.6715    1 date
 0.6265   11 plant-growth
Ranked attributes:
 0.944   27 sclerotia
 0.944   26 int-discolor
 0.833   18 leaf-mild
 0.773   15 leafspot-size
 0.753   35 roots
 0.743   14 leafspots-marg
 0.702   13 leafspots-halo
 0.702   12 leaves
 0.698   19 stem
 0.678   11 plant-growth

This is just the same like above, just adding 5 attributes. For this one, there are 5 common attributes which are leafspot-size, leafspots-halo, leafspots-marg, stem and plant-growth.

That's all. Thanks!


Weka - Attribute Selector Classifier

In weka, they have three technique to perform selected attribute which are :
  • native approach, using the attribute selection classes directly
  • using a meta-classifier
  • the filter approach
For this time, I will be using meta-classifier. Basically meta-classifier will use Attribute Selector Classifier, after it reduce the attribute, then the attribute reduced will be use in other method.

For example :-

You have a data set, the column in data set are :
  • name
  • age
  • smoking
  • heart rate
  • no. tel

After using Attribute Selector Classifier to the data, it will reduce the attribute to :
  • age
  • smoking
  • heart rate
So this attribute will be use in other method such as Multilayer Perceptron, Naive Bayes or any method. That's it.

Practical Session :

Open your weka, and load any data. Or you can try download data from here.

After that go to classify tab.


Then click button Choose -> meta -> AttributeSelectedClassifier


You can change the method, for example I choose Linear Regression.


Just click OK, then choose any Test Options, I choose Percentage split, by 70% for training set, 30% for testing.


Thank you.

Source : https://weka.wikispaces.com/Performing+attribute+selection