Feature selection
Machine learning and data mining 

Machine learning venues

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for three reasons:

 simplification of models to make them easier to interpret by researchers/users,^{[1]}
 shorter training times,
 enhanced generalization by reducing overfitting^{[2]}(formally, reduction of variance^{[1]})
The central premise when using a feature selection technique is that the data contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information.^{[2]} Redundant or irrelevant features are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.^{[3]}
Feature selection techniques should be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). Archetypal cases for the application of feature selection include the analysis of written texts and DNA microarray data, where there are many thousands of features, and a few tens to hundreds of samples.
Contents
 1 Introduction
 2 Subset selection
 3 Optimality criteria
 4 Structure Learning
 5 Minimumredundancymaximumrelevance (mRMR) feature selection
 6 Correlation feature selection
 7 Regularized trees
 8 Overview on metaheuristics methods
 9 Feature selection embedded in learning algorithms
 10 See also
 11 References
 12 Further reading
 13 External links
Introduction
A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.^{[3]}
Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a holdout set. Counting the number of mistakes made on that holdout set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model.
Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the mutual information,^{[3]} the pointwise mutual information,^{[4]} Pearson productmoment correlation coefficient, inter/intra class distance or the scores of significance tests for each class/feature combinations.^{[4]}^{[5]} Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via crossvalidation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems.
Embedded methods are a catchall group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have nonzero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples,^{[6]} and FeaLect which scores all the features based on combinatorial analysis of regression coefficients.^{[7]} One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. These approaches tend to be between filters and wrappers in terms of computational complexity.
In traditional statistics, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by crossvalidation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network.
Subset selection
Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into Wrappers, Filters and Embedded. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in and specific to a model.
Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.
Alternative searchbased techniques are based on targeted projection pursuit which finds lowdimensional projections of the data that score highly: the features that have the largest projections in the lowerdimensional space are then selected.
Search approaches include:
 Exhaustive
 Best first
 Simulated annealing
 Genetic algorithm
 Greedy forward selection
 Greedy backward elimination^{[8]}^{[9]}^{[10]}
 Particle swarm optimization^{[11]}
 Targeted projection pursuit
 Scatter Search^{[12]}
 Variable Neighborhood Search^{[13]}^{[14]}
Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information;^{[15]} see here.
Other available filter metrics include:
 Class separability
 Error probability
 Interclass distance
 Probabilistic distance
 Entropy
 Consistencybased feature selection
 Correlationbased feature selection
Optimality criteria
The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common ones incorporate a measure of accuracy, penalised by the number of features selected (e.g. the Bayesian information criterion). The oldest are Mallows's C_{p} statistic and Akaike information criterion (AIC). These add variables if the tstatistic is bigger than .
Other criteria are Bayesian information criterion (BIC) which uses , minimum description length (MDL) which asymptotically uses , Bonnferroni / RIC which use , maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR) which use something close to .
Structure Learning
Filter feature selection is a specific case of a more general paradigm called Structure Learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms assume the data is generated by a Bayesian Network, and so the structure is a directed graphical model. The optimal solution to the filter feature selection problem is the Markov blanket of the target node, and in a Bayesian Network, there is a unique Markov Blanket for each node.^{[16]}
Minimumredundancymaximumrelevance (mRMR) feature selection
Peng et al.^{[17]} proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set S for the class c is defined by the average value of all mutual information values between the individual feature f_{i} and the class c as follows:
 .
The redundancy of all features in the set S is the average value of all mutual information values between the feature f_{i} and the feature f_{j}:
The mRMR criterion is a combination of two measures given above and is defined as follows:
Suppose that there are n fullset features. Let x_{i} be the set membership indicator function for feature f_{i}, so that x_{i}=1 indicates presence and x_{i}=0 indicates absence of the feature f_{i} in the globally optimal feature set. Let and . The above may then be written as an optimization problem:
The mRMR algorithm is an approximation of the theoretically optimal maximumdependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance^{[18]} when the features are individually useless, but are useful when combined (a pathological case is found when the class is a parity function of the features). Overall the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal maxdependency selection, yet produces a feature set with little pairwise redundancy.
mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in different ways.^{[18]}^{[19]}
Global optimization formulations
mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global quadratic programing optimization problem as follows:^{[20]}
where is the vector of feature relevancy assuming there are n features in total, is the matrix of feature pairwise redundancy, and represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy,^{[21]} due to its placement of the feature self redundancy term on the diagonal of H.
Another global formulation for the mutual information based feature selection problem is based on the conditional relevancy:^{[21]}
where and .
An advantage of SPEC_{CMI} is that it can be solved simply via finding the dominant eigenvector of Q, thus is very scalable. SPEC_{CMI} also handles secondorder feature interaction.
For highdimensional and small sample data (e.g., dimensionality > 10^{5} and the number of samples < 10^{3}), the HilbertSchmidt Independence Criterion Lasso (HSIC Lasso) is useful.^{[22]} HSIC Lasso optimization problem is given as
where is a kernelbased independence measure called the (empirical) HilbertSchmidt independence criterion (HSIC), denotes the trace, is the regularization parameter, and are input and output centered Gram matrices, and are Gram matrices, and are kernel functions, is the centering matrix, is the mdimensional identity matrix (m: the number of samples), is the mdimensional vector with all ones, and is the norm. HSIC always takes a nonnegative value, and is zero if and only if two random variables are statistically independent when a universal reproducing kernel such as the Gaussian kernel is used.
The HSIC Lasso can be written as
where is the Frobenius norm. The optimization problem is a Lasso problem, and thus it can be efficiently solved with a stateoftheart Lasso solver such as the dual augmented Lagrangian method.
Correlation feature selection
The Correlation Feature Selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other".^{[23]}^{[24]} The following equation gives the merit of a feature subset S consisting of k features:
Here, is the average value of all featureclassification correlations, and is the average value of all featurefeature correlations. The CFS criterion is defined as follows:
The and variables are referred to as correlations, but are not necessarily Pearson's correlation coefficient or Spearman's ρ. Dr. Mark Hall's dissertation uses neither of these, but uses three different measures of relatedness, minimum description length (MDL), symmetrical uncertainty, and relief.
Let x_{i} be the set membership indicator function for feature f_{i}; then the above can be rewritten as an optimization problem:
The combinatorial problems above are, in fact, mixed 0–1 linear programming problems that can be solved by using branchandbound algorithms.^{[25]}
Regularized trees
The features from a decision tree or a tree ensemble are shown to be redundant. A recent method called regularized tree^{[26]} can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build one tree model (or one tree ensemble model) and thus are computationally efficient.
Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to outliers, and thus, require little data preprocessing such as normalization. Regularized random forest (RRF) (RRF) is one type of regularized trees. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest.
Overview on metaheuristics methods
A metaheuristic is a general description of an algorithm dedicated to solve difficult (typically NPhard problem) optimization problems for which there is no classical solving methods. Generally, a metaheustochastic stochastics algorithm tending to reach a global optima. There are many metaheuristics, from a simple local search to a complex global search algorithm.
Main principles
The feature selection methods are typically presented in three classes based on how they combine the selection algorithm and the model building
Filter Method
Filter type methods select variables regardless of the model. They are based only on general features like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The others variables will be part of a the model classification, a regression used to classify or a data prediction. These methods are particularly effective in computation time and robust to overfitting.^{[27]}
However, filter methods tend to select redundant variables because they do not consider the relationships between variables. Therefore, they are mainly used as a preprocess method.
Wrapper Method
Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables.^{[28]} The two main disadvantages of these methods are :
 The increasing overfitting risk when the number of observations is insufficient.
 The significant computation time when the number of variables is large.
Embedded Method
Recently, embedded methods have been proposed to reduce the classification of learning. They try to combine the advantages of both previous methods. The learning algorithm takes advantage of its own variable selection algorithm. So, it needs to know preliminary what a good selection is, which limits their exploitation.^{[29]}
Application of feature selection metaheuristics
This is a survey of the application of feature selection metaheuristics lately used in the literature. This survey was realized by J. Hammon in her thesis.^{[27]}
Application  Algorithm  Approach  classifier  Evaluation Function  Ref 

SNPs  Feature Selection Feature Similarity  Filter  r^{2}  Phuong 2005 ^{[28]}  
SNPs  Genetic Algorithm  Wrapper  Decision Tree  Classification accuracy (10fold)  Shah 2004 ^{[30]} 
SNPs  HillClimbing  Filter + Wrapper  Naive Bayesian  Predicted residual sum of squares  Long 2007 ^{[31]} 
SNPs  Simuleating Annealing  Naive bayesian  Classification accuracy (5fold)  Ustunkar 2011 ^{[32]}  
Segments parole  Ants colony  Wrapper  Artificial Neural Network  MSE  Alani 2005 ^{[33]} 
Marketing  Simulated Annealing  Wrapper  Regression  AIC, r^{2}  Meiri 2006 ^{[34]} 
Economy  Simulated Annealing, Genetic Algorithm  Wrapper  Regression  BIC  Kapetanios 2005 ^{[35]} 
Spectral Mass  Genetic Algorithm  Wrapper  Multiple Regression Linear, Partial Least Square  rootmeansquare error of prediction  Broadhurst 2007 ^{[36]} 
Microarray  Tabu Search + PSO  Wrapper  Support Vector Machine, K Nearest Neighbor  Euclidian Distance  Chuang 2009 ^{[37]} 
Microarray  PSO + Genetic Algorithm  Wrapper  Support Vector Machine  Classification accuracy (10fold)  Alba 2007 ^{[38]} 
Microarray  Genetic Algorithm + Iterated Local Search  Embedded  Support Vector Machine  Classification accuracy (10fold)  Duval 2009 ^{[29]} 
Microarray  Iterated Local Search  Wrapper  Regression  Posterior Probability  Hans 2007 ^{[39]} 
Microarray  Genetic Algorithm  Wrapper  K Nearest Neighbor  Classification accuracy (Leaveoneout crossvalidation)  JirapechUmpai 2005 ^{[40]} 
Microarray  Hybrid Genetic Algorithm  Wrapper  K Nearest Neighbor  Classification accuracy (Leaveoneout crossvalidation)  Oh 2004 ^{[41]} 
Microarray  Genetic Algorithm  Wrapper  Support Vector Machine  Sensibility Specificity  Xuan 2011 ^{[42]} 
Microarray  Genetic Algorithm  Wrapper  All paired, Support Vector Machine  Classification accuracy (Leaveoneout crossvalidation)  Peng 2003 ^{[43]} 
Microarray  Genetic Algorithm  Embedded  Support Vector Machine  Classification accuracy (10fold)  Hernandez 2007 ^{[44]} 
Microarray  Genetic Algorithm  Hybrid  Support Vector Machine  Classification accuracy (Leaveoneout crossvalidation)  Huerta 2006 ^{[45]} 
Microarray  Genetic Algorithm  Support Vector Machine  Classification accuracy (10fold)  Muni 2006 ^{[46]}  
Microarray  Genetic Algorithm  Wrapper  Support Vector Machine  EHDIALL, CLUMP  Jourdan 2004 ^{[47]} 
Feature selection embedded in learning algorithms
Some learning algorithms perform feature selection as part of their overall operation. These include:
 regularization techniques, such as sparse regression, LASSO, and SVM
 Regularized trees e.g. regularized random forest implemented in the RRF package
 Decision tree^{[citation needed]}
 Memetic algorithm
 Random multinomial logit (RMNL)
 Autoencoding networks with a bottlenecklayer
 Submodular feature selection, Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection Das el al [6]; Submodular feature selection for highdimensional acoustic score spaces, Liu et al [7]; Submodular Attribute Selection for Action Recognition in Video, Zheng et al[8]
See also
This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. (July 2010)

References
 ↑ ^{1.0} ^{1.1} Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. p. 204.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ ^{2.0} ^{2.1} Bermingham, Mairead L.; PongWong, Ricardo; Spiliopoulou, Athina; Hayward, Caroline; Rudan, Igor; Campbell, Harry; Wright, Alan F.; Wilson, James F.; Agakov, Felix; Navarro, Pau; Haley, Chris S. (2015). "Application of highdimensional feature selection: evaluation for genomic prediction in man". Sci. Rep. 5.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ ^{3.0} ^{3.1} ^{3.2} Guyon, Isabelle; Elisseeff, André (2003). "An Introduction to Variable and Feature Selection". JMLR. 3.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ ^{4.0} ^{4.1} Yang, Yiming; Pedersen, Jan O. (1997). A comparative study on feature selection in text categorization. ICML.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Forman, George (2003). "An extensive empirical study of feature selection metrics for text classification". Journal of Machine Learning Research. 3: 1289–1305.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Lua error in Module:Citation/CS1/Identifiers at line 47: attempt to index field 'wikibase' (a nil value).
 ↑ Lua error in Module:Citation/CS1/Identifiers at line 47: attempt to index field 'wikibase' (a nil value).
 ↑ Figueroa, Alejandro (2015). "Exploring effective features for recognizing the user intent behind web queries". Computers in Industry. 68: 162–169.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Figueroa, Alejandro; Guenter Neumann (2013). Learning to Rank Effective Paraphrases from Query Logs for Community Question Answering. AAAI.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Figueroa, Alejandro; Guenter Neumann (2014). "Categoryspecific models for ranking effective paraphrases in community Question Answering". Expert Systems with Applications. 41: 4730–4742.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Zhang, Y.; Wang, S.; Phillips, P. (2014). "Binary PSO with Mutation Operator for Feature Selection using Decision Tree applied to Spam Detection". KnowledgeBased Systems. 64: 22–31.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ F.C. GarciaLopez, M. GarciaTorres, B. Melian, J.A. MorenoPerez, J.M. MorenoVega. Solving feature subset selection problem by a Parallel Scatter Search, European Journal of Operational Research, vol. 169, no. 2, pp. 477–489, 2006.
 ↑ F.C. GarciaLopez, M. GarciaTorres, B. Melian, J.A. MorenoPerez, J.M. MorenoVega. Solving Feature Subset Selection Problem by a Hybrid Metaheuristic. In First International Workshop on Hybrid Metaheuristics, pp. 59–68, 2004.
 ↑ M. GarciaTorres, F. GomezVela, B. Melian, J.M. MorenoVega. Highdimensional feature selection via feature grouping: A Variable Neighborhood Search approach, Information Sciences, vol. 326, pp. 102118, 2016.
 ↑ Alexander Kraskov, Harald Stögbauer, Ralph G. Andrzejak, and Peter Grassberger, "Hierarchical Clustering Based on Mutual Information", (2003) ArXiv qbio/0311039
 ↑ Aliferis, Constantin (2010). "Local causal and markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation" (PDF). Journal of Machine Learning Research. 11: 171–234.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Lua error in Module:Citation/CS1/Identifiers at line 47: attempt to index field 'wikibase' (a nil value). Program
 ↑ ^{18.0} ^{18.1} Brown, G., Pocock, A., Zhao, M.J., Lujan, M. (2012). "Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection", In the Journal of Machine Learning Research (JMLR). [1]
 ↑ Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic FeatureSelection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey. [2]
 ↑ RodriguezLujan, I.; Huerta, R.; Elkan, C.; Santa Cruz, C. (2010). "Quadratic programming feature selection" (PDF). JMLR. 11: 1491–1516.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ ^{21.0} ^{21.1} Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global Approaches for Mutual Information based Feature Selection". Proceeedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New York City, 2014. "[3]"
 ↑ M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, M. Sugiyama, HighDimensional Feature Selection by FeatureWise NonLinear Lasso. Neural Computation, vol.26, no.1, pp.185207, 2014.
 ↑ M. Hall 1999, Correlationbased Feature Selection for Machine Learning
 ↑ Senliol, Baris, et al. "Fast Correlation Based Filter (FCBF) with a different search strategy." Computer and Information Sciences, 2008. ISCIS'08. 23rd International Symposium on. IEEE, 2008. [4]
 ↑ Hai Nguyen, Katrin Franke, and Slobodan Petrovic, Optimizing a class of feature selection measures, Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Vancouver, Canada, December 2009. [5]
 ↑ H. Deng, G. Runger, "Feature Selection via Regularized Trees", Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012
 ↑ ^{27.0} ^{27.1} J. Hammon. Optimisation combinatoire pour la sélection de variables en régression en grande dimension : Application en génétique animale. November 2013
 ↑ ^{28.0} ^{28.1} T. M. Phuong, Z. Lin et R. B. Altman. Choosing SNPs using feature selection. Proceedings / IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference, pages 301309, 2005. PMID 16447987.
 ↑ ^{29.0} ^{29.1} B. Duval, J.K. Hao et J. C. Hernandez Hernandez. A memetic algorithm for gene selection and molecular classification of an cancer. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO '09, pages 201208, New York, NY, USA, 2009. ACM.
 ↑ S. C. Shah et A. Kusiak. Data mining and genetic algorithm based gene/SNP selection. Artificial intelligence in medicine, vol. 31, no. 3, pages 183196, July 2004. PMID 15302085.
 ↑ N. Long, D. Gianola, G. J.M Rosa et K. A Weigel. Dimension redu ction and variable selection for genomic selection : application to predicting milk yield in Holsteins. Journal of Animal Breeding and Genetics, vol. 128, no. 4, pages 247257, August 2011.
 ↑ G. Ustunkar, S. OzogurAkyuz, G. W. Weber, C. M. Friedrich et Yesim Aydin Son. Selection of representative SNP sets for genomewide association studies : a metaheuristic approach. Optimization Letters, November 2011.
 ↑ A. Alani. Ant Colony Optimization for Feature Subset Selection. In Proceedings of World Academy of Science, Engineering and Technology, pages 3538, 2005.
 ↑ R. Meiri et J. Zahavi. Using simulated annealing to optimize the feature selection problem in marketing applications. European Journal of Operational Research, vol. 171, no. 3, pages 842858, Juin 2006
 ↑ G. Kapetanios. Variable Selection using NonStandard Optimisation of Information Criteria. Working Paper 533, Queen Mary, University of London, School of Economics and Finance, 2005.
 ↑ D. Broadhurst, R. Goodacre, A. Jones, J. J. Rowland et D. B. Kell. Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry. Analytica Chimica Acta, vol. 348, no. 13, pages 7186, August 1997.
 ↑ L.Y. Chuang, C.H. Yang et C.H. Yang. Tabu search and binary particle swarm optimization for feature selection using microarray data. Journal of computational biology : a journal of computational molecular cell biology, vol. 16, no. 12, pages 16891703, Décembre 2009. PMID 20047491
 ↑ E. Alba, J. GariaNieto, L. Jourdan et E.G. Talbi. Gene Selection in Cancer Classification using PSOSVM and GASVM Hybrid Algorithms. Congress on Evolutionary Computation, Singapor : Singapore (2007), 2007
 ↑ C. Hans, A. Dobra et M. West. Shotgun stochastic search for 'large p' regression. Journal of the American Statistical Association, 2007.
 ↑ T. JirapechUmpai et S. Aitken. Feature selection and classification for microarray data analysis : Evolutionary methods for identifying predictive genes. BMC bioinformatics, vol. 6, no. 1, page 148, 2005.
 ↑ I. S. Oh, J. S. Lee et B. R. Moon. Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pages 14241437, November 2004.
 ↑ P. Xuan, M. Z. Guo, J.Wang, C. Y.Wang, X. Y. Liu et Y. Liu. Genetic algorithmbased efficient feature selection for classification of premiRNAs. Genetics and Molecular Research : GMR, vol. 10, no. 2, pages 588603, 2011. PMID 21491369.
 ↑ S. Peng. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Letters, vol. 555, no. 2, pages 358362, December 2003.
 ↑ J. C. H. Hernandez, B. Duval et J.K. Hao. A genetic embedded approach for gene selection and classification of microarray data. In Proceedings of the 5th European conference on Evolutionary computation, machine learning and data mining in bioinformatics, EvoBIO'07, pages 90101, Berlin, Heidelberg, 2007. SpringerVerlag.
 ↑ E. B. Huerta, B. Duval et J.K. Hao. A hybrid GA/SVM approach for gene selection and classification of microarray data. evoworkshops 2006, LNCS, vol. 3907, pages 3444, 2006.
 ↑ D. P. Muni, N. R. Pal et J. Das. Genetic programming for simultaneous feature selection and classifier design. IEEE Transactions on Systems, Man, and Cybernetics, Part B : Cybernetics, vol. 36, no. 1, pages 106117, February 2006.
 ↑ L. Jourdan, C. Dhaenens et E.G. Talbi. Linkage disequilibrium study with a parallel adaptive GA. International Journal of Foundations of Computer Science, 2004.
Further reading
 Feature Selection for Classification: A Review (Survey,2014)
 Feature Selection for Clustering: A Review (Survey,2013)
 Tutorial Outlining Feature Selection Algorithms, Arizona State University
 JMLR Special Issue on Variable and Feature Selection
 Feature Selection for Knowledge Discovery and Data Mining (Book)
 An Introduction to Variable and Feature Selection (Survey)
 Toward integrating feature selection algorithms for classification and clustering (Survey)
 Efficient Feature Subset Selection and Subset Size Optimization (Survey, 2010)
 Searching for Interacting Features
 Feature Subset Selection Bias for Classification Learning
 Y. Sun, S. Todorovic, S. Goodison, Local Learning Based Feature Selection for Highdimensional Data Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1610–1626, 2010.
External links
 A comprehensive package for Mutual Information based feature selection in Matlab
 Feature Selection Package, Arizona State University (Matlab Code)
 NIPS challenge 2003 (see also NIPS)
 Naive Bayes implementation with feature selection in Visual Basic (includes executable and source code)
 Minimumredundancymaximumrelevance (mRMR) feature selection program
 FEAST (Open source Feature Selection algorithms in C and MATLAB)