在机器学习应用中,特征工程扮演着重要的角色。数据和特征决定了机器学习算法的上限,而模型、算法的选择和优化只是不断逼近这个上限而已。特征工程 (Feature Engineering) 介于“数据”和“模型”之间,是利用数据的专业领域知识和现有数据,从源数据中抽取出来对预测结果有用的信息,用在机器学习算法上的过程。美国计算机科学家 Peter Norvig 有两句经典的名言:

  • “基于大量数据的简单模型胜过基于少量数据的复杂模型。”
  • “更多的数据胜过聪明的算法,而好的数据胜过多的数据。”

  吴恩达更是说过“应用机器学习基本上就是特征工程”。对于工业界来说,大部分复杂模型的算法精进都是资深的数据科学家的任务,大部分人员的工作还是跑数据、map-reduce,hive SQL,数据仓库搬砖,做一些业务分析、数据清洗、特征工程(找特征)的工作。当然在实际工作中,特征工程旨在去除原始数据中的杂质和冗余,设计更高效的特征以刻画求解的问题与预测模型之间的关系。

  特征工程面对的数据一般分为两大类:

  1. 结构化数据。
    • 结构化数据可以看做关系型数据库的一张表,每列都有清晰的定义,包含了数值型、类别性两种基本类型。
    • 每一行数据表示一个样本信息。
  2. 非结构化数据。
    • 非结构化数据主要包括文本、图像、音频、视频数据,其包含的信息无法用一个简单的数值表示,也没有清晰的定义,并且每条数据的大小各不相同。

  特征工程一般分成特征提取 (Feature Extraction) 和特征选择 (Feature Selection) 两个方面。

  在特征工程中有一个比较棘手的问题是,如何保证训练集、验证集和测试集的输入一直?

  为了不在训练中掺入测试数据,我们要求最好先单独对训练数据进行特征提取,其中包含归一化等操作,再对验证集或者测试集进行操作,但是问题是在验证集和测试集中对同一个特征做完特征提取后的结果维度可能不一样,这要怎么处理

  1. One could also perform analysis of residuals or log-odds (for linear model) to check for strong nonlinearities.

  2. Create a feature which captures the frequency of the occurrence of each level of the categorical variable. For high cardinality, this helps a lot. One might use ratio/percentage of a particular level to all the levels present.

  3. For every possible value of the variable, estimate the mean of the target variable; use the result as an engineered feature.

  4. Encode a variable with the ratio of the target variable.

  5. Take the two most important variables and throw in second order interactions between them and the rest of the variables - compare the resulting model to the original linear one

  6. if you feel your solutions should be smooth, you can apply a radial basis function kernel . This is like applying a smoothing transform.

  7. If you feel you need covariates , you can apply a polynomial kernel, or add the covariates explicitly

  8. High cardinality features : convert to numeric by preprocessing: out-of-fold average two variable combinations

  9. Additive transformation

  10. difference relative to baseline

  11. Multiplicative transformation : interactive effects

  12. divisive : scaling/normalisation

  13. thresholding numerical features to get boolean values

  14. Cartesian Product Transformation

  15. Feature crosses: cross product of all features – Consider a feature A, with two possible values {A1, A2}. Let B be a feature with possibilities {B1, B2}. Then, a feature-cross between A & B (let’s call it AB) would take one of the following values: {(A1, B1), (A1, B2), (A2, B1), (A2, B2)}. You can basically give these ‘combinations’ any names you like. Just remember that every combination denotes a synergy between the information contained by the corresponding values of A and B.

  16. Normalization Transformation: – One of the implicit assumptions often made in machine learning algorithms (and somewhat explicitly in Naive Bayes) is that the the features follow a normal distribution. However, sometimes we may find that the features are not following a normal distribution but a log normal distribution instead. One of the common things to do in this situation is to take the log of the feature values (that exhibit log normal distribution) so that it exhibits a normal distribution.If the algorithm being used is making the implicit/explicit assumption of the features being normally distributed, then such a transformation of a log-normally distributed feature to a normally distributed feature can help improve the performance of that algorithm.

  17. Quantile Binning Transformation

  18. whitening the data

  19. Windowing – If points are distributed in time axis, previous points in the same window are often very informative

  20. Min-max normalization : does not necessarily preserve order

  21. sigmoid / tanh / log transformations

  22. Handling zeros distinctly – potentially important for Count based features

  23. Decorrelate / transform variables

  24. Reframe Numerical Quantities

  25. Map infrequent categorical variables to a new/separate category.

28.Sequentially apply a list of transforms.

  1. One Hot Encoding

  2. Target rate encoding

Hash Trick Multivariate:

  1. PCA

  2. MODEL STACKING

  3. compressed sensing

34..guess the average” or “guess the average segmented by variable X”

Projection : new basis

  1. Hack projection:

Perform clustering and use distance between points to the cluster center as a feature PCA/SVD – Useful technique to analyze the interrelationships between variables and perform dimensionality reduction with minimum loss of information (find the axis through the data with highest variance / repeat with the next orthogonal axis and so on , until you run out of data or dimensions; Each axis acts a new feature) 36.Sparse coding – choose basis : evaluate the basis based on how well you can use it to reconstruct the input and how sparse it is take some sort of gradient step to improve that evaluation

efficient sparse coding algorithms deep auto encoders 37 :Random forest: train bunch of decision trees :use each leaf as a feature

References

  1. 机器学习之 特征工程
  2. The Comprehensive Guide for Feature Engineering
  3. 【持续更新】机器学习特征工程实用技巧大全
  4. ✳️ Machine Learning Kaggle Competition Part Two: Improving Feature engineering, feature selection, and model evaluation
  5. Feature Engineering 特徵工程中常見的方法
  6. 机器学习之step by step实战及知识积累笔记
  7. 特征工程:数据科学家的秘密武器!
  8. 机器学习项目的完整流程
  9. 连续数据的处理方法
  10. 机器学习中的数据清洗与特征处理综述
  11. python开发:特征工程代码模版(一)
  12. python开发:特征工程代码模版(二)
  13. Normalization(标准化)的原理和实现详解
  14. 数据挖掘的流程和方法、技巧总结
  15. 4.3. 预处理数据
  16. scikit-learn preprocessing
  17. A Comprehensive Guide to Data Exploration
  18. What are good ways to handle discrete and continuous inputs together?
  19. One-hot vs dummy encoding in Scikit-learn
  20. 如何理解统计学中「自由度」这个概念?
  21. One-Hot 编码与哑变量
  22. Smarter Ways to Encode Categorical Data for Machine Learning (Part 1 of 3)
  23. python数据处理,特征工程,比赛等一定会用到的方法
  24. Feature Engineering: Data scientist’s Secret Sauce !
  25. 多维时间序列
  26. 「特征工程」之零基础入门数据挖掘