- Standardization
- Scaling features to a range
- Scaling sparse data
- Scaling data with outliers
- Normalization
The sklearn.preprocessing package provides several common utility functions and transformer classes.
Standardization
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.
# scale : quick to do standarization
x = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
x_scale = preprocessing.scale(x)
print x_scale
# scale data has zero means and unit variance
print "Scale mean is \n", x_scale.mean(axis = 0)
print "Scale std is \n",x_scale.std(axis = 0)
The output is
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
Scale mean is
[ 0. 0. 0.]
Scale std is
[ 1. 1. 1.]
sklearn also provides sklearn.preprocessing.StandardScaler to compute the mean and standard deviation on a training set in case that training data uses the same transformation.
sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
with_mean: If True, center the data before scaling. with_str: If True, scale the data to unit variance. copy: If False, try to avoid a copy and do inplace scaling instead.
Attributes:
scale_ mean_ var_ n_samples_seen_: The number of samples processed by the estimator.
StandardScaler Examples
# sklearn.preprocessing.StandardScaler
scaler = preprocessing.StandardScaler().fit(x)
print "Mean is \n", scaler.mean_
print "Std is \n", scaler.std_
print "Scale is \n", scaler.scale_
print scaler.transform(x)
The output is
Mean is
[ 1. 0. 0.33333333]
Std is
[ 0.81649658 0.81649658 1.24721913]
Scale is
[ 0.81649658 0.81649658 1.24721913]
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
Scaling features to a range
An alternative standardization is scaling features to lie between a given minimum and maximum value. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.
MinMaxScaler transformation formula:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)[source]
Attributes: min_, scale_, data_min_, data_max_, data_range_
min_max_scaler = preprocessing.MinMaxScaler()
x_minmax = min_max_scaler.fit_transform(x)
print x_minmax
print "min_max_scaler scale is \n", x_minmax.scale_
print "min_max_scaler min is \n", x_minmax.min_
The output is
[[ 0.5 0. 1. ]
[ 1. 0.5 0.33333333]
[ 0. 1. 0. ]]
min_max_scaler scale is
[ 0.5 0.5 0.33333333]
min_max_scaler min is
[ 0. 0.5 0.33333333]
MaxAbsScaler works in a very similar fashion, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.
sklearn.preprocessing.MaxAbsScaler(copy=True)
MaxAbsScaler
max_abs_scaler = preprocessing.MaxAbsScaler()
x_maxabs = max_abs_scaler.fit_transform(x)
print x_maxabs
print max_abs_scaler.scale_
Scaling sparse data
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.
MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended way to go about this. However, scale and StandardScaler can accept scipy.sparse matrices as input, as long as with_mean=False is explicitly passed to the constructor.
Scaling data with outliers
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.
sklearn.preprocessing.robust_scale(X, axis=0, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
class sklearn.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
Normalization
# normalization
X_normalized = preprocessing.normalize(X, norm='l2')
print X_normalized
# The preprocessing module further provides a utility class Normalizer
normalizer = preprocessing.Normalizer().fit(X)
print normalizer
print normalizer.transform(X)