Coursera 강의 “Machine Learning with TensorFlow on Google Cloud Platform” 중 네 번째 코스인 Feature Engineering의 강의노트입니다.
Raw Data to Feature
- Feature Engineering
- Scale to large datasets
- Find good features
- Preprocess with Cloud MLE
- What raw data do we need to collect to predict the price of a house?
- Lot Size
- Number of Rooms
- Location
- …
- Raw data must be mapped into
numerical feature vectors
Good vs Bad Features
- What makes a good feature?
- Be
related
to the objective - Be known at
prediction-time
- Be
numeric
withmeaningful magnitude
- Have
enough examples
- Bring
human insight
to problem
- Be
- Different problems in the same domain may need
different features
- Some data could be known
immediately
, and some other data is not known in real time - You cannot train with current data and predict with
stale data
- Avoid having values of which you don’t have
enough examples
Representing Features
-
Raw data are converted to numeric features in different ways
-
Numeric values can be used
as-is
(real value) -
Overly specific
attributes should bediscarded
-
Categorical variables should be
one-hot encoded
-
Preprocess data to create a
vocabulary
of keys- The vocabulary and the mapping of the vocabulary needs to be
identical at prediction time
- The vocabulary and the mapping of the vocabulary needs to be
-
Options for encoding categorical data
- If you know the keys beforehand:
-
If your data is already indexed; i.e., has integers in[0-N):
tf.feature_column.categorical_coulmn_with_identity( ‘employeeId’, num_bucket = 5)
-
If you don’t have a vocabulary of all possible values:
tf.feature_column.categorical_coulmn_with_hash_bucket( ‘employeeId’, hash_bucket_size = 500)
-
Don't mix
magic number with data
ML vs Statistics
ML
= lots of data, keep outliers and build models for themStatistics
= “I’ve got all the data I’ll ever get”, throw away outliers- Exact floats are not meaningful
- Discretize floating point values into
bins
- Discretize floating point values into
- Crazy outliers will hurt trainability
- Ideally, features should have a similar range (Typically [0, 1] or [-1, 1])
Preprocessing Feature Creation
- Feature engineering often requires global statistics and vocabularies
-
Things that are commonly done in preprocessing (In TensorFlow)
- Scaling, discretization, etc. of numeric features
- Splitting, lower-casing, etc. of textual features
- Resizing of input images
- Normalizing volume level of input audio
-
There are two places for feature creation in TensorFlow
Feature Cross
- Using non-linear inputs in a linear learner
- Dividing the input space with two lines yields four quadrants
- The weight of a cell is essentially the prediction for that cell
- Feature crosses
memorize
- Goal of ML is
generalization
- Memorization works when you have
lots of data
- Feature crosses bring a lot of power to linear models
- Feature crosses +
massive data
is an efficient way for learning highly complex spaces - Feature crosses allow a linear model to memorize large datasets
- Optimizing linear models is a convex problem
- Feature crosses,as a preprocessor, make neural networks converge a lot quicker
- Feature crosses +
- Feature crosses combine
discrete
/categorical
features - Feature Crosses lead to
sparsity
Implementing Feature Crosses
- Creating feature crosses using TensorFlow
-
Choosing the number of hash buckets is an art, not a science
-
The number of hash buckets controls sparsity and collisions
- Small hash_buckets → lots of collisions
- High hash_buckets → very sparse
Embedding Feature Crosses
- Creating an
embedding column
from a feature cross - The
weights
in the embedding column are learned from data - The model learns how to embed the feature cross in lower-dimensional space
Where to Do Feature Engineering
-
Three possible places to do feature engineering
- TensorFlow feature_column input_fn
- Dataflow
- Dataflow + TensorFlow (tf.transform)
-
Some preprocessing can be done in
tf.feature_column
-
Powerful preprocessing can be done in TensorFlow by creating a new feature column
Feature Creation in TensorFlow
- Create new features from existing features in TensorFlow
TensorFlow Transform
- Pros and Cons of three ways to do feature engineering
tf.transform
is a hybrid of Beam and TensorFlowAnalyze
- Beam- Find min/max value of numeric feature
- Find all the unique values of a categorical feature
Transform
- TensorFlow- Scale inputs by the min & max
- One-hot encode inputs based on set of unique values
- tf.transform provides two PTransforms
AnalyzeAndTransformDataset
- Executed in Beam to create the training datasetTransformDataset
- Executed in Beam to create the evaluation dataset / The underlying transformations are executed in TensorFlow at prediction time
- tf.transform has two phases
Analysis phase
(compute min/max/vocab etc. using Beam) Executed in Beam while creating training datasetTransform phase
(scale/vocabulary etc. using TensorFlow) Executed in TensorFlow during prediction Executed in Beam to create training/evaluation datasets
Analysis phase
- First, set up the
schema
of the training dataset
- Next, run the
analyze-and-transform
PTransform on training dataset to get back preprocessed training data and the transform function
- Write out the preprocessed training data into
TFRecords
, the most efficient format for TensorFlow
Transform phase
- The preprocessing function is restricted to TensorFlow function you can call from TensorFlow graph
- Writing out the eval dataset is similar,except that we reuse the transform function computed from the training data