# [Coursera] Feature Engineering

## Machine Learning with TensorFlow on GCP

Posted by cyc1am3n on October 24, 2018

Coursera 강의 “Machine Learning with TensorFlow on Google Cloud Platform” 중 네 번째 코스인 Feature Engineering의 강의노트입니다.

## Raw Data to Feature

• Feature Engineering
• Scale to large datasets
• Find good features
• Preprocess with Cloud MLE
• What raw data do we need to collect to predict the price of a house?
• Lot Size
• Number of Rooms
• Location
• Raw data must be mapped into numerical feature vectors

## Good vs Bad Features

• What makes a good feature?
1. Be related to the objective
2. Be known at prediction-time
3. Be numeric with meaningful magnitude
4. Have enough examples
5. Bring human insight to problem
• Different problems in the same domain may need different features
• Some data could be known immediately, and some other data is not known in real time
• You cannot train with current data and predict with stale data
• Avoid having values of which you don’t have enough examples

## Representing Features

• Raw data are converted to numeric features in different ways

• Numeric values can be used as-is (real value)

• Overly specific attributes should be discarded

• Categorical variables should be one-hot encoded

• Preprocess data to create a vocabulary of keys

• The vocabulary and the mapping of the vocabulary needs to be identical at prediction time
• Options for encoding categorical data

• If you know the keys beforehand:
• If your data is already indexed; i.e., has integers in[0-N):

tf.feature_column.categorical_coulmn_with_identity( ‘employeeId’, num_bucket = 5)

• If you don’t have a vocabulary of all possible values:

tf.feature_column.categorical_coulmn_with_hash_bucket( ‘employeeId’, hash_bucket_size = 500)

• Don't mix magic number with data

## ML vs Statistics

• ML = lots of data, keep outliers and build models for them
• Statistics = “I’ve got all the data I’ll ever get”, throw away outliers
• Exact floats are not meaningful
• Discretize floating point values into bins
• Crazy outliers will hurt trainability
• Ideally, features should have a similar range (Typically [0, 1] or [-1, 1])

## Preprocessing Feature Creation

• Feature engineering often requires global statistics and vocabularies
• Things that are commonly done in preprocessing (In TensorFlow)

• Scaling, discretization, etc. of numeric features
• Splitting, lower-casing, etc. of textual features
• Resizing of input images
• Normalizing volume level of input audio
• There are two places for feature creation in TensorFlow

## Feature Cross

• Using non-linear inputs in a linear learner
• Dividing the input space with two lines yields four quadrants
• The weight of a cell is essentially the prediction for that cell
• Feature crosses memorize
• Goal of ML is generalization
• Memorization works when you have lots of data
• Feature crosses bring a lot of power to linear models
• Feature crosses + massive data is an efficient way for learning highly complex spaces
• Feature crosses allow a linear model to memorize large datasets
• Optimizing linear models is a convex problem
• Feature crosses,as a preprocessor, make neural networks converge a lot quicker
• Feature crosses combine discrete / categorical features
• Feature Crosses lead to sparsity

## Implementing Feature Crosses

• Creating feature crosses using TensorFlow
• Choosing the number of hash buckets is an art, not a science

• The number of hash buckets controls sparsity and collisions

• Small hash_buckets → lots of collisions
• High hash_buckets → very sparse

## Embedding Feature Crosses

• Creating an embedding column from a feature cross
• The weights in the embedding column are learned from data
• The model learns how to embed the feature cross in lower-dimensional space

## Where to Do Feature Engineering

• Three possible places to do feature engineering

• TensorFlow feature_column input_fn
• Dataflow
• Dataflow + TensorFlow (tf.transform)

Three possible places to do feature engineering

• Some preprocessing can be done in tf.feature_column

• Powerful preprocessing can be done in TensorFlow by creating a new feature column

## Feature Creation in TensorFlow

• Create new features from existing features in TensorFlow

## TensorFlow Transform

• Pros and Cons of three ways to do feature engineering

• tf.transform is a hybrid of Beam and TensorFlow
• Analyze - Beam
• Find min/max value of numeric feature
• Find all the unique values of a categorical feature
• Transform - TensorFlow
• Scale inputs by the min & max
• One-hot encode inputs based on set of unique values
• tf.transform provides two PTransforms
• AnalyzeAndTransformDataset - Executed in Beam to create the training dataset
• TransformDataset - Executed in Beam to create the evaluation dataset / The underlying transformations are executed in TensorFlow at prediction time
• tf.transform has two phases
• Analysis phase (compute min/max/vocab etc. using Beam) Executed in Beam while creating training dataset
• Transform phase (scale/vocabulary etc. using TensorFlow) Executed in TensorFlow during prediction Executed in Beam to create training/evaluation datasets

## Analysis phase

• First, set up the schema of the training dataset
• Next, run the analyze-and-transform PTransform on training dataset to get back preprocessed training data and the transform function
• Write out the preprocessed training data into TFRecords, the most efficient format for TensorFlow

## Transform phase

• The preprocessing function is restricted to TensorFlow function you can call from TensorFlow graph
• Writing out the eval dataset is similar,except that we reuse the transform function computed from the training data