[Coursera] Feature Engineering

Machine Learning with TensorFlow on GCP

Posted by cyc1am3n on October 24, 2018

Coursera 강의 “Machine Learning with TensorFlow on Google Cloud Platform” 중 네 번째 코스인 Feature Engineering의 강의노트입니다.

Raw Data to Feature

  • Feature Engineering
    • Scale to large datasets
    • Find good features
    • Preprocess with Cloud MLE
  • What raw data do we need to collect to predict the price of a house?
    • Lot Size
    • Number of Rooms
    • Location
  • Raw data must be mapped into numerical feature vectors

Good vs Bad Features

  • What makes a good feature?
    1. Be related to the objective
    2. Be known at prediction-time
    3. Be numeric with meaningful magnitude
    4. Have enough examples
    5. Bring human insight to problem
  • Different problems in the same domain may need different features
  • Some data could be known immediately, and some other data is not known in real time
  • You cannot train with current data and predict with stale data
  • Avoid having values of which you don’t have enough examples

Representing Features

  • Raw data are converted to numeric features in different ways

  • Numeric values can be used as-is (real value)

  • Overly specific attributes should be discarded

  • Categorical variables should be one-hot encoded

  		Vocabulary_list = ['8345', '72365', '87654', '23451'])
  • Preprocess data to create a vocabulary of keys

    • The vocabulary and the mapping of the vocabulary needs to be identical at prediction time
  • Options for encoding categorical data

    • If you know the keys beforehand:
        		Vocabulary_list = ['8345', '72365', '87654', '23451'])
  • If your data is already indexed; i.e., has integers in[0-N):

    tf.feature_column.categorical_coulmn_with_identity( ‘employeeId’, num_bucket = 5)

  • If you don’t have a vocabulary of all possible values:

    tf.feature_column.categorical_coulmn_with_hash_bucket( ‘employeeId’, hash_bucket_size = 500)

  • Don't mix magic number with data

ML vs Statistics

  • ML = lots of data, keep outliers and build models for them
  • Statistics = “I’ve got all the data I’ll ever get”, throw away outliers
  • Exact floats are not meaningful
    • Discretize floating point values into bins
  • Crazy outliers will hurt trainability
  • Ideally, features should have a similar range (Typically [0, 1] or [-1, 1])

Preprocessing Feature Creation

  • Feature engineering often requires global statistics and vocabularies
  features['scaled_price'] = (features['price'] - min_price) / (max_price - min_price)

  	keys=['San Diego', 'Los Angeles', 'San Francisco', 'Sacramento'])
  • Things that are commonly done in preprocessing (In TensorFlow)

    • Scaling, discretization, etc. of numeric features
    • Splitting, lower-casing, etc. of textual features
    • Resizing of input images
    • Normalizing volume level of input audio
  • There are two places for feature creation in TensorFlow

   1. Features are preprocessed in input_FN (train, eval, serving)

  features['capped_rooms'] = tf.clip_by_value(

  # 2. Feature columns are passed into the estimator during construction
  lat = tf.feature_column.numeric_column('latitude')
  dlat = tf.feature_column.bucketized_column(lat,

Feature Cross

  • Using non-linear inputs in a linear learner
  • Dividing the input space with two lines yields four quadrants
  • The weight of a cell is essentially the prediction for that cell
  • Feature crosses memorize
  • Goal of ML is generalization
  • Memorization works when you have lots of data
  • Feature crosses bring a lot of power to linear models
    • Feature crosses + massive data is an efficient way for learning highly complex spaces
    • Feature crosses allow a linear model to memorize large datasets
    • Optimizing linear models is a convex problem
    • Feature crosses,as a preprocessor, make neural networks converge a lot quicker
  • Feature crosses combine discrete / categorical features
  • Feature Crosses lead to sparsity

Implementing Feature Crosses

  • Creating feature crosses using TensorFlow
  day_hr = tf.feature_column.crossed_column([dayofweek, hourofday], 24*7)
  • Choosing the number of hash buckets is an art, not a science

  • The number of hash buckets controls sparsity and collisions

    • Small hash_buckets → lots of collisions
    • High hash_buckets → very sparse

Embedding Feature Crosses

  • Creating an embedding column from a feature cross
  • The weights in the embedding column are learned from data
  • The model learns how to embed the feature cross in lower-dimensional space

Where to Do Feature Engineering

  • Three possible places to do feature engineering

    • TensorFlow feature_column input_fn
    • Dataflow
    • Dataflow + TensorFlow (tf.transform)

    Three possible places to do feature engineering
    Three possible places to do feature engineering

  • Some preprocessing can be done in tf.feature_column

  • Powerful preprocessing can be done in TensorFlow by creating a new feature column

  latbuckets = np.linspace(38.0, 42.0, nbuckets).tolist()
  lonbuckets = np.linspace(-76.0, -72.0, nbuckets).tolist()

  b_lat = tf.bucketized_column(house_lat, latbuckets)
  b_lon = tf.bucketized_column(house_lon, lonbuckets)

  # feature cross and embed
  loc = tf.crossed_column(house_lat, latbuckets)

  eloc = tf.embedding_column(loc, nbuckets//4)

Feature Creation in TensorFlow

  • Create new features from existing features in TensorFlow
  def add_engineered(features):
  	lat1 = features['lat']
  	lat2 = features['metro_lat']
  	latdiff = lat1 - lat2
  	dist = tf.sqrt(latdiff*latdiff + londiff*londiff)
  	features['euclidean'] = dist
  	return features
  def train_input_fn():
  	features = ...
  	return add_engineered(features), label

  def serving_input_fn():
  	return ServingInputReceiver(

TensorFlow Transform

  • Pros and Cons of three ways to do feature engineering

  • tf.transform is a hybrid of Beam and TensorFlow
  • Analyze - Beam
    • Find min/max value of numeric feature
    • Find all the unique values of a categorical feature
  • Transform - TensorFlow
    • Scale inputs by the min & max
    • One-hot encode inputs based on set of unique values
  • tf.transform provides two PTransforms
    • AnalyzeAndTransformDataset - Executed in Beam to create the training dataset
    • TransformDataset - Executed in Beam to create the evaluation dataset / The underlying transformations are executed in TensorFlow at prediction time
  • tf.transform has two phases
    • Analysis phase (compute min/max/vocab etc. using Beam) Executed in Beam while creating training dataset
    • Transform phase (scale/vocabulary etc. using TensorFlow) Executed in TensorFlow during prediction Executed in Beam to create training/evaluation datasets

Analysis phase

  • First, set up the schema of the training dataset
  raw_data_schema = {
  	colname : dataset_schema.ColumnSchema(tf.string, ...)
  		for colname in 'datofweek,key'.split(',')
  	colname : dataset_schema.ColumnSchema(tf.float32, ...)
  		for colname in 'fare_amount,pickuplon,...,dropofflat'.split(',')
  raw_data_metadata = 
  • Next, run the analyze-and-transform PTransform on training dataset to get back preprocessed training data and the transform function
  raw_data = (p # 1.Read in data as usual for Beam
  	| beam.io.Read(beam.io.BigQuerySource(query=myquery, use_standard_sql=True))
  	| beam.Filter(is_valid)) # 2. Filter out data that you don't want to train with

  # 3. Pass raw data + metadata template to AnalyzeAndTransformDataset
  # 4. Get back transformed dataset and a reusable transform function
  transformed_dataset, transform_fn = ((raw_data, raw_data_metadata)
  	| beam_impl.AnalyzeAndTransformDataset(preprocess)) 
  • Write out the preprocessed training data into TFRecords, the most efficient format for TensorFlow
  transformed_data |
  		os.path.join(OUTPUT_DIR, 'train'),



Transform phase

  • The preprocessing function is restricted to TensorFlow function you can call from TensorFlow graph
  def preprocess(inputs):
  	result = {} # Create features from the input tensors and put into "results" dict
  	result['fare_amount'] = inputs['fare_amount'] # Pass through
  	result['dayofweek'] = tft.string_to_int(inputs['dayofweek']) # vocabulary
  	retult['dropofflat'] = (tft.scale_to_0_1(inputs['dropofflat'])) # scaling
  	result['passengers'] = tf.cast(inputs['passengers'], tf.float32) # Other TF fns
  	return result 
  • Writing out the eval dataset is similar,except that we reuse the transform function computed from the training data