Professional-Data-Engineer Practice Test Questions

Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

A.

Field promotion

B.

Randomization

C.

Salting

D.

Hashing

A.

Field promotion

By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make
it easier to design a row key that facilitates queries

What are all of the BigQuery operations that Google charges for?

A.

Storage, queries, and streaming inserts

B.

Storage, queries, and loading data from a file

C.

Storage, queries, and exporting data

D.

Queries and streaming inserts

A.

Storage, queries, and streaming inserts

Explanation
Google charges for storage, queries, and streaming inserts. Loading data from a file and exporting data are free
operations.

Which of the following job types are supported by Cloud Dataproc (select 3 answers)?

A.

Hive

B.

Pig

C.

YARN

D.

Spark

A.

Hive

B.

Pig

D.

Spark

The CUSTOM tier for Cloud Machine Learning Engine allows you to specify the number of which types of
cluster nodes?

A.

Workers

B.

Masters, workers, and parameter servers

C.

Workers and parameter servers

D.

Parameter servers

C.

Workers and parameter servers

The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use
this tier, set values to configure your processing cluster according to these guidelines:
You must set TrainingInput.masterType to specify the type of machine to use for your master node.
You may set TrainingInput.workerCount to specify the number of workers to use.
You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use.
You can specify the type of machine for the master node, but you can't specify more than one master node

Why do you need to split a machine learning dataset into training data and test data?

A.

So you can try two different sets of features

B.

To make sure your model is generalized for more than just the training data

C.

To allow you to create unit tests in your code

D.

So you can use one dataset for a wide model and one for a deep model

B.

To make sure your model is generalized for more than just the training data

The flaw with evaluating a predictive model on training data is that it does not inform you on how well the
model has generalized to new unseen data. A model that is selected for its accuracy on the training dataset
rather than its accuracy on an unseen test dataset is very likely to have lower accuracy on an unseen test
dataset. The reason is that the model is not as generalized. It has specialized to the structure in the training
dataset. This is called overfitting.

Which of the following is NOT a valid use case to select HDD (hard disk drives) as the storage for Google
Cloud Bigtable?

A.

You expect to store at least 10 TB of data.

B.

You will mostly run batch workloads with scans and writes, rather than frequently executing random
reads of a small number of rows.

C.

You need to integrate with Google BigQuery.

D.

You will not use the data to back a user-facing or latency-sensitive application

C.

You need to integrate with Google BigQuery.

Explanation
For example, if you plan to store extensive historical data for a large number of remote-sensing devices and
then use the data to generate daily reports, the cost savings for HDD storage may justify the performance
tradeoff. On the other hand, if you plan to use the data to display a real-time dashboard, it probably would not
make sense to use HDD storage—reads would be much more frequent in this case, and reads are much slower
with HDD storage.

Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)

A.

The wide model is used for memorization, while the deep model is used for generalization.

B.

A good use for the wide and deep model is a recommender system.

C.

The wide model is used for generalization, while the deep model is used for memorization.

D.

A good use for the wide and deep model is a small-scale linear regression problem.

A.

The wide model is used for memorization, while the deep model is used for generalization.

B.

A good use for the wide and deep model is a recommender system.

Explanation
Can we teach computers to learn like humans do, by combining the power of memorization and
generalization? It's not an easy question to answer, but by jointly training a wide linear model (for
memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to
bring us one step closer. At Google, we call it Wide & Deep Learning. It's useful for generic large-scale
regression and classification problems with sparse inputs (categorical features with a large number of possible
feature values), such as recommender systems, search, and ranking problems.

Which of the following statements is NOT true regarding Bigtable access roles?

A.

Using IAM roles, you cannot give a user access to only one table in a project, rather than all tables in a
project.

B.

To give a user access to only one table in a project, grant the user the Bigtable Editor role for that table.

C.

You can configure access control only at the project level.

D.

To give a user access to only one table in a project, you must configure access through your application

B.

To give a user access to only one table in a project, grant the user the Bigtable Editor role for that table.

Explanation
For Cloud Bigtable, you can configure access control at the project level. For example, you can grant the
ability to:
Read from, but not write to, any table within the project.
Read from and write to any table within the project, but not manage instances.
Read from and write to any table within the project, and manage instances

Which software libraries are supported by Cloud Machine Learning Engine?

A.

Theano and TensorFlow

B.

Theano and Torch

C.

TensorFlow

D.

TensorFlow and Torch

C.

TensorFlow

Explanation
Cloud ML Engine mainly does two things:
Enables you to train machine learning models at scale by running TensorFlow training applications in the
cloud.
Hosts those trained models for you in the cloud so that you can use them to get predictions
about new data.

You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?

A.

Both batch and streaming

B.

BigQuery cannot be used as a sink

C.

Only batch

D.

Only streaming

A.

Both batch and streaming

When you apply a BigQueryIO.Write transform in batch mode to write to a single table, Dataflow invokes a
BigQuery load job. When you apply a BigQueryIO.Write transform in streaming mode or in batch mode using
a function to specify the destination table, Dataflow uses BigQuery's streaming inserts

Which of the following is not possible using primitive roles?

A.

Give a user viewer access to BigQuery and owner access to Google Compute Engine instances.

B.

Give UserA owner access and UserB editor access for all datasets in a project.

C.

Give a user access to view all datasets in a project, but not run queries on them.

D.

Give GroupA owner access and GroupB editor access for all datasets in a project.

C.

Give a user access to view all datasets in a project, but not run queries on them.

Primitive roles can be used to give owner, editor, or viewer access to a user or group, but they can't be used to
separate data access permissions from job-running permissions

Which methods can be used to reduce the number of rows processed by BigQuery?

A.

Splitting tables into multiple tables; putting data in partitions

B.

Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause

C.

Putting data in partitions; using the LIMIT clause

D.

Splitting tables into multiple tables; using the LIMIT clause

A.

Splitting tables into multiple tables; putting data in partitions

If you split a table into multiple tables (such as one table for each day), then you can limit your query to the
data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as
your data can be separated by the day.