Professional-Data-Engineer Practice Test Questions

When you store data in Cloud Bigtable, what is the recommended minimum amount of stored data?

A.

500 TB

B.

1 GB

C.

1 TB
500 GB

D.

500 GB

C.

1 TB
500 GB

Explanation
Cloud Bigtable is not a relational database. It does not support SQL queries, joins, or multi-row transactions. It
is not a good solution for less than 1 TB of data.

In order to securely transfer web traffic data from your computer's web browser to the Cloud Dataproc cluster you should use a(n) _____.

A.

VPN connection

B.

Special browser

C.

SSH tunnel

D.

FTP connection

C.

SSH tunnel

Explanation
To connect to the web interfaces, it is recommended to use an SSH tunnel to create a secure connection to the master node.

Suppose you have a dataset of images that are each labeled as to whether or not they contain a human face. To create a neural network that recognizes human faces in images using this labeled dataset, what approach would likely be the most effective?

A.

Use K-means Clustering to detect faces in the pixels.

B.

Use feature engineering to add features for eyes, noses, and mouths to the input data.

C.

Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.

D.

Build a neural network with an input layer of pixels, a hidden layer, and an output layer with two
categories.

C.

Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.

Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most
one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning.
So deep is a strictly defined, technical term that means more than one hidden layer.
In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer’s
output. The further you advance into the neural net, the more complex the features your nodes can recognize,
since they aggregate and recombine features from the
previous layer.
A neural network with only one hidden layer would be unable to automatically recognize high-level features
of faces, such as eyes, because it wouldn't be able to "build" these features using previous hidden layers that
detect low-level features, such as lines.
Feature engineering is difficult to perform on raw image data.
K-means Clustering is an unsupervised learning method used to categorize unlabeled data.

You are planning to use Google's Dataflow SDK to analyze customer data such as displayed below. Your
project requirement is to extract only the customer name from the data source and then write to an output
PCollection.
Tom,555 X street
Tim,553 Y street
Sam, 111 Z street
Which operation is best suited for the above data processing requirement?

A.

ParDo

B.

Sink API

C.

Source API

D.

Data extraction

A.

ParDo

You are planning to use Google's Dataflow SDK to analyze customer data

A.

month-by-month

B.

minute-by-minute

C.

week-by-week

D.

hour-by-hour

B.

minute-by-minute

Explanation
One of the advantages of Cloud Dataproc is its low cost. Dataproc charges for what you really use with
minute-by-minute billing and a low, ten-minute-minimum billing period.

Which of the following statements about Legacy SQL and Standard SQL is not true?
.

A.

Standard SQL is the preferred query language for BigQuery.

B.

If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
One difference between the two query languages is how you specify fully-qualified table names (i.e.
table names that include their associated project name).
You need to set a query language for each dataset and the default is Standard SQL

C.

One difference between the two query languages is how you specify fully-qualified table names (i.e.
table names that include their associated project name).

D.

You need to set a query language for each dataset and the default is Standard SQL

D.

You need to set a query language for each dataset and the default is Standard SQL

Explanation
You do not set a query language for each dataset. It is set each time you run a query and the default query
language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released.
In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard
SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL

Which of these statements about exporting data from BigQuery is false?

A.

To export more than 1 GB of data, you need to put a wildcard in the destination filename.

B.

The only supported export destination is Google Cloud Storage.

C.

Data can only be exported in JSON or Avro format.

D.

The only compression option available is GZIP.

C.

Data can only be exported in JSON or Avro format.

Data can be exported in CSV, JSON, or Avro format. If you are exporting nested or repeated data, then CSV format is not supported.

When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four
values are required: project, region, name, and ____.

A.

zone

B.

node

C.

label

D.

type

A.

zone

At a minimum, you must specify four values when creating a new cluster with the
projects.regions.clusters.create operation:
The project in which the cluster will be created
The region to use
The name of the cluster
The zone in which the cluster will be created
You can specify many more details beyond these minimum requirements. For example, you can
also specify the number of workers, whether preemptible compute should be used, and the network settings.

e on a Cloud Dataproc cluster
____.

A.

application node

B.

conditional node

C.

master node

D.

worker node

C.

master node

The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster
master node. The cluster master-host-name is the name of your Cloud Dataproc cluster followed by an -m
suffix—for example, if your cluster is named "my-cluster", the master-host-name would be "my-cluster-m".

Which of the following is NOT true about Dataflow pipelines?

A.

Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner

B.

Dataflow pipelines can consume data from other Google Cloud services

C.

Dataflow pipelines can be programmed in Java

D.

Dataflow pipelines use a unified programming model, so can work both with streaming and batch data
sources

A.

Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner

Dataflow pipelines can also run on alternate runtimes like Spark and Flink, as they are built using the Apache
Beam SDKs

What are two of the characteristics of using online prediction rather than batch prediction?

A.

It is optimized to handle a high volume of data instances in a job and to run more complex models.

B.

Predictions are returned in the response message.

C.

Predictions are written to output files in a Cloud Storage location that you specify.

D.

It is optimized to minimize the latency of serving predictions

B.

Predictions are returned in the response message.

D.

It is optimized to minimize the latency of serving predictions

Explanation
Online prediction
Optimized to minimize the latency of serving predictions.
Predictions returned in the response message.
Batch prediction
Optimized to handle a high volume of instances in a job and to run more complex models.
Predictions written to output files in a Cloud Storage location that you specify.

When running a pipeline that has a BigQuery source, on your local machine, you continue to get permission denied errors. What could be the reason for that?

A.

Your gcloud does not have access to the BigQuery resources

B.

BigQuery cannot be accessed from local machines

C.

You are missing gcloud on your machine

D.

Pipelines cannot be run locally

A.

Your gcloud does not have access to the BigQuery resources

Explanation
When reading from a Dataflow source or writing to a Dataflow sink using DirectPipelineRunner, the Cloud
Platform account that you configured with the gcloud executable will need access to the corresponding
source/sink