Professional-Data-Engineer Practice Test Questions

Which of these is not a supported method of putting data into a partitioned table?

A.

If you have existing data in a separate file for each day, then create a partitioned table and upload each
file into the appropriate partition.

B.

Run a query to get the records for a specific day from an existing table and for the destination table,
specify a partitioned table ending with the day in the format "$YYYYMMDD".

C.

Create a partitioned table and stream new records to it every day.

D.

Use ORDER BY to put a table's rows into chronological order and then change the table's type to
"Partitioned".

D.

Use ORDER BY to put a table's rows into chronological order and then change the table's type to
"Partitioned".

Explanation
You cannot change an existing table into a partitioned table. You must create a partitioned table from scratch.
Then you can either stream data into it every day and the data will automatically be put in the right partition,
or you can load data into a specific partition by using "$YYYYMMDD" at the end of the table name.

If you're running a performance test that depends upon Cloud Bigtable, all the choices except one below are recommended steps. Which is NOT a recommended step to follow?

A.

Do not use a production instance.

B.

Run your test for at least 10 minutes.

C.

Before you test, run a heavy pre-test for several minutes.

D.

Use at least 300 GB of data.

A.

Do not use a production instance.

If you're running a performance test that depends upon Cloud Bigtable, be sure to follow these steps as you
plan and execute your test:
Use a production instance. A development instance will not give you an accurate sense of how a production
instance performs under load.
Use at least 300 GB of data. Cloud Bigtable performs best with 1 TB or more of data. However, 300 GB of
data is enough to provide reasonable results in a performance test on a 3-node cluster. On larger clusters, use
100 GB of data per node.
Before you test, run a heavy pre-test for several minutes. This step gives Cloud Bigtable a chance to balance
data across your nodes based on the access patterns it observes.
Run your test for at least 10 minutes. This step lets Cloud Bigtable further optimize your data, and it helps
ensure that you will test reads from disk as well as cached reads from memory.

All Google Cloud Bigtable client requests go through a front-end server ______ they are sent to a Cloud
Bigtable node.

A.

before

B.

after

C.

only if

D.

once

A.

before

In a Cloud Bigtable architecture all client requests go through a front-end server before they are sent to a
Cloud Bigtable node.
The nodes are organized into a Cloud Bigtable cluster, which belongs to a Cloud Bigtable instance, which is a
container for the cluster. Each node in the cluster handles a subset of the requests to the cluster.
When additional nodes are added to a cluster, you can increase the number of simultaneous requests that the
cluster can handle, as well as the maximum throughput for the entire cluster.

Which of the following is not true about Dataflow pipelines?(Choose One)

A.

Pipelines are a set of operations

B.

Pipelines represent a data processing job

C.

Pipelines represent a directed graph of steps

D.

Pipelines can share data between instances

D.

Pipelines can share data between instances

The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can
create multiple pipelines, pipelines cannot share data or transforms

What Dataflow concept determines when a Window's contents should be output based on certain criteria being met?

A.

Sessions

B.

OutputCriteria

C.

Windows

D.

Triggers

D.

Triggers

Explanation
Triggers control when the elements for a specific key and window are output. As elements arrive, they are put
into one or more windows by a Window transform and its associated WindowFn, and then passed to the
associated Trigger to determine if the Windows contents should be output.

What is the HBase Shell for Cloud Bigtable?

A.

The HBase shell is a GUI based interface that performs administrative tasks, such as creating and
deleting tables.

B.

The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting
tables.

C.

The HBase shell is a hypervisor based shell that performs administrative tasks, such as creating and
deleting new virtualized instances.

D.

The HBase shell is a command-line tool that performs only user account management functions to grant
access to Cloud Bigtable instances.

B.

The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting
tables.

Explanation
The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting
tables. The Cloud Bigtable HBase client for Java makes it possible to use the HBase shell to connect to Cloud
Bigtable.

What are two methods that can be used to denormalize tables in BigQuery?

A.

1) Split table into multiple tables; 2) Use a partitioned table

B.

1) Join tables into one table; 2) Use nested repeated fields

C.

1) Use a partitioned table; 2) Join tables into one table

D.

1) Use nested repeated fields; 2) Use a partitioned table

B.

1) Join tables into one table; 2) Use nested repeated fields

Explanation
The conventional method of denormalizing data involves simply writing a fact, along with all its dimensions,
into a flat table structure. For example, if you are dealing with sales transactions, you would write each
individual fact to a record, along with the accompanying dimensions such as order and customer information.
The other method for denormalizing data takes advantage of BigQuery’s native support for nested and
repeated structures in JSON or Avro input data. Expressing records using nested and repeated structures can
provide a more natural representation of the underlying data. In the case of the sales order, the outer part of a
JSON structure would contain the order and customer information, and the inner part of the structure would
contain the individual line items of the order, which would be represented as nested, repeated elements.

To run a TensorFlow training job on your own computer using Cloud Machine Learning Engine, what would your command start with?

A.

gcloud ml-engine local train

B.

gcloud ml-engine jobs submit training

C.

gcloud ml-engine jobs submit training local

D.

You can't run a TensorFlow program on your own computer using Cloud ML Engine .

A.

gcloud ml-engine local train

gcloud ml-engine local train - run a Cloud ML Engine training job locally
This command runs the specified module in an environment similar to that of a live Cloud ML Engine
Training Job.
This is especially useful in the case of testing distributed models, as it allows you to validate that you are
properly interacting with the Cloud ML Engine cluster configuration.

Cloud Bigtable is a recommended option for storing very large amounts of
____________________________?

A.

multi-keyed data with very high latency

B.

multi-keyed data with very low latency

C.

single-keyed data with very low latency

D.

single-keyed data with very high latency

C.

single-keyed data with very low latency

Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns,
allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is
known as the row key. Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very
low latency. It supports high read and write throughput at low latency, and it is an ideal data source for
MapReduce operations.

Which of these statements about BigQuery caching is true?

A.

By default, a query's results are not cached.

B.

BigQuery caches query results for 48 hours.

C.

Query results are cached even if you specify a destination table.

D.

There is no charge for a query that retrieves its results from cache.

D.

There is no charge for a query that retrieves its results from cache.

When query results are retrieved from a cached results table, you are not charged for the query.
BigQuery caches query results for 24 hours, not 48 hours.
Query results are not cached if you specify a destination table.
A query's results are always cached except under certain conditions, such as if you specify a destination table

When a Cloud Bigtable node fails, ____ is lost.

A.

all data

B.

no data

C.

the last transaction

D.

the time dimension

B.

no data

Explanation
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload
of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated
with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are
stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud
Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to
the replacement node.
When a Cloud Bigtable node fails, no data is lost

Which Google Cloud Platform service is an alternative to Hadoop with Hive?

A.

Cloud Dataflow

B.

Cloud Bigtable

C.

BigQuery

D.

Cloud Datastore

C.

BigQuery

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data
summarization, query, and analysis.
Google BigQuery is an enterprise data warehouse.