A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records. In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
A. Set the configuration delta.deduplicate = true.
B. VACUUM the Delta table after each batch completes.
C. Perform an insert-only merge with a matching condition on a unique key
D. Perform a full outer join on a unique key and overwrite existing data.
E. Rely on Delta Lake schema enforcement to prevent duplicate records.
Explanation:
To deduplicate data against previously processed records as it is inserted
into a Delta table, you can use the merge operation with an insert-only clause. This allows
you to insert new records that do not match any existing records based on a unique key,
while ignoring duplicate records that match existing records. For example, you can use the
following syntax:
MERGE INTO target_table USING source_table ON target_table.unique_key =
source_table.unique_key WHEN NOT MATCHED THEN INSERT *
This will insert only the records from the source table that have a unique key that is not
present in the target table, and skip the records that have a matching key. This way, you
can avoid inserting duplicate records into the Delta table.
References:
https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-usingmerge
https://docs.databricks.com/delta/delta-update.html#insert-only-merge
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A. If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?
A. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.
B. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.
C. All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.
D. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.
E. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.
Explanation:
The query uses the CREATE TABLE USING DELTA syntax to create a Delta
Lake table from an existing Parquet file stored in DBFS. The query also uses the
LOCATION keyword to specify the path to the Parquet file as
/mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query
creates an external table, which is a table that is stored outside of the default warehouse
directory and whose metadata is not managed by Databricks. An external table can be
created from an existing directory in a cloud storage system, such as DBFS or S3, that
contains data files in a supported format, such as Parquet or CSV.
The resulting state after running the second command is that an external table will be
created in the storage container mounted to /mnt/finance_eda_bucket with the new name
prod.sales_by_store. The command will not change any data or move any files in the
storage container; it will only update the table reference in the metastore and create a new
Delta transaction log for the renamed table.
Verified References: [Databricks Certified Data
Engineer Professional], under “Delta Lake” section; Databricks Documentation, under
“ALTER TABLE RENAME TO” section; Databricks Documentation, under “Create an
external table” section.
Which statement describes the default execution mode for Databricks Auto Loader?
A. New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.
B. Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.
C. Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.
D. New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.
Explanation:
Databricks Auto Loader simplifies and automates the process of loading data
into Delta Lake. The default execution mode of the Auto Loader identifies new files by
listing the input directory. It incrementally and idempotently loads these new files into the
target Delta Lake table. This approach ensures that files are not missed and are processed
exactly once, avoiding data duplication. The other options describe different mechanisms
or integrations that are not part of the default behavior of the Auto Loader.
References:
Databricks Auto Loader Documentation: Auto Loader Guide
Delta Lake and Auto Loader: Delta Lake Integration
Which statement characterizes the general programming model used by Spark Structured Streaming?
A. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
B. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
C. Structured Streaming uses specialized hardware and I/O streams to achieve subsecond latency for data transfer.
D. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
E. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
Explanation:
This is the correct answer because it characterizes the general programming
model used by Spark Structured Streaming, which is to treat a live data stream as a table
that is being continuously appended. This leads to a new stream processing model that is
very similar to a batch processing model, where users can express their streaming
computation using the same Dataset/DataFrame API as they would use for static data. The
Spark SQL engine will take care of running the streaming query incrementally and
continuously and updating the final result as streaming data continues to arrive.
Verified
References: [Databricks Certified Data Engineer Professional], under “Structured
Streaming” section; Databricks Documentation, under “Overview” section.
A Delta Lake table was created with the below query: Realizing that the original query had a typographical error, the below code was executed: ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store Which result will occur after running the second command?
A. The table reference in the metastore is updated and no data is changed.
B. The table name change is recorded in the Delta transaction log.
C. All related files and metadata are dropped and recreated in a single ACID transaction.
D. The table reference in the metastore is updated and all data files are moved.
E. A new Delta transaction log Is created for the renamed table.
Explanation:
The query uses the CREATE TABLE USING DELTA syntax to create a Delta
Lake table from an existing Parquet file stored in DBFS. The query also uses the
LOCATION keyword to specify the path to the Parquet file as
/mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query
creates an external table, which is a table that is stored outside of the default warehouse
directory and whose metadata is not managed by Databricks. An external table can be
created from an existing directory in a cloud storage system, such as DBFS or S3, that
contains data files in a supported format, such as Parquet or CSV.
The result that will occur after running the second command is that the table reference in
the metastore is updated and no data is changed. The metastore is a service that stores
metadata about tables, such as their schema, location, properties, and partitions. The
metastore allows users to access tables using SQL commands or Spark APIs without
knowing their physical location or format. When renaming an external table using the
ALTER TABLE RENAME TO command, only the table reference in the metastore is
updated with the new name; no data files or directories are moved or changed in the
storage system. The table will still point to the same location and use the same format as
before. However, if renaming a managed table, which is a table whose metadata and data
are both managed by Databricks, both the table reference in the metastore and the data
files in the default warehouse directory are moved and renamed accordingly.
Verified
References: [Databricks Certified Data Engineer Professional], under “Delta Lake” section;
Databricks Documentation, under “ALTER TABLE RENAME TO” section; Databricks
Documentation, under “Metastore” section; Databricks Documentation, under “Managed
and external tables” section.
Which statement describes the correct use of pyspark.sql.functions.broadcast?
A. It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
B. It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
C. It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.
D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
E. It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.
Explanation:
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.broadca
st.html
The broadcast function in PySpark is used in the context of joins. When you mark a
DataFrame with broadcast, Spark tries to send this DataFrame to all worker nodes so that
it can be joined with another DataFrame without shuffling the larger DataFrame across the
nodes. This is particularly beneficial when the DataFrame is small enough to fit into the
memory of each node. It helps to optimize the join process by reducing the amount of data
that needs to be shuffled across the cluster, which can be a very expensive operation in
terms of computation and time.
The pyspark.sql.functions.broadcast function in PySpark is used to hint to Spark that a
DataFrame is small enough to be broadcast to all worker nodes in the cluster. When this
hint is applied, Spark can perform a broadcast join, where the smaller DataFrame is sent to
each executor only once and joined with the larger DataFrame on each executor. This can
significantly reduce the amount of data shuffled across the network and can improve the
performance of the join operation.
In a broadcast join, the entire smaller DataFrame is sent to each executor, not just a
specific column or a cached version on attached storage. This function is particularly useful
when one of the DataFrames in a join operation is much smaller than the other, and can fit
comfortably in the memory of each executor node.
References:
Databricks Documentation on Broadcast Joins: Databricks Broadcast Join Guide
PySpark API Reference: pyspark.sql.functions.broadcast
Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code. Which statement describes a main benefit that offset this additional effort?
A. Improves the quality of your data
B. Validates a complete use case of your application
C. Troubleshooting is easier since all steps are isolated and tested individually
D. Yields faster deployment and execution times
E. Ensures that all steps interact correctly to achieve the desired end result
A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function. Which kind of the test does the above line exemplify?
A. Integration
B. Unit
C. Manual
D. functional
Explanation:
A unit test is designed to verify the correctness of a small, isolated piece of
code, typically a single function. Testing a mathematical function that calculates the area
under a curve is an example of a unit test because it is testing a specific, individual function
to ensure it operates as expected.
References:
Software Testing Fundamentals: Unit Testing
A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write. Which consideration will impact the decisions made by the engineer while migrating this workload?
A. All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.
B. Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.
C. Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's upsert functionality.
D. Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.
Explanation:
In Databricks and Delta Lake, transactions are indeed ACID-compliant, but
this compliance is limited to single table transactions. Delta Lake does not inherently
enforce foreign key constraints, which are a staple in relational database systems for
maintaining referential integrity between tables. This means that when migrating workloads
from a relational database system to Databricks Lakehouse, engineers need to reconsider
how to maintain data integrity and relationships that were previously enforced by foreign
key constraints. Unlike traditional relational databases where foreign key constraints help in
maintaining the consistency across tables, in Databricks Lakehouse, the data engineer has
to manage data consistency and integrity at the application level or through careful design
of ETL processes.
References:
Databricks Documentation on Delta Lake: Delta Lake Guide
Databricks Documentation on ACID Transactions in Delta Lake: ACID
Transactions in Delta Lake
In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone. A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before. Why are the cloned tables no longer working?
A. The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.
B. Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.
C. The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command
D. Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.
Explanation:
In Delta Lake, a shallow clone creates a new table by copying the metadata
of the source table without duplicating the data files. When the vacuum command is run on
the source table, it removes old data files that are no longer needed to maintain the
transactional log's integrity, potentially including files referenced by the shallow clone's
metadata. If these files are purged, the shallow cloned tables will reference non-existent
data files, causing them to stop working properly. This highlights the dependency of
shallow clones on the source table's data files and the impact of data management
operations like vacuum on these clones.
References: Databricks documentation on Delta
Lake, particularly the sections on cloning tables (shallow and deep cloning) and data
retention with the vacuum command (https://docs.databricks.com/delta/index.html).
The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id. Which statement describes what the number alongside this field represents?
A. The job_id is returned in this field.
B. The job_id and number of times the job has been are concatenated and returned.
C. The number of times the job definition has been run in the workspace.
D. The globally unique ID of the newly triggered run.
Explanation:
When triggering a job run using the Databricks CLI, the run_id field in the
response represents a globally unique identifier for that particular run of the job. This
run_id is distinct from the job_id. While the job_id identifies the job definition and is
constant across all runs of that job, the run_id is unique to each execution and is used to
track and query the status of that specific job run within the Databricks environment. This
distinction allows users to manage and reference individual executions of a job directly.
A junior data engineer has been asked to develop a streaming data pipeline with a grouped
aggregation using DataFrame df. The pipeline needs to calculate the average humidity and
average temperature for each non-overlapping five-minute interval. Incremental state
information should be maintained for 10 minutes for late-arriving data.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
Choose the response that correctly fills in the blank within the code block to complete this
task.
A. withWatermark("event_time", "10 minutes")
B. awaitArrival("event_time", "10 minutes")
C. await("event_time + ‘10 minutes'")
D. slidingWindow("event_time", "10 minutes")
Explanation:
The correct answer is A. withWatermark(“event_time”, “10 minutes”). This is
because the question asks for incremental state information to be maintained for 10
minutes for late-arriving data. The withWatermark method is used to define the watermark
for late data. The watermark is a timestamp column and a threshold that tells the system
how long to wait for late data. In this case, the watermark is set to 10 minutes. The other
options are incorrect because they are not valid methods or syntax for watermarking in
Structured Streaming.
References:
Watermarking: https://docs.databricks.com/spark/latest/structuredstreaming/watermarks.html
Windowed aggregations: https://docs.databricks.com/spark/latest/structuredstreaming/window-operations.html
Page 3 out of 9 Pages |
Previous |