Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code. Which statement describes a main benefit that offset this additional effort?
A. Improves the quality of your data
B. Validates a complete use case of your application
C. Troubleshooting is easier since all steps are isolated and tested individually
D. Yields faster deployment and execution times
E. Ensures that all steps interact correctly to achieve the desired end result
A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function. Which kind of the test does the above line exemplify?
A. Integration
B. Unit
C. Manual
D. functional
Explanation:
A unit test is designed to verify the correctness of a small, isolated piece of
code, typically a single function. Testing a mathematical function that calculates the area
under a curve is an example of a unit test because it is testing a specific, individual function
to ensure it operates as expected.
References:
Software Testing Fundamentals: Unit Testing
A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write. Which consideration will impact the decisions made by the engineer while migrating this workload?
A. All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.
B. Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.
C. Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's upsert functionality.
D. Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.
Explanation:
In Databricks and Delta Lake, transactions are indeed ACID-compliant, but
this compliance is limited to single table transactions. Delta Lake does not inherently
enforce foreign key constraints, which are a staple in relational database systems for
maintaining referential integrity between tables. This means that when migrating workloads
from a relational database system to Databricks Lakehouse, engineers need to reconsider
how to maintain data integrity and relationships that were previously enforced by foreign
key constraints. Unlike traditional relational databases where foreign key constraints help in
maintaining the consistency across tables, in Databricks Lakehouse, the data engineer has
to manage data consistency and integrity at the application level or through careful design
of ETL processes.
References:
Databricks Documentation on Delta Lake: Delta Lake Guide
Databricks Documentation on ACID Transactions in Delta Lake: ACID
Transactions in Delta Lake
In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone. A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before. Why are the cloned tables no longer working?
A. The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.
B. Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.
C. The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command
D. Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.
Explanation:
In Delta Lake, a shallow clone creates a new table by copying the metadata
of the source table without duplicating the data files. When the vacuum command is run on
the source table, it removes old data files that are no longer needed to maintain the
transactional log's integrity, potentially including files referenced by the shallow clone's
metadata. If these files are purged, the shallow cloned tables will reference non-existent
data files, causing them to stop working properly. This highlights the dependency of
shallow clones on the source table's data files and the impact of data management
operations like vacuum on these clones.
References: Databricks documentation on Delta
Lake, particularly the sections on cloning tables (shallow and deep cloning) and data
retention with the vacuum command (https://docs.databricks.com/delta/index.html).
A junior data engineer has been asked to develop a streaming data pipeline with a grouped
aggregation using DataFrame df. The pipeline needs to calculate the average humidity and
average temperature for each non-overlapping five-minute interval. Incremental state
information should be maintained for 10 minutes for late-arriving data.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
Choose the response that correctly fills in the blank within the code block to complete this
task.
A. withWatermark("event_time", "10 minutes")
B. awaitArrival("event_time", "10 minutes")
C. await("event_time + ‘10 minutes'")
D. slidingWindow("event_time", "10 minutes")
Explanation:
The correct answer is A. withWatermark(“event_time”, “10 minutes”). This is
because the question asks for incremental state information to be maintained for 10
minutes for late-arriving data. The withWatermark method is used to define the watermark
for late data. The watermark is a timestamp column and a threshold that tells the system
how long to wait for late data. In this case, the watermark is set to 10 minutes. The other
options are incorrect because they are not valid methods or syntax for watermarking in
Structured Streaming.
References:
Watermarking: https://docs.databricks.com/spark/latest/structuredstreaming/watermarks.html
Windowed aggregations: https://docs.databricks.com/spark/latest/structuredstreaming/window-operations.html
An hourly batch job is configured to ingest data files from a cloud object storage container
where each batch represent all records produced by the source system in a given hour.
The batch job to process these records into the Lakehouse is sufficiently delayed to ensure
no late-arriving data is missed. The user_id field represents a unique key for the data,
which has the following schema:
user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login
BIGINT, auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full
record of all data in the same schema as the source. The next table in the system is named
account_current and is implemented as a Type 1 table representing the most recent value
for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records processed
hourly, which implementation can be used to efficiently update the described
account_current table as part of each hourly batch job?
A. Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.
B. Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.
C. Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.
D. Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.
E. Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or most recent value for each username.
Explanation:
This is the correct answer because it efficiently updates the account current
table with only the most recent value for each user id. The code filters records in account
history using the last updated field and the most recent hour processed, which means it will
only process the latest batch of data. It also filters by the max last login by user id, which
means it will only keep the most recent record for each user id within that batch. Then, it
writes a merge statement to update or insert the most recent value for each user id into
account current, which means it will perform an upsert operation based on the user id
column. Verified References: [Databricks Certified Data Engineer Professional], under
“Delta Lake” section; Databricks Documentation, under “Upsert into a table using merge”
section.
Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?
A. /jobs/runs/list
B. /jobs/runs/get-output
C. /jobs/runs/get
D. /jobs/get
E. /jobs/list
Explanation:
This is the correct answer because it is the REST API call that can be used
to review the notebooks configured to run as tasks in a multi-task job. The REST API is an
interface that allows programmatically interacting with Databricks resources, such as
clusters, jobs, notebooks, or tables. The REST API uses HTTP methods, such as GET,
POST, PUT, or DELETE, to perform operations on these resources. The /jobs/get endpoint
is a GET method that returns information about a job given its job ID. The information
includes the job settings, such as the name, schedule, timeout, retries, email notifications,
and tasks. The tasks are the units of work that a job executes. A task can be a notebook
task, which runs a notebook with specified parameters; a jar task, which runs a JAR
uploaded to DBFS with specified main class and arguments; or a python task, which runs a
Python file uploaded to DBFS with specified parameters. A multi-task job is a job that has
more than one task configured to run in a specific order or in parallel. By using the /jobs/get
endpoint, one can review the notebooks configured to run as tasks in a multi-task job.
Verified References: [Databricks Certified Data Engineer Professional], under “Databricks
Jobs” section; Databricks Documentation, under “Get” section; Databricks Documentation,
under “JobSettings” section.
A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds. Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?
A. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
B. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
C. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
D. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
Explanation:
The adjustment that will meet the requirement of processing records in less
than 10 seconds is to decrease the trigger interval to 5 seconds. This is because triggering
batches more frequently may prevent records from backing up and large batches from
causing spill. Spill is a phenomenon where the data in memory exceeds the available
capacity and has to be written to disk, which can slow down the processing and increase
the execution time1. By reducing the trigger interval, the streaming query can process
smaller batches of data more quickly and avoid spill. This can also improve the latency and
throughput of the streaming job2.
The other options are not correct, because:
Option A is incorrect because triggering batches more frequently does not allow
idle executors to begin processing the next batch while longer running tasks from
previous batches finish. In fact, the opposite is true. Triggering batches more
frequently may cause concurrent batches to compete for the same resources and
cause contention and backpressure2. This can degrade the performance and
stability of the streaming job.
Option B is incorrect because increasing the trigger interval to 30 seconds is not a
good practice to ensure no records are dropped. Increasing the trigger interval
means that the streaming query will process larger batches of data less frequently,
which can increase the risk of spill, memory pressure, and timeouts12. This can
also increase the latency and reduce the throughput of the streaming job.
Option C is incorrect because the trigger interval can be modified without
modifying the checkpoint directory. The checkpoint directory stores the metadata
and state of the streaming query, such as the offsets, schema, and configuration3.
Changing the trigger interval does not affect the state of the streaming query, and
does not require a new checkpoint directory. However, changing the number of
shuffle partitions may affect the state of the streaming query, and may require a
new checkpoint directory4.
Option D is incorrect because using the trigger once option and configuring a
Databricks job to execute the query every 10 seconds does not ensure that all
backlogged records are processed with each batch. The trigger once option
means that the streaming query will process all the available data in the source
and then stop5. However, this does not guarantee that the query will finish
processing within 10 seconds, especially if there are a lot of records in the source.
Moreover, configuring a Databricks job to execute the query every 10 seconds
may cause overlapping or missed batches, depending on the execution time of the
query.
References:
Memory Management Overview, Structured Streaming Performance Tuning
Guide, Checkpointing, Recovery Semantics after Changes in a Streaming Query, Triggers
A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task. Which statement explains what is preventing this privilege transfer?
A. Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.
B. The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.
C. Other than the default "admins" group, only individual users can be granted privileges on jobs.
D. A user can only transfer job ownership to a group if they are also a member of that group.
E. Only workspace administrators can grant "Owner" privileges to a group.
Explanation:
The reason why the junior data engineer cannot transfer “Owner” privileges
to the “DevOps” group is that Databricks jobs must have exactly one owner, and the owner
must be an individual user, not a group. A job cannot have more than one owner, and a job
cannot have a group as an owner. The owner of a job is the user who created the job, or
the user who was assigned the ownership by another user. The owner of a job has the
highest level of permission on the job, and can grant or revoke permissions to other users
or groups. However, the owner cannot transfer the ownership to a group, only to another
user. Therefore, the junior data engineer’s attempt to transfer “Owner” privileges to the
“DevOps” group is not possible.
References:
Jobs access control: https://docs.databricks.com/security/access-control/tableacls/index.html
Job permissions: https://docs.databricks.com/security/access-control/tableacls/privileges.html#job-permissions
The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels. The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams. Which statement exemplifies best practices for implementing this system?
A. Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.
B. Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.
C. Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.
D. Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.
E. Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.
Explanation:
This is the correct answer because it exemplifies best practices for
implementing this system. By isolating tables in separate databases based on data quality
tiers, such as bronze, silver, and gold, the data engineering team can achieve several
benefits. First, they can easily manage permissions for different users and groups through
database ACLs, which allow granting or revoking access to databases, tables, or views.
Second, they can physically separate the default storage locations for managed tables in
each database, which can improve performance and reduce costs. Third, they can provide
a clear and consistent naming convention for the tables in each database, which can
improve discoverability and usability. Verified References: [Databricks Certified Data
Engineer Professional], under “Lakehouse” section; Databricks Documentation, under
“Database object privileges” section.
A Delta Lake table was created with the below query:
Consider the following query:
DROP TABLE prod.sales_by_store -
If this statement is executed by a workspace admin, which result will occur?
A. Nothing will occur until a COMMIT command is executed.
B. The table will be removed from the catalog but the data will remain in storage.
C. The table will be removed from the catalog and the data will be deleted.
D. An error will occur because Delta Lake prevents the deletion of production data.
E. Data will be marked as deleted but still recoverable with Time Travel.
Explanation:
When a table is dropped in Delta Lake, the table is removed from the catalog
and the data is deleted. This is because Delta Lake is a transactional storage layer that
provides ACID guarantees. When a table is dropped, the transaction log is updated to
reflect the deletion of the table and the data is deleted from the underlying storage.
References:
https://docs.databricks.com/delta/quick-start.html#drop-a-table
https://docs.databricks.com/delta/delta-batch.html#drop-table
What statement is true regarding the retention of job run history?
A. It is retained until you export or delete job run logs
B. It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3
C. it is retained for 60 days, during which you can export notebook run results to HTML
D. It is retained for 60 days, after which logs are archived
E. It is retained for 90 days or until the run-id is re-used through custom run configuration
Page 3 out of 10 Pages |
Previous |