DAS-C01 Practice Test Questions

An insurance company has raw data in JSON format that is sent without a predefined schedule through an
Amazon Kinesis Data Firehose delivery stream to an Amazon S3 bucket. An AWS Glue crawler is scheduled
to run every 8 hours to update the schema in the data catalog of the tables stored in the S3 bucket. Data
analysts analyze the data using Apache Spark SQL on Amazon EMR set up with AWS Glue Data Catalog as
the metastore. Data analysts say that, occasionally, the data they receive is stale. A data engineer needs to
provide access to the most up-to-date data.
Which solution meets these requirements?

A.

Create an external schema based on the AWS Glue Data Catalog on the existing Amazon Redshift
cluster to query new data in Amazon S3 with Amazon Redshift Spectrum.

B.

Use Amazon CloudWatch Events with the rate (1 hour) expression to execute the AWS Glue crawler
every hour.

C.

Using the AWS CLI, modify the execution schedule of the AWS Glue crawler from 8 hours to 1 minute.

D.

Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event
notification on the S3 bucket.

A.

Create an external schema based on the AWS Glue Data Catalog on the existing Amazon Redshift
cluster to query new data in Amazon S3 with Amazon Redshift Spectrum.

A media content company has a streaming playback application. The company wants to collect and analyze
the data to provide near-real-time feedback on playback issues. The company needs to consume this data and
return results within 30 seconds according to the service-level agreement (SLA). The company needs the
consumer to identify playback issues, such as quality during a specified timeframe. The data will be emitted as
JSON and may change schemas over time.
Which solution will allow the company to collect data for processing while meeting these requirements?

A.

Send the data to Amazon Kinesis Data Firehose with delivery to Amazon S3. Configure an S3 event
trigger an AWS Lambda function to process the data. The Lambda function will consume the data and
process it to identify potential playback issues. Persist the raw data to Amazon S3.

B.

Send the data to Amazon Managed Streaming for Kafka and configure an Amazon Kinesis Analytics for Java application as the consumer. The application will consume the data and process it to identify
potential playback issues. Persist the raw data to Amazon DynamoDB.

C.

Send the data to Amazon Kinesis Data Firehose with delivery to Amazon S3. Configure Amazon S3 to
trigger an event for AWS Lambda to process. The Lambda function will consume the data and process it to identify potential playback issues. Persist the raw data to Amazon DynamoDB.

D.

Send the data to Amazon Kinesis Data Streams and configure an Amazon Kinesis Analytics for Java
application as the consumer. The application will consume the data and process it to identify potential
playback issues. Persist the raw data to Amazon S3.

B.

Send the data to Amazon Managed Streaming for Kafka and configure an Amazon Kinesis Analytics for Java application as the consumer. The application will consume the data and process it to identify
potential playback issues. Persist the raw data to Amazon DynamoDB.

A company is migrating its existing on-premises ETL jobs to Amazon EMR. The code consists of a series of jobs written in Java. The company needs to reduce overhead for the system administrators without changing the underlying code. Due to the sensitivity of the data, compliance requires that the company use root device volume encryption on all nodes in the cluster. Corporate standards require that environments be provisioned though AWS CloudFormation when possible.
Which solution satisfies these requirements?

A.

Install open-source Hadoop on Amazon EC2 instances with encrypted root device volumes. Configure
the cluster in the CloudFormation template.

B.

Use a CloudFormation template to launch an EMR cluster. In the configuration section of the cluster,
define a bootstrap action to enable TLS.

C.

Create a custom AMI with encrypted root device volumes. Configure Amazon EMR to use the custom
AMI using the CustomAmild property in the CloudFormation template

D.

Use a CloudFormation template to launch an EMR cluster. In the configuration section of the cluster,
define a bootstrap action to encrypt the root device volume of every node.

C.

Create a custom AMI with encrypted root device volumes. Configure Amazon EMR to use the custom
AMI using the CustomAmild property in the CloudFormation template

A retail company is building its data warehouse solution using Amazon Redshift. As a part of that effort, the company is loading hundreds of files into the fact table created in its Amazon Redshift cluster. The company wants the solution to achieve the highest throughput and optimally use cluster resources when loading data into the company’s fact table. How should the company meet these requirements?

A.

Use multiple COPY commands to load the data into the Amazon Redshift cluster.

B.

Use S3DistCp to load multiple files into the Hadoop Distributed File System (HDFS) and use an HDFS
connector to ingest the data into the Amazon Redshift cluster.

C.

Use LOAD commands equal to the number of Amazon Redshift cluster nodes and load the data in
parallel into each node.

D.

Use a single COPY command to load the data into the Amazon Redshift cluster

B.

Use S3DistCp to load multiple files into the Hadoop Distributed File System (HDFS) and use an HDFS
connector to ingest the data into the Amazon Redshift cluster.

A company wants to improve the data load time of a sales data dashboard. Data has been collected as .csv files
and stored within an Amazon S3 bucket that is partitioned by date. The data is then loaded to an Amazon
Redshift data warehouse for frequent analysis. The data volume is up to 500 GB per day.
Which solution will improve the data loading performance?

A.

Compress .csv files and use an INSERT statement to ingest data into Amazon Redshift.

B.

Split large .csv files, then use a COPY command to load data into Amazon Redshift.

C.

Use Amazon Kinesis Data Firehose to ingest data into Amazon Redshift.

D.

Load the .csv files in an unsorted key order and vacuum the table in Amazon Redshift

C.

Use Amazon Kinesis Data Firehose to ingest data into Amazon Redshift.

A company is building a data lake and needs to ingest data from a relational database that has time-series data. The company wants to use managed services to accomplish this. The process needs to be scheduled daily and bring incremental data only from the source into Amazon S3.
What is the MOST cost-effective approach to meet these requirements?

A.

Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.

B.

Use AWS Glue to connect to the data source using JDBC Drivers. Store the last updated key in an
Amazon DynamoDB table and ingest the data using the updated key as a filter.

C.

Use AWS Glue to connect to the data source using JDBC Drivers and ingest the entire dataset. Use
appropriate Apache Spark libraries to compare the dataset, and find the delta.

D.

Use AWS Glue to connect to the data source using JDBC Drivers and ingest the full data. Use AWS
DataSync to ensure the delta only is written into Amazon S3.

B.

Use AWS Glue to connect to the data source using JDBC Drivers. Store the last updated key in an
Amazon DynamoDB table and ingest the data using the updated key as a filter.

A financial company uses Apache Hive on Amazon EMR for ad-hoc queries. Users are complaining of
sluggish performance.A data analyst notes the following:
Approximately 90% of queries are submitted 1 hour after the market opens.
Hadoop Distributed File System (HDFS) utilization never exceeds 10%.
Which solution would help address the performance issues?

A.

Create instance fleet configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an automatic scaling policy to scale in the instance fleet based on the CloudWatch CapacityRemainingGB metric.

B.

Create instance fleet configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance fleet based on the CloudWatch
YARNMemoryAvailablePercentage metric.

C.

Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch
CapacityRemainingGB metric.

D.

Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch
YARNMemoryAvailablePercentage metric.

C.

Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch
CapacityRemainingGB metric.

A technology company is creating a dashboard that will visualize and analyze time-sensitive data. The data will come in through Amazon Kinesis Data Firehose with the butter interval set to 60 seconds. The dashboard must support near-real-time data.
Which visualization solution will meet these requirements?

A.

Select Amazon Elasticsearch Service (Amazon ES) as the endpoint for Kinesis Data Firehose. Set up a Kibana dashboard using the data in Amazon ES with the desired analyses and visualizations.

B.

Select Amazon S3 as the endpoint for Kinesis Data Firehose. Read data into an Amazon SageMaker
Jupyter notebook and carry out the desired analyses and visualizations

C.

Select Amazon Redshift as the endpoint for Kinesis Data Firehose. Connect Amazon QuickSight with
SPICE to Amazon Redshift to create the desired analyses and visualizations

D.

Select Amazon S3 as the endpoint for Kinesis Data Firehose. Use AWS Glue to catalog the data and
Amazon Athena to query it. Connect Amazon QuickSight with SPICE to Athena to create the desired
analyses and visualizations.

A.

Select Amazon Elasticsearch Service (Amazon ES) as the endpoint for Kinesis Data Firehose. Set up a Kibana dashboard using the data in Amazon ES with the desired analyses and visualizations.

A company’s marketing team has asked for help in identifying a high performing long-term storage service for
their data based on the following requirements:
The data size is approximately 32 TB uncompressed.
There is a low volume of single-row inserts each day.
There is a high volume of aggregation queries each day.
Multiple complex joins are performed.
The queries typically involve a small subset of the columns in a table.
Which storage service will provide the MOST performant solution?

A.

Amazon Aurora MySQL

B.

Amazon Redshift

C.

Amazon Neptune

D.

Amazon Elasticsearch

B.

Amazon Redshift

A company has 1 million scanned documents stored as image files in Amazon S3. The documents contain
typewritten application forms with information including the applicant first name, applicant last name,
application date, application type, and application text. The company has developed a machine learning
algorithm to extract the metadata values from the scanned documents. The company wants to allow internal
data analysts to analyze and find applications using the applicant name, application date, or application text.
The original images should also be downloadable. Cost control is secondary to query performance.
Which solution organizes the images and metadata to drive insights while meeting the requirements?

A.

For each image, use object tags to add the metadata. Use Amazon S3 Select to retrieve the files based on the applicant name and application date

B.

Index the metadata and the Amazon S3 location of the image file in Amazon Elasticsearch Service.
Allow the data analysts to use Kibana to submit queries to the Elasticsearch cluster.

C.

Index the metadata and the Amazon S3 location of the image file in Amazon Elasticsearch Service.
Allow the data analysts to use Kibana to submit queries to the Elasticsearch cluster.

D.

Store the metadata and the Amazon S3 location of the image files in an Apache Parquet file in Amazon
S3, and define a table in the AWS Glue Data Catalog. Allow data analysts to use Amazon Athena to
submit custom queries.

A.

For each image, use object tags to add the metadata. Use Amazon S3 Select to retrieve the files based on the applicant name and application date

A company wants to use an automatic machine learning (ML) Random Cut Forest (RCF) algorithm to
visualize complex real-word scenarios, such as detecting seasonality and trends, excluding outers, and
imputing missing values.
The team working on this project is non-technical and is looking for an out-of-the-box solution that will
require the LEAST amount of management overhead.
Which solution will meet these requirements?

A.

Use an AWS Glue ML transform to create a forecast and then use Amazon QuickSight to visualize the
data.

B.

Use Amazon QuickSight to visualize the data and then use ML-powered forecasting to forecast the key business metrics.

C.

Use a pre-build ML AMI from the AWS Marketplace to create forecasts and then use Amazon
QuickSight to visualize the data.

D.

Use calculated fields to create a new forecast and then use Amazon QuickSight to visualize the data.

A.

Use an AWS Glue ML transform to create a forecast and then use Amazon QuickSight to visualize the
data.

A smart home automation company must efficiently ingest and process messages from various connected
devices and sensors. The majority of these messages are comprised of a large number of small files. These
messages are ingested using Amazon Kinesis Data Streams and sent to Amazon S3 using a Kinesis data stream
consumer application. The Amazon S3 message data is then passed through a processing pipeline built on
Amazon EMR running scheduled PySpark jobs.
The data platform team manages data processing and is concerned about the efficiency and cost of
downstream data processing. They want to continue to use PySpark.
Which solution improves the efficiency of the data processing jobs and is well architected?

A.

Send the sensor and devices data directly to a Kinesis Data Firehose delivery stream to send the data to Amazon S3 with Apache Parquet record format conversion enabled. Use Amazon EMR running
PySpark to process the data in Amazon S3.

B.

Set up an AWS Lambda function with a Python runtime environment. Process individual Kinesis data
stream messages from the connected devices and sensors using Lambda.

C.

Launch an Amazon Redshift cluster. Copy the collected data from Amazon S3 to Amazon Redshift and
move the data processing jobs from Amazon EMR to Amazon Redshift.

D.

Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue.

A.

Send the sensor and devices data directly to a Kinesis Data Firehose delivery stream to send the data to Amazon S3 with Apache Parquet record format conversion enabled. Use Amazon EMR running
PySpark to process the data in Amazon S3.