ITExambyte

Free Questions on AWS Data Engineer Associate

AWS Certified Data Engineer - Associate Exam Dumps

Domain Weightage %
Domain 1: Data Ingestion and Transformation34%
Domain 2: Data Store Management26%
Domain 3: Data Operations and Support22%
Domain 4: Data Security and Governance18%

1 / 65

1. A company is preparing to utilize a provisioned Amazon EMR cluster for executing Apache Spark jobs to conduct big data analysis while prioritizing high reliability. The big data team aims to adhere to best practices for managing cost-efficient and long-running workloads on Amazon EMR while sustaining the current level of performance for the company.

Which combination of resources will most efficiently meet these requirements? (Select two.)

2 / 65

2. A company needs to build a data lake in AWS. The company must provide row-level data access and column-level data access to specific teams. The teams will access the data by using Amazon Athena, Amazon Redshift Spectrum, and Apache Hive from Amazon EMR.
Which solution will meet these requirements with the LEAST operational overhead?

3 / 65

3. A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data engineer has set up the necessary AWS Glue connection details and an associated IAM role. However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint.
The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket.
Which solution will meet this requirement?

4 / 65

4. A company is storing data in an Amazon S3 bucket. The company is in the process of adopting a new data lifecycle and retention policy. The policy is defined as follows:

  • Any newly created data must be available online and will occasionally need to be analyzed with SQL.
  • Data older than 3 years must be securely stored and made available when needed for compliance evaluation within 12 hours.
  • Data older than 10 years must be securely deleted.

A data engineer must configure a solution that would ensure that the data is stored cost effectively according to the lifecycle and retention policy.

Which solution will meet these requirements?

5 / 65

5. A finance company is storing paid invoices in an Amazon S3 bucket. After the invoices are uploaded, an AWS Lambda function uses Amazon Textract to process the PDF data and persist the data to Amazon DynamoDB. Currently, the Lambda execution role has the following S3 permission:

{

"Version": "2012-10-17",
"Statement": [
{
"Sid": "ExampleStmt",
"Action": ["s3:*”],
"Effect": "Allow",
"Resource": ["*"]
}
]
}

The company wants to correct the role permissions specific to Amazon S3 according to security best practices.

Which solution will meet these requirements?

6 / 65

6. A financial institution needs to ingest large volumes of data from various sources, including stock market feeds and customer transactions, for analysis. What type of data ingestion method would be most appropriate in this case?

7 / 65

7. How can you improve the replayability of data ingestion pipelines in AWS?

8 / 65

8. An insurance company is using vehicle insurance data to build a risk analysis machine learning (ML) model. The data contains personally identifiable information (PII). The ML model should not use the PII. Regulations also require the data to be encrypted with an AWS Key Management Service (AWS KMS) key. A data engineer must select the appropriate services to deliver insurance data for use with the ML model.

Which combination of steps will meet these requirements in the MOST cost-effective manner? (Select TWO.)

9 / 65

9. Which of the following scenarios best represents a stateful data transaction in an AWS environment?

10 / 65

10. A studio aims to enhance its media content recommendation system based on user behavior and preferences by integrating insights from third-party datasets into its existing analytics platform. The studio seeks a solution with minimal operational overhead to incorporate third-party datasets efficiently. Which option will fulfill these requirements with the LEAST operational overhead?

11 / 65

11. Your company has a diverse set of data sources in different formats stored in Amazon S3, and you want to create a unified catalog capturing metadata and schema details. Which AWS service can automate this process without the need for manual schema definition?

12 / 65

12. A company is running an Amazon Redshift cluster. A data engineer must design a solution that would give the company the ability to provide analysis on a separate test environment in Amazon Redshift. The solution would use the data from the main Redshift cluster. The second cluster is expected to be used for only 2 hours every 2 weeks as part of the new testing process.

Which solution will meet these requirements in the MOST cost-effective manner?

13 / 65

13. An Amazon Kinesis application is trying to read data from a Kinesis data stream. However, the read data call is rejected. The following error message is displayed: ProvisionedThroughputExceededException.

Which combination of steps will resolve the error? (Select TWO.)

14 / 65

14. ABC Corporation, a leading financial institution, endeavors to modernize its data infrastructure on AWS to bolster its analytics prowess and regulatory compliance. The corporation collects vast datasets from diverse sources including market transactions, customer interactions, and regulatory filings. The imperative is to establish a sophisticated framework for configuring data pipelines, one that seamlessly accommodates scheduling intricacies and interdependencies within the data workflows.
Which amalgamation of AWS services offers the most apt solution?

15 / 65

15. A financial institution is looking to improve the performance and cost efficiency of its data analytics platform. The institution has a massive amount of historical financial transaction data stored in Avro format, and they aim to transform this data into Apache Parquet format to optimize query performance and reduce storage costs. The institution's compliance requirements dictate that the data transformation process must be auditable and trackable.
Considering the stringent compliance requirements and the need for efficient data transformation, which combination of AWS services can the financial institution utilize to achieve the desired data format transformation and compliance adherence?

16 / 65

16. A data analyst needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.
Which solution will meet these requirements MOST cost-effectively?

17 / 65

17. A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options. The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS.
Which extract, transform, and load (ETL) service will meet these requirements?

18 / 65

18. A data engineer is designing an application that will add data for transformation to an Amazon Simple Queue Service (Amazon SQS) queue. A microservice will receive messages from the queue. The data engineer wants to ensure message persistence.

Which events can remove messages from an SQS queue? (Select THREE.)

19 / 65

19. A company operates a frontend ReactJS website that interacts with REST APIs via Amazon API Gateway, which serves as the conduit for the website's functionalities. A data engineer is tasked with developing a Python script that will be sporadically invoked through API Gateway, with the requirement of returning results to API Gateway.

Which solution presents the LEAST operational overhead to meet these specifications?

20 / 65

20. A company is running a cloud-based software application in an Amazon EC2 instance backed by an Amazon RDS for Microsoft SQL Server database. The application collects, processes, and stores confidential information and records in the database. The company wants to eliminate the risk of credential exposure.

Which solution will meet this requirement?

21 / 65

21. A Cloud Data Engineering Team is implementing a system for real-time data ingestion through an API. The architecture needs to include data transformation before storage. The system must handle large files and store them efficiently post-transformation. The team is focused on using a serverless architecture on AWS, with an emphasis on Infrastructure as Code (IaC) for standardized and repeatable deployments across various environments.
Which combination of actions should the Cloud Data Engineering Team take to implement IaC for serverless deployments of data ingestion and transformation pipelines? (Select THREE)

22 / 65

22. A data engineer maintains custom Python scripts utilized by numerous AWS Lambda functions for data formatting processes. Currently, whenever modifications are made to the Python scripts, the data engineer manually updates each Lambda function, which is time-consuming. The data engineer seeks a more streamlined approach for updating the Lambda functions. Which solution addresses this requirement?

23 / 65

23. A global manufacturing company is modernizing its data architecture and adopting cloud-based solutions for data processing and analytics. As part of this transformation, the company needs to implement intermediate data staging locations to efficiently manage and process large volumes of data from multiple sources before loading it into a data warehouse for analysis. The company's data engineering team is tasked with designing a scalable and fault-tolerant solution using AWS services.
Which AWS services will meet these requirements?

24 / 65

24. A company stores datasets in JSON format and .csv format in an Amazon S3 bucket. The company has Amazon RDS for Microsoft SQL Server databases, Amazon DynamoDB tables that are in provisionedcapacity mode, and an Amazon Redshift cluster. A data engineering team must develop a solution that will give data scientists the ability to query all data sources by using syntax similar to SQL.

Which solution will meet these requirements with the LEAST operational overhead?

25 / 65

25. In a data engineering pipeline, a company is using multiple applications and teams to access a shared Amazon S3 bucket. To streamline access and simplify permissions management for these different entities, which S3 feature should the company utilize?

26 / 65

26. You are developing a serverless application on AWS that requires orchestrating a complex workflow involving multiple AWS services. The workflow includes processing data from an S3 bucket, performing data transformations using AWS Lambda functions, and triggering subsequent steps based on the output. Which service provides a fully managed solution for orchestrating this serverless workflow?

27 / 65

27. A healthcare organization needs to perform complex analytical queries on patient records stored in a scalable data warehouse. The solution should offer fast query performance and seamless integration with existing BI tools. Which AWS service should they consider?

28 / 65

28. A multinational corporation with regional offices worldwide utilizes Amazon Redshift for its data warehousing needs. The company needs to ensure that critical tables are not accessed by multiple users simultaneously to prevent data corruption and maintain data consistency. Which locking mechanism in Amazon Redshift would best address this requirement?

29 / 65

29. A consultant company uses a cloud-based time-tracking system to track employee work hours. The company has thousands of employees that are globally distributed. The time-tracking system provides a REST API to obtain the records from the previous day in CSV format. The company has a cron on premises that is scheduled to run a Python program each morning at the same time. The program saves the data into an Amazon S3 bucket that serves as a data lake. A data engineer must provide a solution with AWS services that reuses the same Python code and cron configuration.

Which combination of steps will meet these requirements with the LEAST operational overhead? (Select TWO.)

30 / 65

30. An ecommerce company runs several applications on AWS. The company wants to design a centralized streaming log ingestion solution. The solution needs to be able to convert the log files to Apache Parquet format. Then, the solution must store the log files in Amazon S3. The number of log files being created varies throughout the day. A data engineer must configure a solution that ensures the log files are delivered in near real time.

Which solution will meet these requirements with the LEAST operational overhead?

31 / 65

31. An ecommerce company is running an application on AWS. The application sources recent data from tables in Amazon Redshift. Data that is older than 1 year is accessible in Amazon S3. Recently, a new report has been written in SQL. The report needs to compare a few columns from the current year sales table with the same columns from tables with sales data from previous years. The report runs slowly, with poor performance and long wait times to get results.

A data engineer must optimize the back-end storage to accelerate the query.

Which solution will meet these requirements MOST efficiently?

32 / 65

32. A company is collecting data that is generated by its users for analysis by using an Amazon S3 data lake. Some of the data being collected and stored in Amazon S3 includes personally identifiable information (PII).

The company wants a data engineer to design an automated solution to identify new and existing data that needs PII to be masked before analysis is performed. Additionally, the data engineer must provide an overview of the data that is identified. The task of masking the data will be handled by an application already created in the AWS account. The data engineer needs to design a solution that can invoke this application in real time when PII is found.

Which solution will meet these requirements with the LEAST operational overhead?

33 / 65

33. The financial institution intends to integrate a data mesh framework. This framework should facilitate centralized data governance, data analysis, and data access control. The organization has opted to employ AWS Glue for managing data catalogs and executing extract, transform, and load (ETL) processes. Which pair of AWS services is suitable for realizing the data mesh framework? (Select two.)

34 / 65

34. A data engineer needs to store configuration parameters for different data processing workflows, such as Spark job configurations and database connection details. Which feature of AWS Systems Manager Parameter Store should the engineer use to maintain organization and structure?

35 / 65

35. A data engineer needs to analyze API call patterns to identify potential optimization opportunities within their AWS data processing infrastructure. Which AWS CloudTrail feature can provide insights into API usage trends and patterns?

36 / 65

36. A company operates a production AWS account for running its workloads, while a separate security AWS account has been established by the security team to store and analyze security logs sourced from the production AWS account's Amazon CloudWatch Logs.
To deliver the security logs to the security AWS account using Amazon Kinesis Data Streams, the company needs an appropriate solution.
Which solution will fulfill these requirements effectively?

37 / 65

37. A data analytics company operates several AWS accounts across different countries. The company needs to ensure consistent configuration compliance across all accounts and regions while minimizing administrative overhead. Which approach should the company consider?

38 / 65

38. During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script.
A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must securely store the credentials.
Which combination of steps should the data engineer take to meet these requirements? (Choose two.)

39 / 65

39. A company needs to set up a data catalog and metadata management for data sources that run in the AWS Cloud. The company will use the data catalog to maintain the metadata of all the objects that are in a set of data stores. The data stores include structured sources such as Amazon RDS and Amazon Redshift. The data stores also include semistructured sources such as JSON files and .xml files that are stored in Amazon S3.
The company needs a solution that will update the data catalog on a regular basis. The solution also must detect changes to the source metadata.
Which solution will meet these requirements with the LEAST operational overhead?

40 / 65

40. A financial institution requires a message queuing service to decouple various components of its microservices architecture. The system needs to handle variable message volumes with minimal latency. Which AWS service should they use?

41 / 65

41. A finance company has developed a machine learning (ML) model to enhance its investment strategy. The model uses various sources of data about stock, bond, and commodities markets. The model has been approved for production. A data engineer must ensure that the data being used to run ML decisions is accurate, complete, and trustworthy. The data engineer must automate the data preparation for the model's production deployment.

Which solution will meet these requirements?

42 / 65

42. As a Data Engineering Consultant, you are implementing a data processing solution using AWS Glue, which leverages Apache Spark under the hood. You need to explain to your team how AWS Glue, using Apache Spark, manages data processing jobs differently than a standalone Apache Spark environment.
Which of the following points would you emphasize as a key difference in the AWS Glue implementation of Spark?

43 / 65

43. A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column. Which solution will MOST speed up the Athena query performance?

44 / 65

44. You are tasked with designing a comprehensive serverless workflow for a real-time analytics platform that processes streaming data from various sources. The platform needs to ingest data, perform near real-time analysis, store aggregated results, and trigger alerts based on predefined conditions. You have chosen AWS services to architect this solution.
To perform real-time analytics on the streaming data, which AWS service can be leveraged to process and analyze data as it arrives?

45 / 65

45. A company is using an Amazon S3 data lake. The company ingests data into the data lake by using Amazon Kinesis Data Streams. The company reads and processes the incoming data from the stream by using AWS Lambda. The data being ingested has highly variable and unpredictable volume. Currently, the IteratorAge metric is high at peak times when a high volume of data is being posted to the stream. A data engineer must design a solution to increase performance when reading Kinesis Data Streams with Lambda.

Which solution will meet these requirements? (Select THREE.)

46 / 65

46. A company is building a real-time monitoring system for analyzing web traffic logs. The logs are continuously generated and stored in an Amazon Kinesis Data Firehose delivery stream. The company wants to perform real-time analysis on the incoming logs and extract insights using custom logic.
Which architecture pattern should the company adopt to achieve this requirement?

47 / 65

47. A data engineer is designing an application that will transform data in containers managed by Amazon Elastic Kubernetes Service (Amazon EKS). The containers run on Amazon EC2 nodes. Each containerized application will transform independent datasets and then store the data in a data lake. Data does not need to be shared to other containers. The data engineer must decide where to store data before transformation is complete.

Which solution will meet these requirements with the LOWEST latency?

48 / 65

48. A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded. A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table.

49 / 65

49. A company has data in an on-premises NFS file share. The company plans to migrate to AWS. The company uses the data for data analysis. The company has written AWS Lambda functions to analyze the data. The company wants to continue to use NFS for the file system that Lambda accesses. The data must be shared across all concurrently running Lambda functions.

Which solution should the company use for this data migration?

50 / 65

50. A company is running an Amazon Redshift data warehouse on AWS. The company has recently started using a software as a service (SaaS) sales application that is supported by several AWS services. The company wants to transfer some of the data in the SaaS application to Amazon Redshift for reporting purposes.

A data engineer must configure a solution that can continuously send data from the SaaS application to Amazon Redshift.

Which solution will meet these requirements with the LEAST operational overhead?

51 / 65

51. A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information. The data engineer must identify and remove duplicate information from the legacy application data.
Which solution will meet these requirements with the LEAST operational overhead?

52 / 65

52. A data engineer has created a new account to deploy an AWS Glue extract, transform, and load (ETL) pipeline. The pipeline jobs need to ingest raw data from a source Amazon S3 bucket. Then, the pipeline jobs write the transformed data to a destination S3 bucket in the same account. The data engineer has written an IAM policy with permissions for AWS Glue to access the source S3 bucket and destination S3 bucket. The data engineer needs to grant the permissions in the IAM policy to AWS Glue to run the ETL pipeline.

Which solution will meet these requirements?

53 / 65

53. A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling.
Which solution will meet this requirement?

54 / 65

54. A Data Engineering Team at a financial services company is developing a data API to serve real-time, user specific transactional data from an Amazon RDS for PostgreSQL database to their mobile banking application.
The data is highly dynamic, with frequent reads and writes. The API must offer low latency and high availability, and be capable of scaling automatically to handle peak loads during business hours.
Given these requirements, which architecture should the team implement?

55 / 65

55. A company ingests data into an Amazon S3 data lake from multiple operational sources. The company then ingests the data into Amazon Redshift for a business analysis team to analyze. The business analysis team requires access to only the last 3 months of customer data.

Additionally, once a year, the company runs a detailed analysis of the past year's data to compare the overall results of the previous 12 months. After the analysis and comparison, the data is no longer accessed. However, the data must be kept after 12 months for compliance reasons.

Which solution will meet these requirements in the MOST cost-effective manner?

56 / 65

56. A company has deployed a data pipeline that uses AWS Glue to process records. The records include a JSON-formatted event and can sometimes include base64-encoded images. The AWS Glue job is configured with 10 data processing units (DPUs). However, the AWS Glue job regularly scales to several hundred DPUs and can take a long time to run.

A data engineer must monitor the data pipeline to determine the appropriate DPU capacity.

Which solution will meet these requirements?

57 / 65

57. A data engineer must deploy a centralized metadata storage solution on AWS. The solution needs to be reliable and scalable. The solution needs to ensure that fine-grained permissions can be controlled at the database, table, column, row, and cell levels.

Which solution will meet these requirements with the LEAST operational overhead?

58 / 65

58. At Healthcare firm, a data engineer aims to schedule a workflow executing a set of AWS Glue jobs daily, without the necessity for the jobs to commence or conclude at precise times. Which solution provides the most cost-effective method for running the Glue jobs?

59 / 65

59. Which of the following scenarios best exemplifies the application of AWS Event-driven architecture?

60 / 65

60. A company uses AWS Step Functions to orchestrate a data pipeline. The pipeline consists of Amazon EMR jobs that ingest data from data sources and store the data in an Amazon S3 bucket. The pipeline also includes EMR jobs that load the data to Amazon Redshift.
The company's cloud infrastructure team manually built a Step Functions state machine. The cloud infrastructure team launched an EMR cluster into a VPC to support the EMR jobs. However, the deployed Step Functions state machine is not able to run the EMR jobs.

Which combination of steps should the company take to identify the reason the Step Functions state machine is not able to run the EMR jobs? (Choose two.)

61 / 65

61. A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the morning. The application has very low usage during weekends. The company must ensure that the application performs consistently during peak usage times.
Which solution will meet these requirements in the MOST cost-effective way?

62 / 65

62. A healthcare provider organization has a large amount of patient data stored in an AWS-based data warehouse. The organization wants to make this data available to other systems within the organization and to third-party applications through a modern API interface. Additionally, the organization's compliance requirements mandate that access to patient data must be secure and auditable.
Which combination of AWS services will meet this requirement?

63 / 65

63. A multinational retail corporation is looking to modernize its data processing infrastructure by implementing ETL (Extract, Transform, Load) pipelines on AWS to handle a variety of data sources, including transactional data from its online stores, customer interaction logs from its mobile applications, and inventory data from its physical stores. The company's primary goal is to build a scalable and cost-effective solution that can handle large volumes of data while providing real-time insights to support decision-making processes.

Which AWS services and architecture would best suit the company's requirements?

64 / 65

64. A Startup utilizes Amazon Athena for executing one-time queries on data stored in Amazon S3. With multiple use cases, the startup needs to enforce permission controls to segregate query processes and access to query history among users, teams, and applications within the same AWS account. Which solution aligns with these requirements?

65 / 65

65. A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer needs to query only one column of the data. Which solution will meet these requirements with the LEAST operational overhead?

Your score is

0%

Exit

Scroll to Top