-
-
-
-
URL copied!
Thinking of becoming a certified Google Cloud Professional Data Engineer?
After becoming certified as a Google Cloud Professional Architect, I wanted to continue the momentum and earn the Google Cloud Professional Data Engineer certification. It took a month and a half to prepare for the certification while working my full-time job. For those familiar with Google’s Cloud Architect exam, the Data Engineer exam questions are slightly more complex, but the scope is much smaller.
The Data Engineer certification covers many subjects including Google Cloud Platform data storage, analytics, machine learning, and data processing products. In this article, you’ll learn more about each term and find helpful tips and resources to help you prepare to get this certification yourself.
Cloud Storage and Cloud Datastore
As you may know, cloud storage refers to storing data in the cloud rather than locally on a computer. This allows users to access their files from any device connected to the internet. In Google Cloud Storage, users can store objects in categories called buckets.
Cloud datastore is a database service provided by Google for storing large amounts of structured data for web and mobile applications.
Surprisingly, these products aren’t covered as much in the exam, possibly because they are covered more extensively in the Cloud Architect exam. Just know the basic concepts of each product and when it’s appropriate to use each product.
Cloud SQL
Google’s Cloud SQL is a fully managed database service for MySQL, PostgreSQL, and SQL Server. The service runs on Google’s cloud infrastructure, so users don't need to worry about managing servers, storage, backups, or other technology.
There were a few questions on this product in the exam. If you have practical experience using the product, you should be able to answer any questions.
With inquiries related to other data storage products, be sure to know what scenarios are appropriate for Cloud SQL and when it would be more appropriate to use Datastore, Bigquery, Bigtable, or other products.
Recommended Reading: Cloud-Driven Innovations: What Comes Next?
Bigtable
Bigtable is Google’s distributed database system for managing large amounts of data across multiple machines. It’s a scalable NoSQL database service.
The idea behind Bigtable is to store all the data in one place rather than having each machine store small amounts of data. This allows you to scale up when adding more storage capacity.
This product is covered quite extensively in the exam. You should know the basic concepts of the product:
- How to design an appropriate schema and row key.
- Instances, Cluster, and Nodes.
- Whether Bigtable supports transactions and ACID operations.
- CBT Tool Overview.
- Schema design for time series data.
- Access control with IAM.
- What the size limits are for Bigtable, such as the cell and row size and the maximum number of tables.
BigQuery
Google’s BigQuery is a cloud data warehouse service for processing large amounts of structured and semi-structured data. The service provides fast query access to petabytes of data stored in Google Cloud Storage.
The exam will thoroughly cover BigQuery and as long as you know it, you can answer many of the questions in the exam. You should know the following details:
- The fundamental capabilities of BigQuery and what problem domains it can solve.
- BigQuery security and the level at which security can be applied (project and datastore level, but not table or view level).
- Partitioned tables and wildcard queries (“backtick” syntax).
- Views and their usage scenarios.
- Importing/Exporting data to/from BigQuery.
- The methods available to connect external systems or tools to BigQuery for analytics purposes.
- How the BigQuery billing model works, and who gets billed when queries cross projects and billing account boundaries.
- Access control in IAM.
- BigQuery Best Practices.
- Query plan explanation.
- Legacy v/s Standard SQL.
Pub/Sub
Pub/Sub is a simple communication medium for modern microservice that allows users to stream analytics.
The exam contains many questions about this product, but they are all reasonably high-level. So, it’s essential to know the basic concepts (topics, subscriptions, push and pull delivery flows, etc.).
Most importantly, you should know when to introduce Pub/Sub as a messaging layer in architecture for a given set of requirements.
Apache Hadoop
Apache Hadoop is a software framework for storing large amounts of data across clusters of commodity servers. While it’s technically not part of the Google Cloud Platform, there are questions about this technology in the exam since it’s the underlying technology for Dataproc.
Expect some questions on what HDFS, Hive, Pig, Oozie, or Sqoop are, but basic knowledge of what each technology is and when to use it should be sufficient.
Cloud Dataflow
Cloud dataflow is a platform for building applications that process large amounts of unstructured data such as text, images, video, audio, etc. The platform provides a set of APIs and SDKs that enable developers to build applications using Apache Beam, based on Google’s open-source project, Flows.
There are numerous questions about this product. Since it’s a crucial focus for Google regarding data processing on the Google Cloud Platform, it’s not surprising that many questions focus on this topic.
In addition to knowing the basic capabilities of the product, you will also need to understand concepts like:
Cloud Dataproc
Cloud Dataproc is Google’s cloud computing platform for big data processing. The service provides users with access to petabytes of storage space, along with computing power, via hundreds of thousands of virtual machines. Users can run Hadoop jobs directly from the web interface.
There are only a few questions on this product besides the Hadoop questions mentioned above. Just be sure to understand the differences between Dataproc and Dataflow and when to use one or the other.
Dataflow is typically preferred for new development, whereas Dataproc would be required if you migrate existing on-premises Hadoop or Spark infrastructure to Google Cloud Platform without redevelopment efforts.
TensorFlow, Machine Learning, Cloud DataLab
TensorFlow is an open-source software library for machine learning. TensorFlow aims to provide tools for researchers, developers, and users interested in applying deep learning techniques such as neural networks, support vector machines, and word embeddings.
The exam contains a significant number of questions on this product. You should understand all the basic concepts of designing and developing a machine learning solution on TensorFlow, including images such as data correlation analysis in the Datalab and overfitting and how to correct it.
Detailed TensorFlow or Cloud machine learning programming knowledge is not required, but a concrete understanding of machine learning design and implementation is essential.
Recommended Reading: Maximise your investment in Machine Learning
Stackdriver
Stackdriver provides visibility into how your applications behave at scale across all cloud platforms. With Stackdriver, you can monitor application performance, identify bottlenecks, troubleshoot issues, and gain insights into how users interact with your app.
There are many questions about Stackdriver. However, they focus more on “ops” than a “data engineering” product. Be sure to know the sub-products of Stackdriver, such as Debugger, Error Reporting, Alerting, Trace, and Logging, what they do, and when they should be used.
Data Studio
Google’s Data Studio allows marketers to create dashboards and reports using real-time data from Google Analytics, Facebook Ads Manager, Salesforce, and other sources. Data Studio also offers advanced segmentation, forecasting, and predictive analytics features.
There were a few questions on this topic, including caching concepts and setting up metrics, dimensions, and filters in a report.
How Do I Prepare?
Here are several courses and resources I recommend:
The Data Engineering, Big Data, and Machine Learning course on Coursera provides students with a comprehensive introduction to data engineering, big data analytics, machine learning, cloud computing, and other related topics.
This specialization covers all significant data science concepts, such as databases, SQL, NoSQL, Hadoop, Spark, MapReduce, R, Python, Java, C++, and others. Students learn how to use Google Cloud Platform for scalable solutions using these technologies.
This course is divided into five modules with increasing complexity. Modules are initially shaped with slides and discussion, followed by labs that run through Google’s Codelabs, a free-to-use training platform for hands-on labs in the Google Cloud Platform.
- A Cloud Guru: Google Certified Professional Data Engineer
This course has 20 thorough chapters to prepare you for the exam. Including deep dives into machine learning, Data Analytics with BigQuery, and NoSQL Data with Cloud Bigtable.
It’s designed for those who want to learn how to build scalable cloud solutions using Google’s BigQuery database service. This course covers all aspects of building a big data solution, from designing the architecture to deploying the application.
It also includes labs on Cloud Run Data in GCS and Firestore and running a Pyspark job on Cloud Dataproc Using Google Cloud Storage.
- Cloud Academy : Google Professional Data Engineer Exam Preparation
The Google Professional Data Engineer exam preparation course covers all topics, from basic data structures to advanced algorithms. This course also includes real-world projects which help students understand how to apply concepts learned in class. Students who complete this course can pass the Google Professional Data Engineer certification exams.
Exam Guide & Sample Questions
Google has an official exam guide and sample questions for the Professional Data Engineer certification.
More Resources:
Let’s Work Together
Related Content
Leveraging SaMD Applications to Improve Patient Care and Reduce Costs
One of the most exciting developments in healthcare is the emergence of Software as a Medical Device (SaMD) as a more convenient and cost-effective means to deliver superior care to the tens of millions of people worldwide who suffer from various health conditions.
Learn More
If You Build Products, You Should Be Using Digital Twins
Digital twin technology is one of the fastest growing concepts of Industry 4.0. In the simplest terms, a digital twin is a virtual replica of a real-world object that is run in a simulation environment to test its performance and efficacy
Learn More
Share this page:
-
-
-
-
URL copied!