Study guide for clearing “Databricks Certified Associate Developer for Apache Spark 3.0” exam (python)
Recently, I have cleared “Databricks Certified Associate Developer for Apache Spark 3.0'’. Many people have requested to guide on how to prepare for this certification.
This exam contains 60 MCQ’s and duration is 120 minutes. To pass this exam, you need to give at least 42 correct answers (70%). For all the details of exam and syllabus ,refer this link — exam
Go through this link for exam prep: Databricks Certified Associate Developer for Apache Spark 3.0
I started with this book -Spark- The definitive guide .Thoroughly read and understand chapters from 1–11 and 14–19 . Try to practice the code given in this book specially Part II (Structured APIs — DataFrames, SQL, and Datasets) .This part takes the major number of questions in the exam(~70%).This book will really help you to crack this exam. I referred only this book.
But, if you want , You can also refer — chapters from 1 to 7 of Learning Spark 2nd edition book .
Apart from this, I read various articles and blogs on various spark topics to clear my concepts on
- Partition and coalesce
- Cache and persist
- shuffling concept
- Wide and narrow transformation
- Catalyst optimizer
- Job, stage, task, slot concepts
- Executor, worker, driver, cluster manager
- Adaptive Query Execution and Data Partition Pruning — configuration to enable them
- Deployment modes
- How spark executes a job
- Broadcast and Accumulator
- sparksql and dataframe APIs specially withColumn()
- Read, write file — know the different parameters used
- Date and time functions
- Performance tuning and optimization
- This blogs are also very informative — link1 and link2
- This is also helpful — justEnough Spark
Some tips and tricks that might help you to crack this exam -
- The multiple choices that are given for any particular question are very confusing, specially for dataframe APIs. Try to practice the code as much as you can , so you know which syntax will give the desired output or desired error
- Just knowing the Spark concepts briefly will not work in most cases, the questions are not that straight forward. Try to read and understand how that concept is used in spark ,for example — we know that coalesce is used to effectively decrease the number of partitions, but do you know how coalesce will behave if it is used to increase the number of partitions. This is one kind of example. It’s my recommendation to refer as many articles to clear you understanding on each spark topic.
- Be thorough with the sparksql ,dataframe functions and their sequence of parameters. This will really confuse you in multiple choice questions
- They provide 1 notepad and spark documentation. Note — their is no search button in documentation. I would recommend you to get familiar with the documentation so that you know where are the various topics present. It will help you. It helped me a lot when I was reviewing my answers in the leftover time.
- Usually, you will be able to complete all 60 questions in ~90 minutes, in the leftover time while reviewing questions, try to go though the spark documentation from the start , you will come across many relevant topics that have appeared in your exams. One thing to note, try to complete all questions and then refer this document. Don’t try to refer this doc in between if you are not much familiar with it, else it will consume lot of your time.
If you have any questions , please feel free to connect with me on LinkedIn — Shruti Bhawsar
I hope it helps you!
All the best and Happy leaning :)