Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

Follow publication

Apache Spark — Multi-part Series: Introduction

--

This is a new series blogs aimed at developers in the engineering and analytical space who want to build and expand their knowledge within the inner workings of the Spark API’s (Application Programming Interfaces). A personal goal of the series is to enrich my current understanding of Apache Spark, sharing my learnings and resources in an easily digestable way, to you, the reader. So no matter what level your current understanding of distributed computing is, there will be content and material for you to use in your journey with Apache Spark. This series is aimed at Data Scientists, Data Engineers or anyone else who is new to Spark. Even if you have some knowledge hopefully I can help to fill some of those blind spots!

During these tough times (COVID-19), I am going to endevour to release a new section every few days. This is so I can continue to build on and expand my skillset and hopefully yours too. I have been using Apache Spark on and off for around three and a half years both in working environments as well as in my personal projects.

Series Coverage:

This series will cover almost all aspects of Apache Spark using personal knowledge as well as reputable resources written by the creators of Apache Spark. The areas that will be covered in reasonable depth will include but may not be limited to:

All about spark:

  • What is Apache Spark?
  • Spark Architecture
  • Spark Ecosystem and Languages
  • Spark API’s

In depth functionality:

  • Apache Spark and Koalas
  • Spark Data Types and Ecosystem Variables
  • Spark Steam Processing

Analytics and Machine Learning:

  • Overview of Analytics and Machine Learning Using Spark
  • Preprocessing and Feature Engineering in Spark
  • ML Modelling in Spark
  • Graph Analytics in Spark
  • Deep Learning in Spark
  • Apache Spark and mflow

A vast amount of my knowledge and experience with Apache Spark has been from books and e-Learning provided by the creators of Apache Spark. Not to mention hand on training by a Senior Instructor from Databricks. Two books I have found invaluable during my learning process:

Another key place to hone your skills using Spark is either locally on your machine or by using a cloud based solution such as Azure or AWS. Alternatively one free place to do this is on the Databricks community edition which can be found below:

https://community.cloud.databricks.com/

This environment allows you to create your own Spark cluster, create notebooks, upload data and try anything Spark related! There are some limitations, but for a free service its fantastic! The community edition even allows you to try out mlflow, one of Databricks’ open source projects released in 2019.

There are likely to be code snippets embedded within each of my learning sections which you will be able to run in a Spark environment using your own datasets. If you don’t have your own datasets, there are an abundance of them available on the Kaggle website, all you need to do is sign-up for free to access them for life.

Alternatively, the Databricks free community edition has a number of datasets mounted to the environment on cluster creation. You can run the code below inside a Databricks notebook to retrieve a list of datasets.

%py
display(dbutils.fs.ls("/databricks-datasets"))

You can also print out the README.md file for each of the datasets listed.

%py
with open("/dbfs/databricks-datasets/README.md") as f:
x = ''.join(f.readlines())

print(x)

I will try to include any code prerequisites if any extra functionality or libraries are required to run any code examples.

Finally:

It is going to be a challenge to release all of these sections in rapid succession but I will do my best to do so. If you have any questions or advice please send them over to me via LinkedIn:

Thanks for coming along this journey with me, stay safe!

Series Sections:

Introduction

  1. What is Apache Spark
  2. Spark Architecture

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Nerd For Tech
Nerd For Tech

Published in Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

Luke Thorp
Luke Thorp

Written by Luke Thorp

Learning new skills, tools and methodologies is what I do best! I am currently trying to master Apache Spark, I am not going to lie… it’s quite fun!

No responses yet

Write a response