Learn PySpark locally without an AWS cluster

David Liao
Grubhub Bytes
Published in
9 min readApr 30, 2018

--

I’m a relatively new data engineer at Grubhub — this means I needed to quickly learn how to use Apache Spark, which is the data processing engine our Enterprise Data Warehouse runs on. At my previous job, I designed and built enterprise data warehouses (EDW) for Fortune 100-type clients using business intelligence (BI) tools supplied by SAP and Oracle. In the world of my old job, all databases were relational and visual tools provided enhanced modelling and construction. They emphasized understanding business processes as part of the report cube design.

In my new world, Grubhub defines database tables with a Hive metadata layer and a Presto DB engine on top to facilitate data analysis with SQL. All of this lives in AWS S3 backend physical cloud storage. Because we can interact with our cloud data via SQL, it was easy for me to get fooled into thinking I was still working in a relational DB world when I wasn’t. We do not have relational-type indexes on our data platform to speed up queries. Because of this, we have to think about our physical partitioning of Hive tables carefully.

In other words, I cannot be a good data engineer if I only have a surface understanding of Spark/Hive/AWS S3. PySpark is our extract, transform, load (ETL) language workhorse. I had a difficult time initially trying to learn it in terminal sessions connected to a server on an AWS cluster. It looked like the green code streams on Neo’s screen saver in the Matrix movies.

This led me on a quest to install the Apache Spark libraries on my local Mac OS and use Anaconda Jupyter notebooks as my PySpark learning environment. I prefer a visual programming environment with the ability to save code examples and learnings from mistakes. I went down the rabbit hole, reading a lot of sites, blogs, and Github links to figure out what the heck the correct installation sequence was.

This article is intended to prevent you from wasting time chasing parallel rabbit holes and simplify the process of learning PySpark. You should be able to do so locally without buying time on a cloud server.

Requirements — Java JDK and Jupyter Python notebook pre-installed

I’m going to assume that you have a working Anaconda Jupyter notebook running a version of Python you are happy with. I will give you tips on installing Java, too, since it’s critical to getting PySpark working.

This article will take you through three major processes:

  1. Installing Java.
  2. Finding and installing a Spark version of your choice.
  3. Setting up the minimum set of Python environment variables to run Spark inside a Jupyter notebook session.

All instructions here assume you run a Mac OS.

Installing Java on your local machine

I have found using version Java 8 will work with PySpark Version 2.0+ but higher versions of Java (9 and 10) gave me errors.

There are a couple of ways you can install the JDK. You can use Homebrew on a Mac to install Java with a terminal session command: “brew cask install java8”. Prior to an install, you can get more info with “brew cask info java”. Or you can go to Oracle’s download site, choose the JDK platform you want, and download and install the *.dmg file. It is possible to have multiple JDK versions installed on your machine, but be sure to set the JAVA_HOME environment variable to point to Java 8 installation.

Verify the exact folder on your machine’s hard drive where the Java library resides. Now update your .bash_profile with the two environment variable statements shown below so that any application that needs Java will know where it is.

export JAVA_HOME=/Library/Java/JavaVirtualMachines/<your_jdk_version_directory>.jdk/Contents/Homeexport PATH=$PATH:$JAVA_HOME/bin
An example of a JDK install path

So in this example, the value, “/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin”, is appended to what is already in $PATH.

Installing Apache Spark on your local machine

1. Go to the download site for Apache Spark and choose the version you want. You’ll want it to match either your organization’s version or an appropriate learning version.

2. Download the “*.tgz” file and unzip it by double clicking it or using an unzip application.

3. I consider it consider safest to install/copy developer tools/software manually into usr/local/as a safe haven root directory for manually installed software. Future software installs should avoid overwriting existing files here (“usr/local”).

4. I chose to create a new Spark directory named “/usr/local/spark/” and copy the downloaded Spark files there.

5. Either use a GUI tool like Finder (press control^ + shift + . to see hidden folders/files) or a terminal session to recursively copy the entire contents of the Apache Spark files directory with the Unix copy files command using “-r”, recursive parameter:

cp -r <source directory> /usr/local/spark/

6) Do a quick test after a successful Spark install. In a terminal session, go to your new Spark directory (/usr/local/spark/) and run ./bin/pyspark to see if a successful Spark session starts up.

If this doesn’t work, verify your installation location as in the screenshot above. Otherwise, you should see the “Spark” logo (as shown below), along with the version number you selected, which indicates a successful Spark install.

7. If there was an environment variable “IPYTHON” in your .bash_profile, you will need to reset that variable with this command: “export IPYTHON=””. Make sure it’s blank, with this command: “echo $IPYTHON. If it’s not blank, Spark will post a message like this:

→ “Error in pyspark startup: IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these variables from the environment and set variables PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS instead”.

If you want to start a Spark session with IPython, set the environment variable to “PYSPARK_DRIVER_PYTHON=ipython pyspark”, as suggested by this Coursera Big Data Intro Course.

PySpark environment variables to add to your .bash_profile

1. Make note of where the files highlighted below are located.

I also recommend you add these four lines to your .bash_profile:

export SPARK_HOME=/usr/local/sparkexport PYTHONPATH=$SPARK_HOME/python:$PYTHONPATHexport PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATHexport PATH=$PATH:$SPARK_HOME/bin

You may need to adjust them depending on where you installed Spark n your machine. Take note of the two Zip files noted below — they’ll be referenced in a later section:

‘/usr/local/spark/python/lib/pyspark.zip’,
‘/usr/local/spark/python/lib/py4j-0.10.3-src.zip’

2. Notice that we set up a root home directory for SPARK_HOME. We also re-use the SPARK_HOME value and concatenate it with other values to add to whatever is already in $PYTHONPATH. This is similar to what we did above with the $JAVA_HOME environment variable. You could enter these lines in a Jupyter notebook session, but it’s likely more convenient to update your .bash_profile.

Once you make any changes to your .bash_profile, you need to re-run it to load all the changes into your terminal session. If your .bash_profile is in directory /Users/<username>/, then run:

source ./Users/<username>/.bash_profile

Here’s a list of the new environment variables to add to your .bash_profile so far:

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Homeexport PATH=$PATH:$JAVA_HOME/binexport SPARK_HOME=/usr/local/sparkexport PYTHONPATH=$SPARK_HOME/python:$PYTHONPATHexport PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATHexport PATH=$PATH:$SPARK_HOME/bin

Environment variables to setup inside a Jupyter notebook:

1. The “PYLIB” environment variable seems to only be required when running Spark in a Jupyter notebook. I discovered I needed the next three environment variables by experimenting with the directions in this link from the Anaconda.org website.

Run the following commands inside your Jupyter notebook:

import osos.environ[“PYLIB”] = os.environ[“SPARK_HOME”] + “/python/lib”print ‘PYLIB: ‘, os.environ.get(‘PYLIB’)

Output: PYLIB: /usr/local/spark/python/lib

2. The “PYSPARK_PYTHON” environment variable must be set for the Python version you are using in your Jupyter notebook. In this example, I want to specify the Python 2.7 environment, as I have both Python 3.x and Python 2.7 installed.

Find the path to your Anaconda Python installation and then execute the commands below (which have been adjusted to reflect your Anaconda install location) inside your Jupyter notebook. Take special care to determine the correct path, as you can get misleading permission errors if it’s not correct.

os.environ[“PYSPARK_PYTHON”] = “/usr/local/anaconda3/envs/Py27root/bin/python”print ‘PYSPARK_PYTHON: ‘, os.environ.get(‘PYSPARK_PYTHON’)

Output: PYSPARK_PYTHON: /usr/local/anaconda3/envs/Py27root/bin/python

3. Add two Spark Zip file locations to the PATH environment variable.

Back in the section about modifying your .bash_profile with PySpark environment variables, I specified two Zip files to add to your .bash_profile:

‘/usr/local/spark/python/lib/pyspark.zip’,
‘/usr/local/spark/python/lib/py4j-0.10.3-src.zip’

Now the two Spark zip files should be added to the PATH environment variable. Apply these two commands below in your Jupyter notebook. To future-proof the variables, concatenate the file names with the PYLIB path reference:

sys.path.insert(0, os.environ[“PYLIB”] +”/py4j-0.10.3-src.zip”)sys.path.insert(0, os.environ[“PYLIB”] +”/pyspark.zip”)

In addition, insert an entry for “SPARK_HOME/bin” with this command:

sys.path.insert(0, os.environ[“SPARK_HOME”] +”/bin”)

Verify the three PATH entries were added properly with a print command:

print ‘sys.path: ‘, ‘\n’

4. Two final environment variables should be considered — PYSPARK_DRIVER and PYSPARK_DRIVER_OPTS.

I have found these last two environment variables to be very important when running PySpark inside a Jupyter notebook. Add the following:

os.environ[“PYSPARK_DRIVER_PYTHON”] = “ipython”os.environ[“PYSPARK_DRIVER_PYTHON_OPTS”] = “notebook”

5. Start your PySpark session using the “shell.py” program, located inside the SPARK_HOME subdirectories.

If you can’t find the “shell.py” program, print out your SPARK_HOME environment variable and then search its subdirectories. It should be very similar to the example below:

Double check your “SPARK_HOME” environment variable by getting it with this command and print it.

spark_home = os.environ.get(‘SPARK_HOME’, None)print spark_home

Output: /usr/local/spark

Now, set up a variable to reference the path location of “shell.py” (as shown below), and print it to verify:

spark_shell = spark_home + “/python/pyspark/shell.py”print spark_shell

Output: ‘/usr/local/spark/python/pyspark/shell.py’

Now you are ready to initialize the PySpark session inside your Jupyter notebook with this command:

exec(open(spark_shell).read())

Output:

You should see the “Welcome to Spark” logo display, which indicates that you have successfully initiated a PySpark session on your local Jupyter notebook.

I hope this tutorial is of help on your journey to mastering PySpark. I recently had to rebuild my Mac after a lousy OS update that crashed my machine — thankfully, I had my own blog to help me set up PySpark again. It was so much easier the second time around with a guide like this. In fact, I often kick start a PySpark session inside a local notebook to play with code.

Here’s some quick examples of where learning locally is an advantage:

Looping through a dataframe and printing results of the iteration:

for b in brands.rdd.collect():   brand_df = df_s3.filter(df_s3[“brand”] == b[“brand”])

It was not initially obvious how the “df.filter” function works and testing locally helped me grasp the syntax of new functions.

The example below was actually thorny to figure out. When I have rows of brands that are null mixed with good values, I have to sort the result set in descending order to ensure the value I want is in the first row.

def fill_brand_null_values(self, df):   brands_df = df.select(‘brand’).distinct()   num_brands = brands_df.count()   print ‘Number of distinct brands : ‘, num_brands,’\n’   if num_brands > 1:      # when null values, sort to get Non Null value as 1st record      curr_brand = str(brands_df.sort(‘brand’, ascending=False).collect()[0][0])      curr_brand = str.lower(curr_brand)   else:      curr_brand = str(df.select(‘brand’).distinct().head(1)[0][0])      curr_brand = str.lower(curr_brand)

I wish you many happy future learnings on your PySpark journey!

Do you want to learn more about opportunities with our team? Visit the Grubhub careers page.

--

--

Currently a data engineer at Grubhub building enterprise digital assets for all