Configuring and using SparkSession, SparkContext, DataFrameReader and DataStreamReader objects in a Python REPL or IDE with interactive code-completion and docstring support.

Note: An adaptation of this article was used in my answer to a StackOverflow question here.
Note: A companion JupyterLab notebook attachment appears at the end of this post.

After pip-installing the PySpark Python package, issuing the pyspark(1) command will invoke a Terminal REPL session using your Python interpreter (for example, using /usr/bin/python). However, this session will lack code-completion and accompanying docstring support, making it difficult to explore and interactively learn the Spark API. Matters worsen when using a proper Python IDE, where there’s no ability to even issue the pyspark(1) command. How does one, then, set up an environment such that a Terminal REPL or Python IDE has access to the full PySpark framework (including plugins), yet provides code-completion and accompanying docstring support? In other words, something similar to the screen-capture below, taking note of the beige code-completion pop-up near the bottom:

Interactive PySpark Session

Interactive PySpark Session

We illustrate how to do this now.

Configure a SparkSession, SparkContext, DataFrameReader and DataStreamReader object. Assuming you’ve pip-installed the pyspark and ptpython Python packages, start an ad-hoc interactive session with code-completion and docstring support, by saving the following code block to, say, ./pyspark_init.py, then running it as follows:

nmvega@fedora$ ptpython -i ./pyspark_init.py

Note that with very minor tweaks (mostly to the conf object k/v pairs), the above initialization code  snippet was designed to be usable virtually anywhere. Thus to use it within a proper Python IDE, you can simply paste the above code snippet into a Python helper-module and import it (… pyspark(1) command not needed). \o/

With a code-completion and docstring enabled interactive PySpark session loaded, let’s now perform some basic Spark data engineering within it.

Configure the DataFrameReader object. DataFrameReader objects offer a method to load various kinds of serialized formats (e.g. csv, jsonparquet, etc) into a DataFrame object, as well as a method to set options related to that format. The following example is for a CSV file format:

 

For this example, we’ll load real-estate sales data from Zillow into a pseudo-randomly named temporary filesystem file, which we’ll delete at the end. The file contents will be the data source for our resulting DAGs.

 

 

Perform Transformations and Actions on this DataFrame, using either the DataFrame API or SQL statements. The following statements return DataFrames with identical contents and execution plans:

 

 

Finally, we can delete the temporary file and shutdown the SparkSession …

The end! =:)

ATTACHMENTS:
   ‣ Jupyter notebook of above session in HTML format