Configuring and using SparkSession, SparkContext, DataFrameReader and DataStreamReader objects

Configure a SparkSession, SparkContext, DataFrameReader and DataStreamReader object. With minor tweaks — mostly to the conf object k/v pairs — the following initialization code can be use virtually anywhere. Assuming you’ve pip-installed pyspark, to start an ad-hoc interactive session, save the first code block to, say, ./pyspark_init.py, then run it as follows:

nmvega@fedora$ ptpython -i ./pyspark_init.py # Use python(1) if you don’t use ptpython.
# NOTE: For REPL sessions, your humble author prefers ptpython with vim(1) key bindings.

 

Configure the DataFrameReader object. DataFrameReader objects offer a method to load various kinds of serialized formats (e.g. csv, jsonparquet, etc) into a DataFrame object, as well as a method to set options related to that format. The following example is for a CSV file format:

 

For this example, we’ll load real-estate sales data from Zillow into a pseudo-randomly named temporary filesystem file, which we’ll delete at the end. The file contents will be the data source for our resulting DAGs.

 

 

Perform Transformations and Actions on this DataFrame, using either the DataFrame API or SQL statements. The following statements return DataFrames with identical contents and execution plans:

 

 

Finally, we can delete the temporary file and shutdown the SparkSession …

The end! =:)

ATTACHMENTS:
   ‣ Jupyter notebook of above session in HTML format