Munging MovieLens
Using the dataset to tour Spark DataFrames, SparkSQL, HiveQL, MySQL, Kafka, Parquet, HDFS, and more.

Preamble:
The Jupyter Notebook that this post eventually refers to is found here.

In a previous post we illustrate the R&D Lab that we use to erect arbitrary technology-stacks using LXC/Linux Containers and Docker Containers. Here we’ll make use of that lab. Underlying the JupyterLab Notebook session that follows this post, is a collection of LXC/Linux containers (vps00, vps01, vps02, vps03, vps04 and vps10) that form a Cloudera/CDH 6.1 Hadoop cluster.

CDH 6.x Hadoop Cluster

Additionally, nested within the vps00 LXC/Linux Container is a running Docker engine that provides the following technology stack components:

Although Cloudera/CDH offers Kafka, Zookeeper and Spark services, they will always lag their latest releases; and this is particularly problematic when components like Spark and Kafka are being fixed and feature-enhanced quickly. We therefore adopt a best-of-breed approach and leave Cloudera/CDH to do what it does best — being a Hadoop cluster that provides HDFS, Hive, Hbase, Map/Reduce, etc. services — and use Docker to supply the rest of the stack:

Basically, we use the CDH cluster as a remote Hadoop services provider.

Note: Because some CDH services require Zookeeper, the reader may wonder about a port conflict between CDH Zookeeper and Docker Zookeeper. To avoid this, we operate CDH Zookeeper on port 2181 (the well-known port for Zookeeper) and operate Docker Zookeeper on port 22181 (seen in the output above). Our docker-provided Zookeeper is used by our docker-provided-Kafka.

Finally, vps00 is also host to our UNIX home directory (/home/nmvega/agile.ds.d/) and, within that directory, also our Anaconda Python 3.x virtual environment. Among the rich standard and 3rd party-libraries that Python provides, the following feature most prominently in this post:

One of the great features of Spark 2.x is its Catalyst engine, which among other things makes Spark language-agnostic. Whether one uses Python, R, Java or Scala to drive Spark makes little-to-no difference since Spark cross-compiles code into optimized Catalyst code.

With the Dev/Ops talk behind us, we turn our attention to the heart of this post: Using the MovieLens dataset to tour Spark DataFrames, SparkSQL, HiveQL, MySQL, Kafka, Parquet, HDFS, and more. Here are the tables of the MovieLens schema (and although — as of this writing — ERD relationships aren’t included, they’re not too difficult to figure out):

We will ask questions of this dataset via MySQL, Spark DataFrames, SparkSQL, HiveQL (via beeline(1)), along the way learning how to accomplish things using various techniques ■ We show how to programmatically plug functionality into Spark (like Avro as a data-source and Kafka as a data-sink) ■ We sqoop(1) table-data from MySQL into Hive/HDFS, serialized as Parquet files; and later ingest those files into Spark DataFrames ■We push Spark DataFrames (which represent movielens tables) through Kafka topic-partitions and recover them at the other end; along the way showing how to embed DataFrame metadata and ephemeral keys to drive per-table Kafka topic-partitioning, as well as show how to pack (denormalize) data for ingress and unpack (re-normalize) data for egress ■ and so on.

The following JupyterLab Notebook session is version v0.9.0, which means your humble author wants to provide additional markdown-cell comments and in-line annotations for clarity. There were too many moving parts to complete everything in one go. Hopefully, though, there’s enough already there for readers to understand what is taking place. The next update will improve on this, and perhaps a dedicated follow-on post will treat how Avro containers can make life easier when working with Kafka.

The notebook:

ATTACHMENTS:
   ‣ Munging MovieLens (JupyterLab Notebook Session)