Open source Java projects: Apache Spark

25.08.2015
Big data adoption has been growing by leaps and bounds over the past few years, which has necessitated new technologies to analyze that data holistically. Individual big data solutions provide their own mechanisms for data analysis, but how do you analyze data that is contained in Hadoop, Splunk, files on a file system, a local database, and so forth

The answer is that you need an abstraction that can pull data from all of these sources and analyze potentially petabytes of information very rapidly.

Spark is a computational engine that manages tasks across a collection of worker machines in what is called a computing cluster. It provides the necessary abstraction, integrates with a host of different data sources, and analyzes data very quickly. This installation in the Open source Java projects series reviews Spark, describes how to set up a local environment, and demonstrates how to use Spark to derive business value from your data.

Let's begin by writing a simple word-counting application using Spark in Java. After this hands-on demonstration we'll explore Spark's architecture and how it works.

Similar to the standard "Hello, Hadoop" application, the "Hello, Spark" application will take a source text file and count the number of unique words that are in it. To start, create a new project using Maven with the following command:

Next, modify your pom.xml file to include the following Spark dependency:

Listing 1 shows the complete contents of my pom.xml file.

Note the three plugins I added to the build directive:

With that out of the way, Listing 2 shows the source code for the WordCount application. Note that it shows how to write the Spark code in both Java 7 and Java 8. I'll discuss highlights of both below.

The WordCount application's main method accepts the source text file name from the command line and then invokes the workCountJava8() method. It defines two helper methods -- wordCountJava7() and wordCountJava8() -- that perform the same function (counting words), first in Java 7's notation and then in Java 8's.

The wordCountJava7() method is more explicit, so we'll start there. We first create a SparkConf object that points to our Spark instance, which in this case is "local." This means that we're going to be running Spark locally in our Java process space. In order to start interacting with Spark, we need a SparkContext instance, so we create a new JavaSparkContext that is configured to use our SparkConf. Now we have four steps:

The first step is to leverage the JavaSparkContext's textFile() to load our input from the specified file. This method reads the file from either the local file system or from a Hadoop Distributed File System (HDFS) and returns a resilient distributed dataset (RDD) of Strings. An RDD is Spark's core data abstraction and represents a distributed collection of elements. You'll find that we perform operations on RDDs, in the form of Spark transformations, and ultimately we leverage Spark actions to translate an RDD into our desired result set.

In this case, the transformation we want to first apply to the RDD is the flat map transformation. Transformations come in many flavors, but the most common are as follows:

The flatMap() transformation in Listing 2 returns an RDD that contains one element for each word, split by a space character. The flatMap() method expects a function that accepts a String and returns an Iterable interface to a collection of Strings.

In the Java 7 example, we create an anonymous inner class of type FlatMapFunction and override its call() method. The call() method is passed the input String and returns an Iterable reference to the results. The Java 7 example leverages the Arrays class's asList() method to create an Iterable interface to the String[], returned by the String's split() method. In the Java 8 example we use a lambda expression to create the same function without creating the anonymous inner class:

Given an input s, this function splits s into words separated by spaces, and wraps the resultant String[] into an Iterable collection by calling Arrays.asList(). You can see that this is the exact same implementation, but it's much more succinct.

At this point we have an RDD that contains all of our words. Our next step is to reduce the words RDD into a collection of RDD pairs that map each distinct word to a count of 1, then we'll count the words. The mapToPair() method iterates over every element in the RDD and executes a PairFunction on the element. The PairFunction implements a call() method that accepts an input String (the word from the previous step) and returns a Tuple2 instance.

The reduceByKey() method iterates over the JavaPairRDD, finds all distinct keys in the tuples, and executes the provided Function2's call() method against all of the tuple's values. Stated another way, it finds all instances of the same word (such as apple) and then passes each count (each of the 1 values) to the supplied function to count occurrences of the word. Our call() function simply adds the two counts and returns the result.

Note that this is similar to how we would implement a word count in a Hadoop MapReduce application: map each word in the text file to a count of 1 and then reduce the results by adding all of the counts for each unique key.

Once again, the Java 8 example performs all of these same functions, but does it in a single, succinct line of code:

The mapToPair() method is passed a function. Given a String t, which in our case is a word, the function returns a Tuple2 that maps the word to the number 1. We then chain the reduceByKey() method and pass it a function that reads: given input x and y, return their sum. Note that we needed to cast the input to int so that we could perform the addition operation.

When writing Spark applications you will find yourself frequently working with pairs of elements so Spark provides a set of common transformations that can be applied specifically to key/value pairs:

After we have transformed our RDD pairs, we invoke the saveAsTextFile() action method on the JavaPairRDD to create a directory called output and store the results in files in that directory.

With our data compiled into the format that we want, our final step is to build and execute the application, then use Spark actions to derive some results sets.

Build the WordCount application with the following command:

Execute it from the target directory with the following command:

Let's create a short text file to test that it works. Enter the following text into a file called test.txt:

From your target folder, execute the following command:

Spark will create an output folder with a new file called part-00000. If you look at this file, you should see the following output:

The output contains all words and the number of times that they occur. If you want to optimize the output, you might want to set all words to lower case (note the two "the"s), but I'll leave that as an exercise for you.

Finally, if you really want to take Spark for a spin, check out Project Gutenberg and download the full text of a large book. (I parsed Homer's Odyssey to test this application.)

The saveAsTextFile() method is a Spark action. Earlier we saw transformations, which transform an RDD into another RDD, but actions generate our actual results. The following, in addition to the various save actions, are the most common actions:

The actions listed are some of the most common that you'll use, but other actions exist, including some designed to operate different types of RDD collections. For example mean() and variance() operate on RDDs of numbers and join() operates on RDDs of key/value pairs.

The steps for analyzing data with Spark can be grossly summarized as follows:

An important note is that while you may specify transformations, they do not actually get executed until you specify an action. This allows Spark to optimize transformations and reduce the amount of redundant work that it needs to do. Another important thing to note is that once an action is executed, you'll need to apply the transformations again in order to execute more actions. If you know that you're going to execute multiple actions then you can persist the RDD before executing the first action by invoking the persist() method; just be sure to release it by invoking unpersist() when you're done.

Now that you've seen an overview of the programming model for Spark, let's briefly review how Spark works in a distributed environment. Figure 1 shows the distributed model for executing Spark analysis.

Spark consists of two main components:

The Spark Driver is the process that contains your main() method and defines what the Spark application should do. This includes creating RDDs, transforming RDDs, and applying actions. Under the hood, when the Spark Driver runs, it performs two key activities:

Spark Executors are processes running on distributed machines that execute Spark tasks. Executors start when the application starts and typically run for the duration of the application. They provide two key roles:

The cluster manager is the glue that wires together drivers and executors. Spark provides support for different cluster managers, including Hadoop YARN</a> and Apache Mesos. The cluster manager is the component that deploys and launches executors when the driver starts. You configure the cluster manager in your Spark Context configuration.

The following steps summarize the execution model for distributed Spark applications:

We've reviewed the Spark programming model and seen how Spark applications are distributed across a Spark cluster. We'll conclude with a quick look at how Spark can be used to analyze different data sources. Figure 2 shows the logical components that make up the Spark stack.

At the center of the Spark stack is the Spark Core. The Spark Core contains all the basic functionality of Spark, including task scheduling, memory management, fault recovery, integration with storage systems, and so forth. It also defines the resilient distributed datasets (RDDs) that we saw in our WordCount example and the APIs to interact with RDDs, including all of the transformations and actions we explored in the previous section.

Four libraries are built on top of the Spark Core that allow you to analyze data from other sources:

So where do you go from here As we have already seen, you can load data from a local file system or a distributed file system such as HDFS and S3. Spark provides capabilities to read plain text, JSON, sequence files, protocol buffers, and more. Additionally, Spark allows you to read structured data through Spark SQL, and it allows you to read key/value data from data sources such as Cassandra, HBase, and ElasticSearch. For all these instances the process is the same:

As more small and large operations realize the benefits of big data we are seeing an increase in solutions addressing specific problem domains for big data analysis. One of the challenges for big data is how to analyze collections of data distributed across different technology stacks. Apache Spark provides a computational engine that can pull data from multiple sources and analyze it using the common abstraction of resilient distributed datasets, or RDDs. Its core operation types include transformations to massage your data and convert it from its source format to the form you want to analyze, and actions that derive your business value.

In this article we've built a small, locally run Spark application whose purpose is to count words. You've practiced several different transformations and one save action, and had an overview of Spark's programming and execution models. I've also discussed Spark's support for running distributed across multiple machines by leveraging a cluster manager such as Hadoop YARN and Apache Mesos with drivers and executors. Finally, I discussed extensions built on top of Spark for analyzing data in an SQL database, from a streaming source, and to perform applied analysis solutions for machine learning and graph processing.

We've only scratched the surface of what is possible with Spark, but I hope that this article has both inspired you and equipped you with the fundamentals to start analyzing Internet-scale collections of data using Spark.

(www.javaworld.com)

Steven Haines

Zur Startseite