Spark RDD Partition and Spark RDD Architecture

Our HadoopTpoint App is now available in google play store,please rate and comment it in play store : W3Schools

Spark RDD Partition and Spark RDD Architecture : In Our Previous Post we discussed about Apache Spark Introduction,Advantages and Disadvantages of Apache Spark and Most asked Apache Spark interview questions and answers for Experienced and Freshers and now we are discussing about one of the Main topic in Spark core is Spark RDD Partition and Spark RDD Architecture.In this Post we will discuss about What is Spark RDD,Why do we need Spark RDD,Spark RDD architecture and  Spark RDD Partition.

What is Spark RDD ?

Spark RDD full form is Resilient Distributed Datasets (RDD).Spark RDD is a main Future of Spark core.The definition of Spark RDD is An RDD in Spark is simply an immutable distributed collection of objects.In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

Each RDD is split into multiple partitions, which may be computed on different nodes of the
cluster. RDDs can contain any type of Python, Java, or Scala objects, including userdefined
classes.

Why do we need Spark RDD ?

Spark is very fast and powerful than Mapreduce because Spark is cluster computing platform to be used for fast and general purpose.In Memory cluster computing is main future for Apache Spark.Iterative operations and Interactive operations are very slow in Mapreduce comparing to Apache Spark.Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.

Iterative Operations on Mapreduce

Iterative Operations on Mapreduce

Iterative Operations on Mapreduce

The Intermediate Data is stored on the Disk so replication, serialization, and disk IO all are occurred.

Iterative Operations on Apache Spark

Iterative Operations on Apache Spark

Iterative Operations on Apache Spark

In Spark The Intermediate data is done in Distributed Memory (RAM) instead of storing on disk only because of Spark RDD.

Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of the JOB), then it will store those results on the disk.

 Creating Spark RDD

Spark provides two ways to create RDDs: loading an external dataset and parallelizing
a collection in your driver program.The simplest way to create RDDs is to take an existing collection in your program and pass it to SparkContext’s parallelize() method . This parallelize() is mostly useful for testing and learning purpose but if you have large amount at that time this parallelize() method is not useful.

Spark RDD Partition and Spark RDD Architecture

Spark RDD Partition and Spark RDD Architecture

// parallelize() method in Python
lines = sc.parallelize(["pandas", "i like pandas"])

// parallelize() method in Scala
val lines = sc.parallelize(List("pandas", "i like pandas"))

//parallelize() method in Java
JavaRDD<String> lines = sc.parallelize(Arrays.asList("pandas"," i like pandas"));

A more common way to create RDDs is to load data from external storage. That method is SparkContext.textFile()

// textFile() method in Python
lines = sc.textFile("/path/to/README.md")

// textFile() method in Scala
val lines = sc.textFile("/path/to/README.md")

//textFile() method in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");

Spark RDD Operations

RDDs supported two types of operations transformations and actions. Transformations are operations on RDDs that return a new RDD, such as map() and filter(). Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, such as count() and first().Spark treats transformations and actions very differently, so understanding which type of operation you are performing will be important.Example for Transformations and Actions in Spark.

// filter() transformation in Python
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
// filter() transformation in Scala
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line => line.contains("error"))
// filter() transformation in Java
JavaRDD<String> inputRDD = sc.textFile("log.txt");
JavaRDD<String> errorsRDD = inputRDD.filter(
new Function<String, Boolean>() {
public Boolean call(String x) { return x.contains("error"); }
}
});

In the above Example inputRDD is parent RDD and errorsRDD is second RDD.Here filter() is a transformation,transformations are create a new transformations only.When Actions applied on Transformation then only final result will come.

// Python error count using actions
print "Input had " + badLinesRDD.count() + " concerning lines"
print "Here are 10 examples:"
for line in badLinesRDD.take(10):
print line
// Scala error count using actions
println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)
// Java error count using actions
System.out.println("Input had " + badLinesRDD.count() + " concerning lines")
System.out.println("Here are 10 examples:")
for (String line: badLinesRDD.take(10)) {
System.out.println(line);
}

In this example, we used take() to retrieve a small number of elements in the RDD
at the driver program.take() is one of the example of Actions.

This is the main concept of Spark RDD Partition and Spark RDD Architecture.Share this knowledge ! Join us on Facebook ! Subscribe For our Website and Youtube channel  ! BookMark our HadoopTpoint.com ! Any Doubts Comment below . 

Comments

  1. Please look into the content of your web page.If I am not wrong then there would be FEATURE instead of FUTURE.
    As per my understandings ,following are the mistakes.
    1. Spark RDD is a main Future of Spark core.
    2.In Memory cluster computing is main future for Apache Spark.

Speak Your Mind

*