本文共 2979 字,大约阅读时间需要 9 分钟。
众所周知,wordcount在大数据中的地位相当于helloworld在各种编程语言中的地位。本文并不分析wordcount的计算方法,而是直接给出代码,目的是为了比较Spark中Java,Python,Scala的区别。
显然,Java写法较为复杂,Python简单易懂,Scala是Spark的原生代码,故即为简洁。 Java完整代码:import java.util.Arrays;import java.util.Iterator;import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction;import org.apache.spark.api.java.function.VoidFunction;import scala.Tuple2;public class wordcount { public static void main(String[] args) { SparkConf conf = new SparkConf().setMaster("local").setAppName("wc"); JavaSparkContext sc = new JavaSparkContext(conf); //read a txtfile JavaRDDtext = sc.textFile("/home/vagrant/speech.txt"); //split(" ") JavaRDD words = text.flatMap(new FlatMapFunction () { private static final long serialVersionUID = 1L; @Override public Iterator call(String line) throws Exception { return Arrays.asList(line.split(" ")).iterator(); } }); //word => (word,1) JavaPairRDD counts=words.mapToPair( new PairFunction () { public Tuple2 call(String s) throws Exception { return new Tuple2(s, 1); } } ); //reduceByKey JavaPairRDD results=counts.reduceByKey( new Function2 () { public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } } ) ; //print results.foreach(new VoidFunction >(){ @Override public void call(Tuple2 t) throws Exception { System.out.println("("+t._1()+":"+t._2()+")"); } }); }}
Pyspark完整代码:
# Imports the PySpark librariesfrom pyspark import SparkConf, SparkContext # Configure the Spark context to give a name to the applicationsparkConf = SparkConf().setAppName("MyWordCounts")sc = SparkContext(conf = sparkConf)# The text file containing the words to count (this is the Spark README file)textFile = sc.textFile('/home/vagrant/speech.txt')# The code for counting the words (note that the execution mode is lazy)# Uses the same paradigm Map and Reduce of Hadoop, but fully in memorywordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)# Executes the DAG (Directed Acyclic Graph) for counting and collecting the resultfor wc in wordCounts.collect(): print(wc)
Scala完整代码:
import org.apache.spark.{SparkContext,SparkConf}object test{ def main(args:Array[String]){ val sparkConf = new SparkConf().setMaster("local").setAppName("MyWordCounts") val sc = new SparkContext(sparkConf) sc.textFile("/home/vagrant/speech.txt").flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_).foreach(println) }}
转载地址:http://vqzda.baihongyu.com/