博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Java,Python,Scala比较(三)wordcount
阅读量:6177 次
发布时间:2019-06-21

本文共 2979 字,大约阅读时间需要 9 分钟。

  众所周知,wordcount在大数据中的地位相当于helloworld在各种编程语言中的地位。本文并不分析wordcount的计算方法,而是直接给出代码,目的是为了比较Spark中Java,Python,Scala的区别。

  显然,Java写法较为复杂,Python简单易懂,Scala是Spark的原生代码,故即为简洁。
Java完整代码:

import java.util.Arrays;import java.util.Iterator;import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction;import org.apache.spark.api.java.function.VoidFunction;import scala.Tuple2;public class wordcount {    public static void main(String[] args) {        SparkConf conf = new SparkConf().setMaster("local").setAppName("wc");        JavaSparkContext sc = new JavaSparkContext(conf);        //read a txtfile        JavaRDD
text = sc.textFile("/home/vagrant/speech.txt"); //split(" ") JavaRDD
words = text.flatMap(new FlatMapFunction
() { private static final long serialVersionUID = 1L; @Override public Iterator
call(String line) throws Exception { return Arrays.asList(line.split(" ")).iterator(); } }); //word => (word,1) JavaPairRDD
counts=words.mapToPair( new PairFunction
() { public Tuple2
call(String s) throws Exception { return new Tuple2(s, 1); } } ); //reduceByKey JavaPairRDD
results=counts.reduceByKey( new Function2
() { public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } } ) ; //print results.foreach(new VoidFunction
>(){ @Override public void call(Tuple2
t) throws Exception { System.out.println("("+t._1()+":"+t._2()+")"); } }); }}

Pyspark完整代码:

# Imports the PySpark librariesfrom pyspark import SparkConf, SparkContext  # Configure the Spark context to give a name to the applicationsparkConf = SparkConf().setAppName("MyWordCounts")sc = SparkContext(conf = sparkConf)# The text file containing the words to count (this is the Spark README file)textFile = sc.textFile('/home/vagrant/speech.txt')# The code for counting the words (note that the execution mode is lazy)# Uses the same paradigm Map and Reduce of Hadoop, but fully in memorywordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)# Executes the DAG (Directed Acyclic Graph) for counting and collecting the resultfor wc in wordCounts.collect():    print(wc)

Scala完整代码:

import org.apache.spark.{SparkContext,SparkConf}object test{  def main(args:Array[String]){    val sparkConf = new SparkConf().setMaster("local").setAppName("MyWordCounts")    val sc = new SparkContext(sparkConf)    sc.textFile("/home/vagrant/speech.txt").flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_).foreach(println)  }}


本次分享到此结束,欢迎大家批评与交流~~

转载地址:http://vqzda.baihongyu.com/

你可能感兴趣的文章
win8重装成win8.1后把hyperv的虚拟机导入
查看>>
linux命令汇总(mkdir、rmdir、touch、dirname、basename)
查看>>
mv或者cp带小括号文件名解析问题总结
查看>>
Elasticsearch学习笔记3: bulk批量处理
查看>>
EBS12.2.5 升级到EBS12.2.6的问题及跟踪处理
查看>>
网站访问流程
查看>>
java的日志工具log4j的配置方法
查看>>
jQuery on()方法
查看>>
步调一致才能得胜利
查看>>
mysql 锁机制
查看>>
add_header X-Frame-Options "SAMEORIGIN";NGINX
查看>>
linux中的计划任务
查看>>
Android style报错
查看>>
Lintcode130 Heapify solution 题解
查看>>
【Map】Map、HashMap
查看>>
解决纯数字字符串在js方法参数中不稳定或被截取的问题
查看>>
如何在VMware安装Windows系统
查看>>
阶段性理解phantomjs/selenium/casperjs
查看>>
Java中高级开发工程师是什么技术水平(附28套Java进阶+高级视频教程)
查看>>
sudo命令
查看>>