当前位置:天才代写 > tutorial > 其他教程 > RHadoop尝试 – 统计邮箱呈现次数

RHadoop尝试 – 统计邮箱呈现次数

2017-12-04 08:00 星期一 所属: 其他教程 浏览:926

需求描写:基于RHadoop通过rmr包实现MapReduce算法:

1. 计较邮箱域呈现了几多次
2. 按次数从大到小排序

譬喻:

163.com,14
sohu.com,2

尝试数据:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

算法实现

1. 计较邮箱域呈现了几多次

library(rmr2)
data<-read.table(file="hadoop15.txt")
d0<-to.dfs(keyval(1, data))
from.dfs(d0)

输出:

$key
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1
$val
V1
1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]
5 [email protected]
6 [email protected]
7 [email protected]
8 [email protected]
9 [email protected]
10 [email protected]
11 [email protected]
12 [email protected]
13 [email protected]
14 [email protected]
15 [email protected]
16 [email protected]
17 [email protected]
18 [email protected]
19 [email protected]
20 [email protected]
21 [email protected]
22 [email protected]
23 [email protected]
24 [email protected]
25 [email protected]
26 [email protected]
27 [email protected]
28 [email protected]
29 [email protected]
30 [email protected]
31 [email protected]
32 [email protected]
33 [email protected]
34 [email protected]
35 [email protected]
36 [email protected]
37 [email protected]
38 [email protected]
39 [email protected]
40 [email protected]
41 [email protected]


mr<-function(input=d0){
map<-function(k,v){
keyval(word(as.character(v$V1), 2, sep = fixed('@')),1)
}
reduce =function(k, v ) {
keyval(k, sum(v))
}
d1<-mapreduce(input=input,map=map,reduce=reduce,combine=TRUE)
}
d1<-mr(d0)
from.dfs(d1)

输出:

$key
[1] "126.com" "163.com" "21cn.com" "gmail.com" "qq.com"
[6] "sina.com" "sohu.com" "yahoo.cn" "yahoo.com.cn"
$val
[1] 9 14 1 1 9 2 2 1 2

2. 按次数从大到小排序


sort<-function(input=d1){
map<-function(k,v){
keyval(1,data.frame(k,v))
}
reduce<-function(k,v){
v2<-v[order(as.integer(v$v),decreasing=TRUE),] keyval(1,v2)
}
d2<-mapreduce(input=input,map=map,reduce=reduce,combine=TRUE)
}
d2<-sort(d1)
result<-from.dfs(d2)
result$val

输出:

k v
2 163.com 14
1 126.com 9
5 qq.com 9
6 sina.com 2
7 sohu.com 2
9 yahoo.com.cn 2
3 21cn.com 1
4 gmail.com 1
8 yahoo.cn 1

 

    关键字:

天才代写-代写联系方式