应用场景:

MYSQL布局:

table(用户地址公司表)
uid, company
========
1, tianji
2, tianji
3, tianji
4, ganji
5, ganji
6, ganji
7, ganji
8, 58
....

聚合操纵:
select company,count(company) as num
from t_company group by company
having num>3 and num<=300
order by num desc;


功效:
company,num
===========
tianji,3
ganji,4

1kw行,800MB,MYSQL执行时间,2分钟。


R数据处理惩罚
读入csv(用户地址公司表)
1, tianji
2, tianji
3, tianji
4, ganji
5, ganji
6, ganji
7, ganji
8, 58



  1.   file='comapng'
  2.   companyData<-read.table(file=file, header = FALSE, sep=",", quote = "\"'",
  3.              na.strings="NA",fileEncoding="utf-8",encoding="utf-8")
  4.   names(companyData)<-c('uid','company')
  5.   print(paste('Total Company =>',nrow(companyData)))


  6.   nset<-ddply(companyData, .(company), "nrow")
  7.   nset<-nset[which(nset$nrow<=300 & nset$nrow>3),]

  8.   include<-c()
  9.   for(i in 1:nrow(nset)){
  10.     t<-which(companyData$company==nset$company[i])
  11.     include<-c(include,t)
  12.   }
  13.   print(paste('Available Company =>',length(include)))
  14.   companyData<-companyData[include,]
复制代码

1kw行,800MB,占用内存1.5G,R执行时间,30分钟+


====================
想步伐优化!!


其他教程

2017-12-04


应用场景:MYSQL布局:table(用户地址公司表)uid, company========1, tianji2, tianji3, tianji4, ganji5, ganji6, ganji7, ganji8, 58....聚合操纵:select company,count(com ...