Spark需要包含org.postgresql:postgresql
对应版本的链接库,我们使用的是org.postgresql:postgresql:9.4-1201-jdbc41
版本,通过Cloudera Manager设置了Spark的spark.jars.packages
参数,目前包含的库有
spark.jars.packages=org.apache.commons:commons-csv:1.2,com.databricks:spark-csv_2.10:1.4.0,org.mongodb.spark:mongo-spark-connector_2.10:1.0.0,org.postgresql:postgresql:9.4-1201-jdbc41
调用JDBC库从PostgreSQL中读取Dataframe形式的数据
df <- read.df(sqlContext, source="jdbc", url="jdbc:postgresql://Master:5432/r_test", dbtable="r_test_table", "driver"="org.postgresql.Driver", "user"="postgres", "password"="163700gf")
其中Master:5432
是目前postgresql的ip和端口,r_test
是测试用的database,r_test_table
是r_test
中的一个表,user
和password
是数据库的帐号和密码。
获取到df以后,就可以输出内容了。
> show(df)
DataFrame[id:int, name:string, count:int]
> collect(df)
id name count
1 1 a1 1
2 2 a2 2
3 3 a3 3
4 4 a4 4
调用JDBC库将Dataframe写入PostgreSQL数据库中
> write.df(df, path="NULL", source="jdbc", url="jdbc:postgresql://Master:5432/r_test", "dbtable"="r_test_table", "driver"="org.postgresql.Driver", "user"="postgres", "password"="163700gf", mode="append")
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:259)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:2027)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
at io.netty.channel.SimpleChannelIn
重新将数据append到表中,提示错误,可以根据Mail Archive得知是Spark目前的BUG,已经在Spark 2.0中修复。
]]>wordcount
一项项目。
通过conf
中的99-user_defined_properties.conf
,我们配置了hibench.scale.profile
为gigantic
,也就是测试数据量为300gb
左右。
通过编辑conf
目录下的benchmarks.lst
文件,只保留wordcount
单独一项,并且运行bin/run-all.sh
即可。
执行过程中,mapreduce保持了固有的稳定性,全程没有出现任何异常,只是执行时间过长。而Spark方面,Scala和Java都比较稳定,但是Python遇到了多次由于executor占用内存达到YARN的限制上限而被YARN shutdown的情况,导致多个stage的计算失败,从而让整体运算效率下降。
]]>集群一共4台主机,运行Hadoop、PostgreSQL、MongoDB、Spark及Hive等服务。
在进行cluster benchmark前,先进行一些准备工作,如工具、数据准备等。
自生成的2G、20G、200G的电表用电数据
Name,Date,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20,T21,T22,T23,T24
tpm1_R1-12-47-1_tm_1_7c3c3e39dd7a9d33,2006-01-01,3.1272327499999997,10.394874999999999,5.391951000000001,10.32085,10.383375000000001,10.424949999999999,0.8456075000000001,3.4217275000000003,3.3851625,3.3802950000000003,0.911879,0.90341175,0.87138125,3.2868905,0.82968875,5.8568282499999995,6.073462499999999,6.343275,7.5142075,6.383839999999999,8.80295,6.207680000000001,10.90155,3.2379179999999996
tpm1_R1-12-47-1_tm_2_7c3c3e39dd7a9d33,2006-01-01,0.41440199999999994,0.370113,0.35526250000000004,0.3458965,0.35339425,0.39913449999999995,0.53686825,0.6503465,0.61882925,0.6125167499999999,0.610577,0.6083605,1.8160477499999996,0.5580125,0.5445135,0.57932425,0.7016105,0.90579425,0.9704775,0.93271975,0.89949725,0.8396155,0.70478725,0.5901097499999999
tpm2_R1-12-47-1_tm_2_7c3c3e39dd7a9d33,2006-01-01,0.60541825,0.627177,0.5498865,0.6200257499999999,0.6303885,0.6933225000000001,0.8008432499999999,1.01453125,0.8872774999999999,0.9067015,0.9040405,0.8735145,0.8466045,0.8330110000000001,0.8138880000000001,0.8343885,1.0614575,1.3469475,1.43749,1.384125,1.3378824999999999,1.2823624999999998,1.065415,0.79980075
确保机器上安装好了maven,如果未安装,则下载maven的bin压缩包,解压并且将目录的/bin
文件夹加入系统PATH
变量中即可。
$ mvn --version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
Maven home: /opt/apache-maven
Java version: 1.8.0_77, vendor: Oracle Corporation
Java home: /usr/java/jdk1.8.0_77/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.13.0-83-generic", arch: "amd64", family: "unix"
由于我们的集群部署在互联网之后,需要单独配置以下maven的代理,否则编译是无法进行的。
$ mkdir ~/.m2
$ cp /opt/apache-maven/conf/settings.xml ~/.m2/settings.xml# maven放在/opt/apache-maven中
$ vi ~/.m2/settings.xml
# 找到proxies部分,去掉注释,并且设置相关的代理信息即可。
此时,下载HiBench的代码包,我直接下载了release的5.0版本的打包代码,下载地址在这里。下载好,解压缩后,进入src
目录编译。
$ cd HiBench-HiBench-5.0/src
$ mvn clean package -D spark1.6 -D MR2 # 设置spark的版本号和mapreduce的版本, 开始编译
编译结束后,对HiBench进行配置
$ cd conf
$ cp 99-user_defined_properties.conf.template 99-user_defined_properties.conf
根据我们的集群中Hadoop和Spark的版本,进行配置
hibench.hadoop.home /opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop
hibench.spark.home /opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark
hibench.hdfs.master hdfs://Slave2:8020
hibench.spark.master yarn-client
hibench.hadoop.executable /opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/bin/hadoop
hibench.hadoop.version hadoop2
hibench.hadoop.release cdh5
hibench.hadoop.mapreduce.home /opt/cloudera/parcels/CDH/jars
hibench.spark.version spark1.6
同时设置MapReduce和Spark的内存等参数即可。
]]>在Cloudera Manager中找到Spark服务,进入配置
页面,在筛选器
中选择类别
中的高级
,找到spark-conf/spark-defaults.conf 的 Spark 客户端高级配置代码段(安全阀)
项目,在其中添加以下内容
spark.jars.packages=com.databricks:spark-csv_2.10:1.4.0,org.mongodb.spark:mongo-spark-connector_2.10:1.0.0,org.postgresql:postgresql:9.4-1201-jdbc41
这里我增加了spark-csv
、mongo-spark-connector
和postgresql jdbc
库
重新部署修改的配置后,在终端启动如spark-shell
,就可以看到依赖自动处理了。
free@Slave2:~$ spark-shell
Ivy Default Cache set to: /home/free/.ivy2/cache
The jars for the packages stored in: /home/free/.ivy2/jars
:: loading settings :: url = jar:file:/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/jars/spark-assembly-1.6.0-cdh5.7.1-hadoop2.6.0-cdh5.7.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
org.mongodb.spark#mongo-spark-connector_2.10 added as a dependency
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.4.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
found org.mongodb.spark#mongo-spark-connector_2.10;1.0.0 in central
found org.mongodb#mongo-java-driver;3.2.2 in central
found org.postgresql#postgresql;9.4-1201-jdbc41 in central
downloading https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.4.0/spark-csv_2.10-1.4.0.jar ...
由于我们的集群部署在局域网中,要访问互联网需要通过HTTP/HTTPS代理,我尝试过以下几种命令
export http_proxy=<proxyHost>:<proxyPort>
export https_proxy=<proxyHost>:<proxyPort>
export JAVA_OPTS="-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>"
都没有效果,最后发现可以通过spark.driver.extraJavaOptions
设置来生效
如spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>"
即可
同时可以通过Cloudera Manager,找到Spark服务,进入配置
页面,在筛选器
中选择类别
中的高级
,找到spark-conf/spark-defaults.conf 的 Spark 客户端高级配置代码段(安全阀)
项目,在其中添加以下内容
spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>
spark.executor.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>
重新部署配置即可。
]]>默认的Cloudera CDH是没有包含Zeppelin组件的,如果要部署则须要重新编译安装。
本文记录了我在已经部署好的CDH集群中安装Zeppelin和SparkR相关支持的流程,仅做参考。
本文所有操作都在Ubuntu 14.04 Trusty
环境下执行,使用Cloudera CDH5.7.1版本作为Hadoop集群版本。
Hadoop 2.6.0-cdh5.7.1
Cloudera Spark 1.6.0 自编译版本,增加了SparkR
的支持。
编译环境可以在其他机器上完成,我编译使用的机器同样是Ubuntu 14.04 Trusty
的一台虚拟机。
编译需要JDK、Git、Maven、Nodejs、npm
确保JDK和Git已经安装好,我使用的是
$ git version
git version 1.9.1
$ javac -version
javac 1.8.0_91
安装maven 3.3.9,如果已有可以跳过,我使用apt-get install maven
得到的版本太旧,所以重新通过源码安装一份
$ curl -OL http://mirror.olnevhost.net/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
$ tar -zxf apache-maven-3.3.9-bin.tar.gz -C /usr/local/
$ ln -s /usr/local/apache-maven-3.3.9/bin/mvn /usr/bin/mvn
$ mvn -v
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
Maven home: /usr/local/apache-maven-3.3.9
Java version: 1.8.0_91, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.13.0-32-generic", arch: "amd64", family: "unix"
从Github repo上抓取Zeppelin的源码,并且切换到对应的版本的branch
$ git clone https://github.com/apache/incubator-zeppelin.git
$ git checkout branch-0.6 # 最新的release版本是0.6,所以我们也用这个版本
# 之前忘记更换branch,一直编译master的版本,各种出错
$ git pull # 确保代码是最新的
确定Hadoop
和Spark
的版本号,在CDH集群中的任意一台机器上登录
$ hadoop version
Hadoop 2.6.0-cdh5.7.1
Subversion http://github.com/cloudera/hadoop -r ae44a8970a3f0da58d82e0fc65275fff8deabffd
Compiled by jenkins on 2016-06-01T23:26Z
Compiled with protoc 2.5.0
From source with checksum 298b68dc3b308983f04cb37e8416f13
This command was run using /opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/jars/hadoop-common-2.6.0-cdh5.7.1.jar
$ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
我这边的Hadoop
版本号是2.6.0-cdh5.7.1,Spark
版本号是1.6.0,于是所有都准备好了,可以开始编译啦。
每个编译的参数都可以在这里查看到,我们主要有几个参数需要注意:
$ mvn clean package -Pbuild-distr -Pyarn -Pspark-1.6 -Dspark.version=1.6.0 -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.1 -Ppyspark -Psparkr -Pvendor-repo -DskipTests
编译需要一定的时间,如果中途出错,可以根据日志,修正以后,通过命令-rf :step
来继续编译,比如我在过程中因为网络原因,卡在了:zeppelin-web
这个步骤,就可以使用mvn clean package -Pbuild-distr -Pyarn -Pspark-1.6 -Dspark.version=1.6.0 -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.1 -Ppyspark -Psparkr -Pvendor-repo -DskipTests -rf :zeppelin-web
继续刚才的编译。
编译好以后可以去zeppelin-distribution/target
目录中看到zeppelin-0.6.1-SNAPSHOT.tar.gz
这个压缩包(如果版本不同,文件名中的版本号则不同),把该文件传到CDH集群的某个节点上即可。
将Zeppelin解压到指定的目录中
$ tar zxf zeppelin-0.6.1-SNAPSHOT.tar.gz -C /opt/
$ mv -r /opt/zeppelin-0.6.1-SNAPSHOT /opt/zeppelin
配置Zeppelin
$ mv /opt/zeppelin/conf /etc/zeppelin/conf
$ cd /opt/zeppelin
$ ln -s /etc/zeppelin/conf conf
$ cd /etc/zeppelin/conf
$ cp zeppelin-env.sh{.template,}
$ cp zeppelin-site.xml{.template,}
修改zeppelin-env.sh
文件,包含以下的内容:
export JAVA_HOME=/usr/java/jdk1.8.0_77
export MASTER=yarn-client
export ZEPPELIN_JAVA_OPTS="-Dmaster=yarn-client -Dspark.yarn.jar=/opt/zeppelin/interpreter/spark/zeppelin-spark-0.6.1-SNAPSHOT.jar"
export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark
export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}
if [ -n "$HADOOP_HOME" ]; then
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
export ZEPPELIN_LOG_DIR=/var/log/zeppelin
export ZEPPELIN_PID_DIR=/var/run/zeppelin
export ZEPPELIN_WAR_TEMPDIR=/var/tmp/zeppelin
根据配置检查每个路径是否正确,然后新建对应的目录:
$ mkdir /var/log/zeppelin
$ mkdir /var/run/zeppelin
$ mkdir /var/tmp/zeppelin
为Zeppelin新建一个用户,并且处理相关的路径权限
$ useradd zeppelin
$ chown -R zeppelin:zeppelin /opt/zeppelin/notebook
$ chown zeppelin:zeppelin /etc/zeppelin/conf/interpreter.json
$ chown -R zeppelin:zeppelin /var/log/zeppelin
$ chown -R zeppelin:zeppelin /var/run/zeppelin
$ chown -R zeppelin:zeppelin /var/tmp/zeppelin
$ su hdfs
$ hadoop fs -mkdir /user/zeppelin # 为用户建立hdfs的目录
$ hadoop fs -chmod 777 /user/zeppelin
一切搞定,启动zeppelin
$ su zeppelin
$ cd /opt/zeppelin/
$ bin/zeppelin-daemon.sh start
启动完毕,浏览器打开启动节点的8080端口即可。
在使用SparkR之前,需要安装R语言的knitr
库
$ R
$ install.packages('knitr', dependencies = TRUE)
打开Zeppelin的R Tutorial
笔记,可以测试啦
其他的测试运行中可能会有库不存在导致的错误,自行使用R
的shell进行安装即可。
上篇文章说过,CDH官方发布的Spark版本中并不包含一些特殊的组件,比如SparkR和Thriftserver。要想让Spark支持这些功能,就必须重新编译CDH版本的Spark。
在本地搭建Spark的编译环境太过繁琐,并且由于官方文档的说明,部分RedHat系统中编译的Spark版本甚至会有错误,所以我们直接使用一个现成的集成了编译环境的虚拟机来进行。
如果你在国内,翻墙的网络是必须的,我们需要确保下载安装所需要的组件顺利。
在你的本地机器上,需要以下工具:
您可以根据上面的链接一个一个安装所需要的组件
从github的repo中科隆我之前所说的集成编译环境的虚拟机vagrant-sparkbuilder
$ git clone https://github.com/teamclairvoyant/vagrant-sparkbuilder.git
$ cd vagrant-sparkbuilder
启动这个vagrant虚拟机实例,这个实例中包含了Centos 7.0,同时会安装Puppet代理,通过Puppet来部署编译所需的基本环境(Orcale Java和Cloudera Spark)
$ vagrant up
这个步骤须要一些时间,请耐心等待。
启动成功后就可以登录进虚拟机了
$ vagrant ssh
进入虚拟机以后,我们可以切换到Spark的目录
$ cd spark
刚才的过程中,系统已经自动克隆了Cloudera Spark的repo,我们只需要切换到需要的版本的branch或者tag就可以开始编译了。
$ git checkout cdh5-1.6.0_5.7.1
$ git pull # 确保代码是最新的
由于我需要SparkR
的组件支持,所以在编译之前,还需要安装以下R
的环境
$ sudo yum -y -e1 -d1 install epel-release
$ sudo yum -y -e1 -d1 install R
安装好以后,就可以开始编译了,编译根据配置,大概须要10~20分钟。
这个命令会编译出包含SparkR和Thriftserver的Cloudera Spark。
$ patch -p0 </vagrant/undelete.patch # 修改make-distribution.sh,让编译的结果包含SparkR等组件
$ ./make-distribution.sh -DskipTests \
-Dhadoop.version=2.6.0-cdh5.7.1 \
-Phadoop-2.6 \
-Pyarn \
-Psparkr \
-Phive \
-Pflume-provided \
-Phadoop-provided \
-Phbase-provided \
-Phive-provided \
-Pparquet-provided \
-Phive-thriftserver
编译成功以后,就可以把编译好的数据同步回我们的机器了。
$ rsync -a dist/ /vagrant/dist-cdh5.7.1-nodeps
如果你需要再编译其他的CDH版本,那么记得抛弃掉我们之前对make-distribution.sh
的修改
$ git checkout -- make-distribution.sh
一切结束后,使用exit
退出ssh
,然后关闭vagrant虚拟机
$ vagrant halt
$ vagrant destroy # 摧毁,如果需要的话
如果在编译过程中出现内存不足的错误Cannot allocate memory
,则须要修改Vagrantfile
,将配置文件中的v.memory
和v.cpus
提高,并且重新启动vagrant up
。
本文所有操作都在Ubuntu 14.04 Trusty
环境下执行,使用Cloudera CDH5.7.1版本作为Hadoop集群版本。
Hadoop 2.6.0-cdh5.7.1
Spark 1.6.0
Cloudera官方申明表示CDH5.7.1版本依旧不支持Spark的部分组件,如SparkR,并且官方论坛的讨论帖中也没有给出具体的解决时间。
本文基于这篇博文,实验了在CDH集群中部署R环境和sparkR的可能性。
请确保以下操作在集群中每个节点都执行
本段依据这篇博文进行,但是根据过程稍有不同。
首先在这里https://cran.r-project.org/mirrors.html根据位置选择一个合适的CRAN镜像,我选择了清华大学的https://mirrors.tuna.tsinghua.edu.cn/CRAN/bin/linux/ubuntu
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
LVERSION=`lsb_release -c | cut -c 11-` #trusty
echo deb https://mirrors.tuna.tsinghua.edu.cn/CRAN/bin/linux/ubuntu $LVERSION/ | sudo tee -a /etc/apt/sources.list #deb https://mirrors.tuna.tsinghua.edu.cn/CRAN/bin/linux/ubuntu trusty/
sudo apt-get update
sudo apt-get install r-base -y #r-base package
sudo apt-get install r-base-dev -y #r-base-dev package
安装完毕以后,就可以测试以下R是否正常使用了。
$ R
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> q()
配置rJava
$ su # 使用root用户(我的环境下sudo R CMD javareconf -e无法正常使用)
$ R CMD javareconf -e
Java interpreter : /usr/java/jdk1.8.0_77/jre/bin/java
Java version : 1.8.0_77
Java home path : /usr/java/jdk1.8.0_77
Java compiler : /usr/java/jdk1.8.0_77/bin/javac
Java headers gen.: /usr/java/jdk1.8.0_77/bin/javah
Java archive tool: /usr/java/jdk1.8.0_77/bin/jar
trying to compile and link a JNI program
detected JNI cpp flags : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
detected JNI linker flags : -L$(JAVA_HOME)/jre/lib/amd64/server -ljvm
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/usr/java/jdk1.8.0_77/include -I/usr/java/jdk1.8.0_77/include/linux -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c conftest.c -o conftest.o
gcc -std=gnu99 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o conftest.so conftest.o -L/usr/java/jdk1.8.0_77/jre/lib/amd64/server -ljvm -L/usr/lib/R/lib -lR
The following Java variables have been exported:
JAVA_HOME JAVA JAVAC JAVAH JAR JAVA_LIBS JAVA_CPPFLAGS JAVA_LD_LIBRARY_PATH
Running: /bin/bash
$ export LD_LIBRARY_PATH=$JAVA_LD_LIBRARY_PATH
$ R
install.packages("rJava")
请确保以下操作在集群中每个节点都执行
从Spark官网的下载地址下载和CDH搭配的Spark版本相同的版本,这里我下载了spark-1.6.0-bin-hadoop2.6
解压缩
tar zxvf spark-1.6.0-bin-hadoop2.6.tgz
cd spark-1.6.0-bin-hadoop2.6
确定下CDH中的文件权限
$ ll /opt/cloudera/parcels/CDH/lib/spark/
total 80
drwxr-xr-x 9 root root 4096 6月 2 08:20 ./
drwxr-xr-x 37 root root 4096 6月 2 08:23 ../
drwxr-xr-x 3 root root 4096 6月 2 07:48 assembly/
drwxr-xr-x 2 root root 4096 6月 2 07:48 bin/
drwxr-xr-x 2 root root 4096 6月 2 07:48 cloudera/
lrwxrwxrwx 1 root root 15 6月 2 07:48 conf -> /etc/spark/conf/
drwxr-xr-x 3 root root 4096 6月 2 07:48 examples/
drwxr-xr-x 2 root root 4096 6月 2 08:24 lib/
-rw-r--r-- 1 root root 17352 6月 2 07:48 LICENSE
-rw-r--r-- 1 root root 23529 6月 2 07:48 NOTICE
drwxr-xr-x 6 root root 4096 6月 2 07:48 python/
-rw-r--r-- 1 root root 0 6月 2 07:48 RELEASE
drwxr-xr-x 2 root root 4096 6月 2 07:48 sbin/
lrwxrwxrwx 1 root root 19 6月 2 07:48 work -> /var/run/spark/work
复制R
文件夹到/opt/cloudera/parcels/CDH/lib/spark/
中
sudo cp -R R /opt/cloudera/parcels/CDH/lib/spark/R
再确定下权限是否正确
ll /opt/cloudera/parcels/CDH/lib/spark/
total 84
drwxr-xr-x 10 root root 4096 7月 13 14:52 ./
drwxr-xr-x 37 root root 4096 6月 2 08:23 ../
drwxr-xr-x 3 root root 4096 6月 2 07:48 assembly/
drwxr-xr-x 2 root root 4096 6月 2 07:48 bin/
drwxr-xr-x 2 root root 4096 6月 2 07:48 cloudera/
lrwxrwxrwx 1 root root 15 6月 2 07:48 conf -> /etc/spark/conf/
drwxr-xr-x 3 root root 4096 6月 2 07:48 examples/
drwxr-xr-x 2 root root 4096 6月 2 08:24 lib/
-rw-r--r-- 1 root root 17352 6月 2 07:48 LICENSE
-rw-r--r-- 1 root root 23529 6月 2 07:48 NOTICE
drwxr-xr-x 6 root root 4096 6月 2 07:48 python/
drwxr-xr-x 3 root root 4096 7月 13 14:52 R/
-rw-r--r-- 1 root root 0 6月 2 07:48 RELEASE
drwxr-xr-x 2 root root 4096 6月 2 07:48 sbin/
lrwxrwxrwx 1 root root 19 6月 2 07:48 work -> /var/run/spark/work
备份CDH中Spark的bin
目录和sbin
目录
sudo cp -R /opt/cloudera/parcels/CDH/lib/spark/bin /opt/cloudera/parcels/CDH/lib/spark/bin.bak
sudo cp -R /opt/cloudera/parcels/CDH/lib/spark/sbin /opt/cloudera/parcels/CDH/lib/spark/sbin.bak
使用下载的Spark中的bin
和sbin
覆盖
sudo cp bin/* /opt/cloudera/parcels/CDH/lib/spark/bin/
sudo cp sbin/* /opt/cloudera/parcels/CDH/lib/spark/sbin/
配置本地的环境变量
source /etc/spark/conf/spark-env.sh
切换到正确的用户,就可以启动sparkR了
$ su hdfs
$ sparkR
.............................
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Spark context is available as sc, SQL context is available as sqlContext
> x <- 0
> x
[1] 0
我们可以测试以下spark源码中自带的example来看看效果
ml.r
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# To run this example use
# ./bin/sparkR examples/src/main/r/ml.R
# Load SparkR library into your R session
library(SparkR)
# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-ML-example")
sqlContext <- sparkRSQL.init(sc)
# Train GLM of family 'gaussian'
training1 <- suppressWarnings(createDataFrame(sqlContext, iris))
test1 <- training1
model1 <- glm(Sepal_Length ~ Sepal_Width + Species, training1, family = "gaussian")
# Model summary
summary(model1)
# Prediction
predictions1 <- predict(model1, test1)
head(select(predictions1, "Sepal_Length", "prediction"))
# Train GLM of family 'binomial'
training2 <- filter(training1, training1$Species != "setosa")
test2 <- training2
model2 <- glm(Species ~ Sepal_Length + Sepal_Width, data = training2, family = "binomial")
# Model summary
summary(model2)
# Prediction (Currently the output of prediction for binomial GLM is the indexed label,
# we need to transform back to the original string label later)
predictions2 <- predict(model2, test2)
head(select(predictions2, "Species", "prediction"))
# Stop the SparkContext now
sparkR.stop()
使用spark-submit
提交到YARN中
$ spark-submit --master=yarn --num-executors 4 ml.R
没有错误执行完就可以看到结果了,并且可以在YARN的管理界面看到
nodes="`cat /etc/hosts | grep -i node | awk '{print $2}'`"
for target in $nodes; do
echo $target "=================="
scp -r /home/free/spark-1.6.0-bin-hadoop2.6/R root@$target:/opt/cloudera/parcels/CDH/lib/spark/R
done
for target in $nodes; do
echo $target "=================="
ssh -o StrictHostKeyChecking=no root@$target "cp -R /opt/cloudera/parcels/CDH/lib/spark/bin /opt/cloudera/parcels/CDH/lib/spark/bin.bak"
done
for target in $nodes; do
echo $target "=================="
ssh -o StrictHostKeyChecking=no root@$target "cp -R /opt/cloudera/parcels/CDH/lib/spark/sbin /opt/cloudera/parcels/CDH/lib/spark/sbin.bak"
done
for target in $nodes; do
echo $target "=================="
scp -r /home/free/spark-1.6.0-bin-hadoop2.6/bin/* root@$target:/bin/
done
for target in $nodes; do
echo $target "=================="
scp -r /home/free/spark-1.6.0-bin-hadoop2.6/sbin root@$target:/opt/cloudera/parcels/CDH/lib/spark/sbin
done
for target in $nodes; do
echo $target "=================="
ssh -o StrictHostKeyChecking=no root@$target "mv /opt/cloudera/parcels/CDH/lib/spark/bin/bin/* /opt/cloudera/parcels/CDH/lib/spark/bin/"
done
for target in $nodes; do
echo $target "=================="
ssh -o StrictHostKeyChecking=no root@$target "mv /opt/cloudera/parcels/CDH/lib/spark/sbin/sbin/* /opt/cloudera/parcels/CDH/lib/spark/sbin/"
done
for target in $nodes; do
echo $target "=================="
ssh -o StrictHostKeyChecking=no root@$target "rm -r /opt/cloudera/parcels/CDH/lib/spark/bin/bin/"
done
for target in $nodes; do
echo $target "=================="
ssh -o StrictHostKeyChecking=no root@$target "rm -r /opt/cloudera/parcels/CDH/lib/spark/sbin/sbin/"
done
为了防止官网下载的版本和Cloudera CDH兼容性存在问题,我重新使用Cloudera Spark的源码编译了一份对应版本的Spark,详细的过程请参考这篇文章。
]]>http://www.slideshare.net/perlcareers/how-to-write-a-developer-cvrsum-that-will-get-you-hired
]]>简单的来说, interface{} 是指两个内容:
对于Go中的interface{}
, 通常指一个空的接口(empty interface), 也就是说这个interface并没有任何方法.
不像Java, 可以使用implement
这种关键词来手动继承接口, 所以在Go中, 所有的类型(type), 即便一个不包含任何方法的类型(type), 都会自动的继承空的接口(也就是interface{}
, 上一段中说的空的接口).
所以, 如果一个函数用interface{}
作为参数, 那么意味着这个参数可以接受任何类型的值.
假设我们有一个函数, 定义如下
func DoSomething(v interface{}) {
// ...
}
那么在DoSomething
的函数内部, v
的类型是什么?
v
是任何类型 (v
is of any type)
但这种观点显然是错误的, v
并不是任何类型, 而是interface{}
类型.
在调用DoSomething
, 并且传递一个参数给DoSomething
的时候, Go运行时会在必要的阶段执行一次类型转换(type conversion), 并且把这个传递的值转换为一个interface{}
类型的值. 所以从运行时的角度看, 任何值都是一个类型的, 而且interface{}
是v
的一个静态类型(static type).
为了更进一步了解, 参考了一下Go Data Structures: Interfaces的部分内容
假设一个interface
定义如下
type Stringer interface {
String() string
}
Interface values are represented as a two-word pair giving a pointer to information about the type stored in the interface and a pointer to the associated data.
Assigning b to an interface value of type Stringer sets both words of the interface value.
interface
类型的值通常包含两部分内容
The first word in the interface value points at what I call an interface table or itable.
那么第一个word指向的是一个接口表itable(an interface table), 这段数据开头拥有一些关于需要的类的元信息(metadata), 之后便是一个储存各个方法的指针的列表.
Note that the itable corresponds to the interface type, not the dynamic type.
所以说itable只包含和interface{}
类型有关的方法, 而不包含一个动态类型的所有方法.
换句话说, Stringer
这个类型的itable虽然申明了需要Binary
这个类, 但是itable的方法列表中却只有String
这个方法, 而Binary
类中的其他方法(比如Get
)并没有出现在itable中.
The second word in the interface value points at the actual data.
interface
结构中的第二个word是一个指向值的内存空间的指针, 也就是说Go运行时会开辟一段新的内存储存具体的值的内容. 所以当我们使用var s Stringer = b
来申明s
的时候, 其实是复制了b
的内容, 而不是直接将指针指向b
的内存地址. 所以当我们修改b
的内容的时候, s
的内容不会被改变.
通常情况下, 我们储存在interface
中的值可能非常大, 但对于interface
类型来说, 运行时只利用了1个word的大小来储存指针, 所以Go在堆中开辟了一大块内存, 并且用1个word大小的内存来记录这块内存的指针.
A positive integer is called a palindrome if its representation in the decimal system is the same when read from left to right and from right to left. For a given positive integer K of not more than 1000000 digits, write the value of the smallest palindrome larger than K to output. Numbers are always displayed without leading zeros.
The first line contains integer t, the number of test cases. Integers K are given in the next t lines.
For each K, output the smallest palindrome larger than K.
Input:
2
808
2133
Output:
818
2222
Time limit: 2s-9s
Source limit: 50000B
Memory limit: 1536MB
不知道为什么, 总是提示超时, 所以暂时先这样了. 因为谢了一些teat cases, 所以代码可能有一些奇怪的函数.
大概的思路就是先根据字符串前半段生成回文, 然后再做进位处理, 进位的思路就是给最中间的一位/两位加一, 如果存在'9', 则设置为'0', 并且在之后那一位/两位加一.
Golang代码:
package main
import (
"fmt"
"strconv"
)
var t, l int
var k [1000001]int
func IsPalindrome(n string) bool {
s := n
l := len(s)
if n == "" || n[0] == '0' {
return false
}
var left, right string
switch {
case l%2 == 0:
left = s[:l/2]
right = s[l/2:]
case l%2 != 0:
left = s[:l/2]
right = s[l/2+1:]
}
for i, j := 0, len(right)-1; i < len(left); i, j = i+1, j-1 {
if left[i] != right[j] {
return false
}
}
return true
}
func GetNumberString(s, e int) string {
ret := ""
for i := s; i < e; i++ {
if i == 0 && k[i] == 0 {
continue
}
ret = ret + strconv.Itoa(k[i])
}
return ret
}
func NextPalindrome_(str string) string {
var c rune
l = 1
k[0] = 0
for _, c = range str {
k[l] = int(c) - '0'
l++
}
return NextPalindrome()
}
func NextPalindrome() string {
var str = ""
var i, j int
var lift bool = true
if l == 2 {
switch k[1] {
case 9:
return "11"
default:
return strconv.Itoa(k[1] + 1)
}
}
for i = 1; i <= l/2; i++ {
j = l - i
if k[i] > k[j] {
lift = false
} else if k[i] < k[j] {
lift = true
}
k[j] = k[i]
}
// fmt.Println(GetNumberString(0, l))
i--
if lift {
if k[j] == 9 && k[i] == 9 {
for k[j] == 9 && k[i] == 9 {
k[j], k[i] = 0, 0
i--
j++
}
if i == 0 {
str = GetNumberString(0, l)
if str[0] != '0' {
return str
}
}
}
if i == 0 {
j--
}
if i != j {
k[i]++
k[j]++
} else {
k[i]++
}
}
str = GetNumberString(0, l)
return str
}
func main() {
fmt.Scanln(&t)
for i := 0; i < t; i++ {
var c rune
var str string
l = 1
k[0] = 0
fmt.Scanln(&str)
for _, c = range str {
k[l] = int(c) - '0'
l++
}
fmt.Print(NextPalindrome())
if t-i != 1 {
fmt.Println()
}
}
}
时间不通过…
]]>Kostya likes the number 4 much. Of course! This number has such a lot of properties, like:
Impressed by the power of this number, Kostya has begun to look for occurrences of four anywhere. He has a list of T integers, for each of them he wants to calculate the number of occurrences of the digit 4 in the decimal representation. He is too busy now, so please help him.
The first line of input consists of a single integer T, denoting the number of integers in Kostya’s list.
Then, there are T lines, each of them contain a single integer from the list.
Output T lines. Each of these lines should contain the number of occurences of the digit 4 in the respective integer from Kostya’s list.
1 ≤ T ≤ 105
(Subtask 1): 0 ≤ Numbers from the list ≤ 9 - 33 points
(Subtask 2): 0 ≤ Numbers from the list ≤ 109 - 67 points
Input
5
447474
228
6664
40
81
Output:
4
0
1
1
0
太简单了…把输入当做字符串处理找出包含几个4即可.
Golang解法
package main
import (
"fmt"
)
var i, j, t int
var number string
var answers []int
func main() {
fmt.Scanln(&t)
answers = make([]int, t)
for i = 0; i < t; i++ {
fmt.Scanln(&number)
answers[i] = 0
for j = 0; j < len(number); j++ {
if number[j] == '4' {
answers[i]++
}
}
}
for i = 0; i < t; i++ {
fmt.Println(answers[i])
}
}
]]>Transform the algebraic expression with brackets into RPN form (Reverse Polish Notation). Two-argument operators: +, -, *, /, ^ (priority from the lowest to the highest), brackets ( ). Operands: only letters: a,b,…,z. Assume that there is only one RPN form (no expressions like a*b*c).
t [the number of expressions <= 100]
expression [length <= 400]
[other expressions]
Text grouped in [ ] does not appear in the input file.
The expressions in RPN form, one per line.
Input:
3
(a+(b*c))
((a+b)*(z+x))
((a+t)*((b+(a+c))^(c+d)))
Output:
abc*+
ab+zx+*
at+bac++cd+^*
Time limit: 5s
Source limit: 50000B
Memory limit: 1536MB
非常简单, 处理一下括号然后按照顺序输出即可.
Golang代码:
package main
import (
"fmt"
)
var i, j, t, pos int
var experssion string
var expressions []string
func transform_and_print(result string) string {
if experssion[pos] == '(' {
pos++
result = result + transform_and_print("")
operator := experssion[pos]
pos++
result = result + transform_and_print("")
result = result + string(operator)
pos++
} else {
result = result + string(experssion[pos])
pos++
}
return result
}
func Transform_to_rpn(exp string) string {
experssion = exp
pos = 0
result := ""
result = transform_and_print(result)
return result
}
func main() {
fmt.Scanln(&t)
expressions = make([]string, t)
for i = 0; i < t; i++ {
fmt.Scanf("%s\n", &expressions[i])
}
for i = 0; i < t; i++ {
fmt.Println(Transform_to_rpn(expressions[i]))
}
}
TIME 0.01
MEMORY 771M (这里应该是个BUG, spoj所有golang的内存占用都是771M开始起跳的, 所以应该是小于1M的内存占用)
Peter wants to generate some prime numbers for his cryptosystem. Help him! Your task is to generate all prime numbers between two given numbers!
The input begins with the number t of test cases in a single line (t<=10). In each of the next t lines there are two numbers m and n (1 <= m <= n <= 1000000000, n-m<=100000) separated by a space.
For every test case print all prime numbers p such that m <= p <= n, one number per line, test cases separated by an empty line.
Input:
2
1 10
3 5
Output:
2
3
5
7
3
5
Time limit: 6s
Source limit: 50000B
Memory limit: 1536MB
一个关于素数的题目, 难点在于生成多个范围内的素数, 所以需要考虑时间复杂度(空间限制1.5G, 基本上不会超过范围), 所以这里选择Sieve of Eratosthenes算法来生成素数, 为了避免反复生成eratosthenes数组, 所以我们提取出最大的n
的sqrt(n)
来作为数组的范围, 再利用每组范围的值来做一个index, 尽可能减少空间占用.
Golang代码:
package main
import (
"fmt"
"math"
)
func main() {
var k, j, i, max_m, max_n, test_cases, kase int64
fmt.Scanln(&test_cases)
case_m, case_n := make([]int64, test_cases), make([]int64, test_cases)
EratosthenesArray := make(map[int64][]bool)
max_m = 0
max_n = 0
for i = 0; i < test_cases; i++ {
fmt.Scanf("%d %d", &case_m[i], &case_n[i])
if case_m[i] > case_n[i] {
case_m[i] = 0
case_n[i] = 0
}
if max_m < case_m[i] {
max_m = case_m[i]
}
if max_n < case_n[i] {
max_n = case_n[i]
}
length := case_n[i] - case_m[i] + 1
EratosthenesArray[i] = make([]bool, length)
}
if max_m <= max_n {
upperbound := int64(math.Sqrt(float64(max_n)))
UpperboundArray := make([]bool, upperbound+1)
for i = 2; i <= upperbound; i++ {
if !UpperboundArray[i] {
for k = i * i; k <= upperbound; k += i {
UpperboundArray[k] = true
}
for kase = 0; kase < test_cases; kase++ {
start := (case_m[kase] - i*i) / i
if case_m[kase]-i*i < 0 {
start = i
}
for k = start * i; k <= case_n[kase]; k += i {
if k >= case_m[kase] && k <= case_n[kase] {
EratosthenesArray[kase][k-case_m[kase]] = true
}
}
}
}
}
}
for i = 0; i < test_cases; i++ {
k = 0
for j = 0; j <= case_n[i]-case_m[i]; j++ {
if !EratosthenesArray[i][j] {
ret := case_m[i] + j
if ret > 1 {
fmt.Println(ret)
}
}
}
fmt.Println()
}
}
TIME 1.08
MEMORY 772M (这里应该是个BUG, spoj所有golang的内存占用都是771M开始起跳的, 所以应该是1M的占用)
Mr. Chef has been given a number N. He has a tendency to double whatever he get. So now he has got the number N with him and he has multiplied the number N by 2. Now Chef is superstitious. He believes in something known as Lucky Number. His lucky number is defined as any number, which when multiplied by 2 has no other factors other than 1,2, and N. If the number is lucky all you have to do is print “LUCKY NUMBER”. If the number is not a lucky number, print “Sorry”..
The first line consists of T, which is the number of test cases. Every line of the next T lines consists of N.
Print LUCKY NUMBER if the number is lucky and “Sorry” if the number is not lucky followed by a new line.
1<=T<=1000
1<=N<=1000000
Input
3
26
12
11
Output:
Sorry
Sorry
LUCKY NUMBER
非常简单的题目, 所谓的Lucky Number就是判定n
是不是素数, 同时是否是2的幂.
在判断素数部分, 使用了Sieve of Eratosthenes算法, 可以大幅度降低时间复杂度.
在判断是否为2的幂方面, 使用了bitwise操作.
Golang解法
package main
import (
"fmt"
"math"
)
var ns []int64
var UpperboundArray []bool
var upperbound, i, j, k int64
func is_pow_of_two(n int64) bool {
return (n&(n-1) != 0)
}
func generate_prime_number_table(n int64) {
upperbound = int64(math.Sqrt(float64(n)))
UpperboundArray = make([]bool, n+1)
for i = 2; i <= upperbound; i++ {
if !UpperboundArray[i] {
for k = i * i; k <= n; k += i {
UpperboundArray[k] = true
}
}
}
}
func is_prime_number(n int64) bool {
return !UpperboundArray[n]
}
func is_lucky_number(n int64) bool {
if n < 1 || n > 1000000 {
return false
}
return is_pow_of_two(n) || is_prime_number(n)
}
func main() {
var testcases, n int64
fmt.Scanln(&testcases)
ns = make([]int64, testcases)
n = 0
for i = 0; i < testcases; i++ {
fmt.Scanln(&ns[i])
if n < ns[i] {
n = ns[i]
}
}
generate_prime_number_table(n)
for i = 0; i < testcases; i++ {
if is_lucky_number(ns[i]) {
fmt.Println("LUCKY NUMBER")
} else {
fmt.Println("Sorry")
}
}
}
]]>这样的方案简单粗暴, 同时也会带来一些问题, 就是会使用uwsgi本身来分发app中的静态资源, 但在伴旅的生产环境中已经能明显感觉到uwsgi的效率问题, 那么就来想办法解决吧!
解决的方法毋庸置疑就是使用nginx来直接处理static相关的请求, 如果排除掉docker, 只需要在nginx的配置文件中新增诸如下面的几行
location /static {
alias /static;
}
但是在docker中, flask app的static目录和nginx的环境是相对隔离的, 如果不修改当前的已经运行的container, 很难让nginx可以直接访问到静态资源目录.
但是好在我们的nginx还是支持cache的, 于是就有了一种相对简单有效的方案.
要开启nginx的缓存机制, 可以随意谷歌就能找到非常完美的文档, 所以我们只需要修改nginx的conf, 新增如下代码
proxy_cache_path /tmp/cache levels=1:2 keys_zone=cache:30m max_size=1G;
upstream app_upstream {
server app:5000;
}
# 修改server block
location /static {
proxy_cache cache;
proxy_cache_valid 30m;
proxy_pass http://app_upstream;
}
这个方案有着很强的扩展性, 可以有效的支援多个flask app, 但是唯一的缺陷就是静态资源的刷新需要等待30分钟, 或者手动重启nginx实例.
所以我们还有一套相对复杂的方案, 也算是有效利用了docker中volume的概念.
这种方式可以避免cache刷新的时间, 同时在我目前的项目架构上更适合, 并且已经上线到了production环境, 效果有明显提升.
目前我们有多个(>10)flask app, 通过docker container部署在服务器上, 每个app使用uwsgi启动, 并且开放socket端口给host本地
同时还通过docker部署了MySQL, Redis和Nginx-Proxy
最初每个flask app的content都保存在container中, 这也就导致nginx并不能获取到每个flask app的static文件中的内容, 于是对docker container做了一些修改, 包括
这样一来, 就可以通过修改nginx-proxy的tmplate来实现static的加速了
大概的template段如下
{{ range $host, $containers := groupByMulti $ "Env.VIRTUAL_HOST" "," }}
upstream {{ $host }} {
{{ range $container := $containers }}
{{ $addrLen := len $container.Addresses }}
{{/* If only 1 port exposed, use that */}}
{{ if eq $addrLen 1 }}
{{ with $address := index $container.Addresses 0 }}
# {{$container.Name}}
server {{ $address.IP }}:{{ $address.Port }};
{{ end }}
{{/* If more than one port exposed, use the one matching VIRTUAL_PORT env var */}}
{{ else if $container.Env.VIRTUAL_PORT }}
{{ range $address := .Addresses }}
{{ if eq $address.Port $container.Env.VIRTUAL_PORT }}
# {{$container.Name}}
server {{ $address.IP }}:{{ $address.Port }};
{{ end }}
{{ end }}
{{/* Else default to standard web port 80 */}}
{{ else }}
{{ range $address := $container.Addresses }}
{{ if eq $address.Port "80" }}
# {{$container.Name}}
server {{ $address.IP }}:{{ $address.Port }};
{{ end }}
{{ end }}
{{ end }}
{{ end }}
}
server {
server_name {{ $host }};
# client_max_body_size 50M;
{{ if (exists (printf "/etc/nginx/vhost.d/%s" $host)) }}
include {{ printf "/etc/nginx/vhost.d/%s" $host }};
{{ end }}
location / {
uwsgi_pass {{ $host }};
include uwsgi_params;
{{ if (exists (printf "/etc/nginx/htpasswd/%s" $host)) }}
auth_basic "Restricted {{ $host }}";
auth_basic_user_file {{ (printf "/etc/nginx/htpasswd/%s" $host) }};
{{ end }}
}
location ^~ /static/ {
{{ $host_array := split $host "."}}
{{ $loc := first $host_array}}
{{ $suf := "app-dev.xyzabc.com"}}
{{ if hasSuffix $suf $host}}
root /app_contents/volumes/dev/{{$loc}}/app/;
{{ end }}
{{ $suf := "app.xyzabc.com"}}
{{ if hasSuffix $suf $host}}
root /app_contents/volumes/production/{{$loc}}/app/;
{{ end }}
{{ $suf := "app-staging.xyzabc.com"}}
{{ if hasSuffix $suf $host}}
root /app_contents/volumes/staging/{{$loc}}/app/;
{{ end }}
}
}
{{ end }}
最初的最初对于uwsgi的性能理解过于乐观, 在部署到production环境后发现uwsgi经常会在处理static内容的时候阻塞, 并且导致其他restful接口效率降低. 最初并没有怀疑效率是由于uwsgi引起的, 直到在本地完全搭建了一套环境后才发现即便在本地加载静态文件时也会有明显的延迟.
从目前production环境的效果看, 使用nginx来处理static资源是一个比较正确的选择, 但是我依然还是对uwsgi的效率存有执念的, 在uwsgi的doc中发现了一篇讲解优化uwsgi静态资源处理能力的文章(点这里查看), 其中提到了不少之前没有注意到的参数和方法, 之后可以再次试试, 看看能否解开uwsgi的能力封印.
]]>但是如果某些服务并没有在container中执行, 则docker就显得不那么方便了, 比如我们需要通过container连接host上运行的MySQL数据库, 最容易想出来的解决方式是在代码中写死host对外的ip地址, 但是这需要将指定的端口暴露在外网.
显然, 这种办法很不科学, 也很合docker的路子, 所以我们可以通过维护container的host文件来做到.
如果是Linux主机, 我们可以通过
ip route show 0.0.0.0/0 | grep -Eo 'via \S+' | awk '{ print $2 }'
来获取当前主机的局域网ip, 这个ip地址是可以从container中直接访问的
如果是mac用户, 使用boot2docker, 则默认ip是192.168.59.3
, 不需要特别获取
通过docker run
指令的--add-host
参数, 我们可以在启动container的时候方便的新增host内容, 为了方便使用, 先把之前的获取ip的命令alias
一下
alias hostip="ip route show 0.0.0.0/0 | grep -Eo 'via \S+' | awk '{ print \$2 }'"
然后再执行docker run
命令
docker run --add-host=hostip:$(hostip) -it debian
这样, 我们就可以直接在container内通过hostip
来访问本地host了
如果访问mysql等服务, 可能需要修改user的域哟
]]>安装shipyard
按照官方网站的说明, 我们直接执行 docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
即可运行一个部署好了shipyard的docker容器. 需要注意的是, 这个容器会使用8080端口作为shipyard的api controller入口.
shipyard/deploy start
执行完毕后会输出默认的shipyard管理用户和密码, 如
Unable to find image ‘shipyard/deploy:latest’ locally
Pulling repository shipyard/deploy
ec8a310a5557: Download complete
511136ea3c5a: Download complete
19e1e1d132d3: Download complete
e153b2ff5a59: Download complete
Status: Downloaded newer image for shipyard/deploy:latest
Pulling image: shipyard/rethinkdb
Starting Rethinkdb Data
Starting Rethinkdb
Starting Shipyard
Pulling image: shipyard/shipyard:latest
Shipyard Stack started successfully
Username: admin Password: shipyard
其中admin
就是用户名, shipyard
就是密码, 这样我们就可以通过http://ip:8080/
来进行登录管理了

如图所示就是安装成功啦, 使用用户名密码就可以登录了.
配置shipyard
CLI
模式, 在服务器中执行docker run -ti --rm shipyard/shipyard-cli
就可以启动一个shipyard-cli
的容器, 在容器中输入help
会得到命令说明.shipyard-cli
启动好以后第一件事需要做的就是登录当前的shipyard
, 在shipyard-cli
中输入shipyard login
, 会依次提示输入URL
, Username
和 Password
, URL
填写http://ip:8080/
即可, Username
和Password
则填写用户名密码, 就是admin
和shipyard
, 如果没有提示错误信息, 就证明登录成功, 可以通过shipyard accounts
查看当前用户列表了.shipyard change-password
, 重复输入新密码两次即可./etc/default/docker
文件, 增加
DOCKER_OPTS=“$DOCKER_OPTS -H tcp://0.0.0.0:4243 -H unix:///var/run/docker.sock”
sudo service docker stop
sudo service docker start
ifconfig
查看docker0
这个interface的ip, 我这里是172.17.42.1
, 所以这样填写即可references:
http://www.freezefamily.net/2014/11/docker-and-shipyard-on-ubuntu-trusty-14-04/
http://shipyard-project.com/docs/quickstart/
首先对vassal配置文件进行修改, 借用uwsgi的smart-attach-daemon
, 我们可以轻松的做到启动celery, 只需要加一句
smart-attach-daemon = /tmp/celery_%(vassal_name).pid celery -A application.tasks worker --pidfile=/tmp/celery_%(vassal_name).pid
但是修改完以后就出现了如下错误 AttributeError: ‘Flask’ object has no attribute ‘user_options’
, 进行搜索后发现
The your_application string has to point to your application’s package or module that creates the celery object.
可问题就是application.tasks
的确就是生成celery的module, 于是想说尝试直接指向生成的celery object试试效果, 既修改vassal配置为
smart-attach-daemon = /tmp/celery_%(vassal_name).pid celery -A application.tasks.celery_app worker --pidfile=/tmp/celery_%(vassal_name).pid
上传更新后发现居然正常了, 重新排查原因, 原来是我在application.tasks
中将application.flask_app
定义成了app
, 而celery会默认寻找app
这个对象作为celery的对象, 所以才会出现启动celery时候出现Flask
的错误.
YuvImage yuv_image = new YuvImage(imageByte, ImageFormat.NV21, width, height, null);
Rect rect = new Rect(0, 0, width, height);
ByteArrayOutputStream output_stream = new ByteArrayOutputStream();
yuv_image.compressToJpeg(rect, 100, output_stream);
byte[] byt = output_stream.toByteArray();
Bitmap full = BitmapFactory.decodeByteArray(byt, 0, byt.length);
Mat mYuv = new Mat(height, width, CvType.CV_8UC1);
mYuv.put(0, 0, data);
note that you should be aware CvType.CV_8UC1
, if you cannot get mat correctly, you should try CvType.CV_8UC3
instead.
下载ndk
从官方网页 https://developer.android.com/tools/sdk/ndk/index.html 下载mac版本的ndk, 一般为android-ndk-r10d-darwin-x86_64.bin
, 然后解压缩
bash
chmod a+x android-ndk-r10d-darwin-x86_64.bin
./android-ndk-r10d-darwin-x86_64.bin
获得名为android-ndk-r10d
的目录, 放到想要的目录中即可, 我放在了~/Android_ndk
中
项目设置
通过AS(=Android Studio)新建空Activity项目, 在AS的Project视图中切换到Project
模式, 建立需要的jni
目录和jniLibs
目录
并且打开根目录的local.properties
, 添加如下内容, 目录指向你在上一步设置的ndk文件夹
ndk.dir=/path/to/your/ndk/folder
Hello World from jni
配置AS支持直接调用javah, 方便生成对应类的头文件
AS选择Preferences->External Tools
增加一个新的Tool, 如下配置
Program
填写/usr/bin/javah
, Parameters
填写-v -jni -d $ModuleFileDir$/src/main/jni $FileClass$
, Working directory
填写$SourcepathEntry$
现在修改我们的MainActivity
, 申明我们需要的函数, 如hello()
java
static{
System.loadLibrary(“hello”);
}
private TextView textView ;
public native String hello();
@Override
public void setContentView(int layoutResID) {
super.setContentView(layoutResID);
textView = (TextView) findViewById(R.id.textview);
textView.setText(hello());
}
然后就可以使用这个工具方便的创建我们需要的头文件啦, 右键选择MainActivity
, 选择javah
工具, lib
文件夹中就会生成package_name_class_name.h
的头文件, 也就是我们的sh_rui_demo_project_opencv_MainActivity.h
, 内容大致如下
// sh_rui_demo_project_opencv_MainActivity.h
/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class com_android_hacks_ndkdemo_MyActivity */
#ifndef _Included_sh_rui_demo_project_opencv_MainActivity
#define _Included_sh_rui_demo_project_opencv_MainActivity
#ifdef __cplusplus
extern “C” {
#endif
/*
* Class: sh_rui_demo_project_opencv_MainActivity
* Method: hello
* Signature: ()Ljava/lang/String;
*/
JNIEXPORT jstring JNICALL Java_sh_rui_demo_project_opencv_MainActivity_hello
(JNIEnv *, jobject);
#ifdef __cplusplus
}
#endif
#endif
然后在`jni`文件夹中创建`main.c`, 引用生成的头文件, 并且实现对应的函数`hello()`
// main.c
#include “sh_rui_demo_project_opencv_MainActivity.h”
JNIEXPORT jstring JNICALL Java_sh_rui_demo_project_opencv_MainActivity_hello (JNIEnv * env, jobject obj){
return (*env)->NewStringUTF(env, “Hello from JNI”);
}
最后在build.gradle
中调用ndk即可
defaultConfig {
applicationId “sh.rui.demo.project.opencv”
minSdkVersion 9
targetSdkVersion 21
versionCode 1
versionName “1.0”
ndk{
moduleName “hello”
}
}
需要注意的是, moduleName
需要和Activity
中调用的库名一样, 如本文中的hello
最后效果如下