Hadoop 部署

基础环境

三台Linux虚拟机服务器

node1: Master(HDFS\YARN\Spark) 和 Worker(HDFS\YARN\Spark)
node2: Worker(HDFS\YARN\Spark)
node3: Worker(HDFS\YARN\Spark) 和 Hive

Hadoop3集群、JDK1.8、Centos7.6

192.168.1.4 master 【绑定公网IP】
192.168.1.7 worker1
192.168.1.9 worker2

1
2
3
hostnamectl set-hostname master
hostnamectl set-hostname worker1
hostnamectl set-hostname worker2

设置免密登录:
http://t.zoukankan.com/Nanaya-p-13202946.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ssh-keygen -t rsa

scp /root/.ssh/id_rsa.pub ec2-user@192.168.1.7:/home/ec2-user

scp /root/.ssh/id_rsa.pub ec2-user@192.168.1.9:/home/ec2-user

cat id_rsa.pub >> authorized_keys

vim etc/ssh/sshd_config

PermitRootLogin yes # 允许root认证登录
PubkeyAuthentication # 启用公钥私钥配对认证方式
AuthorizedKeysFile .ssh/authorized_keys # 公钥文件路径(和上面生成的文件相同)
systemctl restart sshd

## 配置hosts
echo '192.168.1.4 master
192.168.1.7 worker1
192.168.1.9 worker2' >> /etc/hosts

Local模式基本原理

本质:启动一个JVM Process进程(一个进程里面有多个线程),执行任务Task
Local模式可以限制模拟Spark集群环境的线程数量,即Local[N]或Local[],其中,N代表可以使用的N个线程,每个线程拥有一个CPU core。如果不指定N,则默认1个线程(该线程有1个Core)。通常Cpu有几个Core,就指定几个线程,最大化利用计算能力。如果是Local[],则按照Cpu最多的Cores设置线程数。

下载安装【三台】

参考:
https://blog.csdn.net/dream_an/article/details/80258283

https://blog.csdn.net/weixin_53227758/article/details/121977047 [很详细]

https://blog.csdn.net/zyx1260168395/article/details/120921697?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-0-120921697-blog-123827164.pc_relevant_antiscanv2&spm=1001.2101.3001.4242.1&utm_relevant_index=3

https://blog.csdn.net/weixin_49167174/article/details/123827164

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17


网上找java8的安装包,百度网盘比较好,官方下载需要一堆注册信息

mkdir /opt/java # 上传java
tar -zxf jdk-8u271-linux-x64.tar.gz
mv jdk1.8.0_271/ /opt/java/
## 配置JAVA环境变量
echo 'export JAVA_HOME=/opt/java/jdk1.8.0_271
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
export PATH=$PATH:${JAVA_HOME}/bin' > /etc/profile.d/jdk-1.8.sh

# 使环境变量生效
source /etc/profile

# 查看java
java -version

hadoop3+配置文件 github配置文件源码地址

1
2
3
4
5
6
7
8
9
Hadoop官网:http://hadoop.apache.org/
yum install wget # 太慢,直接网页下载再上传
https://downloads.apache.org/hadoop/common/hadoop-3.2.3/

mkdir /opt/hadoop
#共需要配置/opt/hadoop/hadoop-3.2.3/etc/hadoop/下的六个个文件,分别是
# hadoop-env.sh、core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml、workers
tar -zxf hadoop-3.2.3.tar.gz
mv hadoop-3.2.3 /opt/hadoop/

修改master配置文件

配置文件,主从都一样,并不影响,除非主节点你不设置DataNode

配置文件备份

1
2
3
4
5
6
7
cd /opt/hadoop/hadoop-3.2.3/etc/hadoop/
cp hadoop-env.sh hadoop-env.sh.bak
cp core-site.xml core-site.xml.bak
cp hdfs-site.xml hdfs-site.xml.bak
cp yarn-site.xml yarn-site.xml.bak
cp mapred-site.xml mapred-site.xml.bak
cp workers workers.bak

创建目录

1
mkdir -p /data/hadoop/tmp /data/hadoop/hdfs/data /data/hadoop/hdfs/name /data/hadoop/pid/ /data/hadoop/log/

修改hadoop-env.sh 环境变量

https://baijiahao.baidu.com/s?id=1668099113918809329&wfr=spider&for=pc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
echo 'export JAVA_HOME=/opt/java/jdk1.8.0_271
export HADOOP_HOME=/opt/hadoop/hadoop-3.2.3
export HADOOP_PID_DIR=/data/hadoop/pid
export HADOOP_LOG_DIR=/data/hadoop/log

export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop

# export HIVE_HOME=/opt/hive
# export HBASE_HOME=/opt/hbase

export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root" ' >> /opt/hadoop/hadoop-3.2.3/etc/hadoop/hadoop-env.sh

也可以在/etc/profile.d/中配置

修改 core-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9710</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/tmp</value>
</property>
</configuration>

name: fs.default.namefs.defaultFS 区别:FS是两台主机以上进行高可用配置,name则是单点使用;value: NameNode URI

name: io.file.buffer.size: Size of read/write buffer used in SequenceFiles

name: hadoop.tmp.dir: A base for other temporary directories.

修改 hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<configuration>
<!-- Configurations for NameNode: -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/data/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>268435456</value>
</property>
<property>
<name>dfs.namenode.secondary.ttp-address</name>
<value>master:9711</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
</property>
<!-- Configurations for DataNode: -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/data/hadoop/hdfs/data</value>
</property>
</configuration>

name: dfs.blocksize: HDFS blocksize of 256MB for large file-systems.

name: dfs.namenode.http-address: The address and the base port where the dfs namenode web ui will listen on.

name: dfs.namenode.handler.count: More NameNode server threads to handle RPCs from large number of DataNodes.

name: dfs.datanode.data.dir: If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

修改 yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<configuration>
<!-- Configurations for ResourceManager and NodeManager: 未配置 -->
<!-- Configurations for ResourceManager: -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:9712</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:9713</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:9714</value>
</property>

<!-- Configurations for NodeManager: -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
<!-- Configurations for History Server: 暂时不配置吧-->

</configuration>

修改 mapred-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
<configuration>
<!-- Configurations for MapReduce Applications: -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- Configurations for MapReduce JobHistory Server: -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:9715</value>
</property>
</configuration>

修改 workers

1
2
3
master
worker1
worker2

复制Hadoop文件到其他集群、配置Hadoop环境变量、格式化hdfs、开启集群、查看、关闭、重置集群

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mkdir -p /data/hadoop/tmp /data/hadoop/hdfs/data /data/hadoop/hdfs/name /data/hadoop/pid/ /data/hadoop/log/

scp hadoop-env.sh 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp core-site.xml 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp hdfs-site.xml 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp yarn-site.xml 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp mapred-site.xml 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp workers 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/

scp hadoop-env.sh 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp core-site.xml 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp hdfs-site.xml 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp yarn-site.xml 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp mapred-site.xml 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
scp workers 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/

格式化并启动集群

1
2
3
4
5
/opt/hadoop/hadoop-3.2.3/bin/hdfs namenode -format 
/opt/hadoop/hadoop-3.2.3/sbin/start-all.sh

/opt/hadoop/hadoop-3.2.3/sbin/start-dfs.sh
/opt/hadoop/hadoop-3.2.3/sbin/start-yarn.sh

测试

1


## 报错参考:
https://mathsigit.github.io/blog_page/2017/11/16/hole-of-submitting-mr-of-hadoop300RC0/

https://blog.csdn.net/qianhuihan/article/details/83379837

https://www.jianshu.com/p/9ebe74753794

https://stackoverflow.com/questions/50927577/could-not-find-or-load-main-class-org-apache-hadoop-mapreduce-v2-app-mrappmaster