基础环境 三台Linux虚拟机服务器
node1: Master(HDFS\YARN\Spark) 和 Worker(HDFS\YARN\Spark) node2: Worker(HDFS\YARN\Spark) node3: Worker(HDFS\YARN\Spark) 和 Hive
Hadoop3集群、JDK1.8、Centos7.6
192.168.1.4 master 【绑定公网IP】 192.168.1.7 worker1 192.168.1.9 worker2
1 2 3 hostnamectl set-hostname master hostnamectl set-hostname worker1 hostnamectl set-hostname worker2
设置免密登录:http://t.zoukankan.com/Nanaya-p-13202946.html
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ssh-keygen -t rsa scp /root/.ssh/id_rsa.pub ec2-user@192.168.1.7:/home/ec2-user scp /root/.ssh/id_rsa.pub ec2-user@192.168.1.9:/home/ec2-user cat id_rsa.pub >> authorized_keys vim etc/ssh/sshd_config PermitRootLogin yes # 允许root认证登录 PubkeyAuthentication # 启用公钥私钥配对认证方式 AuthorizedKeysFile .ssh/authorized_keys # 公钥文件路径(和上面生成的文件相同) systemctl restart sshd ## 配置hosts echo '192.168.1.4 master 192.168.1.7 worker1 192.168.1.9 worker2' >> /etc/hosts
Local模式基本原理 本质:启动一个JVM Process进程(一个进程里面有多个线程),执行任务Task Local模式可以限制模拟Spark集群环境的线程数量,即Local[N]或Local[],其中,N代表可以使用的N个线程,每个线程拥有一个CPU core。如果不指定N,则默认1个线程(该线程有1个Core)。通常Cpu有几个Core,就指定几个线程,最大化利用计算能力。如果是Local[ ],则按照Cpu最多的Cores设置线程数。
下载安装【三台】 参考:https://blog.csdn.net/dream_an/article/details/80258283
https://blog.csdn.net/weixin_53227758/article/details/121977047 [很详细]
https://blog.csdn.net/zyx1260168395/article/details/120921697?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-0-120921697-blog-123827164.pc_relevant_antiscanv2&spm=1001.2101.3001.4242.1&utm_relevant_index=3
https://blog.csdn.net/weixin_49167174/article/details/123827164
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 网上找java8的安装包,百度网盘比较好,官方下载需要一堆注册信息 mkdir /opt/java tar -zxf jdk-8u271-linux-x64.tar.gz mv jdk1.8.0_271/ /opt/java/ echo 'export JAVA_HOME=/opt/java/jdk1.8.0_271 export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar export PATH=$PATH:${JAVA_HOME}/bin' > /etc/profile.d/jdk-1.8.shsource /etc/profilejava -version
hadoop3+配置文件 github配置文件源码地址
1 2 3 4 5 6 7 8 9 Hadoop官网:http://hadoop.apache.org/ yum install wget https://downloads.apache.org/hadoop/common/hadoop-3.2.3/ mkdir /opt/hadoop tar -zxf hadoop-3.2.3.tar.gz mv hadoop-3.2.3 /opt/hadoop/
修改master配置文件
配置文件,主从都一样,并不影响,除非主节点你不设置DataNode
配置文件备份 1 2 3 4 5 6 7 cd /opt/hadoop/hadoop-3.2.3/etc/hadoop/cp hadoop-env.sh hadoop-env.sh.bak cp core-site.xml core-site.xml.bak cp hdfs-site.xml hdfs-site.xml.bak cp yarn-site.xml yarn-site.xml.bak cp mapred-site.xml mapred-site.xml.bak cp workers workers.bak
创建目录 1 mkdir -p /data/hadoop/tmp /data/hadoop/hdfs/data /data/hadoop/hdfs/name /data/hadoop/pid/ /data/hadoop/log /
修改hadoop-env.sh 环境变量 https://baijiahao.baidu.com/s?id=1668099113918809329&wfr=spider&for=pc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 echo 'export JAVA_HOME=/opt/java/jdk1.8.0_271 export HADOOP_HOME=/opt/hadoop/hadoop-3.2.3 export HADOOP_PID_DIR=/data/hadoop/pid export HADOOP_LOG_DIR=/data/hadoop/log export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop # export HIVE_HOME=/opt/hive # export HBASE_HOME=/opt/hbase export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export HDFS_NAMENODE_USER="root" export HDFS_DATANODE_USER="root" export HDFS_SECONDARYNAMENODE_USER="root" export YARN_RESOURCEMANAGER_USER="root" export YARN_NODEMANAGER_USER="root" ' >> /opt/hadoop/hadoop-3.2.3/etc/hadoop/hadoop-env.sh
也可以在/etc/profile.d/中配置
修改 core-site.xml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <configuration > <property > <name > fs.defaultFS</name > <value > hdfs://master:9710</value > </property > <property > <name > io.file.buffer.size</name > <value > 131072</value > </property > <property > <name > hadoop.tmp.dir</name > <value > /data/hadoop/tmp</value > </property > </configuration >
name: fs.default.name 和fs.defaultFS 区别:FS是两台主机以上进行高可用配置,name则是单点使用;value: NameNode URI
name: io.file.buffer.size : Size of read/write buffer used in SequenceFiles
name: hadoop.tmp.dir : A base for other temporary directories.
修改 hdfs-site.xml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 <configuration > <property > <name > dfs.namenode.name.dir</name > <value > /data/hadoop/hdfs/name</value > </property > <property > <name > dfs.replication</name > <value > 1</value > </property > <property > <name > dfs.blocksize</name > <value > 268435456</value > </property > <property > <name > dfs.namenode.secondary.ttp-address</name > <value > master:9711</value > </property > <property > <name > dfs.namenode.handler.count</name > <value > 100</value > </property > <property > <name > dfs.datanode.data.dir</name > <value > /data/hadoop/hdfs/data</value > </property > </configuration >
name: dfs.blocksize : HDFS blocksize of 256MB for large file-systems.
name: dfs.namenode.http-address : The address and the base port where the dfs namenode web ui will listen on.
name: dfs.namenode.handler.count : More NameNode server threads to handle RPCs from large number of DataNodes.
name: dfs.datanode.data.dir : If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.
修改 yarn-site.xml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 <configuration > <property > <name > yarn.resourcemanager.hostname</name > <value > master</value > </property > <property > <name > yarn.resourcemanager.webapp.address</name > <value > master:9712</value > </property > <property > <name > yarn.resourcemanager.address</name > <value > master:9713</value > </property > <property > <name > yarn.resourcemanager.scheduler.address</name > <value > master:9714</value > </property > <property > <name > yarn.nodemanager.aux-services</name > <value > mapreduce_shuffle</value > </property > <property > <name > yarn.nodemanager.env-whitelist</name > <value > JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value > </property > </configuration >
修改 mapred-site.xml 1 2 3 4 5 6 7 8 9 10 11 12 <configuration > <property > <name > mapreduce.framework.name</name > <value > yarn</value > </property > <property > <name > mapreduce.jobhistory.webapp.address</name > <value > master:9715</value > </property > </configuration >
修改 workers
复制Hadoop文件到其他集群、配置Hadoop环境变量、格式化hdfs、开启集群、查看、关闭、重置集群 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 mkdir -p /data/hadoop/tmp /data/hadoop/hdfs/data /data/hadoop/hdfs/name /data/hadoop/pid/ /data/hadoop/log / scp hadoop-env.sh 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp core-site.xml 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp hdfs-site.xml 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp yarn-site.xml 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp mapred-site.xml 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp workers 192.168.1.7:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp hadoop-env.sh 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp core-site.xml 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp hdfs-site.xml 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp yarn-site.xml 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp mapred-site.xml 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/ scp workers 192.168.1.9:/opt/hadoop/hadoop-3.2.3/etc/hadoop/
格式化并启动集群 1 2 3 4 5 /opt/hadoop/hadoop-3.2.3/bin/hdfs namenode -format /opt/hadoop/hadoop-3.2.3/sbin/start-all.sh /opt/hadoop/hadoop-3.2.3/sbin/start-dfs.sh /opt/hadoop/hadoop-3.2.3/sbin/start-yarn.sh
测试
## 报错参考:
https://mathsigit.github.io/blog_page/2017/11/16/hole-of-submitting-mr-of-hadoop300RC0/
https://blog.csdn.net/qianhuihan/article/details/83379837
https://www.jianshu.com/p/9ebe74753794
https://stackoverflow.com/questions/50927577/could-not-find-or-load-main-class-org-apache-hadoop-mapreduce-v2-app-mrappmaster