博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hadoop跨集群迁移数据(整理版)
阅读量:4091 次
发布时间:2019-05-25

本文共 4775 字,大约阅读时间需要 15 分钟。

1. 什么是DistCp

  DistCp(分布式拷贝)是用于大规模集群内部和集群之间拷贝的工具。它使用Map/Reduce实现文件分发,错误处理和恢复,以及报告生成。它把文件和目录的列表作为map任务的输入,每个任务会完成源列表中部分文件的拷贝。由于使用了Map/Reduce方法,这个工具在语义和执行上都会有特殊的地方。

1.1 DistCp使用的注意事项

  1. DistCp会尝试着均分需要拷贝的内容,这样每个map拷贝差不多相等大小的内容。但因为文件是最小的拷贝粒度,所以配置增加同时拷贝(如map)的数目不一定会增加实际同时拷贝的数目以及总吞吐量。

  2. 如果没使用-m选项,DistCp会尝试在调度工作时指定map的数据为 min (total_bytes / bytes.per.map, 20 * num_task_trackers),其中bytes.per.map默认是256MB。

  3. 建议对于长时间运行或定期运行的作业,根据源和目标集群大小、拷贝数量大小以及带宽调整map的数目。

  4. 对于不同Hadoop版本间的拷贝,用户应该使用HftpFileSystem。这是一个只读文件系统,所以DistCp必须运行在目标端集群上(更确切的的说是能够写入目标集群的TaskTracker上)。源的格式是 hftp://<dfs.http.address>/<path> (默认情况dfs.http.address是 <namenode>:50070)

2. Hadoop DistCp的api使用

[root@node105 ~]# hadoop distcpusage: distcp OPTIONS [source_path...] 
OPTIONS -append Reuse existing data in target files and append new data to them if possible -async Should distcp execution be blocking -atomic Commit all changes or none -bandwidth
Specify bandwidth per map in MB -blocksperchunk
If set to a positive value, fileswith more blocks than this value will be split into chunks of
blocks to be transferred in parallel, and reassembled on the destination. By default,
is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when the source file system implements getBlockLocations method and the target file system implements concat method -copybuffersize
Size of the copy buffer to use. By default
is 8192B. -delete Delete from target, files missing in source -diff
Use snapshot diff report to identify the difference between source and target -f
List of files that need to be copied -filelimit
(Deprecated!) Limit number of files copied to <= n -filters
The path to a file containing a list of strings for paths to be excluded from the copy. -i Ignore failures during copy -log
Folder on DFS where distcp execution logs are saved -m
Max number of concurrent maps to use for copy -mapredSslConf
Configuration for ssl config file, to use with hftps://. Must be in the classpath. -numListstatusThreads
Number of threads to use for building file listing (max 40). -overwrite Choose to overwrite target files unconditionally, even if they exist. -p
preserve status (rbugpcaxt)(replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps). If -p is specified with no
, then preserves replication, block size, user, group, permission, checksum type and timestamps. raw.* xattrs are preserved when both the source and destination paths are in the /.reserved/raw hierarchy (HDFS only). raw.* xattrpreservation is independent of the -p flag. Refer to the DistCp documentation for more details. -rdiff
Use target snapshot diff report to identify changes made on target -sizelimit
(Deprecated!) Limit number of files copied to <= n bytes -skipcrccheck Whether to skip CRC checks between source and target paths. -strategy
Copy strategy to use. Default is dividing work based on file sizes -tmp
Intermediate work path to be used for atomic commit -update Update target, copying only missingfiles or directories

3. 测试用例

  1. 查看将要迁移的目标文件

[root@calculation101 ~]# hdfs dfs -du -h /test/2018/10/

  2. 创建新集群的测试目录:

[hdfs@node105 root]$ [hdfs@node105 root]$ hdfs dfs -mkdir -p /yangjianqiu/data/[hdfs@node105 root]$ [hdfs@node105 root]$ hdfs dfs -chown -R root:root  /yangjianqiu/data/  [hdfs@node105 root]$ [hdfs@node105 root]$ exit exit[root@node105 ~]# [root@node105 ~]# hdfs dfs -ls /yangjianqiuFound 1 itemsdrwxr-xr-x   - root root          0 2018-10-29 03:29 /yangjianqiu/data

  2. 开始迁移数据I并记录日志以及迁移数据所用时间:

[root@node105 ~]# mkdir /yangjianqiu[root@node105 ~]# [root@node105 ~]# [root@node105 ~]# nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 & [1] 11125 [root@node105 ~]# [root@node105 ~]# jobs [1]+ Running nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 &

4. 应用程序调用distcp接口

总结

【参考资料】

 Hive 数据 bulkload 导入 HBase

  hadoop跨集群之间迁移hive数据

 hadoop 集群跨版本数据迁移

 DistCp between HA clusters

  Copying Cluster Data Using DistCp

 Java Code Examples for org.apache.hadoop.tools.DistCp

 HDFS集群PB级数据迁移方案-DistCp生产环境实操篇

转载地址:http://gzcii.baihongyu.com/

你可能感兴趣的文章
剑指_复杂链表的复制
查看>>
FTP 常见问题
查看>>
shell 快捷键
查看>>
MODULE_DEVICE_TABLE的理解
查看>>
No devices detected. Fatal server error: no screens found
查看>>
db db2_monitorTool IBM Rational Performace Tester
查看>>
postgresql监控工具pgstatspack的安装及使用
查看>>
swift中单例的创建及销毁
查看>>
php的autoload与global
查看>>
IE不支持option的display:none属性
查看>>
[分享]mysql内置用于字符串型ip地址和整数型ip地址转换函数
查看>>
【JAVA数据结构】双向链表
查看>>
【JAVA数据结构】先进先出队列
查看>>
MouseEvent的e.stageX是Number型,可见as3作者的考虑
查看>>
移植Vim配色方案到Eclipse
查看>>
谈谈加密和混淆吧[转]
查看>>
关于按钮的mouseOver和rollOver
查看>>
Socket经验记录
查看>>
对RTMP视频流进行BitmapData.draw()出错的解决办法
查看>>
SecurityError Error 2148 SWF 不能访问本地资源
查看>>