今回は手っ取り早い方法でbrewでhadoopとhiveを導入した。
インストールはこれだけ。
% brew install hive hadoop
あとはQiitaにある解説記事のとおりにやっていった。
${任意のディレクトリ}は相対パスで書くと「/usr/local/Cellar/hadoop/2.2.0/libexec」から始まる
自分は相対パスでvarディレクトリにHDFSファイルを置くようにセットした。
ひと通りスタンドアローンモードで動かしたあとにsshの鍵設定をして共有モードonにして core-site.xml, hdfs-site.xml,yarn-site.xmlを編集。
hdfsのファイルをフォーマットしhadoopを立ち上げるだけの至って簡単な作業だった。
$ hdfs namenode -format $ sbin/start-dfs.sh
ググりながら打ったhdfsのコマンドがところどころdeprecateの警告出てるので、2.Xで結構変わってるのがよく分かる。
MapReduceのサンプルとか動かしつつHiveでちょろっと遊んでみた。 MRを直接書くのはこれ見た時にかなり辛そうな印象を受けたのでまだ簡単そうなHiveから入ることにする。 Hadoopのいろんな言語でwordcount(1) | Tech Blog
Hadoop Streamingでもいいかなと思いつつも一番簡単なやつからやってみる。
理解が進んで余裕が出来たらgo&hadoop streamingってのもやってみたい。
で、hive。この辺りの記事をみてワードカウントをやってみた。
- Hive WordCountサンプル(Hishidama's Apache Hive WordCount Sample)
- Hive WordCountサンプル(Hishidama's Apache Hive WordCount Sample)
hdfsコマンドをいくつか実行してファイルをUpして準備完了。
% cat input/file2 xyz xyz bar baz foo foo bar bar bar 123 % hdfs dfs -mkdir /hadoop % hdfs dfs -put input/file2 /hadoop/input % hdfs dfs -ls -R /hadoop/input 15/04/26 00:24:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable -rw-r--r-- 1 hoge supergroup 40 2015-04-25 23:52 /hadoop/input/file2
あとは、hive上からクエリを投げて集計するだけ。
% export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" % hive <中略> hive> CREATE EXTERNAL TABLE docs (line STRING) LOCATION '/hadoop/input'; hive> select line, COUNT(*) FROM docs group by line; Query ID = hoge_20150426002727_4be0f399-64fc-423b-8262-06fa346599ff Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 2015-04-26 00:28:05,906 Stage-1 map = 100%, reduce = 0% 2015-04-26 00:28:11,149 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1134999526_0004 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 568 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 123 1 bar 4 baz 1 foo 2 xyz 2 Time taken: 21.975 seconds, Fetched: 5 row(s)
20秒ぐらいで結果が返ってくる。Hadoop上で動いてるしこんなもんだろってレスポンス。
SORT処理を加えてみる。
hive> select line, COUNT(*) c FROM docs group by line ORDER BY c DESC LIMIT 3 ; Query ID = hoge_20150426004444_3984e3f7-f20f-4521-b6bf-f5aaddfdfb7c Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 2015-04-26 00:45:07,773 Stage-1 map = 100%, reduce = 0% 2015-04-26 00:45:11,901 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1391458279_0009 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) Hadoop job information for Stage-2: number of mappers: 0; number of reducers: 0 2015-04-26 00:45:28,319 Stage-2 map = 100%, reduce = 0% 2015-04-26 00:45:32,439 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1950979420_0010 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 888 HDFS Write: 0 SUCCESS Stage-Stage-2: HDFS Read: 888 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK bar 4 xyz 2 foo 2 Time taken: 41.991 seconds, Fetched: 3 row(s)
41秒ぐらいになった。SORTを加えるとStage-Stageが1つ増えて前のクエリよりも倍の時間かかってるorz
みたところ、Total jobsの数に比例して実行時間が増えていく仕組みっぽい。
複雑なクエリ投げれば投げるほど時間食いそうな感じがするので意味が無いけど、サブクエリを使ってもう1つjobを増やして投げてみる。
hive> SELECT COUNT(1) FROM (select line, COUNT(*) c FROM docs group by line ORDER BY c DESC LIMIT 3 ) t; Query ID = yuokada_20150426014444_d754f6cc-6ffe-43ca-9269-58cc18e0b497 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 2015-04-26 01:44:48,875 Stage-1 map = 100%, reduce = 0% 2015-04-26 01:44:52,975 Stage-1 map = 100%, reduce = 100% Ended Job = job_local547425555_0013 Launching Job 2 out of 3 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) Hadoop job information for Stage-2: number of mappers: 0; number of reducers: 0 2015-04-26 01:45:09,271 Stage-2 map = 100%, reduce = 0% 2015-04-26 01:45:13,366 Stage-2 map = 100%, reduce = 100% Ended Job = job_local658393957_0014 Launching Job 3 out of 3 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) Hadoop job information for Stage-3: number of mappers: 0; number of reducers: 0 2015-04-26 01:45:29,673 Stage-3 map = 100%, reduce = 0% 2015-04-26 01:45:33,788 Stage-3 map = 100%, reduce = 100% Ended Job = job_local1529346977_0015 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 1048 HDFS Write: 0 SUCCESS Stage-Stage-2: HDFS Read: 1048 HDFS Write: 0 SUCCESS Stage-Stage-3: HDFS Read: 1048 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 3 Time taken: 60.001 seconds, Fetched: 1 row(s)
予想通り、1jobのときの3倍程度の時間で完了している。
この辺りのドキュメント見る感じでは普通のRDBMSと同じように使うのはダメなんだな。 参考: SQL感覚でHiveQLを書くと痛い目にあう例 — still deeper
ぼちぼち、Hadooop、Hive周りの理解のための環境構築が完了したのでこれからGWぐらいでもっと深いところやれればいいな気分。