uokadaの見逃し三振は嫌いです

ここで述べられていることは私の個人的な意見に基づくものであり、私が所属する組織には一切の関係はありません。

Hadoop,Hive環境構築した。

自分のMac上にHadoop環境を構築した。

今回は手っ取り早い方法でbrewhadoopとhiveを導入した。

インストールはこれだけ。

% brew install hive hadoop

あとはQiitaにある解説記事のとおりにやっていった。

${任意のディレクトリ}は相対パスで書くと「/usr/local/Cellar/hadoop/2.2.0/libexec」から始まる 自分は相対パスでvarディレクトリにHDFSファイルを置くようにセットした。

ひと通りスタンドアローンモードで動かしたあとにsshの鍵設定をして共有モードonにして core-site.xml, hdfs-site.xml,yarn-site.xmlを編集。

hdfsのファイルをフォーマットしhadoopを立ち上げるだけの至って簡単な作業だった。

$ hdfs namenode -format
$ sbin/start-dfs.sh

ググりながら打ったhdfsのコマンドがところどころdeprecateの警告出てるので、2.Xで結構変わってるのがよく分かる。

MapReduceのサンプルとか動かしつつHiveでちょろっと遊んでみた。 MRを直接書くのはこれ見た時にかなり辛そうな印象を受けたのでまだ簡単そうなHiveから入ることにする。 Hadoopのいろんな言語でwordcount(1) | Tech Blog

Hadoop Streamingでもいいかなと思いつつも一番簡単なやつからやってみる。
理解が進んで余裕が出来たらgo&hadoop streamingってのもやってみたい。

Adentures in Go: Go on Hadoop

で、hive。この辺りの記事をみてワードカウントをやってみた。

hdfsコマンドをいくつか実行してファイルをUpして準備完了。

% cat input/file2
xyz
xyz
bar
baz
foo
foo
bar
bar
bar
123
% hdfs dfs -mkdir /hadoop
% hdfs dfs -put input/file2 /hadoop/input
% hdfs dfs -ls -R /hadoop/input
15/04/26 00:24:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 hoge supergroup         40 2015-04-25 23:52 /hadoop/input/file2

あとは、hive上からクエリを投げて集計するだけ。

% export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
% hive
<中略>
hive> CREATE EXTERNAL TABLE docs (line STRING) LOCATION '/hadoop/input';
hive> select line, COUNT(*) FROM docs group by line;
Query ID = hoge_20150426002727_4be0f399-64fc-423b-8262-06fa346599ff
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2015-04-26 00:28:05,906 Stage-1 map = 100%,  reduce = 0%
2015-04-26 00:28:11,149 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1134999526_0004
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 568 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
123    1
bar 4
baz 1
foo 2
xyz 2
Time taken: 21.975 seconds, Fetched: 5 row(s)

20秒ぐらいで結果が返ってくる。Hadoop上で動いてるしこんなもんだろってレスポンス。

SORT処理を加えてみる。

hive> select line, COUNT(*) c FROM docs group by line ORDER BY c DESC LIMIT 3 ;
Query ID = hoge_20150426004444_3984e3f7-f20f-4521-b6bf-f5aaddfdfb7c
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2015-04-26 00:45:07,773 Stage-1 map = 100%,  reduce = 0%
2015-04-26 00:45:11,901 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1391458279_0009
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-2: number of mappers: 0; number of reducers: 0
2015-04-26 00:45:28,319 Stage-2 map = 100%,  reduce = 0%
2015-04-26 00:45:32,439 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_local1950979420_0010
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 888 HDFS Write: 0 SUCCESS
Stage-Stage-2:  HDFS Read: 888 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
bar 4
xyz 2
foo 2
Time taken: 41.991 seconds, Fetched: 3 row(s)

41秒ぐらいになった。SORTを加えるとStage-Stageが1つ増えて前のクエリよりも倍の時間かかってるorz
みたところ、Total jobsの数に比例して実行時間が増えていく仕組みっぽい。

複雑なクエリ投げれば投げるほど時間食いそうな感じがするので意味が無いけど、サブクエリを使ってもう1つjobを増やして投げてみる。

hive> SELECT COUNT(1) FROM (select line, COUNT(*) c FROM docs group by line ORDER BY c DESC LIMIT 3 ) t;
Query ID = yuokada_20150426014444_d754f6cc-6ffe-43ca-9269-58cc18e0b497
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2015-04-26 01:44:48,875 Stage-1 map = 100%,  reduce = 0%
2015-04-26 01:44:52,975 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local547425555_0013
Launching Job 2 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-2: number of mappers: 0; number of reducers: 0
2015-04-26 01:45:09,271 Stage-2 map = 100%,  reduce = 0%
2015-04-26 01:45:13,366 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_local658393957_0014
Launching Job 3 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-3: number of mappers: 0; number of reducers: 0
2015-04-26 01:45:29,673 Stage-3 map = 100%,  reduce = 0%
2015-04-26 01:45:33,788 Stage-3 map = 100%,  reduce = 100%
Ended Job = job_local1529346977_0015
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 1048 HDFS Write: 0 SUCCESS
Stage-Stage-2:  HDFS Read: 1048 HDFS Write: 0 SUCCESS
Stage-Stage-3:  HDFS Read: 1048 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3
Time taken: 60.001 seconds, Fetched: 1 row(s)

予想通り、1jobのときの3倍程度の時間で完了している。

この辺りのドキュメント見る感じでは普通のRDBMSと同じように使うのはダメなんだな。 参考: SQL感覚でHiveQLを書くと痛い目にあう例 — still deeper

ぼちぼち、Hadooop、Hive周りの理解のための環境構築が完了したのでこれからGWぐらいでもっと深いところやれればいいな気分。