java 이외에 다른 프로그램으로 map/reduce 프로그램을 작성하고자 할때 사용 (ex, ruby, python...)
본질적으로 text processing에 적합함
1. ruby
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-files test_map.rb, test_reduce.rb \
-input /tmp/test \
-output /tmp/output \
-mapper test_map.rb \
-combiner test_reduce.rb \
-reducer test_reduce.rb
( files : hadoop cluster에 배포하고 싶은 file )
2. python
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-files test_map.py, test_reduce.py \
-input /tmp/test \
-output /tmp/output \
-mapper test_map.py \
-combiner test_reduce.py \
-reducer test_reduce.py
1) map
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
if (temp != "+9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
2. reduce
#!/usr/bin/env python
import sys
(last_key, max_val) = (None, -sys.maxint)
for line in sys.stdin:
(key, val) = line.strip().split("\t")
if last_key and last_key != key:
print "%s\t%s" % (last_key, max_val)
(last_key, max_val) = (key, int(val))
else:
(last_key, max_val) = (key, max(max_val, int(val)))
if last_key:
print "%s\t%s" % (last_key, max_val)
'NoSQL > Hadoop' 카테고리의 다른 글
yarn 구조 (0) | 2017.03.08 |
---|---|
hadoop read & write (0) | 2017.03.06 |
hadoop locality (0) | 2017.03.06 |
hadoop distcp (0) | 2017.03.02 |
Hadoop streaming (0) | 2016.07.01 |