Hadoop streaming

2016. 7. 1. 14:25

1. 실행

$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \

    -input myInputDir \
    -output myOutputDir \
    -mapper mapper.py \
    -reducer reducer.py \
    -file mapper.py \
    -file reducer.py

1) mapper

$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
    -input myInputDir \
    -output myOutputDir \
    -mapper mapper.py \
    -reducer reducer.py \
    -file mapper.py \
    -file reducer.py

2) reducer

from operator import itemgetter
import sys

currentWord = None
currentCount = 0

for line in sys.stdin:
line = line.strip()

# Get elements of pair created by the mapper
word, count = line.split()

# Convert count to an integer
    try:
        count = int(count)
    except ValueError:
        continue

    if currentWord != word:
        if currentWord is not None:
            print('%s\t%d' % (currentWord, currentCount))
        currentWord = word
        currentCount = 0

currentCount += count

# Output last word group if needed
if currentCount > 0:
print('%s\t%d' % (currentWord, currentCount))

'NoSQL > Hadoop' 카테고리의 다른 글

yarn 구조 (0)	2017.03.08
hadoop read & write (0)	2017.03.06
hadoop streaming (0)	2017.03.06
hadoop locality (0)	2017.03.06
hadoop distcp (0)	2017.03.02

세모데

Hadoop streaming

'NoSQL > Hadoop' 카테고리의 다른 글

+ Recent posts

티스토리툴바