spark 파일(데이터) 포맷

2016. 7. 5. 10:56

1. 스파크가 지원하는 파일 포맷

1) file format : text, json, seqfile, protocol buffer, etc

2) file system : nfs, hdfs, s3, etc

3) key/value 저장소 : 카산드라, hase, elastic search, jdbc support db etc

2. text file

spark sell를 사용하여 text 파일 불러오기

- 단일 파일

input = sc.textFile("file:///~~~~/text.file")

- 다중 파일

input = sc.wholeTextFile("file:///~~/")

rdd_result.saveAsTextFile(outfile)

3. Json

텍스트 또는 Json 직렬화 라이브러리, 하둡 포맷을 통해서 이용

import json

data = input.map(lambda x: json.loads(x))

(data.filter(lamda x: x['lovepandas'].map(lambda x:json.dumps(x)).saveAsTextfile(outfile)

4. 시퀀스

val data = sc.sequencefile(infile, 'org.apache.hadoop.io.text', 'org.apache.hadoop.io.intwritetable')

val data = sc.parallelize(list('a', 1),('b', 2))

data.saveAsSequenceFile(outfile)

pyspark 기본 (0)	2017.08.16
spark sql (0)	2016.12.02
spark library 유형 (0)	2016.11.06
spark 대화형 쉘 (0)	2016.07.01
spark configuration for elasticsearch-hadoop (0)	2016.06.22

세모데