Slides da segunda aula - IC

INF550 - Computação em Nuvem I
Apache Spark
Islene Calciolari Garcia
Instituto de Computação - Unicamp
Julho de 2016
Sumário
Revisão da aula passada...
Objetivos
HDFS
MapReduce
Ecossistema do Hadoop
Spark
RDDs e SparkContext
Como testar?
Revisão rápida de Python
pyspark
Laboratório
Revisão da aula passada...
Objetivos
I
Primeira parte do curso
I
I
I
Cloud computing
Data centers
Segunda parte do curso
I
I
Modelo de Programação MapReduce
Spark e Ecossistema do Hadoop
Arquitetura do HDFS
Fonte: http://hadoop.apache.org
Anatomy of a File Read
To get an idea of how data flows between the client interacting with HDFS, the name‐
HDFS
node, and the datanodes, consider Figure 3-2, which shows the main sequence of events
when reading a file.
Leitura de arquivo
Figure 3-2. A client reading data from HDFS
The client opens the file it wishesFonte:
to readHadoop—The
by calling open()Definitive
on the FileSystem
object,
Guide, Tom
White
which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2).
DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to
to understand the data flow because it clarifies HDFS’s coherency model.
HDFS
We’re going to consider the case of creating a new file, writing data to it, then closing
theem
file.arquivo
This is illustrated in Figure 3-4.
Escrita
Figure 3-4. A client writing data to HDFS
Fonte: Hadoop—The Definitive Guide, Tom White
The client creates the file by calling create() on DistributedFileSystem (step 1 in
MapReduce
Visão colorida
http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.html
Word Count
http://www.cs.uml.edu/~jlu1/doc/source/report/img/MapReduceExample.png
Ecossistema do Hadoop
Ecossistema do Hadoop
http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview
Hadoop 1.0
Fonte: http://hadoop.apache.org
Hadoop 1.0
I
JobTracker
I
I
I
Gerenciamento dos Task Trackers (recursos e falhas)
Gerenciamento do ciclo de vida dos jobs
TaskTracker
I
I
iniciar e encerrar task
enviar status para o JobTracker
I
Escalabilidade?
I
Outros modelos de programação?
YARN
Yet Another Resource Negotiator
Fonte: http://hadoop.apache.org
YARN
I
JobTracker estava sobrecarregado
I
I
I
I
Gerenciamento de recursos
Gerenciamento de aplicações
Container: abstração que incorpora recursos como cpu,
memória, disco, rede...
ResourceManager
I
Escalonador de recursos
I
NodeManager
I
ApplicationMaster
Testando o YARN
$ sbin/start-yarn.sh
$ bin/hadoop dfs -put input /input
$ bin/yarn jar \
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar \
wordcount /input /output
$ bin/hadoop dfs -get /output output
I
Verifique os jobs em http://localhost:8088/
I
Quando terminar de usar
$ sbin/stop-yarn.sh
Gerenciamento de múltiplas aplicações
http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/
Define reducers
Como
o Spark
tão mais
For
complex
work,consegue
chain jobs ser
together
rápido?
Modelo de processamento MapReduce
– Use a higher level language or DSL that does this for you
Typical MapReduce Workflows
Job 1
Last Job
Job 2
© 2014 MapR Technologies
Maps
Reduces
SequenceFile
Input to
Job 1
Maps
Maps
Reduces
SequenceFile
SequenceFile
Output from
Job 1
Reduces
Output from
Job 2
Input to
last job
Output from
last job
HDFS
© 2014 MapR Technologies
37
Carol McDonald: An Overview of Apache Spark
1
Como oIterations
Spark consegue ser tão mais rápido?
Resilient Distributed Datasets
Step
Step
Step
Step
Step
In-memory Caching
Carol McDonald: An Overview of Apache Spark
• Data Partitions read from RAM instead of disk
I
RDD: principal abstração em Spark
I
Imutável
I
Tolerante a falhas
© 2014 MapR Technologies
Spark Programming Model
SparkContext
Driver Program
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
SparkContext
cluster
Worker Node
Task
Worker Node
Task
Task
Carol McDonald: An Overview of Apache Spark
SparkContext
Cluster overview
http://spark.apache.org/docs/latest/cluster-overview.html
SparkContext
RDDs
e partições Datasets (RDD)
ient
Distributed
es around RDDs
ant
collection of
on in parallel
memory
eley.edu/~matei/papers/
df
http://www.cs.berkeley.edu/
~matei/papers/2012/nsdi_spark.pdf
© 2014 MapR
Technologies
57
��
Operações
com RDDs
��
��
��
I Muito mais do que Map e Reduce
��
I
Transformações e Ações
I
Processamento lazy das transformações
��
��
��
��
��
��
��
http://databricks.com
Como testar?
Como testar?
Spark e Python
I
Spark pode facilmente ser utilizado com Scala, Java ou
Python
I
I
Veja Spark Quick Start
Shells
I
I
python shell
pyspark
Revisão rápida de Python
Operações com Strings
Comandos
astring = "Spark"
print astring
print astring.len
print astring[0]
print astring[1:3]
print astring[3:]
print astring[0:5:2]
print astring[::-1]
Saı́da
Spark
5
S
pa
rk
Sak
krapS
Python: Mais operações com Strings
Comandos
line = " GNU is not Unix.
line = line.strip()
print line
words = line.split()
print words
print words[1]
Saı́da
"
GNU is not Unix.
[’GNU’, ’is’, ’not’, ’Unix’]
is
Revisão rápida de Python
Funções
def soma(a,b):
return a + b
def mult(a,b):
return a * b
def invertString(s):
return s[::-1]
Revisão rápida de Python
Funções lambda
Funções que não recebem um nome em tempo de execução
>>> lista = range(1,10)
>>> print lista
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> def impar(i) :
...
return i % 2 != 0
>>>
[1,
>>>
[1,
filter (impar, lista)
3, 5, 7, 9]
filter (lambda x: x % 2 != 0, lista)
3, 5, 7, 9]
pyspark
orking
With RDDs
Primeiro SparkContext
textFile = sc.textFile(”SomeFile.txt”)
RDD
Welcome to
____
__
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ ‘/ __/ ’_/
/__ / .__/\_,_/_/ /_/\_\
/_/
Carol McDonald: An Overview of Apache Spark
version 1.6.1
Using Python version 2.7.11 (default, Jun 20 2016 14:45:23)
SparkContext available as sc, HiveContext available as sqlContext.
>>> lines = sc.textFile("tcpdump.list");
© 2014 MapR Technologies
1998 DARPA Intrusion Detection Evaluation
1
2
3
4
5
7
8
9
10
11
13
Start
Start
Date
Time
01/27/1998 00:00:01
01/27/1998 05:04:43
01/27/1998 06:04:36
01/27/1998 08:45:01
01/27/1998 09:23:45
01/27/1998 15:11:32
01/27/1998 21:53:17
01/27/1998 21:58:21
01/27/1998 22:57:53
01/27/1998 23:57:28
01/27/1998 25:38:00
Duration
00:00:23
67:59:01
00:00:59
00:00:01
00:00:01
00:00:12
00:00:45
00:00:01
26:59:00
130:23:08
00:00:01
Src
Dest Src
Dest
Attack
Serv
Port Port IP
IP
Score Name
ftp
1755 21 192.168.1.30 192.168.0.20 0.31 telnet 1042 23 192.168.1.30 192.168.0.20 0.42 smtp
43590 25 192.168.1.30 192.168.0.40 12.0 finger 1050 79 192.168.0.40 192.168.1.30 2.56 guess
http
1031 80 192.168.1.30 192.168.0.40 -1.3 sunrpc 2025 111 192.168.1.30 192.168.0.20 3.10 rpc
exec
2032 512 192.168.1.30 192.168.0.40 2.95 exec
http
1031 80 192.168.1.30 192.168.0.20 0.45 login 2031 513 192.168.0.40 192.168.1.20 7.00 shell 1022 514 192.168.1.30 192.168.0.20 0.52 guess
eco/i 192.168.0.40 192.168.1.30 0.01 -
1998 DARPA Intrusion Detection Evaluation
https://www.ll.mit.edu/ideval/docs/index.html
Filter
Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)
RDD
RDD
RDD
RDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
MapR Technologies
Carol McDonald: An Overview© 2014
of Apache
Spark
Como imprimir e filtrar um RDD
>>> lines = sc.textFile("tcpdump.list")
>>> lines.take(10)
>>> for x in lines.collect():
...
print x
>>> telnet = lines.filter(lambda x: "telnet" in x)
>>> for x in telnet.collect():
...
print x
Algumas transformações simples
map(func)
flatmap(func)
filter(func)
groupByKey()
reduceByKey(func)
sortByKey(ascending)
todo elemento do RDD original será
transformado por func
todo elemento do RDD original será
transformado em 0 ou mais itens por func
retorna apenas elementos selecionados por func
Dado um dataset (k, v)
retorna (k, Iterable<v>)
Dado um dataset (k, v)
retorna outro, com chaves agrupadas por func
Dado um dataset (k,v) retorna outro
ordenado em ordem ascendente ou descendente
Veja mais em Spark Programming Guide
Algumas ações
count()
collect()
take(n)
retorna o número de elementos no dataset
retorna todos elementos do dataset
retorna os n primeiros elementos do dataset
Veja mais em Spark Programming Guide
Laboratório
I
Instale o Spark
I
Obtenha uma versão do darpa dataset
I
Elabore questões interessantes e opere com os dados
I
Entrega de código e relatório via Moodle
Exemplo simples
Como ordenar os acessos por tipo de serviço
>>> lines = sc.textFile("tcpdump.list")
>>> servicePairs = lines.map(lambda x: (str(x.split()[4]),
>>> sortedServ = servicePairs.sortByKey()
Como poderı́amos rastrear os dados para identificar os ataques???
Referências
I
I
Python Tutorial
Apache Spark
I
I
Spark Programming Guide
Clash of Titans: MapReduce vs. Spark for Large Scale Data
Analytics, Juwei Shi e outros, IBM Research, China