Diferencias

Muestra las diferencias entre dos versiones de la página.

--- wiki2:hadoop:ecosystem [2019/05/08 11:55]
alfred
+++ wiki2:hadoop:ecosystem [2020/05/09 09:25] (actual)
@@ Línea 28: / Línea 28: @@
 Se puede extender Hive con User Defined Functions. También puedes cargar datos con varias aplicaciones o formatos (avro, xml...). También se puede usar con Spark (Spark puede usar Hive para obtener datos).
+**Avro** es un formato optimizado para cargar en clusters. Otro formato para Hadoop es el denominado **Parquet**, Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
 ===== Formas de leer datos en real time =====
@@ Línea 91: / Línea 92: @@
 Apache Sqoop is a tool that uses MapReduce to transfer data between Hadoop clusters and relational databases very efficiently. It works by spawning tasks on multiple data nodes to download various portions of the data in parallel. When you're finished, each piece of data is replicated to ensure reliability, and spread out across the cluster to ensure you can process it in parallel on your cluster.
+The nice thing about Sqoop is that we can automatically load our relational data from MySQL into HDFS, while preserving the structure.
+Hive and Impala also allow you to create a schema for the HDFS files using ''CREATE EXTERNAL TABLE'' commands. However Sqoop does that authomatically.
 ===== Notes =====
   * **Cloudbase** is a group of tools already pre-installed on a Linux distribution to make easier the use of Hadoop technologies.
   * {{ :wiki2:hadoop:traditional_etl_vs_elt_on_hadoop.pdf |ETL and ELT}}

Programming

Herramientas de usuario

Herramientas del sitio

Diferencias

Herramientas de la página