Apache Hive

From Infogalactic: the planetary knowledge core
Jump to: navigation, search
Apache Hive
Apache Hive
Developer(s) Contributors
Stable release 2.0.0[1] / February 15, 2016; 2 years ago (2016-02-15)
Development status Active
Written in Java
Operating system Cross-platform
Type Database management system
License Apache License 2.0
Website hive.apache.org

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[2] While developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA).[3][4] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.[5]

Features

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL[6] with schema on read and transparently converts queries to MapReduce, Apache Tez[7] and Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes.[8] Other features of Hive include:

  • Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10, more index types are planned.
  • Different storage types such as plain text, RCFile, HBase, ORC, and others.
  • Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
  • Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.
  • Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
  • SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.[9]

Four file formats are supported in Hive, which are TEXTFILE,[10] SEQUENCEFILE, ORC[11] and RCFILE.[12][13][14] Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13.[15][16]. Additional Hive plugins support querying of the Bitcoin Blockchain[17].

HiveQL

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited subquery support.[18][19] Support for insert, update, and delete with full ACID functionality was made available with release 0.14.[20]

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce or Tez, or Spark jobs, which are submitted to Hadoop for execution.[21]

Hive unit testing frameworks


See also

References

  1. "Apache Hive Download News".<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  2. Venner, Jason (2009). Pro Hadoop. Apress. ISBN 978-1-4302-1942-2.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  3. Use Case Study of Hive/Hadoop
  4. OSCON Data 2011, Adrian Cockcroft, "Data Flow at Netflix" on YouTube
  5. Amazon Elastic MapReduce Developer Guide
  6. HiveQL Language Manual
  7. Apache Tez
  8. Working with Students to Improve Indexing in Apache Hive
  9. Lam, Chuck (2010). Hadoop in Action. Manning Publications. ISBN 1-935182-19-6.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  10. Optimising Hadoop and Big Data with Text and HiveOptimising Hadoop and Big Data with Text and Hive
  11. LanguageManual ORC
  12. Faster Big Data on Hadoop with Hive and RCFile
  13. Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop
  14. Yongqiang He; Rubao Lee; Yin Huai; Zheng Shao; Namit Jain; Xiaodong Zhang; Zhiwei Xu. "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems" (PDF).<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  15. "Parquet". 18 Dec 2014. Archived from the original on 2 February 2015. Retrieved 2 February 2015.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  16. Massie, Matt (21 August 2013). "A Powerful Big Data Trio: Spark, Parquet and Avro". zenfractal.com. Archived from the original on 2 February 2015. Retrieved 2 February 2015.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  17. {{cite web|last1=Franke|first1=Jörn|title=Hive & Bitcoin: Analytics on Blockchain data with SQL|url=https://snippetessay.wordpress.com/2016/04/28/hive-bitcoin-analytics-on-blockchain-data-with-sql/
  18. White, Tom (2010). Hadoop: The Definitive Guide. O'Reilly Media. ISBN 978-1-4493-8973-4.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  19. Hive Language Manual
  20. ACID and Transactions in Hive
  21. Hive A Warehousing Solution Over a MapReduce Framework

External links