Apache Hadoop is an emerging technology that was designed to address the specific requirements of Big Data. It can deal with petabytes of structured and unstructured data. The technology was developed by Yahoo! in 2005 and it got its name from a toy elephant. However, Hadoop does not work alone. Rather, it is part of an increasing number of associated technologies such as HBase, Hive, Pig, Oozie, and Zookeeper.
- Is Fault-tolerance open-source software framework that can deal with software and hardware failures.
- Scales well to any increase in processors, memory or storage devices.
- Supports parallel programing models with the help of MapReduce.
Typically, Hadoop consists of two basic elements:
- The first part, Hadoop Distributed File System or HDFS, is responsible for providing high-bandwidth and cluster-based storage.
- The second part, MapReduce, is responsible for distributing large datasets across multiple servers.
Hadoop is useful in situations where we have complex information, unstructured data, heavily recursive algorithms, machine learning implementations, fault-tolerant tasks, parallelized algorithms (such as geo-spatial analysis), sequential sequencing, or very big datasets that cannot be easily fitted into storage devices.
Even though it is real, Hadoop is still not the technology of choice in situations where we need to analyze conventional structured data such as call records, customer information and transaction data. In such cases, traditional relational database management systems (RDBMS) can perform better.