Hadoop is currently the most common single Big Data platform. However, still other techniques play a role in the scene. While there are proprietary distributions for Hadoop which are developed by giant Big Data companies, such commercial products rely heavily on open source projects.
Hadoop ecosystem includes a set of tools that function near MapReduce and HDFS (the two main Hadoop core components) and help the two store and manage data, as well as perform the analytic tasks. As there is an increasing number of new technologies that encircle Hadoop, it is important to realize that certain products maybe more appropriate to fulfill certain requirements than others.
In this article, we overview a number of technologies that collectively form Hadoop environment including the core elements, database and data management tools, the analytic tools, data transfer tools, security tools, cloud computing for Hadoop as well as data serialization.
Hadoop core technologies provide a fault tolerance mechanism to store large datasets. Hadoop Distributed File System (HDFS) is where data are stored. Data files are broken into blocks and are distributed over the servers. It is designed to run on several clusters and to be resilient to failures since it makes several copies of its data blocks. MapReduce, one the other hand, is a paradigm to process data. It was the first programming method to develop applications in Hadoop, comprising of two programs written in java: Mappers, to extract data from HDFS and put into maps, and Reducers to aggregate the results produced by the mappers.
Database and data management tools store and manage data using, mostly, the NoSQL paradigm. They do not use the known Structured Query Language (SQL), the database schema, or other common relational database internal operations. Common NoSQL storing types include document databases (e.g. MongoDB and CouchDB), graph databases (e.g. Neo4j and Giraph), key-value databases (e.g. Cassandra) and others.
The analytic tools help perform data per-processing operations such as data cleaning, data integration, data transformation and data reduction. They also implement machine learning algorithms (e.g. classification or regression) to build insights from data and thus do a favor to business intelligence (BI). For example, Apache Mahout is a group of machine learning algorithms (e.g. k-means for data clustering, random forest and logistic regression for data classification) that perform sophisticated analytic operations. The library has been used widely in recent years to develop recommender system for online businesses. Designed especially for Hadoop, Pig is another project that can process data, speed up code and make it handier. It can extract, transform and load data (usually referred to in the database jargon as ETL.) Pig uses a procedural data processing approach, in contrast to another popular project which is Hive that is based on writing logic-driven queries.
Data transfer tools help move data between the different Hadoop clusters and external data sources. Apache Flume, for example, is responsible for collecting, aggregating and moving data into HDFS. It supports multi-hop flows and fan-in and fan-out processes. Sources, sinks and channels are stored in a configuration file. The Flume project was designed originally to make data collecting easy and scalable by running agents on the source machines. The agents send data updates to collectors that in turn aggregate them into large chunks that are saved later as HDFS files. Sqoop (SQL to Hadoop), another tool, is meant to transfer data between Hadoop clusters and relational databases (such as Oracle or Microsoft SQL Server) that traditionally use SQL instructions.
The need to secure Hadoop against malicious attacks has also received attention from giant software developers in recent time. Such security issues are related to authentication (checking identity), authorization (checking privileges), encryption (encrypting data while in transmission), and auditing (logging services). Hence, tools like Knox and Sentry were developed. Kerberos, for instance, provides authentication services in Hadoop clusters, although it was first developed as a network authentication tool that provides encrypted tickets between clients and servers.
Cloud computing for Hadoop offers low upfront costs and the ability to scale up. Cloud computing provides important services such as resource sharing, rapid elasticity, network access, on-demand self-service and measured resource service to Hadoop clusters in the cloud. However, cloud computing and another closely related concept which is virtualization (creating virtual computing entities) do come with some cost: usually with some degree of performance degradation. While Amazon Web Services (AWS) is the most popular cloud computing service, Serengeti is a Hadoop virtualization tool that helps building virtual Hadoop cluster in the cloud.
Data serialization is related to how the data should look like (i.e. internal representations and references) when it moves from one place to another, especially that Big Data processes require the transfer of data between the different parts of the system (perhaps for several times.) The different stages of data processing may require different languages and Application Programming Interfaces (APIs). There is a wide range of data serialization tools. To this end, it is important to put in mind a number of factors when deciding which serialization format to choose: size of data, speed at which data is read/written by computers, whether data can be easily understood and ease of use. In addition to Thrift, Avro, Protocol Buffers (Protobuf), BSON (Binary JSON), and Parquet, JSON (Java Script Object Notation) has become the preferable way to transfer data inside Hadoop environment. It is self-describing, hierarchical, has a fairly simple format, and uses the key-value method to describe data. The big advantage of using JSON is that it maps to the data structure of most programming languages and is able to keep the parsing code and schema design as simple as possible.