In a previous post, we addressed Hadoop ecosystem and a set of tools that reside and operate near the two core components of Hadoop (i.e. MapReduce and HDFS) to help them store and manage data, and perform various analytic tasks. However, Big Data landscape is more than Hadoop alone.
In this post, we will expand the circle a little bit and address the many technologies that are involved in Big Data processes. The Big Data landscape can be daunting. The vast proliferation of technologies in this competitive market means there’s no single go-to solution. However, it is possible to group the different tools and frameworks based on similarity in goal and functionality into a number of main components:
- Distributed file systems: file systems that run on multiple servers and allow access to files from multiple hosts, which means the ability to share files and storage resources by multiple users. They can contain very large files, replicated across multiple servers and scale easily. Actions such as storing, retrieving and deleting are very common. Distributed file systems differ in their performance, the mutability of content, handling of concurrent writes, handling of permanent or temporary loss of nodes or storage, and their policy of storing content. The best-known example of distributed file systems is Hadoop File System (HDFS), the open-source implementation of the Google file system. Other examples include Tachyon File System, XtreemFS, Ceph File System, OpenAFS and several others.
- Distributed programming frameworks: open-source community frameworks that help deal with distributed data and complexities associated with the work of distributed programming such as restarting failed jobs, tracking results, and so on. One of the best-known examples standing here is Apache Hadoop software library, a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Other Hadoop-related projects include Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, Spark, Tez, and Zookeeper.
- Data integration frameworks: tools, such as Apache Sqoop and Apache Flume, that allow moving data from one data source to another. The process is similar to ETL (extract, transform, and load) processes in traditional data warehousing systems. Apache Flume, for example, is responsible for collecting, aggregating and moving data into HDFS. It supports multi-hop flows and fan-in and fan-out processes. Sqoop (SQL to Hadoop), on the other hand, is meant to transfer data between Hadoop clusters and relational databases (such as Oracle or Microsoft SQL Server) that traditionally use SQL instructions.
- Machine learning frameworks. With the amount of data that we need to analyze today, it has become crucial to develop specialized frameworks and libraries that can deal with the great amount of data that we have today. Machine learning is ideal for exploiting the opportunities hidden in Big Data and well suited to the complexity of dealing with disparate data sources and the huge variety of variables and amounts of data involved. A large number of machine learning stands there including Amazon machine learning (visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology), Apache Mahout (Machine learning library and math library, on top of MapReduce), scikit-learn (machine learning in Python), Spark MLlib (a Spark implementation of some common machine learning (ML) functionality), Microsoft Azure Machine Learning (built on the machine learning capabilities already available in several Microsoft products including Xbox and Bing and using predefined templates and workflows), Ayasdi Core, brain, Cloudera Oryx, Concurrent Pattern, convnetjs, etc.
- NoSQL databases: non-traditional paradigms that have the potential to deal with large datasets (at the Web scale) and solve challenges associated with Big Data processes under certain constraints, i.e. scope, time, and cost. They show several advantages over traditional database systems such as high scalability, manageability and administration, low cost, schemaless data representation, development time, speed and flexible data models. Many different types of databases have arisen, but they can be categorized into: document-oriented (eg. MongoDB, CouchDB and SimpleDB), columnar (e.g. BigTable and HBase), key-value (e.g. MemcacheDB, Redis and Riak) XML (e.g. MarkLogic, BaseX and eXist) and graph (e.g. Neo4J, GraphDB and Giraph).
- Scheduling tools: job schedulers/workload automation solutions that provide end-to-end automation of ETL, data warehousing, and reporting, including triggering new jobs (e.g. starting a map-reduce task whenever a new data file is added to a folder), passing of data and managing of dependencies between systems. Apache Oozie, for instance, is a scalable, reliable and extensible system the default job scheduler on top of Hadoop for many Hadoop distributions. It is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop, and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Other scheduling tools include Apache Aurora, Apache Falcon, Apache Chronos, Linkedin Azkaban, Pinterest Pinball, and Sparrow.
- Benchmarking tools: a set of tools that were developed to optimize Big Data installation by providing standardized profiling suites. A profiling suite is taken from a representative set of big data jobs. Examples include Apache Hadoop Benchmarking, Berkeley SWIM Benchmark, Big-Bench, Hive-benchmarks, Hive-testbench, Intel HiBench, Mesosaurus, Netflix Inviso, PUMA Benchmarking, and Yahoo Gridmix3.
- System deployment: deploying new applications into the Big Data cluster (e.g. Hadoop cluster) to automate the installation and configuration of Big Data components. The benefit expected from Big Data infrastructure will be compromised if such deployment is done in the wrong way. Right deployment of Big Data processes requires is critical and requires specific skills and resources. For instance, Apache Ambari is an easy-to-use Hadoop management web UI backed by its RESTful APIs. It is able to deploy a complete Hadoop system from scratch, however, is not possible use this GUI in a Hadoop system that is already running. Apache Mesos is a cluster manager that provides resource sharing and isolation across cluster applications. Apache Yarn is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. Other examples include Ankush, Apache Bigtop, Apache Helix, Apache Slider, Apache Whirr, Brooklyn, Buildoop, etc.
- Service programming: dealing with how to obtain data, and how to deploy the results of analysis back into the Web and other systems. The best-known two examples of such services are the Representational State Transfer (or REST) and the Service-Oriented Architecture (or SOA) APIs that connect Big Data components and services to the rest of the application. SOA use is more appropriate for applications that deal more with results of specific analytic or reductional processes. REST, on the other hand, makes sense if an application needs to know about Big Data resources without abstraction into high-level services.
- Security. Big Data has offered a tremendous opportunity for enterprises across industries. However, without appropriate encryption solutions in place, Big Data can mean big problems, particularly that data sources can include personally identifiable information, payment card data, intellectual property, health records, and many others. Big data security tools allow having control over access to data. Forms of the way that encryption tools can protect Big Data is by encrypting and controlling access at the file-system level, or alternatively encrypting specific columns in an application before it writes the field to a database. Another form of data protection is by providing encryption to data stored in systems logs, configuration files, error logs, disk caches, etc. and allowing access to these data only to privileged users. Apache Knox Gateway is a system that provides a single point of secure access for Apache Hadoop clusters. Apache Ranger (a framework to enable, monitor and manage comprehensive data security across the Hadoop platform (formerly called Apache Argus), Apache Sentry, PacketPig, and Voltage SecureData.