For successful launch of fast, big data projects, try DataTorrent’s AppFactory.
Hadoop is a distributed system comprising many components such as multiple nodes, different services and user applications. Security in this environment is multi-faceted and requires authentication between the different components as they communicate with each other. There are back-end components such as Hadoop services, applications launched by users and front-end components such as web services and web based management consoles. All of these have different security requirements. Then there is data that needs to be protected.
To function as a system all these distributed components are communicating with each other. Users and applications launched by the users interact with Hadoop services to get access to compute or storage resources and use them. There are many internal interactions happening between the Hadoop services themselves to manage the cluster resources. The core services used by the applications are Resource Manager and Node Manager together known as YARN and Name Node and Data Node making up HDFS. The back-end interactions between core services or between applications and the core services typically occur over RPC. The front-end interactions are typically used for management type operations and are web services based. Both of these need to be protected.
In Hadoop, when security (also called “secure mode”) is enabled, a combination of Kerberos and Delegation Tokens is used for authentication between back-end components. Kerberos is a mutual distributed authentication mechanism facilitating authentication between users and services or between services. It serves as a gatekeeper for access to services. Kerberos authentication requires Kerberos credentials from the participating parties and for clients that don’t have Kerberos credentials Delegation Token authentication is used. Once authentication is successful, the requests are processed by the service. To protect the data flowing between the services SSL encryption can be used. The responsibility falls on the applications and the application platforms to be able to interact with Kerberos secured back-end components in secure mode. They also need to support Delegation Tokens where Kerberos authentication is not possible. More information about Hadoop’s security can be found in the Hadoop security design document.
Hadoop supports a few different mechanisms for front-end authentication over HTTP/HTTPS. Basic digest authentication and SPNEGO with Kerberos over HTTP are supported. Other authentication mechanisms such as hybrid Kerberos or custom authentication mechanisms can be plugged in as long as they follow a specific interface. Applications making front-end web service requests to Hadoop should support authentication using SPNEGO.
Complementarily applications may also need to authenticate users to access their own portals. In secure Hadoop environment users may already be using Kerberos SPNEGO to authenticate with Hadoop front-end services and applications should support this mode of authentication on their portals. However, they may also need to support other common front-end authentication mechanisms such as LDAP or PAM to integrate with authentication infrastructure already used in the enterprise.
In a multi user environment authentication just isn’t sufficient as there are many users using the system and they are allowed or disallowed to do certain things. An authorization mechanism is also needed to complement the authentication. A comprehensive solution in security should also support Role Based Access Control or RBAC for short. It should be possible to specify what each user can and can’t do with fine grained permission control. Some examples of these permissions are: which users can launch applications, which users can see other users’ applications and which can’t, which users are administrators who have access to all functionality. Traditionally, Hadoop has provided limited authorization support and supported only broad service level access controls such as who can make communicate with HDFS services, YARN services etc. as described here. The responsibility for providing RBAC falls on the application platform and a comprehensive solution should come with a built-in RBAC solution.
So the security demands on applications running on Hadoop are numerous and varied. A comprehensively designed security solution for an application or an application platform should support all these back-end and front-end interactions.
Apache Apex is an open source application platform and runtime engine that lets users develop and run high performance distributed applications capable of processing big data in a Hadoop YARN environment. DataTorrent RTS is an enterprise software product offering on top of Apex that provides many productivity tools and ease of use features. These include graphical application assembly, application UI dashboards, operational and management UI console. The software suite works seamlessly with Kerberos in secure Hadoop. Just a few configuration parameters such as Kerberos credentials are needed and no special setup or user involvement is needed. In the rest of the discussion we will look at how the different pieces of the software works in a secure environment and how they interact with Hadoop and each other. This should serve as a blueprint for applications striving to support and run on secure Hadoop.
Different front-end authentication mechanisms for the users along with RBAC are supported out of the box as well and one can be selected by just specifying it in the configuration. Plugging custom authentication mechanisms is also supported by providing a JAAS compliant plugin. We will not talk about the front-end mechanisms in this discussion as it is a substantial topic in its own right and requires its own document. It will be discussed in a companion document in the future.
In this section we will see how security works for applications built on Apache Apex or DataTorrent RTS. We will look at the different methodologies involved in running the applications and in each case we will look into the different components that are involved. We will go into the architecture of these components and look at the different security mechanisms that are in play.
To launch applications in Apache Apex the command line client dtcli can be used. The application artifacts such as binaries and properties are supplied as an application package. The client, during the various steps involved to launch the application needs to communicate with both the Resource Manager and the Name Node. The Resource Manager communication involves the client asking for new resources to run the application master and start the application launch process. The steps along with sample Java code are described in Writing YARN Applications. The Name Node communication includes the application artifacts being copied to HDFS so that they are available across the cluster for launching the different application containers.
In secure mode the communications with both Resource Manager and Name Node requires authentication and the mechanism is Kerberos. Below is an illustration showing this.
The client dtcli supports Kerberos authentication and will automatically enable it in a secure environment. To authenticate, some Kerberos configuration namely the Kerberos credentials, are needed by the client. There are two parameters, the Kerberos principal and keytab to use for the client. These can be specified in the dt-site.xml configuration file. The properties are shown below
Refer to document Operation and Installation Guide section Multi Tenancy and Security subsection CLI Configuration in the documentation for more information. The document can also be accessed here client configuration
There is another important functionality that is performed by the client and that is to retrieve what are called delegation tokens from the Resource Manager and Name Node to seed the application master container that is to be launched. This is detailed in the next section.
When the application is completely up and running there are different components of the application running as separate processes possibly on different nodes in the cluster as it is a distributed application. These components would be interacting with each other and the Hadoop services. In secure mode all these interactions have to be authenticated before they can be successfully processed. The interactions are illustrated below in a diagram to give a complete overview. Each of them is explained in subsequent sections.
Every Apache Apex application has a master process akin to any YARN application. In our case it is called STRAM (Streaming Application Master). It is a master process that runs in its own container and manages the different distributed components of the application. Among other tasks it requests Resource Manager for new resources as they are needed and gives back resources that are no longer needed. STRAM also needs to communicate with Name Node from time-to-time to access the persistent HDFS file system.
In secure mode STRAM has to authenticate with both Resource Manager and Name Node before it can send any requests and this is achieved using Delegation Tokens. Since STRAM runs as a managed application master it runs in a Hadoop container. This container could have been allocated on any node based on what resources were available. Since there is no fixed node where STRAM runs it does not have Kerberos credentials and hence unlike the launch client dtcli it cannot authenticate with Hadoop services Resource Manager and Name Node using Kerberos. Instead, Delegation Tokens are used for authentication.
Delegation tokens are tokens that are dynamically issued by the source and clients use them to authenticate with the source. The source stores the delegation tokens it has issued in a cache and checks the delegation token sent by a client against the cache. If a match is found, the authentication is successful else it fails. This is the second mode of authentication in secure Hadoop after Kerberos. More details can be found in the Hadoop security design document. In this case the delegation tokens are issued by Resource Manager and Name Node. STRAM would use these tokens to authenticate with them. But how does it get them in the first place? This is where the launch client dtcli comes in.
The client dtcli, since it possesses Kerberos credentials as explained in the Application Launch section, is able to authenticate with Resource Manager and Name Node using Kerberos. It then requests for delegation tokens over the Kerberos authenticated connection. The servers return the delegation tokens in the response payload. The client in requesting the resource manager for the start of the application master container for STRAM seeds it with these tokens so that when STRAM starts it has these tokens. It can then use these tokens to authenticate with the Hadoop services.
A streaming container is a process that runs a part of the application business logic. It is a container deployed on a node in the cluster. The part of business logic is implemented in what we call an operator. Multiple operators connected together make up the complete application and hence there are multiple streaming containers in an application. The streaming containers have different types of communications going on as illustrated in the diagram above. They are described below.
The streaming containers periodically communicate with the application master STRAM. In the communication they send what are called heartbeats with information such as statistics and receive commands from STRAM such as deployment or un-deployment of operators, changing properties of operators etc. In secure mode this communication cannot just occur without any authentication. To facilitate this authentication special tokens called STRAM Delegation Tokens are used. These tokens are created and managed by STRAM. When a new streaming container is being started, since STRAM is the one negotiating resources from Resource Manager for the container and requesting to start the container, it seeds the container with the STRAM delegation token necessary to communicate with it. Thus, a streaming container has the STRAM delegation token to successfully authenticate and communicate with STRAM.
As mentioned earlier an operator implements a piece of the business logic of the application and multiple operators together complete the application. In creating the application the operators are assembled together in a direct acyclic graph, a pipeline, with output of operators becoming the input for other operators. At runtime the stream containers hosting the operators are connected to each other and sending data to each other. In secure mode these connections should be authenticated too, more importantly than others, as they are involved in transferring application data.
When operators are running there will be effective processing rate differences between them due to intrinsic reasons such as operator logic or external reasons such as different resource availability of CPU, memory, network bandwidth etc. as the operators are running in different containers. To maximize performance and utilization the data flow is handled asynchronous to the regular operator function and a buffer is used to intermediately store the data that is being produced by the operator. This buffered data is served by a buffer server over the network connection to the downstream streaming container containing the operator that is supposed to receive the data from this operator. This connection is secured by a token called the buffer server token. These tokens are also generated and seeded by STRAM when the streaming containers are deployed and started and it uses different tokens for different buffer servers to have better security.
Like STRAM, streaming containers also need to communicate with NameNode to use HDFS persistence for reasons such as saving the state of the operators. In secure mode they also use NameNode delegation tokens for authentication. These tokens are also seeded by STRAM for the streaming containers.
DataTorrent RTS provides a graphical UI management console called dtmanage to be able to manage the Apex applications. This includes functionality such as importing applications, starting them, monitoring all the aspects of the application such as individual operator performance and shutting down applications.
The UI console is powered by a service called Gateway that runs in the back-end. This service is installed as part of the DataTorrent installation on an edge node or a node in the Hadoop cluster. The UI console operated by the user communicates with the Gateway service using REST based web service calls. The Gateway service in turn communicates with the STRAMs of the individual applications using REST web services.
Gateway interacts with different Hadoop services and application components, these interactions are illustrated in the diagram below.
Gateway communicates with Hadoop services Resource Manager and Name Node much like the command line client described earlier. In secure mode it authenticates with these services using Kerberos credentials. Since Gateway is running on a fixed node it can have the Kerberos credentials namely a Kerberos principal and keytab. These can be configured in the dt-site.xml configuration file as follows
Refer to document Operation and Installation Guide in DataTorrent RTS documentation, section Multi Tenancy and Security subsection DT Gateway Configuration for more information. The document can also be accessed here gateway configuration
Gateway communicates with the individual application STRAMs to manage the applications and provide operational information about them to the UI console for the user. There are two kinds of communication. The first is an indirect request via a proxy service run by Hadoop called Resource Manager Web Services Proxy and second is direct requests to the individual STRAMs.
The request to STRAM via the Web Services Proxy happens before any direct communication between Gateway and STRAM. This request is made as part of application discovery and is also useful to establish some security credentials. Also, this form of indirect request is infrequent. In secure mode an administrator can additionally enable Kerberos over HTTP for the Web Services Proxy in Hadoop. In this case Gateway automatically detects this and authenticates with the Proxy using a protocol called SPNEGO which is Kerberos over HTTP. Once successful the request is forwarded to STRAM and the response sent back to Gateway.
In the second form of communication Gateway directly connects to STRAM to make the requests. In secure mode we want this connection to be also authenticated. In this type of connection Gateway passes a web service token in the request and the STRAM checks this token. If the token is valid, then the request is processed else it is denied.
How does Gateway get the web service Token in the first place? This happens as part of the initial communication via the Web Services Proxy. In that connection Gateway has already authenticated with the Proxy using SPNEGO and STRAM when it receives that initial request generates and sends back a web service token similar to a delegation token. This token is then used by Gateway in subsequent requests it makes directly to STRAM and STRAM is able to validate it since it generated the token in the first place.
We looked at the different security requirements for distributed applications when they run in a secure Hadoop environment and looked at how Apex and DataTorrent RTS solve this. We looked mainly at back-end authentication mechanisms and touched upon a little bit on front-end authentication when discussing how Gateway interacts with the Hadoop front-end services using SPNEGO. There are additional front-end security considerations as described during introduction such as authenticating the user to the UI console and RBAC for users. These are beyond the purview of this blog and will be considered in companion blogs.
DataTorrent helps our customers get into production quickly using open source technologies to handle real-time, data-in-motion, fast, big data for critical business outcomes. Using the best-of-breed solutions in open source, we harden the open source applications to make them act and look as one through our unique IP that we call Apoxi™. Apoxi is a framework that gives you the power to manage those open source applications reliably just like you do with your proprietary enterprise software applications, lowering the time to value (TTV) with total lower cost of ownership (TCO). To get started, you can download DataTorrent RTS or micro data services from the DataTorrent AppFactory.