Documentation
  • Products
  • Services & Support
  • Solutions

Cloudera Enterprise 6.3.x | Other versions

Cloudera Navigator Data ManagementCloudera Navigator and the CloudUsing Cloudera Navigator with Amazon S3
View All Categories
  • Getting Started
    • Cloudera Personas
    • Planning a New Cloudera Enterprise Deployment
    • CDH
      • Hive
      • Impala
      • Kudu
      • Sentry
      • Spark
      • External Documentation
    • Cloudera Manager
      • Software Management
        • Parcels
    • Navigator
      • Getting Started
      • FAQ
    • Navigator Encryption
      • Navigator Key Trustee Server
      • Navigator Key HSM
      • Navigator HSM KMS
      • Navigator Encrypt
    • Proof-of-Concept Installation Guide
      • Before You Begin
      • Installing a Proof-of-Concept Cluster
        • Step 1: Run the Cloudera Manager Installer
        • Step 2: Install CDH Using the Wizard
        • Step 3: Set Up a Cluster
      • Managing the Embedded Database
      • Migrating Embedded PostgreSQL Database to External PostgreSQL Database
    • Getting Support
    • FAQ
  • Release Notes
  • Requirements and Supported Versions
  • Installation
    • Before You Install
      • Storage Space Planning for Cloudera Manager
      • Configure Network Names
      • Disabling the Firewall
      • Setting SELinux mode
      • Enable an NTP Service
      • Install Python 2.7 on Hue Hosts
      • Impala Requirements
      • Required Privileges
      • Ports
        • Cloudera Manager and Navigator
        • Navigator Encryption
        • CDH Components
        • DistCp
        • Third-Party Components
      • Recommended Role Distribution
      • Custom Installation Solutions
        • Configuring a Local Parcel Repository
        • Configuring a Local Package Repository
        • Manually Install Cloudera Software Packages
        • Creating Virtual Images of Cluster Hosts
        • Configuring a Custom Java Home Location
        • Creating a CDH Cluster Using a Cloudera Manager Template
      • Service Dependencies in Cloudera Manager
    • Installing Cloudera Manager and CDH
      • Step 1: Configure a Repository
      • Step 2: Install JDK
      • Step 3: Install Cloudera Manager Server
      • Step 4: Install Databases
        • Install and Configure MariaDB
        • Install and Configure MySQL
        • Install and Configure PostgreSQL
        • Install and Configure Oracle Database
      • Step 5: Set up the Cloudera Manager Database
      • Step 6: Install CDH and Other Software
      • Step 7: Set Up a Cluster
    • Installing Navigator Data Management
    • Installing Navigator Encryption
      • Installing Cloudera Navigator Key Trustee Server
      • Installing Cloudera Navigator Key HSM
      • Installing Key Trustee KMS
      • Installing Navigator HSM KMS Backed by Thales HSM
      • Installing Navigator HSM KMS Backed by Luna HSM
      • Installing Cloudera Navigator Encrypt
    • After Installation
      • Deploying Clients
      • Testing the Installation
      • Installing the GPL Extras Parcel
      • Migrating from Packages to Parcels
      • Migrating from Parcels to Packages
      • Secure Your Cluster
    • Troubleshooting Installation Problems
    • Uninstalling Cloudera Software
      • Uninstalling a CDH Component From a Single Host
  • Upgrade Guide
  • Cluster Management
    • Cloudera Manager
      • Cloudera Manager Admin Console
        • Home Page
        • Documentation
        • Automatic Logout
      • FAQ
      • Cloudera Manager API
        • Cluster Automation
      • Cloudera Manager Administration
        • Starting, Stopping, and Restarting the Cloudera Manager Server
        • Configuring Cloudera Manager Server Ports
        • Moving the Cloudera Manager Server to a New Host
        • Migrating Embedded PostgreSQL Database to External PostgreSQL Database
        • Migrating from PostgreSQL Database Server to MySQL/Oracle Database Server
        • Managing the Cloudera Manager Server Log
        • Cloudera Manager Agents
          • Starting, Stopping, and Restarting Cloudera Manager Agents
          • Configuring Cloudera Manager Agents
          • Managing Cloudera Manager Agent Logs
        • Configuring Network Settings
        • Managing Licenses
        • Sending Usage and Diagnostic Data to Cloudera
        • Exporting and Importing Cloudera Manager Configuration
        • Backing Up Cloudera Manager
        • Other Tasks and Settings
        • Cloudera Management Service
      • Extending Cloudera Manager
    • Cluster Configuration Overview
      • Modifying Configuration Properties Using Cloudera Manager
      • Autoconfiguration
      • Custom Configuration
      • Stale Configurations
      • Client Configuration Files
      • Viewing and Reverting Configuration Changes
      • Exporting and Importing Cloudera Manager Configuration
      • Cloudera Manager Configuration Properties Reference
    • Managing Clusters
      • Adding and Deleting Clusters
      • Starting, Stopping, Refreshing, and Restarting a Cluster
      • Pausing a Cluster in AWS
      • Renaming a Cluster
      • Cluster-Wide Configuration
      • Virtual Private Clusters and Cloudera SDX
        • Compatibility Considerations for Virtual Private Clusters
        • Tutorial: Using Impala, Hive and Hue with Virtual Private Clusters
        • Networking Considerations for Virtual Private Clusters
      • Managing Services
        • HBase
        • HDFS
          • Data Durability
            • Enabling Erasure Coding
          • NameNodes
            • Backing Up and Restoring HDFS Metadata
            • Moving NameNode Roles
            • Sizing NameNode Heap Memory
            • Backing Up and Restoring NameNode Metadata
          • DataNodes
            • Configuring Storage Directories for DataNodes
            • Configuring Storage Balancing for DataNodes
            • Performing Disk Hot Swap for DataNodes
          • JournalNodes
          • Configuring Short-Circuit Reads
          • Configuring HDFS Trash
          • Preventing Inadvertent Deletion of Directories
          • HDFS Balancers
          • Enabling WebHDFS
          • Adding HttpFS
          • Adding and Configuring an NFS Gateway
          • Setting HDFS Quotas
          • Configuring Mountable HDFS
          • Configuring Centralized Cache Management in HDFS
          • Configuring Proxy Users to Access HDFS
          • Using CDH with Isilon Storage
          • Configuring Heterogeneous Storage in HDFS
        • Hive
        • Hue
          • Adding a Hue Service and Role Instance
          • Managing Hue Analytics Data Collection
          • Enabling Hue Applications Using Cloudera Manager
        • Impala
          • The Impala Service
          • Modifying Impala Startup Options
          • Post-Installation Configuration for Impala
          • Configuring Impala to Work with ODBC
          • Configuring Impala to Work with JDBC
        • Key-Value Store Indexer
        • Kudu
        • Solr
        • Spark
          • Managing Spark Using Cloudera Manager
          • Managing the Spark History Server
        • Sqoop 1 Client
        • YARN (MRv2) and MapReduce (MRv1)
          • Managing YARN
          • Managing YARN ACLs
          • Managing MapReduce
        • Managing ZooKeeper
        • Configuring Services to Use the GPL Extras Parcel
    • Managing Hosts
      • Viewing Host Details
      • Using the Host Inspector
      • Adding a Host to the Cluster
      • Specifying Racks for Hosts
      • Host Templates
      • Performing Maintenance on a Cluster Host
        • Tuning and Troubleshooting Host Decommissioning
        • Maintenance Mode
      • Changing Hostnames
      • Deleting Hosts
      • Moving a Host Between Clusters
    • Managing Services
      • Adding a Service
      • Comparing Configurations for a Service Between Clusters
      • Add-on Services
      • Starting, Stopping, and Restarting Services
      • Rolling Restart
      • Aborting a Pending Command
      • Deleting Services
      • Renaming a Service
      • Configuring Maximum File Descriptors
      • Exposing Hadoop Metrics to Graphite
      • Exposing Hadoop Metrics to Ganglia
    • Managing Roles
      • Role Instances
      • Role Groups
    • Monitoring and Diagnostics
      • Introduction to Cloudera Manager Monitoring
        • Time Line
        • Health Tests
        • Home Page
        • Viewing Charts for Cluster, Service, Role, and Host Instances
        • Configuring Monitoring Settings
      • Monitoring Clusters
      • Inspecting Network Performance
      • Monitoring Services
        • Monitoring Service Status
        • Viewing Service Status
        • Viewing Service Instance Details
        • Viewing Role Instance Status
          • The Processes Tab
        • Running Diagnostic Commands for Roles
        • Periodic Stacks Collection
        • Viewing Running and Recent Commands
        • Monitoring Resource Management
      • Monitoring Hosts
        • Host Details
        • Host Inspector
      • Monitoring Activities
        • Monitoring MapReduce Jobs
          • Viewing and Filtering MapReduce Activities
          • Viewing the Jobs in a Pig, Oozie, or Hive Activity
          • Task Attempts
          • Viewing Activity Details in a Report Format
          • Comparing Similar Activities
          • Viewing the Distribution of Task Attempts
        • Monitoring Impala Queries
          • Query Details
        • Monitoring YARN Applications
        • Monitoring Spark Applications
      • Events
      • Alerts
        • Managing Alerts
          • Configuring Alert Email Delivery
          • Configuring Alert SNMP Delivery
          • Configuring Custom Alert Scripts
      • Triggers
        • Cloudera Manager Trigger Use Cases
      • Lifecycle and Security Auditing
      • Charting Time-Series Data
        • Dashboards
        • tsquery Language
        • Metric Aggregation
      • Logs
        • Viewing the Cloudera Manager Server Log
        • Viewing the Cloudera Manager Agent Logs
        • Managing Disk Space for Log Files
      • Reports
        • Directory Usage Report
        • Disk Usage Reports
        • Activity, Application, and Query Reports
        • The File Browser
        • Downloading HDFS Directory Access Permission Reports
      • Troubleshooting Cluster Configuration and Operation
      • Monitoring Reference
        • Cloudera Manager Entity Types
        • Cloudera Manager Entity Type Attributes
        • Cloudera Manager Events
          • HEALTH_CHECK Category
          • SYSTEM Category
          • AUDIT_EVENT Category
          • HBASE Category
          • LOG_MESSAGE Category
          • ACTIVITY_EVENT Category
        • Cloudera Manager Health Tests
          • Active Database Health Tests
          • Active Key Trustee Server Health Tests
          • Activity Monitor Health Tests
          • Alert Publisher Health Tests
          • Authentication Server Health Tests
          • Authentication Server Load Balancer Health Tests
          • Authentication Service Health Tests
          • Cloudera Management Service Health Tests
          • DataNode Health Tests
          • Event Server Health Tests
          • Failover Controller Health Tests
          • Flume Health Tests
          • Flume Agent Health Tests
          • Garbage Collector Health Tests
          • HBase Health Tests
          • HBase REST Server Health Tests
          • HBase Thrift Server Health Tests
          • HDFS Health Tests
          • History Server Health Tests
          • Hive Health Tests
          • Hive Execution Health Tests
          • Hive Metastore Server Health Tests
          • HiveServer2 Health Tests
          • Host Health Tests
          • Host Monitor Health Tests
          • HttpFS Health Tests
          • Hue Health Tests
          • Hue Server Health Tests
          • Impala Health Tests
          • Impala Catalog Server Health Tests
          • Impala Daemon Health Tests
          • Impala Llama ApplicationMaster Health Tests
          • Impala StateStore Health Tests
          • JobHistory Server Health Tests
          • JobTracker Health Tests
          • JournalNode Health Tests
          • Kafka Health Tests
          • Kafka Broker Health Tests
          • Kafka MirrorMaker Health Tests
          • Kerberos Ticket Renewer Health Tests
          • Key Management Server Health Tests
          • Key Management Server Proxy Health Tests
          • Key-Value Store Indexer Health Tests
          • Kudu Health Tests
          • Lily HBase Indexer Health Tests
          • Load Balancer Health Tests
          • MapReduce Health Tests
          • Master Health Tests
          • Monitor Health Tests
          • NFS Gateway Health Tests
          • NameNode Health Tests
          • Navigator Audit Server Health Tests
          • Navigator Luna KMS Metastore Health Tests
          • Navigator Luna KMS Proxy Health Tests
          • Navigator Metadata Server Health Tests
          • Navigator Thales KMS Metastore Health Tests
          • Navigator Thales KMS Proxy Health Tests
          • NodeManager Health Tests
          • Oozie Health Tests
          • Oozie Server Health Tests
          • Passive Database Health Tests
          • Passive Key Trustee Server Health Tests
          • RegionServer Health Tests
          • Reports Manager Health Tests
          • ResourceManager Health Tests
          • SecondaryNameNode Health Tests
          • Sentry Health Tests
          • Sentry Server Health Tests
          • Service Monitor Health Tests
          • Solr Health Tests
          • Solr Server Health Tests
          • Spark Health Tests
          • Spark (Standalone) Health Tests
          • Tablet Server Health Tests
          • TaskTracker Health Tests
          • Telemetry Publisher Health Tests
          • Tracer Health Tests
          • WebHCat Server Health Tests
          • Worker Health Tests
          • YARN (MR2 Included) Health Tests
          • ZooKeeper Health Tests
          • ZooKeeper Server Health Tests
        • Cloudera Manager Metrics
          • Accumulo Metrics
          • Active Database Metrics
          • Active Key Trustee Server Metrics
          • Activity Metrics
          • Activity Monitor Metrics
          • Agent Metrics
          • Alert Publisher Metrics
          • Attempt Metrics
          • Authentication Server Metrics
          • Authentication Server Load Balancer Metrics
          • Authentication Service Metrics
          • Cloudera Management Service Metrics
          • Cloudera Manager Server Metrics
          • Cluster Metrics
          • DSSD DataNode Metrics
          • DataNode Metrics
          • Directory Metrics
          • Disk Metrics
          • Event Server Metrics
          • Failover Controller Metrics
          • Filesystem Metrics
          • Flume Metrics
          • Flume Channel Metrics
          • Flume Sink Metrics
          • Flume Source Metrics
          • Garbage Collector Metrics
          • HBase Metrics
          • HBase REST Server Metrics
          • HBase RegionServer Replication Peer Metrics
          • HBase Thrift Server Metrics
          • HDFS Metrics
          • HDFS Cache Directive Metrics
          • HDFS Cache Pool Metrics
          • HRegion Metrics
          • HTable Metrics
          • History Server Metrics
          • Hive Metrics
          • Hive Execution Metrics
          • Hive Metastore Server Metrics
          • HiveServer2 Metrics
          • Host Metrics
          • Host Monitor Metrics
          • HttpFS Metrics
          • Hue Metrics
          • Hue Server Metrics
          • Impala Metrics
          • Impala Catalog Server Metrics
          • Impala Daemon Metrics
          • Impala Daemon Resource Pool Metrics
          • Impala Llama ApplicationMaster Metrics
          • Impala Pool Metrics
          • Impala Pool User Metrics
          • Impala Query Metrics
          • Impala StateStore Metrics
          • Isilon Metrics
          • Java KeyStore KMS Metrics
          • JobHistory Server Metrics
          • JobTracker Metrics
          • JournalNode Metrics
          • Kafka Metrics
          • Kafka Broker Metrics
          • Kafka Broker Topic Metrics
          • Kafka Broker Topic Partition Metrics
          • Kafka Consumer Metrics
          • Kafka Consumer Group Metrics
          • Kafka MirrorMaker Metrics
          • Kafka Producer Metrics
          • Kafka Replica Metrics
          • Kerberos Ticket Renewer Metrics
          • Key Management Server Metrics
          • Key Management Server Proxy Metrics
          • Key Trustee KMS Metrics
          • Key Trustee Server Metrics
          • Key-Value Store Indexer Metrics
          • Kudu Metrics
          • Kudu Replica Metrics
          • Lily HBase Indexer Metrics
          • Load Balancer Metrics
          • MapReduce Metrics
          • Master Metrics
          • Monitor Metrics
          • NFS Gateway Metrics
          • NameNode Metrics
          • Navigator Audit Server Metrics
          • Navigator HSM KMS backed by SafeNet Luna HSM Metrics
          • Navigator HSM KMS backed by Thales HSM Metrics
          • Navigator Luna KMS Metastore Metrics
          • Navigator Luna KMS Proxy Metrics
          • Navigator Metadata Server Metrics
          • Navigator Thales KMS Metastore Metrics
          • Navigator Thales KMS Proxy Metrics
          • Network Interface Metrics
          • NodeManager Metrics
          • Oozie Metrics
          • Oozie Server Metrics
          • Passive Database Metrics
          • Passive Key Trustee Server Metrics
          • RegionServer Metrics
          • Reports Manager Metrics
          • ResourceManager Metrics
          • SecondaryNameNode Metrics
          • Sentry Metrics
          • Sentry Server Metrics
          • Server Metrics
          • Service Monitor Metrics
          • Solr Metrics
          • Solr Replica Metrics
          • Solr Server Metrics
          • Solr Shard Metrics
          • Spark Metrics
          • Spark (Standalone) Metrics
          • Sqoop 1 Client Metrics
          • Tablet Server Metrics
          • TaskTracker Metrics
          • Telemetry Publisher Metrics
          • Time Series Table Metrics
          • Tracer Metrics
          • User Metrics
          • WebHCat Server Metrics
          • Worker Metrics
          • YARN (MR2 Included) Metrics
          • YARN Pool Metrics
          • YARN Pool User Metrics
          • ZooKeeper Metrics
          • Disabling Metrics for Specific Roles
    • Performance Management
      • Optimizing Performance in CDH
      • Choosing and Configuring Data Compression
      • Tuning the Solr Server
      • Tuning Spark Applications
      • Tuning YARN
      • Tuning JVM Garbage Collection
    • Resource Management
      • Static Service Pools
        • Linux Control Groups (cgroups)
      • Dynamic Resource Pools
      • YARN (MRv2) and MapReduce (MRv1) Schedulers
        • Configuring the Fair Scheduler
        • Enabling and Disabling Fair Scheduler Preemption
      • Data Storage for Monitoring Data
      • Cluster Utilization Reports
        • Creating a Custom Cluster Utilization Report
    • High Availability
      • HDFS High Availability
        • Introduction to HDFS High Availability
        • Configuring Hardware for HDFS HA
        • Enabling HDFS HA
        • Disabling and Redeploying HDFS HA
        • Configuring Other CDH Components to Use HDFS HA
        • Administering an HDFS High Availability Cluster
        • Changing a Nameservice Name for Highly Available HDFS Using Cloudera Manager
      • MapReduce (MRv1) and YARN (MRv2) High Availability
        • YARN (MRv2) ResourceManager High Availability
        • Work Preserving Recovery for YARN Components
        • MapReduce (MRv1) JobTracker High Availability
      • Cloudera Navigator Key Trustee Server High Availability
      • Enabling Key Trustee KMS High Availability
      • Enabling Navigator HSM KMS High Availability
      • High Availability for Other CDH Components
        • HBase High Availability
          • HBase Read Replicas
        • Oozie High Availability
        • Search High Availability
      • Navigator Data Management in a High Availability Environment
      • Configuring Cloudera Manager for High Availability With a Load Balancer
        • Introduction to Cloudera Manager Deployment Architecture
        • Prerequisites for Setting up Cloudera Manager High Availability
        • Cloudera Manager Failover Protection
        • High-Level Steps to Configure Cloudera Manager High Availability
          • Step 1: Setting Up Hosts and the Load Balancer
          • Step 2: Installing and Configuring Cloudera Manager Server for High Availability
          • Step 3: Installing and Configuring Cloudera Management Service for High Availability
          • Step 4: Automating Failover with Corosync and Pacemaker
        • Database High Availability Configuration
        • TLS and Kerberos Configuration for Cloudera Manager High Availability
    • Backup and Disaster Recovery
      • Port Requirements for Backup and Disaster Recovery
      • Data Replication
        • Designating a Replication Source
        • HDFS Replication
          • Monitoring the Performance of HDFS Replications
        • Hive/Impala Replication
          • Monitoring the Performance of Hive/Impala Replications
        • Replicating Data to Impala Clusters
        • Using Snapshots with Replication
        • Enabling Replication Between Clusters with Kerberos Authentication
        • Replication of Encrypted Data
        • HBase Replication
      • Snapshots
        • Cloudera Manager Snapshot Policies
        • Managing HBase Snapshots
        • Managing HDFS Snapshots
      • BDR Tutorials
        • How To Back Up and Restore Apache Hive Data Using Cloudera Enterprise BDR
        • How To Back Up and Restore HDFS Data Using Cloudera Enterprise BDR
        • BDR Automation Examples
      • Migrating Data between Clusters Using distcp
        • Copying Cluster Data Using DistCp
        • Copying Data between a Secure and an Insecure Cluster using DistCp and WebHDFS
        • Post-migration Verification
    • Backing Up Databases
    • Cloudera Navigator Administration
    • Accessing Storage Using Amazon S3
      • Configuring the Amazon S3 Connector
        • Using S3 Credentials with YARN, MapReduce, or Spark
      • Using Fast Upload with Amazon S3
      • Configuring and Managing S3Guard
      • How to Configure a MapReduce Job to Access S3 with an HDFS Credstore
      • Importing Data into Amazon S3 Using Sqoop
    • Accessing Storage Using Microsoft ADLS
      • Configuring ADLS Access Using Cloudera Manager
      • Configuring ADLS Gen1 Connectivity
      • Configuring ADLS Gen2 Connectivity
      • Importing Data into Microsoft Azure Data Lake Store Using Sqoop
    • Configuring Google Cloud Storage Connectivity
    • How To Create a Multitenant Enterprise Data Hub
  • Security
    • Overview
      • Authentication Overview
      • Encryption Overview
        • Encryption Mechanisms
      • Authorization Overview
      • Auditing and Data Governance
    • Authentication
      • Kerberos Security Artifacts Overview
      • Configuring Authentication in Cloudera Manager
        • Cloudera Manager User Accounts
        • Configuring External Authentication and Authorization for Cloudera Manager
        • Enabling Kerberos Authentication for CDH
          • Step 1: Install Cloudera Manager and CDH
          • Step 2: Install JCE Policy Files for AES-256 Encryption
          • Step 3: Create the Kerberos Principal for Cloudera Manager Server
          • Step 4: Enabling Kerberos Using the Wizard
          • Step 5: Create the HDFS Superuser
          • Step 6: Get or Create a Kerberos Principal for Each User Account
          • Step 7: Prepare the Cluster for Each User
          • Step 8: Verify that Kerberos Security is Working
          • Step 9: (Optional) Enable Authentication for HTTP Web Consoles for Hadoop Roles
        • Kerberos Authentication for Non-Default Users
        • Customizing Kerberos Principals
        • Managing Kerberos Credentials Using Cloudera Manager
        • Using a Custom Kerberos Keytab Retrieval Script
        • Adding Trusted Realms to the Cluster
        • Using Auth-to-Local Rules to Isolate Cluster Users
      • Configuring Authentication for Cloudera Navigator
        • Cloudera Navigator and External Authentication
          • Configuring Cloudera Navigator for Active Directory
          • Configuring Cloudera Navigator for LDAP
          • Configuring Cloudera Navigator for SAML
        • Configuring Groups for Cloudera Navigator
      • Configuring Authentication for Other Components
        • Flume Authentication
          • Configuring Kerberos for Flume Thrift Source and Sink Using Cloudera Manager
          • Writing to a Secure HBase Cluster
          • Using Substitution Variables with Flume for Kerberos Artifacts
        • HBase Authentication
          • Configuring Kerberos Authentication for HBase
          • Configuring Secure HBase Replication
          • Configuring the HBase Client TGT Renewal Period
        • Hive Authentication
          • HiveServer2 Security Configuration
          • Using Hive to Run Queries on a Secure HBase Server
        • HttpFS Authentication
        • Hue Authentication
          • Enable Hue to Use Kerberos for Authentication
        • Impala Authentication
          • Enabling Kerberos Authentication for Impala
          • Enabling LDAP Authentication for Impala
          • Using Multiple Authentication Methods with Impala
          • Configuring Impala Delegation for Hue and BI Tools
        • Cloudera Search Authentication
          • Using Kerberos with Cloudera Search
        • Spark Authentication
        • Sqoop1 Authentication
        • ZooKeeper Authentication
      • Configuring a Dedicated MIT KDC for Cross-Realm Trust
      • Integrating MIT Kerberos and Active Directory
      • Hadoop Users (user:group) and Kerberos Principals
      • Mapping Kerberos Principals to Short Names
    • Authorization
      • Cloudera Manager User Roles
      • HDFS Extended ACLs
      • Authorization for HDFS Web UIs
      • Configuring LDAP Group Mappings
      • Authorization With Apache Sentry
      • Configuring HBase Authorization
    • Encrypting Data in Transit
      • Understanding Keystores and Truststores
      • Configuring TLS Encryption for Cloudera Manager and CDH Using Auto-TLS
      • Manually Configuring TLS Encryption for Cloudera Manager
      • Manually Configuring TLS Encryption on the Agent Listening Port
      • Manually Configuring TLS/SSL Encryption for CDH Services
        • Configuring TLS/SSL for HDFS, YARN and MapReduce
        • Configuring TLS/SSL for HBase
        • Configuring TLS/SSL for Flume
        • Configuring Encrypted Communication Between HiveServer2 and Client Drivers
        • Configuring TLS/SSL for Hue
        • Configuring TLS/SSL for Impala
        • Configuring TLS/SSL for Oozie
        • Configuring TLS/SSL for Solr
        • Spark Encryption
        • Configuring TLS/SSL for HttpFS
      • Configuring TLS/SSL for Navigator Audit Server
      • Configuring TLS/SSL for Navigator Metadata Server
      • Configuring TLS/SSL for Kafka (Navigator Event Broker)
      • Configuring Encrypted Transport for HDFS
      • Configuring Encrypted Transport for HBase
    • Encrypting Data at Rest
      • Data at Rest Encryption Reference Architecture
      • Data at Rest Encryption Requirements
      • Resource Planning for Data at Rest Encryption
      • HDFS Transparent Encryption
        • Optimizing Performance for HDFS Transparent Encryption
        • Enabling HDFS Encryption Using the Wizard
        • Managing Encryption Keys and Zones
        • Configuring the Key Management Server (KMS)
        • Securing the Key Management Server (KMS)
          • Configuring KMS Access Control Lists (ACLs)
        • Migrating from a Key Trustee KMS to an HSM KMS
        • Migrating Keys from a Java KeyStore to Cloudera Navigator Key Trustee Server
        • Migrating a Key Trustee KMS Server Role Instance to a New Host
        • Configuring CDH Services for HDFS Encryption
    • Cloudera Navigator Key Trustee Server
      • Backing Up and Restoring Key Trustee Server and Clients
      • Initializing Standalone Key Trustee Server
      • Configuring a Mail Transfer Agent for Key Trustee Server
      • Verifying Cloudera Navigator Key Trustee Server Operations
      • Managing Key Trustee Server Organizations
      • Managing Key Trustee Server Certificates
    • Cloudera Navigator Key HSM
      • Initializing Navigator Key HSM
      • HSM-Specific Setup for Cloudera Navigator Key HSM
      • Validating Key HSM Settings
      • Managing the Navigator Key HSM Service
      • Integrating Key HSM with Key Trustee Server
    • Cloudera Navigator Encrypt
      • Registering Cloudera Navigator Encrypt with Key Trustee Server
      • Preparing for Encryption Using Cloudera Navigator Encrypt
      • Encrypting and Decrypting Data Using Cloudera Navigator Encrypt
      • Converting from Device Names to UUIDs for Encrypted Devices
      • Navigator Encrypt Access Control List
      • Maintaining Cloudera Navigator Encrypt
    • Configuring Encryption for Data Spills
      • Configuring Encrypted On-disk File Channels for Flume
    • Impala Security Overview
      • Security Guidelines for Impala
      • Securing Impala Data and Log Files
      • Installation Considerations for Impala Security
      • Securing the Hive Metastore Database
      • Securing the Impala Web User Interface
    • Kudu Security Overview
    • How-To Guides
      • Add Root and Intermediate CAs to Truststore for TLS/SSL
      • Amazon S3 Security
      • Authenticate Kerberos Principals Using Java
      • Check Cluster Security Settings
      • Configure Antivirus Software on CDH Hosts
      • Configure Browser-based Interfaces to Require Authentication (SPNEGO)
      • Configure Browsers for Kerberos Authentication (SPNEGO)
      • Configure Cluster to Use Kerberos Authentication
      • Convert DER, JKS, PEM Files for TLS/SSL Artifacts
      • Configure Authentication for Amazon S3
      • Configure Encryption for Amazon S3
      • Configure AWS Credentials
      • Enable Sensitive Data Redaction
      • Log a Security Support Case
      • Obtain and Deploy Keys and Certificates for TLS/SSL
      • Renew and Redistribute Certificates
      • Set Up a Gateway Host to Restrict Access to the Cluster
      • Set Up Access to Cloudera EDH or Altus Director (Microsoft Azure Marketplace)
      • Use Self-Signed Certificates for TLS
    • Troubleshooting Security Issues
      • Error Messages
      • Authentication and Kerberos Issues
      • HDFS Encryption Issues
      • Key Trustee KMS Encryption Issues
      • TLS/SSL Issues
      • YARN, MRv1, and Linux OS Security
        • TaskController Error Codes (MRv1)
        • ContainerExecutor Error Codes (YARN)
  • Cloudera Navigator Data Management
    • Overview
    • Search
      • Performing Actions on Entities
    • Auditing
      • Using Audit Events to Understand Cluster Activity
      • Exploring Audit Data
      • Cloudera Navigator Audit Event Reports
    • Analytics
    • Policies
    • Lineage
      • Using the Lineage View
      • Using Lineage to Display Table Schema
      • Generating Lineage Diagrams
    • Business Metadata
      • Defining Managed Properties
      • Adding and Editing Metadata
    • Administration (Navigator Console)
      • Managing Metadata Storage with Purge
      • Administering Navigator User Roles
    • Navigator Configuration and Management
      • Accessing Navigator Data Management Logs
      • Backing Up Cloudera Navigator Data
      • Authentication and Authorization
      • Configuring Cloudera Navigator to work with Hue HA
      • Cloudera Navigator support for Virtual Private Clusters
      • Encryption (TLS/SSL) and Cloudera Navigator
      • Limiting Sensitive Data in Navigator Logs
      • Preventing Concurrent Logins from the Same User
      • Navigator Audit Server Management
        • Setting Up Navigator Audit Server
        • Enabling Audit and Log Collection for Services
        • Configuring Service Auditing Properties
        • Adding Audit Filters
        • Monitoring Navigator Audit Service Health
        • Publishing Audit Events
        • Maintaining Navigator Audit Server
      • Navigator Metadata Server Management
        • Setting Up Navigator Metadata Server
        • Navigator Metadata Server Tuning
        • Configuring and Managing Extraction
        • Hive and Impala Lineage Configuration
        • Configuring the Server for Policy Messages
    • Cloudera Navigator and the Cloud
      • Using Cloudera Navigator with Altus Clusters
        • Configuring Extraction for Altus Clusters on AWS
      • Using Cloudera Navigator with Amazon S3
        • Configuring Extraction for Amazon S3
    • Cloudera Navigator APIs
      • Navigator APIs Overview
      • Applying Metadata to HDFS and Hive Entities using the API
      • Using the Purge APIs for Metadata Maintenance Tasks
    • Cloudera Navigator Reference
      • Lineage Diagram Icons
      • Search Syntax and Properties
      • Service Audit Events
      • Service Metadata Entity Types
      • Metadata Policy Expressions
      • User Roles and Privileges Reference
    • Troubleshooting Navigator Data Management
  • CDH Component Guides
    • Crunch
    • Flume
      • Configuring
        • Configuring the Flume Properties File
        • Files Installed by the Flume RPM and Debian Packages
        • Configuring Flume Security with Kafka
      • Using & Managing
        • Running Flume
        • Supported Sources, Sinks, and Channels
        • Flume Kudu Sink
        • Viewing the Flume Documentation
    • HBase
      • Configuring
        • Accessing HBase by using the HBase Shell
        • HBase Online Merge
        • Using MapReduce with HBase
        • Configuring HBase Garbage Collection
        • Configuring the HBase Canary
        • Configuring the Blocksize for HBase
        • Configuring the HBase BlockCache
        • Configuring Quotas
        • Configuring the HBase Scanner Heartbeat
        • Limiting the Speed of Compactions
        • Configuring and Using the HBase REST API
        • Configuring HBase MultiWAL Support
        • Storing Medium Objects (MOBs) in HBase
        • Configuring the Storage Policy for the Write-Ahead Log (WAL)
      • Using & Managing
        • Starting and Stopping HBase
        • Accessing HBase by using the HBase Shell
        • Using HBase Command-Line Utilities
        • Using the HBCK2 Tool to Remediate HBase Clusters
        • Hedged Reads
        • Reading Data from HBase
        • HBase Filtering
        • Writing Data to HBase
        • Importing Data Into HBase
        • Exposing HBase Metrics to a Ganglia Server
        • Using HashTable and SyncTable Tool
      • Security
      • Troubleshooting
    • Hive
      • Installation and Upgrade
      • Configuring
        • Configuring HiveServer2
        • File System Permissions
        • Starting, Stopping, & Using HS2
        • Using Hive w/HBase
        • Installing JDBC/ODBC Drivers
        • Setting HADOOP_MAPRED_HOME
      • Using & Managing
        • Managing Hive with Cloudera Manager
        • Ingesting & Querying Data
        • Using Parquet Tables
        • Running Hive on Spark
        • Using HS2 Web UI
        • Using Query Plan Graph View
        • Accessing Table Statistics
        • Managing UDFs
        • Hive ETL Jobs on S3
        • Hive with ADLS
        • Erasure Coding with Hive
        • Removing the Hive Compilation Lock
        • Sqoop HS2 Import
      • Tuning
        • Tuning Hive on Spark
        • Tuning Hive on S3
        • Configuring HS2 HA
        • Enabling Query Vectorization
      • Hive Metastore (HMS)
        • Configuring
          • Configuring HMS
          • Configuring HMS HA
          • Configuring HMS for HDFS HA
          • Configuring Shared Amazon RDS as HMS
        • Using & Managing
          • Starting the Metastore
          • Using Metastore Schema Tool
      • Data Replication
      • Security
      • HCatalog
        • HCatalog Prerequisites
        • Configuration Change on Hosts Used with HCatalog
        • Accessing Table Information with the HCatalog Command-line API
        • Accessing Table Data with MapReduce
        • Accessing Table Data with Pig
        • Accessing Table Information with REST
        • Viewing the HCatalog Documentation
      • Troubleshooting
    • Hue
      • Hue Versions
      • Reference Architecture
      • Installation & Upgrade
      • Using
        • Enable SQL Editor Autocompleter
        • Use Governance-Based Data Discovery
        • Use S3 as Source or Sink in Hue
      • Administration
        • Configuring
        • Customize Hue Web UI
        • Enable Governance-Based Data Discovery
        • Enable S3 Cloud Storage
        • Run Shell Commands
        • Connecting a Database
          • Connect to MySQL or MariaDB
          • Connect to PostgreSQL
          • Connect to Oracle (Parcel)
          • Connect to Oracle (Package)
          • Custom Database Tutorial
        • Migrate the Database
        • Populate the Database
      • Performance Tuning
        • Add Load Balancer
        • Configure High Availability
        • Hue/HDFS High Availability
      • Security
        • User Permissions
        • Create Password Scripts
        • Authenticate Users with LDAP
        • Synchronize with LDAP Server
        • Authenticate Users with SAML
        • Authorize Groups with Sentry
      • Troubleshooting
        • Potential Misconfiguration
        • Unable to connect to database with provided credential
        • Unable to view Snappy-compressed files
        • “Unknown Attribute Name” exception while enabling SAML
        • Invalid query handle
        • Services backed by Postgres fail or hang
        • Downloading query results from Hue takes long time
        • Error validating LDAP user in Hue
        • 502 Proxy Error while accessing Hue from the Load Balancer
        • Hue Load Balancer does not start after enabling TLS
        • Unable to kill Hive queries from Job Browser
        • 1040, 'Too many connections' exception
        • Unable to connect Oracle database to Hue using SCAN
        • Increasing the maximum number of processes for Oracle database
        • Unable to authenticate to Hbase when using Hue
    • Impala
      • Concepts and Architecture
        • Components
        • Developing Applications
        • Role in the Hadoop Ecosystem
      • Deployment Planning
        • Impala Requirements
        • Designing Schemas
      • Tutorials
      • Administration
        • Setting Timeouts
        • Load-Balancing Proxy for HA
        • Managing Disk Space
        • Auditing
        • Viewing Lineage Info
      • SQL Reference
        • Comments
        • Data Types
          • ARRAY Complex Type (CDH 5.5 or higher only)
          • BIGINT
          • BOOLEAN
          • CHAR
          • DECIMAL
          • DOUBLE
          • FLOAT
          • INT
          • MAP Complex Type (CDH 5.5 or higher only)
          • REAL
          • SMALLINT
          • STRING
          • STRUCT Complex Type (CDH 5.5 or higher only)
          • TIMESTAMP
            • Customizing Time Zones
          • TINYINT
          • VARCHAR
          • Complex Types (CDH 5.5 or higher only)
        • Literals
        • SQL Operators
        • Schema Objects and Object Names
          • Aliases
          • Databases
          • Functions
          • Identifiers
          • Tables
          • Views
        • SQL Statements
          • DDL Statements
          • DML Statements
          • ALTER DATABASE
          • ALTER TABLE
          • ALTER VIEW
          • COMMENT
          • COMPUTE STATS
          • CREATE DATABASE
          • CREATE FUNCTION
          • CREATE ROLE
          • CREATE TABLE
          • CREATE VIEW
          • DELETE
          • DESCRIBE
          • DROP DATABASE
          • DROP FUNCTION
          • DROP ROLE
          • DROP STATS
          • DROP TABLE
          • DROP VIEW
          • EXPLAIN
          • GRANT
          • INSERT
          • INVALIDATE METADATA
          • LOAD DATA
          • REFRESH
          • REFRESH AUTHORIZATION
          • REFRESH FUNCTIONS
          • REVOKE
          • SELECT
            • Joins
            • ORDER BY Clause
            • GROUP BY Clause
            • HAVING Clause
            • LIMIT Clause
            • OFFSET Clause
            • UNION Clause
            • Subqueries
            • TABLESAMPLE Clause
            • WITH Clause
            • DISTINCT Operator
          • SET
            • Query Options for the SET Statement
              • ABORT_ON_ERROR
              • ALLOW_ERASURE_CODED_FILES
              • ALLOW_UNSUPPORTED_FORMATS
              • APPX_COUNT_DISTINCT
              • BATCH_SIZE
              • BUFFER_POOL_LIMIT
              • COMPRESSION_CODEC
              • COMPUTE_STATS_MIN_SAMPLE_SIZE
              • DEBUG_ACTION
              • DECIMAL_V2
              • DEFAULT_JOIN_DISTRIBUTION_MODE
              • DEFAULT_SPILLABLE_BUFFER_SIZE
              • DISABLE_CODEGEN
              • DISABLE_CODEGEN_ROWS_THRESHOLD
              • DISABLE_ROW_RUNTIME_FILTERING
              • DISABLE_STREAMING_PREAGGREGATIONS
              • DISABLE_UNSAFE_SPILLS
              • ENABLE_EXPR_REWRITES
              • EXEC_SINGLE_NODE_ROWS_THRESHOLD
              • EXEC_TIME_LIMIT_S
              • EXPLAIN_LEVEL
              • HBASE_CACHE_BLOCKS
              • HBASE_CACHING
              • IDLE_SESSION_TIMEOUT
              • KUDU_READ_MODE
              • LIVE_PROGRESS
              • LIVE_SUMMARY
              • MAX_ERRORS
              • MAX_MEM_ESTIMATE_FOR_ADMISSION
              • MAX_NUM_RUNTIME_FILTERS
              • MAX_ROW_SIZE
              • MAX_SCAN_RANGE_LENGTH
              • MEM_LIMIT
              • MIN_SPILLABLE_BUFFER_SIZE
              • MT_DOP
              • NUM_NODES
              • NUM_ROWS_PRODUCED_LIMIT
              • NUM_SCANNER_THREADS
              • OPTIMIZE_PARTITION_KEY_SCANS
              • PARQUET_COMPRESSION_CODEC
              • PARQUET_ANNOTATE_STRINGS_UTF8
              • PARQUET_ARRAY_RESOLUTION
              • PARQUET_DICTIONARY_FILTERING
              • PARQUET_FALLBACK_SCHEMA_RESOLUTION
              • PARQUET_FILE_SIZE
              • PARQUET_READ_STATISTICS
              • PREFETCH_MODE
              • QUERY_TIMEOUT_S
              • REPLICA_PREFERENCE
              • REQUEST_POOL
              • RESOURCE_TRACE_RATIO
              • RUNTIME_BLOOM_FILTER_SIZE
              • RUNTIME_FILTER_MAX_SIZE
              • RUNTIME_FILTER_MIN_SIZE
              • RUNTIME_FILTER_MODE
              • RUNTIME_FILTER_WAIT_TIME_MS
              • S3_SKIP_INSERT_STAGING
              • SCAN_BYTES_LIMIT
              • SCHEDULE_RANDOM_REPLICA
              • SCRATCH_LIMIT
              • SHUFFLE_DISTINCT_EXPRS
              • SUPPORT_START_OVER
              • SYNC_DDL
              • THREAD_RESERVATION_AGGREGATE_LIMIT
              • THREAD_RESERVATION_LIMIT
              • TIMEZONE
              • TOPN_BYTES_LIMIT
          • SHOW
          • SHUTDOWN
          • TRUNCATE TABLE
          • UPDATE
          • UPSERT
          • USE
          • VALUES
          • Optimizer Hints
        • Built-In Functions
          • Mathematical Functions
          • Bit Functions
          • Type Conversion Functions
          • Date and Time Functions
          • Conditional Functions
          • String Functions
          • Miscellaneous Functions
          • Aggregate Functions
            • APPX_MEDIAN
            • AVG
            • COUNT
            • GROUP_CONCAT
            • MAX
            • MIN
            • NDV
            • STDDEV, STDDEV_SAMP, STDDEV_POP
            • SUM
            • VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP
          • Analytic Functions
        • User-Defined Functions (UDFs)
        • SQL Differences Between Impala and Hive
        • Porting SQL
      • Resource Management
        • Admission Control and Query Queuing
        • Configuring Resource Pools and Admission Control
        • Admission Control Sample Scenario
      • Performance Tuning
        • Performance Best Practices
        • Join Performance
        • Table and Column Statistics
        • Benchmarking
        • Controlling Resource Usage
        • Runtime Filtering
        • HDFS Caching
        • HDFS Block Skew
        • Data Cache for Remote Reads
        • Testing Impala Performance
        • EXPLAIN Plans and Query Profiles
      • Scalability Considerations
        • Scaling Limits and Guidelines
        • Dedicated Coordinators
        • Metadata Management
      • Partitioning
      • File Formats
        • Text Data Files
        • Parquet Data Files
        • ORC Data Files
        • Avro Data Files
        • RCFile Data Files
        • SequenceFile Data Files
      • Using Impala to Query Kudu Tables
      • HBase Tables
      • S3 Tables
        • Configure with Cloudera Manager
        • Configure from Command Line
      • ADLS Tables
      • Logging
      • Impala Client Access
        • The Impala Shell
          • Configuration Options
          • Connecting to impalad
          • Running Commands and SQL Statements
          • Command Reference
        • Configuring Impala to Work with ODBC
        • Configuring Impala to Work with JDBC
      • Troubleshooting Impala
        • Web User Interface
        • Breakpad Minidumps
      • Ports Used by Impala
      • Impala Reserved Words
      • Impala Frequently Asked Questions
    • Kafka
      • Setup
      • Cloudera Manager
      • Clients
      • Brokers
      • Integration
        • Security
        • Managing Multiple Kafka Versions
        • Managing Topics across Multiple Kafka Clusters
        • Setting up an End-to-End Data Streaming Pipeline
        • Developing Kafka Clients
        • Metrics
      • Administration
        • Administration Basics
        • Broker Migration
        • User Limits for Kafka
        • Quotas
        • Kafka Command Line Tools
        • Disk Management
        • JBOD
          • Setup and Migration
        • Delegation Tokens
          • Enable Delegation Tokens
          • Managing Individual Delegation Tokens
          • Rotating the Master Key/Secret
          • Client Authentication
          • Kafka Security Hardening with Zookeeper ACLs
        • Kafka Streams
      • Performance Tuning
        • Handling Large Messages
        • Cluster Sizing
        • Broker Configuration
        • System-Level Broker Tuning
        • Kafka-ZooKeeper Performance Tuning
      • Reference
        • Metrics Reference
        • Useful Shell Command Reference
      • Kafka Public APIs
      • FAQ
    • Kudu
      • Concepts and Architecture
      • Usage Limitations
      • Installation and Upgrade
      • Configuration
      • Administration
      • Developing Applications with Kudu
      • Using Apache Impala with Kudu
      • Using the Hive Metastore with Kudu
      • Schema Design
      • Transaction Semantics
      • Background Tasks
      • Scaling Guide
      • Troubleshooting
      • More Resources
    • Oozie
      • Configuration
        • Configuring an External Database for Oozie
        • Oozie High Availability
        • Configuring Oozie to Use HDFS HA
        • Oozie Authentication
        • Using Sqoop Actions with Oozie
        • Configuring Oozie to Enable MapReduce Jobs To Read/Write from Amazon S3
        • Configuring Oozie to Enable MapReduce Jobs To Read/Write from Microsoft Azure (ADLS)
      • Oozie
        • Starting, Stopping, and Accessing the Oozie Server
        • Adding the Oozie Service Using Cloudera Manager
        • Redeploying the Oozie ShareLib
        • Configuring Oozie Data Purge Settings Using Cloudera Manager
        • Dumping and Loading an Oozie Database Using Cloudera Manager
        • Adding Schema to Oozie Using Cloudera Manager
        • Enabling the Oozie Web Console on Managed Clusters
        • Enabling Oozie SLA with Cloudera Manager
        • Setting the Oozie Database Timezone
        • Scheduling in Oozie Using Cron-like Syntax
    • Phoenix
      • Release Notes
      • Prerequisites
      • Installing Apache Phoenix using Cloudera Manager
      • Using Apache Phoenix to Store and Access Data
        • Orchestrating SQL and APIs with Apache Phoenix
        • Configuring Phoenix Query Server
          • Connecting to PQS
        • Creating and Using User-Defined Functions (UDFs) in Phoenix
        • Mapping Phoenix Schemas to HBase Namespaces
        • Associating Tables of a Schema to a Namespace
        • Using Phoenix Client to Load Data
        • Using the Index in Phoenix
      • Understanding Apache Phoenix-Spark Connector
      • Understanding Apache Phoenix-Hive Connector
      • Performance Tuning
      • Frequently Asked Questions
      • Uninstalling Phoenix Parcel
    • Search
      • Search
        • Understanding
        • Search and Other CDH Components
        • Architecture
        • Tasks and Processes
      • Tutorial
        • Validating Search Deployment
        • Preparing to Index Sample Tweets
        • Using MapReduce Batch Indexing to Index Sample Tweets
        • Near Real Time (NRT) Indexing Tweets Using Flume
        • Using Hue with Search
      • Deployment Planning
        • Schemaless Mode
      • Deploying
        • Using Search through a Proxy for High Availability
        • Using Custom JAR Files with Search
        • Cloudera Search Security
          • Enable Kerberos Authentication in Cloudera Search
      • Managing
        • Configuration
        • Collections
        • solrctl Reference
        • Example solrctl Usage
        • Migrating Solr Replicas
        • Backing Up and Restoring
      • ETL with Cloudera Morphlines
        • Example Morphline Usage
      • Indexing Data
        • Near Real Time Indexing
          • Flume NRT Indexing
            • Flume MorphlineSolrSink Configuration Options
            • Flume MorphlineInterceptor Configuration Options
            • Flume Solr UUIDInterceptor Configuration Options
            • Flume Solr BlobHandler Configuration Options
            • Flume Solr BlobDeserializer Configuration Options
          • Lily HBase NRT Indexing
            • Using the Lily HBase NRT Indexer Service
            • Configuring Lily HBase Indexer Security
        • Batch Indexing
          • Spark Indexing
          • MapReduce Indexing
            • MapReduceIndexerTool
            • Lily HBase Batch Indexing
      • FAQ
      • Troubleshooting
        • Configuration and Log Files
        • Identifying Problems
        • Solr Query Returns no Documents when Executed with a Non-Privileged User
    • Sentry
      • Before You Install Sentry
      • Installing and Upgrading the Sentry Service
      • Configuring
        • Sentry High Availability
        • Enabling Sentry Authorization for Impala
        • Configuring Sentry Authorization for Cloudera Search
      • Using & Managing
        • Synchronizing HDFS ACLs and Sentry Permissions
        • Authorization Privilege Model for Hive and Impala
        • Authorization Privilege Model for Cloudera Search
        • Hive SQL Syntax for Use with Sentry
        • Object Ownership
        • Using the Sentry Web Server
        • Sentry Debugging and Failure Scenarios
      • Troubleshooting
      • How-To Guides
        • Enabling High Availability
        • Verify HDFS ACL Sync
        • Managing Table Access in Hue
    • Spark
      • Running Your First Spark Application
      • Troubleshooting for Spark
      • Frequently Asked Questions about Apache Spark in CDH
      • Spark Application Overview
      • Developing Spark Applications
        • Developing and Running a Spark WordCount Application
        • Using Spark Streaming
        • Using Spark SQL
        • Using Spark MLlib
        • Accessing External Storage
          • Accessing Data Stored in Amazon S3 through Spark
          • Accessing Data Stored in Azure Data Lake Store (ADLS) through Spark
          • Accessing Avro Data Files From Spark SQL Applications
          • Accessing Parquet Files From Spark SQL Applications
        • Building Spark Applications
        • Configuring Spark Applications
      • Running Spark Applications
        • Running Spark Applications on YARN
        • Using PySpark
          • Running Spark Python Applications
          • Spark and IPython and Jupyter Notebooks
        • Tuning Spark Applications
      • Spark and Hadoop Integration
        • Building and Running a Crunch Application with Spark
    • File Formats and Compression
      • Parquet
        • Predicate Pushdown in Parquet
      • Avro
      • Data Compression
      • Snappy Compression
  • Glossary

Configuring Extraction for Amazon S3

Depending on the specifics of the Amazon S3 bucket targeted for extraction by Cloudera Navigator, the configuration process follows one of two alternative paths:
  • Default configuration—The default configuration is available for Amazon S3 buckets that have no existing Amazon SQS or Amazon SNS services configured. During configuration, Cloudera Navigator accesses the configured AWS account, performs an initial bulk extract from the Amazon S3 bucket, sets up Amazon SQS queues in each region with buckets, and sets up event notifications for each bucket for subsequent incremental extracts—all transparently to the Cloudera Manager administrator handling the configuration process.
  • Custom configuration—Custom configuration is required for any Amazon S3 bucket that is currently using Amazon SQS (has queues set up for other applications, for example) or is setup for notifications using Amazons SNS. In these cases you must manually configure a new queue—bring your own queue—and in some cases, additionally configure Amazon SNS for fanout.
Continue reading:
  • AWS Credentials Requirements
  • Default Configuration
  • Custom Configurations
    • Configuring Your Own Queues
      • Configure the Queue for Cloudera Navigator
      • Configure Event Notification for the Queues
      • Configuring Amazon SNS Fan-out
    • Defining and Attaching Policies
      • Event Notification Policy for Custom Queues
      • Extraction Policies for Custom Queues
      • Extraction Policies JSON Reference
    • Setting Properties with Advanced Configuration Snippets
  • Cloudera Navigator Extraction Behavior for Amazon S3
  • Cloudera Navigator Properties for Amazon S3

AWS Credentials Requirements

Cloudera Manager can have multiple AWS credentials configured for various purposes at any given time. These are listed by name on the AWS Credentials page which is accessible from the Cloudera Manager Admin Console, under the Administration menu. However, there are specific constraints on AWS credentials for Cloudera Navigator as follows:
  • Navigator supports a single key for authentication; only one AWS credential can be used for a given Navigator instance. Navigator can extract metadata from any number of S3 buckets, assuming the buckets can be accessed with the configured credential.
  • An AWS credential configured for connectivity from one Cloudera Navigator instance cannot be used by another Cloudera Navigator instance. Configuring the same AWS credentials for use with different Cloudera Navigator instances can result in unpredictable behavior.
  • Cloudera Navigator requires an AWS credential associated with an IAM user identity rather than an IAM role.
  • Any changes to the AWS credentials (for example, if you rotate credentials on a regular basis) must be for the same AWS account (IAM user). Changing the AWS credentials to those of a different IAM user results in errors from the Amazon Simple Queue Service (used transparently by Cloudera Navigator). If a new key is provided to Cloudera Navigator, the key must belong to the same AWS account as the prior key.
  • For the default configuration, the account for this AWS credential must have administrator privileges for:
    • Amazon S3
    • Amazon Simple Queue Service (SQS)
  • For the custom configurations, the account needs privileges for Amazon S3, Amazon SQS, and for Amazon Simple Notification Service (SNS).

Default Configuration

The steps below assume that you have the required AWS credentials for the IAM user with the Amazon S3 bucket. Amazon Web Services (AWS) account (an IAM user account) and that you can use the AWS Management Console . The AWS credentials for the IAM user are configured for Cloudera Navigator using the Cloudera Manager Admin Console during the configuration process below.

Important: If the Amazon S3 bucket is already configured for queuing or notification, do not follow the steps in this section. See Custom Configurations instead.

Configuring Cloudera Navigator for Amazon S3

At the end of the configuration process detailed below, Cloudera Navigator authenticates to AWS using the credentials and performs an initial bulk extract of entities from the Amazon S3 bucket. It also sets up the necessary Amazon SQS queue (or queues, one for each region) to use for subsequent incremental extracts.

The steps below assume you have the required AWS credentials available.

  1. Log in to the Cloudera Manager Admin Console.
  2. Click Administration > AWS Credentials. The AWS Credentials page displays, listing any existing credentials that have been setup for the Cloudera Manager cluster.
  3. Click the Add Access Key Credentials button. In the Add Access Key Credentials page:
    1. Enter a Name for the credentials. The name can contain alphanumeric characters and can include hyphens, underscores, and spaces but should be meaningful in the context of your production environment. Use the name of the Amazon S3 bucket or its functionality, for example, cust-data-raw or post-proc-results.
    2. Enter the AWS Access Key ID.
    3. Enter the AWS Secret Key.

  4. Click Add. The Edit S3Guard:aws-cred-name page displays, giving you the option to enable S3Guard for the S3 bucket.
    • Click the Enable S3Guard box only if the AWS credential has privileges on Amazon DynamoDB and if the preliminary S3Guard configuration is complete. See Cloudera Administration for details about Configuring and Managing S3Guard.
  5. Click Save.
    Note: Cloudera Manager stores the AWS credential securely in a non-world readable location. The access key ID and secret values are masked in the Cloudera Manager Admin Console, encrypted before being passed to other processes, and redacted in the logfiles.
    The Connect to Amazon Web Services page displays the credential name and services available for its use:

  6. Click the Enable S3 Metadata Extraction link in the Cloudera Navigator section of the page. A Confirm prompt displays, notifying you that Cloudera Navigator must be manually restarted after this change.
  7. Click OK to enable the connection.
  8. At the top of the Cloudera Manager Admin Console, click the Stale Configuration restart button when you are ready to restart Cloudera Navigator.

Metadata and lineage for Amazon S3 buckets will be available in the Cloudera Navigator console along with other sources, such as HDFS, Hive, and so on. It may take several minutes to complete the initial extraction depending on the number of objects stored on the Amazon S3 bucket.

Custom Configurations

Follow these steps for Amazon S3 buckets that are already configured with queues or event notifications. Custom configurations include configuring your own queue (BYOQ) and BYOQ with Fan-out, as detailed below.

Configuring Your Own Queues

Sometimes referred to as Bring Your Own Queue (BYOQ), configuring your own queue is required if the Amazon S3 bucket being targeted for extraction by Cloudera Navigator already has existing queues or is configured for notifications. The process involves stopping Cloudera Navigator and then using the AWS Management Console for the following tasks:
  • Creating and configuring an Amazon Simple Queue Service (SQS) queue for Cloudera Navigator for each region in which the AWS (IAM user) account has Amazon S3 buckets.
  • Configuring Amazon Simple Notification Service (SNS) on each bucket to send Create, Rename, Update, Delete (CRUD) events to the Cloudera Navigator queue.
  • Configuring the bucket for Notification Fan-Out if needed to support existing notifications configured for other applications.
  • Adding a Policy for the appropriate extraction process (Bulk + Incremental, Bulk Only) to the IAM user account.
  • Adding the Policy for event notifications to the IAM user account.
Important: Always make sure any newly created Amazon S3 buckets are configured for event notifications before adding data so the queues are properly updated.

Configure the Queue for Cloudera Navigator

This manual configuration process requires stopping Cloudera Navigator. You must create a queue for each region that has S3 buckets.

  1. Log in to Cloudera Manager Admin Console and stop Cloudera Navigator:
    • Select Clusters > Cloudera Management Service
    • Click the Instances tab.
    • Click the checkbox next to Navigator Audit Server and Navigator Metadata Server in the Role Type list to select these roles.
    • From the Actions for Selected (2) menu button, select Stop
  2. Log in to the AWS Management Console with AWS account (IAM user) and open the Simple Queue Service setup page (select Services > Messaging > Simple Queue Service. Click Create New Queue or Get Started Now if region has no configured queues.)
  3. For each region that has Amazon S3 buckets, create a queue as follows:
    1. Click the Create New Queue button. Enter a Queue Name, click the Standard Queue (not FIFO), and then click Configure Queue. Configure the queue using the following settings:
      Default Visibility Timeout 10 minutes
      Message Retention Period 14 days
      Delivery Delay 0 seconds
      Receive Message Wait Time 0 seconds
    2. Select the queue you created, click the Permissions tab, click Add a Permission, and configure the following in the Add a Permision to... dialog box:
      Effect Allow
      Principal Everybody
      Actions SendMessage
    3. Click the Add Conditions (optional) link open the condition fields and enter the following values:
      Qualifier None
      Condition ArnLike
      Key aws:SourceArn
      Value arn:aws:s3::*:*
    4. Click Add Condition to save the settings.
    5. Click Add Permission to save all settings for the queue.


Repeat this process for each region that has Amazon S3 buckets.

Configure Event Notification for the Queues

After creating queues for all regions with Amazon S3 buckets, you must configure event notification for each Amazon S3 bucket. Assuming you are still logged into the AWS Management Console:
  1. Navigate to the Amazon S3 bucket for the region (Services > Storage > S3).
  2. Select the bucket.
  3. Click the Properties tab.
  4. Click the Events settings box.
  5. Click Add notification.
  6. Configure event notification for the bucket as follows:
    Name nav-send-metadata-on-change
    Events
    • ObjectCreated(All)
    • ObjectRemoved(All)
    Send to SQS Queue
    SQS queue Enter the name of your queue
Important: Cloudera Navigator extracts metadata from one queue only for each region.

Configuring Amazon SNS Fan-out

Configure SNS fanout if you have existing S3 event notification. For more information about SNS fanout, see Amazon documentation for Common SNS Scenarios

  1. Add the following to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties. See Setting Properties with Advanced Configuration Snippets for details about using Cloudera Manager Admin Console if necessary.
    nav.s3.extractor.incremental.enable=true
    nav.s3.extractor.incremental.auto_setup.enable=false
    nav.s3.extractor.incremental.queues=queue_json
    
    Specify the queue properties using the following JSON template (without any spaces). Escape commas (,) by preceding them with two backslashes (\\), as shown in the template:
    [{"region":"us-west-1"\\,"queueUrl":"https://sqs.aws_region.amazonaws.com/account_num/queue_name"}\\,{queue_2}\\,
    ...
    {queue_n}]
    
  2. Restart Cloudera Navigator.

Defining and Attaching Policies

Event Notification Policy for Custom Queues

To enable event notification for an custom queue, create the following policy by copying the policy text and pasting it in the policy editor, and then attaching it to the Cloudera Navigator user in AWS.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481678612000",
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:ReceiveMessage"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Stmt1481678744000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetBucketNotification",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

Extraction Policies for Custom Queues

Custom configurations require a valid extraction policy be defined and attached to the AWS user account associated with the Amazon S3 bucket. The policy is a JSON document that specifies the type of extraction. As mentioned in Overview of Amazon S3 Extraction Processes, the two types of extraction are as follows:

  • Bulk + Incremental—This is the recommended approach for both cost and performance reasons and is used by the Default Configuration process automatically.
  • Bulk Only—This approach is recommended for proof-of-concept deployments. It is required for the BYOQ with Fan-out configuration. In addition to applying the policy as detailed below, this approach also requires setting the nav.s3.extractor.incremental.enable property to false. See Setting Properties with Advanced Configuration Snippets and the Cloudera Navigator Properties for Amazon S3 for details.
To configure the policy:
  • Log in to the AWS Management Console using the IAM user account associated with the target Amazon S3 bucket.
  • Copy the appropriate JSON text from the table below Extraction Policies JSON Referenceand paste into the AWS Management Console policy editor for the Navigator user account on (the IAM user) through the AWS Management Console.
Extraction Policies JSON Reference
Bulk + Incremental (Recommended) Bulk Only
  • Initial bulk process extracts all metadata.
  • Subsequent incremental process extracts changes only (CRUD).
  • Cannot be used with Amazon S3 buckets that use event notification.
  • Must be used for Amazon S3 buckets configured for event notifications.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481678612000",
            "Effect": "Allow",
            "Action": [
                "sqs:CreateQueue",
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:ReceiveMessage",
                "sqs:SetQueueAttributes"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Stmt1481678744000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetBucketNotification",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481676614000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

Setting Properties with Advanced Configuration Snippets

Certain features require additional settings or changes to the Cloudera Navigator configuration. For example, configuring BYOQ queues to use bulk-only extraction requires not only creating and attaching the extraction policy but also adding the following snippet to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties setting:

nav.s3.extractor.incremental.enable=false

To change property values by adding an advanced configuration snippet:

  • Log in to the Cloudera Manager Admin Console.
  • Select Clusters > Cloudera Management Service.
  • Click Configuration.
  • Click Navigator Metadata Server under the Scope filter, and click Advanced under the Category filter.
  • Enter Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties in the Search field to find the property.
  • Enter the property and its setting as a key-value pair, for example:
    property=your_setting
    in:
  • Click Save Changes.
  • Restart the Navigator Metadata Server instance.

Cloudera Navigator Extraction Behavior for Amazon S3

Objects from S3 buckets appear as you would expect directories and files to appear in Navigator. However, there are some behaviors that are specific to S3 source types:
  • Unnamed directories

    It is possible to place files in unnamed directory in an S3 bucket, such as s3://mybucket//myfile. Navigator does not extract files inside unnamed directories.

  • Deleted implicit directories

    When adding files to an S3 bucket, S3 may create implicit directories for the file if the directories specified in the file path do not already exist. When a file is deleted and its implicit directories are also removed on S3, Navigator will show the file as deleted but will not delete the implicit directories. You can filter these directories from the Navigator search results by setting implicit:false in the search query.

  • Inconsistency delays

    Inconsistencies that occur in AWS can delay Navigator extraction of metadata and lineage from Amazon S3. When Cloudera Navigator detects an inconsistency, extraction may stop until the inconsistency is resolved in AWS. Cloudera Navigator will retry at the next scheduled extraction.

Cloudera Navigator Properties for Amazon S3

The table below lists the Navigator Metadata Server properties (cloudera-navigator.properties) that control extraction and other features related to Amazon S3. These properties can be set using the Cloudera Manager Admin Console to set properties in the advanced configuration settings.

Changing any of the values in the table requires a restart of Cloudera Navigator.

Option Description
nav.aws.api.limit Default is 5,000,000. Maximum number of Amazon Web Services (AWS) API calls that Cloudera Navigator can make per month.
nav.sqs.max_receive_count Default is 10. Number of retries for inconsistent SQS messages (inconsistent due to eventual consistency).
nav.s3.extractor.enable Default is true when an AWS credential has been configured to extract metadata from Amazon S3.
nav.s3.extractor.incremental.auto_setup.enable Default is true. Enables Cloudera Navigator to set up Amazon SQS queues to receive notifications from Amazon S3 events. Set to false to disable the automatic setup and custom configure your own queue (BYOQ with Fan-out).
nav.s3.extractor.incremental.batch_size Default is 1000. Number of messages held in memory during the extraction process.
nav.s3.extractor.incremental.enable Default is true. Enables incremental extraction. Setting to false disables incremental extraction and effectively enables bulk-only extraction.
nav.s3.extractor.incremental.event.overwrite Default is false. Prevents any existing event notifications from being overwritten by Cloudera Navigator auto-generated queues created during default configuration process.
Important: Do not set to true unless you fully understand the impact of overwriting event notifications. Setting to true may overwrite critical existing business logic.
nav.s3.extractor.incremental.queues No default queue. Used by the custom configuration only. Specify a list of queues for the custom configuration using the JSON template. The list of queues should include the existing queues already in use and the newly configured queue that Cloudera Navigator will use for incremental extracts.
nav.s3.extractor.max_threads Default is 3. The number of extractors (worker processes) to run in parallel.
nav.s3.home_region Default is us-west-1. AWS region closest to the cluster and the Cloudera Navigator instance. Select the same AWS region (or the nearest one geographically) to minimize latency for API requests.
nav.s3.implicit.batch_size Default is 1000. Number of Solr documents held in memory when updating the state of implicit directories.

Categories: Data Management | Governance | Metadata | Navigator | S3 | All Categories

Using Cloudera Navigator with Amazon S3
Cloudera Navigator APIs
  • About Cloudera
  • Resources
  • Contact
  • Careers
  • Press
  • Documentation

United States: +1 888 789 1488
Outside the US: +1 650 362 0488

© 2021 Cloudera, Inc. All rights reserved. Apache Hadoop and associated open source project names are trademarks of the Apache Software Foundation. For a complete list of trademarks, click here.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required notices. A copy of the Apache License Version 2.0 can be found here.

Terms & Conditions  |  Privacy Policy

Page generated September 29, 2021.