Steps to setup an Opensource Real-time IoT data pipeline in Azure cloud

Architecture diagram:

Through this and a subsequent series of articles I would like to layout a detailed step-by-step process for building a realtime IoT data pipeline based primarily on Apache open source technologies on Azure cloud.

I start with the detailed architecture and instructions to build production grade clusters.

In the subsequent article we shall see how the various processors provided by Nifi help us string together each of the tools in the above table and help us debug the data ingestion, processing and storage process using its data provenance feature.

Installing Cloudbreak on Azure:

Refer to https://docs.cloudera.com/HDPDocuments/Cloudbreak/Cloudbreak-2.8.0/install-azure/content/cb_vm-requirements.html

Step 1:

Spawn a CentOS 7 server in Azure cloud with 16GB RAM, 40GB disk, 4 cores

Do a sudo -i to execute all the following commands as root.

yum -y update

systemctl enable firewalld

systemctl start firewalld

systemctl status firewalld

setenforce 0 sed -i ‘s/SELINUX=enforcing/SELINUX=disabled/g’ /etc/selinux/config

getenforce (The command should return “Disabled”.)

Step 2:

yum install -y docker

systemctl start docker

systemctl enable docker

Incase you face problems installing docker using the above commands check:

https://github.com/NaturalHistoryMuseum/scratchpads2/wiki/Install-Docker-and-Docker-Compose-(Centos-7)

https://docs.docker.com/install/linux/docker-ce/centos/

Step 3:

Check the Docker Logging Driver configuration:

docker info | grep “Logging Driver”

If it is set to Logging Driver: journald, you must set it to “json-file” instead. To do that:

  1. Open the docker file for editing:
vi /etc/sysconfig/docker
  • Edit the following part of the file so that it looks like below (showing log-driver=json-file):
# Modify these options if you want to change the way the docker daemon runs
OPTIONS='--selinux-enabled --log-driver=json-file --signature-verification=false'
  • Restart Docker:

systemctl restart docker

systemctl status docker

Step 4:

Install Cloudbreak on a VM:

yum -y install unzip tar

curl -Ls public-repo-1.hortonworks.com/HDP/cloudbreak/cloudbreak-deployer_2.8.0_$(uname)_x86_64.tgz | sudo tar -xz -C /bin cbd

cbd –version

mkdir cloudbreak-deployment

cd cloudbreak-deployment

In the directory, create a file called Profile with the following content:

export UAA_DEFAULT_SECRET=MY-SECRET

export UAA_DEFAULT_USER_PW=MY-PASSWORD

export UAA_DEFAULT_USER_EMAIL=MY-EMAIL

export PUBLIC_IP=MY_VM_IP

For example

export UAA_DEFAULT_SECRET=GravitySecret123

export UAA_DEFAULT_USER_PW=GravitySecurePassword123

export UAA_DEFAULT_USER_EMAIL=peter.smith@family.com

export PUBLIC_IP=172.22.222.222

Generate configurations by executing

rm *.yml

cbd generate

The cbd start command includes the cbd generate command which applies the following steps:

  • Creates the docker-compose.yml file, which describes the configuration of all the Docker containers required for the Cloudbreak deployment.
  • Creates the uaa.yml file, which holds the configuration of the identity server used to authenticate users with Cloudbreak.

Start the Cloudbreak application by using the following commands

cbd pull-parallel

cbd start

check Cloudbreak application logs

cbd logs cloudbreak

If you observe any failure then install jq

wget -O jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64

chmod +x ./jq

cp jq /usr/bin

You should see a message like this in the log: Started CloudbreakApplication in 36.823 seconds.

Step 5:

Log in to the Cloudbreak UI using https://ip_address or

use

cbd start

To obtain the login information

Log in to the Cloudbreak web UI using the credentials that you configured in your Profile file:

  • The username is the UAA_DEFAULT_USER_EMAIL
  • The password is the UAA_DEFAULT_USER_PW

Creating a NiFi cluster:

Refer to url : https://community.cloudera.com/t5/Community-Articles/Using-Cloudbreak-to-create-a-Flow-Management-NiFi-cluster-on/ta-p/249132

Step 1:

Here we shall be creating a Nifi cluster using a custom blueprint. If you try to create your own nifi cluster it fails.

Click on the CREATE BLUEPRINT button. You should see the Create Blueprint screen.

Step 2:

Click on the Upload JSON File button and upload the blueprint attached herewith

{
  "Blueprints": {
    "blueprint_name": "hdf-nifi-no-kerberos",
    "stack_name": "HDF",
    "stack_version": "3.2"
  },
  "configurations": [
    {
      "nifi-ambari-config": {
        "nifi.security.encrypt.configuration.password": "changethisplease",
        "nifi.max_mem": "1g",
        "nifi.sensitive.props.key": "changethisplease"
      }
    },
    {
      "nifi-bootstrap-env": {
        "content": "#\n# Licensed to the Apache Software Foundation (ASF) under one or more\n# contributor license agreements. See the NOTICE file distributed with\n# this work for additional information regarding copyright ownership.\n# The ASF licenses this file to You under the Apache License, Version 2.0\n# (the \"License\"); you may not use this file except in compliance with\n# the License. You may obtain a copy of the License at\n#\n# http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n#\n\n# Java command to use when running NiFi\njava=java\n\n# Username to use when running NiFi. This value will be ignored on Windows.\nrun.as={{nifi_user}}\n##run.as=root\n\n# Configure where NiFi's lib and conf directories live\nlib.dir={{nifi_install_dir}}/lib\nconf.dir={{nifi_config_dir}}\n\n# How long to wait after telling NiFi to shutdown before explicitly killing the Process\ngraceful.shutdown.seconds=20\n\n{% if security_enabled %}\njava.arg.0=-Djava.security.auth.login.config={{nifi_jaas_conf}}\n{% endif %}\n\n# Disable JSR 199 so that we can use JSP's without running a JDK\njava.arg.1=-Dorg.apache.jasper.compiler.disablejsr199=true\n\n# JVM memory settings\njava.arg.2=-Xms{{nifi_initial_mem}}\njava.arg.3=-Xmx{{nifi_max_mem}}\n\n# Enable Remote Debugging\n#java.arg.debug=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8000\n\njava.arg.4=-Djava.net.preferIPv4Stack=true\n\n# allowRestrictedHeaders is required for Cluster/Node communications to work properly\njava.arg.5=-Dsun.net.http.allowRestrictedHeaders=true\njava.arg.6=-Djava.protocol.handler.pkgs=sun.net.www.protocol\n\n# The G1GC is still considered experimental but has proven to be very advantageous in providing great\n# performance without significant \"stop-the-world\" delays.\njava.arg.13=-XX:+UseG1GC\n\n#Set headless mode by default\njava.arg.14=-Djava.awt.headless=true\n\n#Ambari Metrics Collector URL - passed in to flow.xml for AmbariReportingTask\njava.arg.15=-Dambari.metrics.collector.url=http://{{metrics_collector_host}}:{{metrics_collector_port}}/ws/v1/timeline/metrics\n\n#Application ID - used in flow.xml - passed into flow.xml for AmbariReportingTask\njava.arg.16=-Dambari.application.id=nifi\n\n#Sets the provider of SecureRandom to /dev/urandom to prevent blocking on VMs\njava.arg.17=-Djava.security.egd=file:/dev/urandom\n\n# Requires JAAS to use only the provided JAAS configuration to authenticate a Subject, without using any \"fallback\" methods (such as prompting for username/password)\n# Please see https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/single-signon.html, section \"EXCEPTIONS TO THE MODEL\"\njava.arg.18=-Djavax.security.auth.useSubjectCredsOnly=true\n\n###\n# Notification Services for notifying interested parties when NiFi is stopped, started, dies\n###\n\n# XML File that contains the definitions of the notification services\nnotification.services.file={{nifi_config_dir}}/bootstrap-notification-services.xml\n\n# In the case that we are unable to send a notification for an event, how many times should we retry?\nnotification.max.attempts=5\n\n# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi is started?\n#nifi.start.notification.services=email-notification\n\n# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi is stopped?\n#nifi.stop.notification.services=email-notification\n\n# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi dies?\n#nifi.dead.notification.services=email-notification\n\njava.arg.tmpdir=-Djava.io.tmpdir=/usr/hdf/current/nifi/lib"
      }
    },
    {
      "nifi-properties": {
		"nifi.sensitive.props.key": "changemeplease",
		"nifi.security.identity.mapping.pattern.kerb": "^(.*?)@(.*?)$",
		"nifi.security.identity.mapping.value.kerb": "$1",
		"nifi.security.user.login.identity.provider": ""
      }
    },
    {
      "nifi-ambari-ssl-config": {
		"nifi.toolkit.tls.token": "changemeplease",
		"nifi.node.ssl.isenabled": "false",
		"nifi.toolkit.dn.prefix": "CN=",
		"nifi.toolkit.dn.suffix": ", OU=NIFI"
      }
    },
    {
      "nifi-registry-ambari-config": {
        "nifi.registry.security.encrypt.configuration.password": "changethisplease"
      }
    },
    {
      "nifi-registry-properties": {
        "nifi.registry.security.identity.mapping.pattern.kerb": "^(.*?)@(.*?)$",
        "nifi.registry.security.identity.mapping.value.kerb": "$1",
        "nifi.registry.db.password": "changethisplease"
      }
    },
    {
      "nifi-registry-ambari-ssl-config": {
		"nifi.registry.ssl.isenabled": "false",
		"nifi.registry.toolkit.dn.prefix": "CN=",
		"nifi.registry.toolkit.dn.suffix": ", OU=NIFI"
      }
    }
  ],
  "host_groups": [
    {
      "name": "Services",
      "components": [
        {
          "name": "NIFI_REGISTRY_MASTER"
        },
        {
          "name": "METRICS_COLLECTOR"
        },
        {
          "name": "METRICS_MONITOR"
        },
        {
          "name": "METRICS_GRAFANA"
        },
        {
          "name": "ZOOKEEPER_CLIENT"
        }
      ],
      "cardinality": "1"
    },
    {
      "name": "NiFi",
      "components": [
        {
          "name": "NIFI_MASTER"
        },
        {
          "name": "METRICS_MONITOR"
        },
        {
          "name": "ZOOKEEPER_CLIENT"
        }
      ],
      "cardinality": "1+"
    },
    {
      "name": "ZooKeeper",
      "components": [
        {
          "name": "ZOOKEEPER_SERVER"
        },
        {
          "name": "METRICS_MONITOR"
        },
        {
          "name": "ZOOKEEPER_CLIENT"
        }
      ],
      "cardinality": "3+"
    }
  ]
}

In the left menu, click on Clusters. Cloudbreak will display configured clusters. Click the CREATE CLUSTER button. Cloudbreak will display the Create Cluster wizard

Step 3:

To create a credential in Azure follow the following steps

We shall be creating an interactive credential

  • In the Cloudbreak web UI, select Credentials from the navigation pane.
  • Click Create Credential.
  • Under Cloud provider, select “Microsoft Azure”.
  • Select Interactive Login.
  • Provide the following information:
Parameter Description
Name Enter a name for your credential.
Description (Optional) Enter a description.
Subscription Id Copy and paste the Subscription ID from your Subscriptions.
Tenant Id Copy and paste your Directory ID from your Active Directory > Properties.
Azure role type You have the following options: “Use existing Contributor role” (default)

While creating credentials from Azure for cloudbreak get the tenant details from Azure Active Directory – > Properties -> Directory ID

Incase of errors refer to the url : https://docs.cloudera.com/HDPDocuments/Cloudbreak/Cloudbreak-2.9.0/create-credential-azure/content/cb_create-interactive-credential.html

Step 4:

General Configuration

By default, the General Configuration screen is displayed using the BASIC view.

  • Credential: Select the Azure credential you created in the above step.
  • Cluster Name: Enter a name for your cluster. The name must be between 5 and 40 characters, must start with a letter, and must only include lowercase letters, numbers, and hyphens.
  • Region: Select the region in which you would like to launch your cluster.
  • Platform Version: Cloudbreak currently defaults to HDP 2.6. Select the dropdown arrow and select HDF 3.1.
  • Cluster Type: As mentioned previously, there are two supported cluster types. Make sure select the blueprint you just created.

Click the green NEXT button.

Step 5:

Hardware and Storage

Cloudbreak will display the Hardware and Storage screen. On this screen, you have the ability to change the instance types, attached storage and where the Ambari server will be installed. As you you can see, we will deploy 1 NiFi and 1 Zookeeper node. In a production environment you would typically have at least 3 Zookeeper nodes. We will use the defaults.

Click the green NEXT button.

Step 6:

Gateway Configuration

Cloudbreak will display the Gateway Configuration screen. On this screen, you have the ability to enable a protected gateway. This gateway uses Knox to provide a secure access point for the cluster. We will leave this option disabled.

Click the green NEXT button.

Step 7:

Network

Cloudbreak will display the Network screen. On this screen, you have the ability to specify the NetworkSubnet, and Security Groups. Cloudbreak defaults to creating new configurations. We will use the default options to create new configurations.

Because we are using a custom blueprint which disables SSL, we need to update the security groups with correct ports for the NiFi and NiFi Registry UIs. In the SERVICES security group, add the port 61080 with TCP. Click the + button to add the rule. In the NIFI security group, add the port 9090 with TCP. Click the + button to add the rule.

You should see something similar the following:

Click the green NEXT button.

Step 8:

Security

Cloudbreak will display the Security screen. On this screen, you have the ability to specify the Ambari admin username and password. You can create a new SSH key or selecting an existing one.

ssh-keygen -t rsa

Copy the public key generated from command above to the field in the screenshot below

You have the ability to display a JSON version of the blueprint. You also have the ability display a JSON version of the cluster definition. Both of these can be used with Cloudbreak CLI to programatically automate these operations.

Click the green CREATE CLUSTER button.

Step 9:

Cluster Summary

Cloudbreak will display the Cluster Summary page. It will generally take between 10-15 minutes for the cluster to be fully deployed. As you can see, this screen looks similar to and HDP cluster. The big difference is the Blueprint and HDF Version.

Click on the Ambari URL to open the Ambari UI.

Step 10:

Ambari

You will likely see a browser warning when you first open the Ambari UI. That is because we are using self-signed certificates.

Click on the ADVANCED button. Then click the link to Proceed.

You will be presented with the Ambari login page. You will login using the username and password you specified when you created the cluster.

You should see the cluster summary screen. As you can see, we have a cluster with Zookeeper, NiFi, and the NiFi Registry.

Click on the NiFi service in the left hand menu. Now you can access the Quick Links menu for a shortcut to the NiFi UI.

You should see the NiFi UI.

Back in the Ambari UI, click on the NiFi Registry service in the left hand menu. Now you can access the Quick Links menu for a shortcut to the NiFi Registry UI.

You should see the NiFi Registry UI.

Creating a Kafka cluster:

Step 1:

Follow the instructions as above on the HortonWorks UI (similar to NiFi cluster) for creating Kafka and you should be able to create it easily. Unlike NiFi it does not necessarily require a custom template to get created.

Creating a Cassandra Cluster:

Step 1:

Spawn 2 CentOS 7 machines in Azure cloud

Step 2:

Install Cassandra

vi /etc/yum.repos.d/datastax.repo

[datastax]

name = DataStax Repo for Apache Cassandra

baseurl = http://rpm.datastax.com/community

enabled = 1

gpgcheck = 0

yum install dsc30

yum install cassandra30-tools

Step 3:

Open the following ports in Azure for inbound traffic

7000

7001

7199

9042

9160

9142

Also open them in centos using the following commands for each of the ports

# firewall-cmd –zone=public –add-port=7000/tcp –permanent

# firewall-cmd –reload

# iptables-save | grep 7000

Step 4:

modify the following entry in /etc/cassandra/conf/cassandra-env.sh on both the machines

JVM_OPTS=”$JVM_OPTS -Djava.rmi.server.hostname=martians.eastus.cloudapp.azure.com”

Step 5:

modify the following entries in /etc/cassandra/conf/cassandra.yaml on both the machines

– seeds: “martians.eastus.cloudapp.azure.com,martiansext.eastus.cloudapp.azure.com”

On machine 1 only

listen_address: martians.eastus.cloudapp.azure.com

rpc_address: 0.0.0.0

rpc_port: 9160

start_rpc: true

broadcast_rpc_address: martians.eastus.cloudapp.azure.com

On machine 2 only

listen_address: martiansext.eastus.cloudapp.azure.com

rpc_address: 0.0.0.0

rpc_port: 9160

start_rpc: true

broadcast_rpc_address: martiansext.eastus.cloudapp.azure.com

Step 6:

modify /etc/hosts file to add the following entries

10.8.8.8 martians.eastus.cloudapp.azure.com (private ips)

10.9.9.9 martiansext.eastus.cloudapp.azure.com

13.7.7.7 martians.eastus.cloudapp.azure.com (public ips)

40.6.66.66 martiansext.eastus.cloudapp.azure.com

Step 7:

systemctl restart Cassandra

nodetool status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

—  Address   Load       Tokens       Owns (effective)  Host ID                               Rack

UN  10.8.8.8  263.24 KB  256          100.0%            943b9686-ec28-49ed-a8f8-5223d88df8b1  rack1

UN  10.9.9.9  287.54 KB  256          100.0%            29a37841-5af4-47cd-9065-5a1c953274f1  rack1

Step 8:

cqlsh martians.eastus.cloudapp.azure.com

CREATE TABLE Device.data (      

Serial_number text,

dimension int,

param_reading1 text, 

param_reading2 text, 

param_readingN text, 

PRIMARY KEY (Serial_number)

 );

INSERT INTO  Device.data (Serial_number, dimension, param_reading1, param_reading2, param_readingN) VALUES(‘Boeing737’, 45, ‘LCM-100’, ‘HCM-200’, ‘TFT-400’);

run cqlsh martiansext.eastus.cloudapp.azure.com

 select * from Device.data;

 You should see the record that has been added

Creating a Spark Cluster:

Step 1:

spawn 3 CentOS based linux vms in Azure cloud. 1 of these will serve as master and the other 2 as slaves.

gravitymaster.eastus.cloudapp.azure.com

52.888.88.888 (public ip)

10.9.9.9 (private ip)

gravityslave1.eastus.cloudapp.azure.com

52.777.777.777

10.6.6.6

gravityslave2.eastus.cloudapp.azure.com

13.55.555.555

10.4.4.4

ensure that you use the same user id on all the 3 machines. Say gravitymaster

Step 2:

Add the following 3 entries to the /etc/hosts file of gravitymaster.

52.888.88.888 gravitymaster.eastus.cloudapp.azure.com

52.777.777.777 gravityslave1.eastus.cloudapp.azure.com

13.55.555.555 gravityslave2.eastus.cloudapp.azure.com

Step 3:

install java on master and slave machines using.

wget -c –header “Cookie: oraclelicense=accept-securebackup-cookie” http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.rpm

yum localinstall jdk-8u131-linux-x64.rpm

sudo alternatives –config java

sudo sh -c “echo export JAVA_HOME=/usr/java/jdk1.8.0_131/jre  >> /etc/environment”

java -version

Step 4:

install scala on master and slave machines using.

wget https://downloads.lightbend.com/scala/2.13.0/scala-2.13.0.rpm

yum install scala-2.13.0.rpm

scala -version

Step 5:

Configure passwordless ssh from master to the slave machines.

[gravitymaster@gravitymaster ~]$ ssh-keygen -t rsa

Press enter wherever prompted

Execute the following commands

ssh gravitymaster@gravityslave1.eastus.cloudapp.azure.com mkdir -p .ssh

cat .ssh/id_rsa.pub | ssh gravitymaster@gravityslave1.eastus.cloudapp.azure.com ‘cat >> .ssh/authorized_keys’

ssh gravitymaster@gravityslave1.eastus.cloudapp.azure.com “chmod 700 .ssh; chmod 640 .ssh/authorized_keys”

ssh gravitymaster@gravityslave1.eastus.cloudapp.azure.com

exit

ssh gravitymaster@gravityslave2.eastus.cloudapp.azure.com mkdir -p .ssh

cat .ssh/id_rsa.pub | ssh gravitymaster@gravityslave2.eastus.cloudapp.azure.com ‘cat >> .ssh/authorized_keys’

ssh gravitymaster@gravityslave2.eastus.cloudapp.azure.com “chmod 700 .ssh; chmod 640 .ssh/authorized_keys”

ssh gravitymaster@gravityslave2.eastus.cloudapp.azure.com

Step 6:

install spark on all 3 VMs using.

wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

$ tar xvf spark-2.4.4-bin-hadoop2.7.tgz

mv spark-2.4.4-bin-hadoop2.7 /usr/local/spark

vi ~/.bashrc

Add the following line to ~/.bashrc file.

export PATH = $PATH:/usr/local/spark/bin

$ source ~/.bashrc

Step 7:

Make the following changes to the master server.

cd /usr/local/spark/conf

cp spark-env.sh.template spark-env.sh

vi spark-env.sh

export SPARK_MASTER_HOST='<MASTER-IP>’ (make sure this is the private ip of the azure server)

export JAVA_HOME=/usr/java/jdk1.8.0_131/jre

Step 8:

edit the configuration of slaves on the master machine vi /usr/local/spark/conf.

gravitymaster.eastus.cloudapp.azure.com

gravityslave1.eastus.cloudapp.azure.com

gravityslave2.eastus.cloudapp.azure.com

Step 9:

start the spark cluster, using the following command on master.

$ cd /usr/local/spark

$ ./sbin/start-all.sh

Step 10:

Access the spark web ui using.

http://gravitymaster.eastus.cloudapp.azure.com:8080/

Incase of issue check the master and slave logs located at

/usr/local/spark/logs/spark-gravitymaster-org.apache.spark.deploy.master.Master-1-gravitymaster.eastus.cloudapp.azure.com.out

/usr/local/spark/logs/spark-gravitymaster-org.apache.spark.deploy.worker.Worker-1-gravityslave1.out

/usr/local/spark/logs/spark-gravitymaster-org.apache.spark.deploy.worker.Worker-1-gravityslave2.out

Step 11:

For stopping the above processes.

$ cd /usr/local/spark
$ ./sbin/stop-all.sh

Or check the process ids from the logfiles in the previous step and

Kill -9 <processid>

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: