Architecture diagram:


Through this and a subsequent series of articles I would like to layout a detailed step-by-step process for building a realtime IoT data pipeline based primarily on Apache open source technologies on Azure cloud.
I start with the detailed architecture and instructions to build production grade clusters.
In the subsequent article we shall see how the various
processors provided by Nifi help us string together each of the tools in the
above table and help us debug the data ingestion, processing and storage process
using its data provenance feature.
Installing Cloudbreak on Azure:
Step 1:
Spawn a CentOS 7 server in Azure cloud with 16GB RAM, 40GB disk, 4 cores
Do a sudo -i to execute all the following commands as root.
yum -y update
systemctl enable firewalld
systemctl start firewalld
systemctl status firewalld
setenforce 0 sed -i ‘s/SELINUX=enforcing/SELINUX=disabled/g’ /etc/selinux/config
getenforce (The command should return “Disabled”.)
Step 2:
yum install -y docker
systemctl start docker
systemctl enable docker
Incase you face problems installing docker using the above commands check:
https://docs.docker.com/install/linux/docker-ce/centos/
Step 3:
Check the Docker Logging Driver configuration:
docker info | grep “Logging Driver”
If it is set to Logging Driver: journald, you must set it to “json-file” instead. To do that:
- Open the
docker
file for editing:
vi /etc/sysconfig/docker
- Edit the following part of
the file so that it looks like below (showing
log-driver=json-file
):
# Modify these options if you want to change the way the docker daemon runs
OPTIONS=
'--selinux-enabled --log-driver=json-file --signature-verification=false'
- Restart Docker:
systemctl restart docker
systemctl status docker
Step 4:
Install Cloudbreak on a VM:
yum -y install unzip tar
curl -Ls public-repo-1.hortonworks.com/HDP/cloudbreak/cloudbreak-deployer_2.8.0_$(uname)_x86_64.tgz | sudo tar -xz -C /bin cbd
cbd –version
mkdir cloudbreak-deployment
cd cloudbreak-deployment
In the directory, create a file called Profile with the following content:
export UAA_DEFAULT_SECRET=MY-SECRET
export UAA_DEFAULT_USER_PW=MY-PASSWORD
export UAA_DEFAULT_USER_EMAIL=MY-EMAIL
export PUBLIC_IP=MY_VM_IP
For example
export UAA_DEFAULT_SECRET=GravitySecret123
export UAA_DEFAULT_USER_PW=GravitySecurePassword123
export UAA_DEFAULT_USER_EMAIL=peter.smith@family.com
export PUBLIC_IP=172.22.222.222
Generate configurations by executing
rm *.yml
cbd generate
The cbd start command includes the cbd generate command which applies the following steps:
- Creates the
docker-compose.yml
file, which describes the configuration of all the Docker containers required for the Cloudbreak deployment. - Creates the
uaa.yml
file, which holds the configuration of the identity server used to authenticate users with Cloudbreak.
Start the Cloudbreak application by using the following commands
cbd pull-parallel
cbd start
check Cloudbreak application logs
cbd logs cloudbreak
If you observe any failure then install jq
wget -O jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
chmod +x ./jq
cp jq /usr/bin
You should see a message like this in the log: Started
CloudbreakApplication in
36.823
seconds.
Step 5:
Log in to the Cloudbreak UI using https://ip_address or
use
cbd start
To obtain the login information
Log in to the Cloudbreak web UI using the credentials that you configured in your Profile file:
- The username is the UAA_DEFAULT_USER_EMAIL
- The password is the UAA_DEFAULT_USER_PW
Creating a NiFi cluster:
Refer to url : https://community.cloudera.com/t5/Community-Articles/Using-Cloudbreak-to-create-a-Flow-Management-NiFi-cluster-on/ta-p/249132
Step 1:
Here we shall be creating a Nifi cluster using a custom blueprint. If you try to create your own nifi cluster it fails.
Click on the CREATE BLUEPRINT button. You should
see the Create Blueprint screen.
Step 2:
Click on the Upload JSON File button and upload the blueprint attached herewith
{
"Blueprints": {
"blueprint_name": "hdf-nifi-no-kerberos",
"stack_name": "HDF",
"stack_version": "3.2"
},
"configurations": [
{
"nifi-ambari-config": {
"nifi.security.encrypt.configuration.password": "changethisplease",
"nifi.max_mem": "1g",
"nifi.sensitive.props.key": "changethisplease"
}
},
{
"nifi-bootstrap-env": {
"content": "#\n# Licensed to the Apache Software Foundation (ASF) under one or more\n# contributor license agreements. See the NOTICE file distributed with\n# this work for additional information regarding copyright ownership.\n# The ASF licenses this file to You under the Apache License, Version 2.0\n# (the \"License\"); you may not use this file except in compliance with\n# the License. You may obtain a copy of the License at\n#\n# http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n#\n\n# Java command to use when running NiFi\njava=java\n\n# Username to use when running NiFi. This value will be ignored on Windows.\nrun.as={{nifi_user}}\n##run.as=root\n\n# Configure where NiFi's lib and conf directories live\nlib.dir={{nifi_install_dir}}/lib\nconf.dir={{nifi_config_dir}}\n\n# How long to wait after telling NiFi to shutdown before explicitly killing the Process\ngraceful.shutdown.seconds=20\n\n{% if security_enabled %}\njava.arg.0=-Djava.security.auth.login.config={{nifi_jaas_conf}}\n{% endif %}\n\n# Disable JSR 199 so that we can use JSP's without running a JDK\njava.arg.1=-Dorg.apache.jasper.compiler.disablejsr199=true\n\n# JVM memory settings\njava.arg.2=-Xms{{nifi_initial_mem}}\njava.arg.3=-Xmx{{nifi_max_mem}}\n\n# Enable Remote Debugging\n#java.arg.debug=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8000\n\njava.arg.4=-Djava.net.preferIPv4Stack=true\n\n# allowRestrictedHeaders is required for Cluster/Node communications to work properly\njava.arg.5=-Dsun.net.http.allowRestrictedHeaders=true\njava.arg.6=-Djava.protocol.handler.pkgs=sun.net.www.protocol\n\n# The G1GC is still considered experimental but has proven to be very advantageous in providing great\n# performance without significant \"stop-the-world\" delays.\njava.arg.13=-XX:+UseG1GC\n\n#Set headless mode by default\njava.arg.14=-Djava.awt.headless=true\n\n#Ambari Metrics Collector URL - passed in to flow.xml for AmbariReportingTask\njava.arg.15=-Dambari.metrics.collector.url=http://{{metrics_collector_host}}:{{metrics_collector_port}}/ws/v1/timeline/metrics\n\n#Application ID - used in flow.xml - passed into flow.xml for AmbariReportingTask\njava.arg.16=-Dambari.application.id=nifi\n\n#Sets the provider of SecureRandom to /dev/urandom to prevent blocking on VMs\njava.arg.17=-Djava.security.egd=file:/dev/urandom\n\n# Requires JAAS to use only the provided JAAS configuration to authenticate a Subject, without using any \"fallback\" methods (such as prompting for username/password)\n# Please see https://docs.oracle.com/javase/8/docs/technotes/guides/security/jgss/single-signon.html, section \"EXCEPTIONS TO THE MODEL\"\njava.arg.18=-Djavax.security.auth.useSubjectCredsOnly=true\n\n###\n# Notification Services for notifying interested parties when NiFi is stopped, started, dies\n###\n\n# XML File that contains the definitions of the notification services\nnotification.services.file={{nifi_config_dir}}/bootstrap-notification-services.xml\n\n# In the case that we are unable to send a notification for an event, how many times should we retry?\nnotification.max.attempts=5\n\n# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi is started?\n#nifi.start.notification.services=email-notification\n\n# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi is stopped?\n#nifi.stop.notification.services=email-notification\n\n# Comma-separated list of identifiers that are present in the notification.services.file; which services should be used to notify when NiFi dies?\n#nifi.dead.notification.services=email-notification\n\njava.arg.tmpdir=-Djava.io.tmpdir=/usr/hdf/current/nifi/lib"
}
},
{
"nifi-properties": {
"nifi.sensitive.props.key": "changemeplease",
"nifi.security.identity.mapping.pattern.kerb": "^(.*?)@(.*?)$",
"nifi.security.identity.mapping.value.kerb": "$1",
"nifi.security.user.login.identity.provider": ""
}
},
{
"nifi-ambari-ssl-config": {
"nifi.toolkit.tls.token": "changemeplease",
"nifi.node.ssl.isenabled": "false",
"nifi.toolkit.dn.prefix": "CN=",
"nifi.toolkit.dn.suffix": ", OU=NIFI"
}
},
{
"nifi-registry-ambari-config": {
"nifi.registry.security.encrypt.configuration.password": "changethisplease"
}
},
{
"nifi-registry-properties": {
"nifi.registry.security.identity.mapping.pattern.kerb": "^(.*?)@(.*?)$",
"nifi.registry.security.identity.mapping.value.kerb": "$1",
"nifi.registry.db.password": "changethisplease"
}
},
{
"nifi-registry-ambari-ssl-config": {
"nifi.registry.ssl.isenabled": "false",
"nifi.registry.toolkit.dn.prefix": "CN=",
"nifi.registry.toolkit.dn.suffix": ", OU=NIFI"
}
}
],
"host_groups": [
{
"name": "Services",
"components": [
{
"name": "NIFI_REGISTRY_MASTER"
},
{
"name": "METRICS_COLLECTOR"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "METRICS_GRAFANA"
},
{
"name": "ZOOKEEPER_CLIENT"
}
],
"cardinality": "1"
},
{
"name": "NiFi",
"components": [
{
"name": "NIFI_MASTER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "ZOOKEEPER_CLIENT"
}
],
"cardinality": "1+"
},
{
"name": "ZooKeeper",
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "ZOOKEEPER_CLIENT"
}
],
"cardinality": "3+"
}
]
}
In the left menu, click on Clusters. Cloudbreak will display configured clusters. Click the CREATE CLUSTER button. Cloudbreak will display the Create Cluster wizard
Step 3:
To create a credential in Azure follow the following steps
We shall be creating an interactive credential
- In the Cloudbreak web UI, select Credentials from the navigation pane.
- Click Create Credential.
- Under Cloud provider, select “Microsoft Azure”.
- Select Interactive Login.
- Provide the following information:
Parameter | Description |
Name | Enter a name for your credential. |
Description | (Optional) Enter a description. |
Subscription Id | Copy and paste the Subscription ID from your Subscriptions. |
Tenant Id | Copy and paste your Directory ID from your Active Directory > Properties. |
Azure role type | You have the following options: “Use existing Contributor role” (default) |
While creating credentials from Azure for cloudbreak get the tenant details from Azure Active Directory – > Properties -> Directory ID
Incase of errors refer to the url : https://docs.cloudera.com/HDPDocuments/Cloudbreak/Cloudbreak-2.9.0/create-credential-azure/content/cb_create-interactive-credential.html
Step 4:
General Configuration
By default, the General Configuration screen is displayed using the BASIC
view.
- Credential: Select the Azure credential you created in the above step.
- Cluster Name: Enter a name for your cluster. The name must be between 5 and 40 characters, must start with a letter, and must only include lowercase letters, numbers, and hyphens.
- Region: Select the region in which you would like to launch your cluster.
- Platform Version: Cloudbreak currently defaults to HDP 2.6. Select the dropdown arrow and select HDF 3.1.
- Cluster Type: As mentioned previously, there are two supported cluster types. Make sure select the blueprint you just created.
Click the green NEXT
button.
Step 5:
Hardware and Storage
Cloudbreak will display the Hardware
and Storage
screen. On this screen, you have the ability to change the
instance types, attached storage and where the Ambari server will be installed.
As you you can see, we will deploy 1 NiFi and 1 Zookeeper node. In a production
environment you would typically have at least 3 Zookeeper nodes. We will use
the defaults.

Click the green NEXT
button.
Step 6:
Gateway Configuration
Cloudbreak will display the Gateway
Configuration
screen. On this screen, you have the ability to enable a
protected gateway. This gateway uses Knox to provide a secure access point for the
cluster. We will leave this option disabled.
Click the green NEXT
button.
Step 7:
Network
Cloudbreak will display the Network
screen. On
this screen, you have the ability to specify the Network
, Subnet
, and Security
Groups
. Cloudbreak defaults to creating new configurations. We will
use the default options to create new configurations.
Because we are using a custom blueprint which disables SSL, we
need to update the security groups with correct ports for the NiFi and NiFi
Registry UIs. In the SERVICES
security
group, add the port 61080
with TCP
. Click
the +
button to add the rule. In the NIFI
security
group, add the port 9090
with TCP
. Click
the +
button to add the rule.
You should see something similar the following:

Click the green NEXT
button.
Step 8:
Security
Cloudbreak will display the Security
screen. On
this screen, you have the ability to specify the Ambari admin username and
password. You can create a new SSH key or selecting an existing one.
ssh-keygen -t rsa
Copy the public key generated from command above to the field in the screenshot below
You have the ability to display a JSON
version of
the blueprint. You also have the ability display a JSON
version of
the cluster definition. Both of these can be used with Cloudbreak CLI to
programatically automate these operations.

Click the green CREATE CLUSTER
button.
Step 9:
Cluster Summary
Cloudbreak will display the Cluster
Summary
page. It will generally take between 10-15 minutes for the
cluster to be fully deployed. As you can see, this screen looks similar to and
HDP cluster. The big difference is the Blueprint
and HDF
Version
.

Click on the Ambari URL
to open
the Ambari UI.
Step 10:
Ambari
You will likely see a browser warning when you first open the Ambari UI. That is because we are using self-signed certificates.
Click on the ADVANCED
button.
Then click the link to Proceed
.
You will be presented with the Ambari login page. You will login using the username and password you specified when you created the cluster.
You should see the cluster summary screen. As you can see, we have a cluster with Zookeeper, NiFi, and the NiFi Registry.
Click on the NiFi
service in
the left hand menu. Now you can access the Quick
Links
menu for a shortcut to the NiFi UI.
You should see the NiFi UI.
Back in the Ambari UI, click on the NiFi
Registry
service in the left hand menu. Now you can access
the Quick Links
menu for a
shortcut to the NiFi Registry UI.
You should see the NiFi Registry UI.
Creating a Kafka cluster:
Step 1:
Follow the instructions as above on the HortonWorks UI
(similar to NiFi cluster) for creating Kafka and you should be able to create
it easily. Unlike NiFi it does not necessarily require a custom template to get
created.
Creating a Cassandra Cluster:
Step 1:
Spawn 2 CentOS 7 machines in Azure cloud
Step 2:
Install Cassandra
vi /etc/yum.repos.d/datastax.repo
[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck = 0
yum install dsc30
yum install cassandra30-tools
Step 3:
Open the following ports in Azure for inbound traffic
7000
7001
7199
9042
9160
9142
Also open them in centos using the following commands for each of the ports
# firewall-cmd –zone=public –add-port=7000/tcp –permanent
# firewall-cmd –reload
# iptables-save | grep 7000
Step 4:
modify the following entry in /etc/cassandra/conf/cassandra-env.sh on both the machines
JVM_OPTS=”$JVM_OPTS -Djava.rmi.server.hostname=martians.eastus.cloudapp.azure.com”
Step 5:
modify the following entries in /etc/cassandra/conf/cassandra.yaml on both the machines
– seeds: “martians.eastus.cloudapp.azure.com,martiansext.eastus.cloudapp.azure.com”
On machine 1 only
listen_address: martians.eastus.cloudapp.azure.com
rpc_address: 0.0.0.0
rpc_port: 9160
start_rpc: true
broadcast_rpc_address: martians.eastus.cloudapp.azure.com
On machine 2 only
listen_address: martiansext.eastus.cloudapp.azure.com
rpc_address: 0.0.0.0
rpc_port: 9160
start_rpc: true
broadcast_rpc_address: martiansext.eastus.cloudapp.azure.com
Step 6:
modify /etc/hosts file to add the following entries
10.8.8.8 martians.eastus.cloudapp.azure.com (private ips)
10.9.9.9 martiansext.eastus.cloudapp.azure.com
13.7.7.7 martians.eastus.cloudapp.azure.com (public ips)
40.6.66.66 martiansext.eastus.cloudapp.azure.com
Step 7:
systemctl restart Cassandra
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
— Address Load Tokens Owns (effective) Host ID Rack
UN 10.8.8.8 263.24 KB 256 100.0% 943b9686-ec28-49ed-a8f8-5223d88df8b1 rack1
UN 10.9.9.9 287.54 KB 256 100.0% 29a37841-5af4-47cd-9065-5a1c953274f1 rack1
Step 8:
cqlsh martians.eastus.cloudapp.azure.com
CREATE TABLE Device.data (
Serial_number text,
dimension int,
param_reading1 text,
param_reading2 text,
param_readingN text,
PRIMARY KEY (Serial_number)
);
INSERT INTO Device.data (Serial_number, dimension, param_reading1, param_reading2, param_readingN) VALUES(‘Boeing737’, 45, ‘LCM-100’, ‘HCM-200’, ‘TFT-400’);
run cqlsh martiansext.eastus.cloudapp.azure.com
select * from Device.data;
You should see the record that has been
added
Creating a Spark Cluster:
Step 1:
spawn 3 CentOS based linux vms in Azure cloud. 1 of these will serve as master and the other 2 as slaves.
gravitymaster.eastus.cloudapp.azure.com
52.888.88.888 (public ip)
10.9.9.9 (private ip)
gravityslave1.eastus.cloudapp.azure.com
52.777.777.777
10.6.6.6
gravityslave2.eastus.cloudapp.azure.com
13.55.555.555
10.4.4.4
ensure that you use the same user id on all the 3 machines. Say gravitymaster
Step 2:
Add the following 3 entries to the /etc/hosts file of gravitymaster.
52.888.88.888 gravitymaster.eastus.cloudapp.azure.com
52.777.777.777 gravityslave1.eastus.cloudapp.azure.com
13.55.555.555 gravityslave2.eastus.cloudapp.azure.com
Step 3:
install java on master and slave machines using.
wget -c –header “Cookie: oraclelicense=accept-securebackup-cookie” http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.rpm
yum localinstall jdk-8u131-linux-x64.rpm
sudo alternatives –config java
sudo sh -c “echo export JAVA_HOME=/usr/java/jdk1.8.0_131/jre >> /etc/environment”
java -version
Step 4:
install scala on master and slave machines using.
wget https://downloads.lightbend.com/scala/2.13.0/scala-2.13.0.rpm
yum install scala-2.13.0.rpm
scala -version
Step 5:
Configure passwordless ssh from master to the slave machines.
[gravitymaster@gravitymaster ~]$ ssh-keygen -t rsa
Press enter wherever prompted
Execute the following commands
ssh gravitymaster@gravityslave1.eastus.cloudapp.azure.com mkdir -p .ssh
cat .ssh/id_rsa.pub | ssh gravitymaster@gravityslave1.eastus.cloudapp.azure.com ‘cat >> .ssh/authorized_keys’
ssh gravitymaster@gravityslave1.eastus.cloudapp.azure.com “chmod 700 .ssh; chmod 640 .ssh/authorized_keys”
ssh gravitymaster@gravityslave1.eastus.cloudapp.azure.com
exit
ssh gravitymaster@gravityslave2.eastus.cloudapp.azure.com mkdir -p .ssh
cat .ssh/id_rsa.pub | ssh gravitymaster@gravityslave2.eastus.cloudapp.azure.com ‘cat >> .ssh/authorized_keys’
ssh gravitymaster@gravityslave2.eastus.cloudapp.azure.com “chmod 700 .ssh; chmod 640 .ssh/authorized_keys”
ssh gravitymaster@gravityslave2.eastus.cloudapp.azure.com
Step 6:
install spark on all 3 VMs using.
wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
$ tar xvf spark-2.4.4-bin-hadoop2.7.tgz
mv spark-2.4.4-bin-hadoop2.7 /usr/local/spark
vi ~/.bashrc
Add the following line to ~/.bashrc file.
export PATH = $PATH:/usr/local/spark/bin
$ source ~/.bashrc
Step 7:
Make the following changes to the master server.
cd /usr/local/spark/conf
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
export SPARK_MASTER_HOST='<MASTER-IP>’ (make sure this is the private ip of the azure server)
export JAVA_HOME=/usr/java/jdk1.8.0_131/jre
Step 8:
edit the configuration of slaves on the master machine vi /usr/local/spark/conf.
gravitymaster.eastus.cloudapp.azure.com
gravityslave1.eastus.cloudapp.azure.com
gravityslave2.eastus.cloudapp.azure.com
Step 9:
start the spark cluster, using the following command on master.
$ cd /usr/local/spark
$ ./sbin/start-all.sh
Step 10:
Access the spark web ui using.
http://gravitymaster.eastus.cloudapp.azure.com:8080/
Incase of issue check the master and slave logs located at
/usr/local/spark/logs/spark-gravitymaster-org.apache.spark.deploy.master.Master-1-gravitymaster.eastus.cloudapp.azure.com.out
/usr/local/spark/logs/spark-gravitymaster-org.apache.spark.deploy.worker.Worker-1-gravityslave1.out
/usr/local/spark/logs/spark-gravitymaster-org.apache.spark.deploy.worker.Worker-1-gravityslave2.out
Step 11:
For stopping the above processes.
$
cd /usr/local/spark
$ ./sbin/stop-all.sh
Or check the process ids from the logfiles in the previous step and
Kill -9 <processid>