Skip to main content

Automation of Extracting JIRA Issues and Loading to Hive Table Using Python and Shell Script

Automation of Extracting JIRA Issues and Loading to Hive Table Using Python and Shell Script

Orchestrate End to End flow of Extracting JIRA issues with custom fields using Python and loading them to Hive using Shell Script.

Requirements:

Python Modules:

jira

pandas

 

Process:

jira python module extracts using JIRA REST API , for more information please see (https://pypi.org/project/jira/) .

Pandas Modules is used to convert the list to data frame and write to CSV files with specified delimiter which can be used to load to hive table.

CODE:

 

from jira import JIRA

import pandas as pd

 

#Specify your company base_url for JIRA Board

URL=https://xxx.jira.com’

 

#authenticate JIRA using basic authentication Username and Password

jira = JIRA(server=URL,basic_auth=(username’,’password))

 

#PULL all the aviable issues in JIRA for the project specified.

blockSize=1000

blockNo=0

#initialize the list for adding the issues

jiraIssues=[]

while True:

    startIDX=blockNo*blockSize

    projectAllTkts=jira.search_issues('project=<project_code>',startIDX,blockSize)

    if len(projectAllTkts) == 0:

        break

       

    blockNo +=1

    for tkt in projectAllTkts:

        jiraIssues.append(tkt)

 

#to display the count of issues extracted

print(len(jiraIssues)

 

tktsDF=pd.DataFrame()

 

for tkt in jiraIssues:

    #specify dictionary to extract required fields from JIRA board

    fieldsDict={

    'PROJECT_CODE': tkt.fields.project.key,    #4 CHARS Project Code in Key

    'JIRA_TKT': tkt.key,                       #4 CHARS PROJECT CODE + NUMBER 

    'ISSUE_TYPE': tkt.fields.issuetype.name,  #Bug,Story etc

    'ASSIGNEE': tkt.fields.assignee,

    'CREATOR': tkt.fields.creator,

    'CREATION_DATE': tkt.fields.created,

    'REPORTER': tkt.fields.reporter,

    'SUMMARY': tkt.fields.summary,

    'PRIORITY': tkt.fields.priority.name,

    'STATUS': tkt.fields.status.name,

    'LAST_UPDATED_DATE': tkt.fields.updated

    }

   

    tktsDF=tkts.append(fieldsDict,ignore_index=True)

   

 

#use Columns to organize in required order if needed

#colORDER=['CREATED_DATE','PROJECT_CODE','JIRA_TKT','ISSUE_TYPE','STATUS','PRIORITY','CREATOR','REPORTER','ASSIGNEE','SUMMARY','LAST_UPDATED_DATE']

#tktsDF=tktsDF.reindex(columns=colORDER)

 

#convert CREATED_DATE and LAST_UPDATED_DATE to DATE Format

tktsDF['CREATED_DATE']=tktsDF['CREATED_DATE'].appy(lambda field:pd.Timestamp(field).strftime('%Y-%m-%d'))

tktsDF['LAST_UPDATED_DATE']=tktsDF['LAST_UPDATED_DATE'].appy(lambda field:pd.Timestamp(field).strftime('%Y-%m-%d'))

 

tktDF.to_csv("all_jira_tkts.csv",encoding='utf-8',header=True,index=False,sep='|')

 

 

Shell Wrapper Script to Execute Python script and Load to Hive External Table.

 

#!/bin/ksh

TODAY=`date +'%Y%m%d'`

 

echo "starting shell script ${0} on ${TODAY}"

 

#Remove the Previous run files

rm /output/all_jira_tkts.csv

 

#activate conda environment

export PATH="/app/minconda3/bin:$PATH"

source activate py_env37

 

#start of python script

python <python_script.py>

 

if [ $? != 0 ] ; then

        echo "ALERT: Encountered Error in Python Script Execution"

        exit 1

else

        echo "SUCESS: Python Script Compelted"

fi

 

#hive command to drop existing table

hive -S -e "DROP TABLE IF EXISTS schema.tbl_name;"

 

echo "Dropped Table "

 

#Remove HDFS all_jira_tkts.csv file present in previous run"

hdfs dfs -rm -skipTrash hdfs://<cluster>/user/jira/all_jira_tkts.csv

 

#copy created csv file from local to HDFS

chmod 777 /output/all_jira_tkts.csv

hdfs dfs -copyFromLocal /output/all_jira_tkts.csv hdfs://<cluster>/user/jira

 

#verify if file copied sucessfully

hadoop fs -test -f hdfs://<cluster>/user/jira/all_jira_tkts.csv

if [ $? != 0 ] ; then

        echo "ALERT: Encountered Error in previous command csv File Not Present"

        exit 2

else

        echo "SUCESS: Copied File from Local to HDFS "

fi

 

#Create External Table by specifying the schema for all_jira_tkts.csv csv file

hive -S -e "CREATE EXTERNAL TABLE IF NOT EXISTS schema.tbl_name (created_date date,

project_code string,jira_issue string,issue_type string,status string,priority string,

creator string,summary string,last_updated_date date)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '|'

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 'hdfs://<cluster>/user/jira'

tblproperties ('skip.header.line.count'='1');

"

if [ $? != 0 ] ; then

        echo "ALERT: Encountered Error in creating external table"

        exit 3

else

        echo "SUCESS: Created External Table "

 

fi

 

echo "Completed Execution of Script ${0}"

 

 

 

 

 

Schedule it to run daily twice a day at 8 45 AM and 12 45 AM on Hadoop Edge Node using Cron Scheduler

 

crontab -e

 

45 8,12 * * * /<wrapperScriptwithloc>.sh 2>&1 > /log/$(date +\%Y\%m\%d\%H\%M\%S)_<wrapperScript>.log

 

 

 

Comments

Popular posts from this blog

LookML

  What Is LookML? LookML is a language for describing dimensions, aggregates, calculations, and data relationships in a SQL database. Looker uses a model written in LookML to construct SQL queries against a particular database. LookML Projects A LookML Project is a collection of model, view, and dashboard files that are typically version controlled together via a Git repository. The model files contain information about which tables to use, and how they should be joined together. The view files contain information about how to calculate information about each table (or across multiple tables if the joins permit them). LookML separates structure from content, so the query structure (how tables are joined) is independent of the query content (the columns to access, derived fields, aggregate functions to compute, and filtering expressions to apply). LookML separates content of queries from structure of queries SQL Queries Generated by Looker For data analysts, LookML fosters DRY style...

CSV to HTML Converter Shell Script

#Generic Converter from CSV to HTML #!/bin/bash usage () { cat <<EOF Usage:$(basename $0)[OPTIONS] input_file > ouptut.html Explicit Delimiter can be specified if not then it default to comma as delimiter Options: -d specify delimiter , instead of comma --head specified then treats first line as column header , <thead> and <th> tags --foot last line , <tfoot> and <th> tags Samples: 1.$(basename $0) input.csv Parse 'input.csv' with comma as delimiter and send HTML table to STDOUT 2. $(basename $0) -d '|' < input.csv > out.html Parse 'input.csv' with PIPE as delimiter and send HTML table to out.html 3. $(basename $0) -d '\t' --head --foot < input.tsv > out.html Parse 'input.tsv' , tab as delimiter process first and last lines as header and footer write to out.html ...

A Deep Dive Into Google BigQuery Architecture

Introduction Google’s BigQuery is an enterprise-grade cloud-native data warehouse. BigQuery was first launched as a service in 2010 with general availability in November 2011. Since inception, BigQuery has evolved into a more economical and fully-managed data warehouse which can run blazing fast interactive and ad-hoc queries on datasets of petabyte-scale. In addition, BigQuery now integrates with a variety of Google Cloud Platform (GCP) services and third-party tools which makes it more useful. BigQuery is serverless, or more precisely data warehouse as a service. There are no servers to manage or database software to install. BigQuery service manages underlying software as well as infrastructure including scalability and high-availability. The pricing model is quite simple - for every 1 TB of data processed you pay $5. BigQuery exposes simple client interface which enables users to run interactive queries. Overall, you don’t need to know much about underlying BigQuery architecture or...