From unique values of first column, get all the values of the 2nd column and print them in the same line by separating them with commas

The what:

I have a file as below:

$ cat sample.csv 
DEPT_ID,EMP_NAME
100,Keshava
100,Madhava
100,Narayana
100,Govinda
100,Vishnu
110,Madhusudana
110,Trivikrama
110,Vamana
110,Shridhara
110,Hrishikesha
110,Padmanabha
120,Damodara
120,Sankashana
120,Vasudeva
120,Pradyumna
130,Aniruddha
130,Purushottama
130,Adhokshaja
130,Narasimha
140,Achyuta
140,Janardana
140,Upendra
140,Hari
140,Krishna

 The challenge is to make it look like below:

$ cat sample.csv 
DEPT_ID: EMP_NAME (SEPARATED BY COMMA)
100: Keshava, Madhava, Narayana, Govinda, Vishnu
110: Madhusudana, Trivikrama, Vamana, Shridhara, Hrishikesha, Padmanabha
120: Damodara, Sankashana, Vasudeva, Pradyumna, Aniruddha
130: Purushottama, Adhokshaja, Narasimha
140: Achyuta, Janardana, Upendra, Hari, Krishna


The How:

Method 1 - Using pure shell commands

tail -n +2 sample.csv | cut -d "," -f1 | sort -u | while read dept_id; do emp_names=$(grep "${dept_id}" sample.csv | cut -d "," -f2 | tr '\n' ','); echo $dept_id: $emp_names; done | sed -e 's/,$//g'
100: Keshava,Madhava,Narayana,Govinda,Vishnu
110: Madhusudana,Trivikrama,Vamana,Shridhara,Hrishikesha,Padmanabha
120: Damodara,Sankashana,Vasudeva,Pradyumna
130: Aniruddha,Purushottama,Adhokshaja,Narasimha
140: Achyuta,Janardana,Upendra,Hari,Krishna

Explanation:

The logic:

  1. Get the unique dept_id values
  2. Using the unique dept_id values, get the employee names who belong to one particular department

1. tail -n +2 sample.csv ==> Print from the second line of the file (since the first line contains the column names, the first line can be ignored). 
2. cut -d"," -f1 ==> Get the first column of the CSV file by splitting it up and printing only the first column
3. sort -u ==> Sort the resultant output (by default, alphabetically) and print only the unique values (-u)
### From Steps 1-3, I have obtained the unique dept_id values. Now, for each dept_id, get the list of employees (EMP_NAME column) and separate them using comma
4. while read dept_id ==> Looping through the unique values of the dept_id to retrieve the EMP_NAME
5. emp_names=$(grep "${dept_id}" sample.csv | cut -d "," -f2 | tr '\n' ',') ==> Using variable emp_names for the sake of simplicity

grep "${dept_id}" sample.csv | cut -d "," -f2 ==> The command "grep" gets the dept_id and emp_name for a particular dept_id and "cut" command will get the list of all EMP_NAME columns for the given dept_id

tr '\n' ',' ==> This command converts the emp_names from
nnewline (\n) separated to comma separated.
6. echo $dept_id: $emp_names ==> prints out the dept_id and the related emp_names

 

Method 2 - Using Python

# Import the csv package
import csv

# Declare source and target files as variables
source_file_name = '/full/path/of/source/file/sample.csv'

# Open the file for "read"
with open(source_file_name, 'r') as fh:
    lines = fh.readlines()

# Print the lines to ensure that it's read properly. Comment out after testing.
# print(type(lines), "\n", lines[1:])

# Store the contents of the converted file in a dict object
single_line = {}

# Loop through the source file
for line in lines[1:]:

    # Split the contents of each line based on comma. The first column would be dept_id and the 2nd - emp_name
    dept_id, emp_name = str(line).split(',')

    # Strip each column value to ensure that there are no special/extra characters.
    dept_id = str(dept_id).strip('\r').strip()
    emp_name = str(emp_name).strip('\r').strip()

    # Check if dept_id already exists in the single_line dictionary. If it does, then append the emp_name to the
    # existing value. If it doesn't, then enter a new value with the dept_id as key and emp_name as value
    if dept_id in single_line:
        single_line[dept_id].append(emp_name)
    else:
        single_line[dept_id] = [emp_name]

# Print the dict object to ensure that it's converted properly. Comment out after testing.
print("\nsingle_line Dict object Contents:\n-----------------------------\n",
single_line, "\n\nOutput in Custom Format:\n-----------------------------\n")

# Loop through the dictionary object - single_line and print the values.
for key, v in single_line.items():
    # print(key)
    # Concatenate all the values of emp_name, separated by comma, for one dept_id into a single variable
    val = ','.join(v)
    print(key,":",val)
print("\n\n")



Output:


$ python csv_to_custom_format.py

single_line Dict object Contents:
-----------------------------
 {'100': ['Keshava', 'Madhava', 'Narayana', 'Govinda', 'Vishnu'], '110': ['Madhusudana', 'Trivikrama', 'Vamana', 'Shridhara', 'Hrishikesha', 'Padmanabha'], '120': ['Damodara', 'Sankashana', 'Vasudeva', 'Pradyumna'], '130': ['Aniruddha', 'Purushottama', 'Adhokshaja', 'Narasimha'], '140': ['Achyuta', 'Janardana', 'Upendra', 'Hari', 'Krishna']} 

Output in Custom Format:
-----------------------------

100 : Keshava,Madhava,Narayana,Govinda,Vishnu
110 : Madhusudana,Trivikrama,Vamana,Shridhara,Hrishikesha,Padmanabha
120 : Damodara,Sankashana,Vasudeva,Pradyumna
130 : Aniruddha,Purushottama,Adhokshaja,Narasimha
140 : Achyuta,Janardana,Upendra,Hari,Krishna

Comments

Popular posts from this blog

java.lang.ExceptionInInitializerError while trying to Access login page

Solution to "End Program - WMS Idle"

WGET shell Script for downloading patches