Monday, August 12, 2019

Machine Learning basics using Python


In this article you will learn the basics of machine learning using python ;such as uploading datasets and displaying dataset & its attributes. Use the dataset given below.

Numeric.csv  dataset  Click here to download.


Open colab.research.google.com



Uploading dataset using Panda
import pandas as pd     
df=pd.read_csv("numeric.csv") 
print(df)

Check the number of rows and columns present in the dataset
import pandas as pd     
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( “number of rows:”,rows)
print( “number of columns:”,columns)



Displaying column_names and data
import pandas as pd     
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_names=df.columns   
print(column_names)       #displays column names
data=df.values
print(data)               #displays data present in the dataset without headings



Analyzing attribute values to identify qualitative or quantitative type
import pandas as pd     
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns   

from collections import Counter
for i in range(columns):
  attribute_data=df[column_name[i]]
  cnt=Counter(attribute_data)
  cnt=dict(cnt)
  print("attribute " ,(i+1),cnt)            # to display frequency of values in each attribute.
  print("value range", attribute_data.unique())


Finding Outliers
import pandas as pd     
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns   

from collections import Counter
for i in range(columns):
  attribute_data=df[column_name[i]]
  cnt=Counter(attribute_data)
  cnt=dict(cnt)
  print("attribute " ,(i+1),cnt)            # to display frequency of values in each attribute.
Import seaborn as sns
Sns.boxplot(x=df[column_name[0]])    # to draw boxplot for first column. Here, x denotes x axis


You can see 5 outliers in the boxplot. These values are having less frequency in the first column.
To remove outliers, we can use following approaches.

  • Z score approach
  • IQR approach

First we have to identify outliers to remove them.
For this, we can use heatmap.

import pandas as pd
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns   

from collections import Counter
for i in range(columns):
  attribute_data=df[column_name[i]]
  cnt=Counter(attribute_data)
  cnt=dict(cnt)
  print("attribute " ,(i+1),cnt)
  print("value range", attribute_data.unique())
  import seaborn as Sns
Sns.boxplot(x=df[column_name[0]])
Sns.heatmap(df,cbar=False)


We can also use following to find outliers.. but you have to import suitable modules before using them.
import pandas as pd
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns   

from collections import Counter
for i in range(columns):
  attribute_data=df[column_name[i]]
  cnt=Counter(attribute_data)
  cnt=dict(cnt)
  print("attribute " ,(i+1),cnt)
  print("value range", attribute_data.unique())

import seaborn as Sns
Sns.boxplot(x=df[column_name[0]])
Sns.heatmap(df,cbar=False)

import missingno as msno
msno.matrix(df)


We can also use the following..
msno.dendrogram(df)

Thank you...

No comments:

Post a Comment