Showing posts with label ML. Show all posts
Showing posts with label ML. Show all posts

Monday, August 12, 2019

Machine Learning basics using Python


In this article you will learn the basics of machine learning using python ;such as uploading datasets and displaying dataset & its attributes. Use the dataset given below.

Numeric.csv  dataset  Click here to download.


Open colab.research.google.com



Uploading dataset using Panda
import pandas as pd     
df=pd.read_csv("numeric.csv") 
print(df)

Check the number of rows and columns present in the dataset
import pandas as pd     
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( “number of rows:”,rows)
print( “number of columns:”,columns)



Displaying column_names and data
import pandas as pd     
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_names=df.columns   
print(column_names)       #displays column names
data=df.values
print(data)               #displays data present in the dataset without headings



Analyzing attribute values to identify qualitative or quantitative type
import pandas as pd     
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns   

from collections import Counter
for i in range(columns):
  attribute_data=df[column_name[i]]
  cnt=Counter(attribute_data)
  cnt=dict(cnt)
  print("attribute " ,(i+1),cnt)            # to display frequency of values in each attribute.
  print("value range", attribute_data.unique())


Finding Outliers
import pandas as pd     
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns   

from collections import Counter
for i in range(columns):
  attribute_data=df[column_name[i]]
  cnt=Counter(attribute_data)
  cnt=dict(cnt)
  print("attribute " ,(i+1),cnt)            # to display frequency of values in each attribute.
Import seaborn as sns
Sns.boxplot(x=df[column_name[0]])    # to draw boxplot for first column. Here, x denotes x axis


You can see 5 outliers in the boxplot. These values are having less frequency in the first column.
To remove outliers, we can use following approaches.

  • Z score approach
  • IQR approach

First we have to identify outliers to remove them.
For this, we can use heatmap.

import pandas as pd
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns   

from collections import Counter
for i in range(columns):
  attribute_data=df[column_name[i]]
  cnt=Counter(attribute_data)
  cnt=dict(cnt)
  print("attribute " ,(i+1),cnt)
  print("value range", attribute_data.unique())
  import seaborn as Sns
Sns.boxplot(x=df[column_name[0]])
Sns.heatmap(df,cbar=False)


We can also use following to find outliers.. but you have to import suitable modules before using them.
import pandas as pd
df=pd.read_csv("numeric.csv") 
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns   

from collections import Counter
for i in range(columns):
  attribute_data=df[column_name[i]]
  cnt=Counter(attribute_data)
  cnt=dict(cnt)
  print("attribute " ,(i+1),cnt)
  print("value range", attribute_data.unique())

import seaborn as Sns
Sns.boxplot(x=df[column_name[0]])
Sns.heatmap(df,cbar=False)

import missingno as msno
msno.matrix(df)


We can also use the following..
msno.dendrogram(df)

Thank you...