In this article you will learn the basics of machine learning using python ;such as uploading datasets and displaying dataset & its attributes. Use the dataset given below.
Numeric.csv dataset Click here to download.
Open colab.research.google.com
Uploading dataset using Panda
import pandas as pd
df=pd.read_csv("numeric.csv")
print(df)
Check the number of rows and columns present in the dataset
import pandas as pd
df=pd.read_csv("numeric.csv")
rows,columns=df.shape
print( “number of rows:”,rows)
print( “number of columns:”,columns)
Displaying column_names and data
import pandas as pd
df=pd.read_csv("numeric.csv")
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_names=df.columns
print(column_names) #displays column names
data=df.values
print(data) #displays data present in the dataset without headings
Analyzing attribute values to identify qualitative or quantitative type
import pandas as pd
df=pd.read_csv("numeric.csv")
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns
from collections import Counter
for i in range(columns):
attribute_data=df[column_name[i]]
cnt=Counter(attribute_data)
cnt=dict(cnt)
print("attribute " ,(i+1),cnt) # to display frequency of values in each attribute.
print("value range", attribute_data.unique())
Finding Outliers
import pandas as pd
df=pd.read_csv("numeric.csv")
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns
from collections import Counter
for i in range(columns):
attribute_data=df[column_name[i]]
cnt=Counter(attribute_data)
cnt=dict(cnt)
print("attribute " ,(i+1),cnt) # to display frequency of values in each attribute.
Import seaborn as sns
Sns.boxplot(x=df[column_name[0]]) # to draw boxplot for first column. Here, x denotes x axis
You can see 5 outliers in the boxplot. These values are having less frequency in the first column.
To remove outliers, we can use following approaches.
- Z score approach
- IQR approach
First we have to identify outliers to remove them.
For this,
we can use heatmap.
import pandas as pd
df=pd.read_csv("numeric.csv")
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns
from collections import Counter
for i in range(columns):
attribute_data=df[column_name[i]]
cnt=Counter(attribute_data)
cnt=dict(cnt)
print("attribute " ,(i+1),cnt)
print("value range", attribute_data.unique())
import seaborn as Sns
Sns.boxplot(x=df[column_name[0]])
Sns.heatmap(df,cbar=False)
We can also use following to find outliers.. but you have to import suitable modules before using them.
import pandas as pd
df=pd.read_csv("numeric.csv")
rows,columns=df.shape
print( "number of rows:",rows)
print( "number of columns:",columns)
column_name=df.columns
from collections import Counter
for i in range(columns):
attribute_data=df[column_name[i]]
cnt=Counter(attribute_data)
cnt=dict(cnt)
print("attribute " ,(i+1),cnt)
print("value range", attribute_data.unique())
import seaborn as Sns
Sns.boxplot(x=df[column_name[0]])
Sns.heatmap(df,cbar=False)
import missingno as msno
msno.matrix(df)
We can also use the following..
msno.dendrogram(df)
Thank you...