Sunday, March 31, 2024

STAIR FIVE: RANDOM FOREST

RANDOM FOREST 

Does it is similar as forest that we visits for in our free time ???

An algorithm for supervised learning is called random forest. An ensemble of decision trees, often trained using the bagging approach, makes up the "forest" it constructs. The bagging method's basic tenet is that the total outcome is increased by combining many learning models.

The bagging technique is expanded upon by the random forest algorithm, which uses feature randomness in addition to bagging to produce an uncorrelated forest of decision trees. Feature randomness, sometimes referred to as "the random subspace method" or "feature bagging" (link is external to IBM.com), produces a random selection of features that guarantee low correlation between decision trees. There is a significant distinction between random forests and decision trees. Random forests merely choose a portion of those features, whereas decision trees take into account all potential feature splits.

Random Forest


How exactly Random Forest Works?
The three primary hyperparameters of random forest algorithms must be set prior to training. These consist of the size of the nodes, the count of trees, and the quantity of characteristics sampled. Regression and classification issues can then be resolved using the random forest classifier.

Each decision tree in the ensemble of decision trees used in the random forest technique is made up of a bootstrap sample, which is a sample of data taken from a training set with replacement. One-third of the training sample is designated as test data; this is referred to as the out-of-bag (oob) sample, and it is something we will discuss more.
 
Feature bagging is then used to introduce yet another randomization, increasing dataset variety and decreasing decision tree correlation. The prediction's determination will change depending on the kind of problem. The individual decision trees in a regression job will be averaged, and in a classification work, the predicted class will be determined by a majority vote, or the most common categorical variable. 

Random Forest with Classification and Regression:
The hyperparameters of a random forest are almost identical to those of a decision tree or bagging classifier. Fortunately, the classifier-class of random forest may be used with ease, eliminating the requirement to combine a decision tree with a bagging classifier. Regression tasks can also be handled using random forest by utilising the regressor of the method.

As the trees grow, random forest introduces more unpredictability into the model. When splitting a node, it looks for the best feature from a randomly selected subset of features rather than the most significant feature. This leads to a great deal of variation and, in general, a better model.

As such, the process for splitting a node in a random forest classifier only considers a random subset of the features. Using random thresholds for each feature in addition to looking for the optimal thresholds (like a typical decision tree does) is another way to further increase the randomness of trees.

Random Forest Apllications:
Several industries have used the random forest algorithm to help them make better business decisions. Among the use cases are:

Finance: Because it saves time on data administration and pre-processing duties, this method is preferred over others. It can be used to assess high-risk credit applicants, identify fraud, and identify issues with option pricing.
Healthcare: The random forest approach can be used to solve issues in gene expression categorization, biomarker development, and sequence annotation in computational biology (link points outside of IBM.com). Doctors are therefore able to estimate pharmacological responses to certain drugs.
E-commerce: Recommendation engines and cross-selling are two uses for it.

Random Forest Advantages:
  • Adaptable applications.
  • Simple-to-read hyperparameters.
  • Not too many trees cause the classifier to overfit.
Random Forest Disadvantages:
  • More trees are needed for increased accuracy.
  • More trees cause the model to lag
  • Incapable of describing relationships among data
So, Till here we have gone through Supervised Machine Learning. In which we Gone through Classifications and Regression with its respective algorithms with sufficient of information, working, application, uses, advantages and disadvantages, etc. From onwards we are moving for Unsupervised  Machine Learning. So, I think readers you really enjoyed till here i hope you guys also enjoy next blogs from same channel !!
As, I previously said during the journey I will take readers through hand on journey. I am providing the link of folder which is freely accessible where i have posted various documents in which I have implemented the model on easiest level. Any beginner can easily understand the models.
Those models are implemented in "jupyter notebook" which is the platform for implementing python projects. 

Kindly, refer the link provided below:

References:
    [1]https://medium.com/@roiyeho/random-forests-98892261dc49

    * THANK YOU *

    STAIR FOUR: DECISION TREE

    DECISION TREE

    What is Decision Tree ? Does this sounds like our normal tress ????
    • Although decision trees are a supervised learning technique, they are primarily employed to solve classification problems. 
    • However, they can also be used to solve regression problems. 
    • This classifier is tree-structured, with internal nodes standing in for dataset attributes, branches for decision rules, and leaf nodes for each outcome.
    • The Decision Node and the Leaf Node are the two nodes that make up a decision tree. 
    • While leaf nodes represent the result of decisions and do not have any more branches, decision nodes are used to make any kind of decision and have numerous branches.The characteristics of the provided dataset are used to inform the decisions or the test.
    • It is a graphical tool that shows all of the options for solving a problem or making a decision given certain parameters.
    • It is named a decision tree because, like a tree, it begins with the root node and grows on subsequent branches to form a structure like a tree.
    • The Classification and Regression Tree algorithm, or CART algorithm, is what we use to construct trees.
    • A decision tree only poses a question, and then divides the tree into subtrees according to the response (Yes/No).
    • Below diagram explains the general structure of a decision tree:


    Decision Tree
    Terminologies for Decision Trees:
    • Root Node: The decision tree originates at the root node. It depicts the complete dataset, which is then split up into two or more sets of similar data.
    • Leaf Node: After obtaining a leaf node, the tree cannot be further divided; leaf nodes are the ultimate output nodes.
    • Splitting: The process of splitting the decision node/root node into sub-nodes in accordance with the specified parameters is known as splitting.
    • Branch/Sub Tree: A tree created by slicing another tree into a branch or subtree.
    • Pruning: Removing undesirable limbs from a tree is the process of pruning.
    • Parent/Child Node: Nodes in a tree are referred to as parent and child nodes, respectively. The parent node is the root node. 
    What is the Process of the Decision Tree Algorithm?

    The procedure in a decision tree begins at the root node in order to forecast the class of the given dataset. This algorithm follows the branch and advances to the next node by comparing the values of the root attribute with the record (actual dataset) attribute.

    Decision Tree Process can be followed from some steps:
    1. Start the tree at the root node, which has the entire dataset.
    2. Use the Attribute Selection Measure (ASM) to determine which attribute in the dataset is the best.
    3. Create subgroups inside the S that include potential values for the greatest qualities.
    4. Create the decision tree node with the best attribute at its core.
    5. Using the dataset subsets generated in step 3, recursively develop new decision trees. This process should be continued until the nodes can no longer be classified further; at this point, the final node is referred to as a leaf node.
    Attribute Selection Measures:
    The primary problem that emerges while implementing a decision tree is figuring out which attribute is ideal for the root node and its child nodes. In order to address these issues, a method known as attribute selection measure, or ASM, has been developed. We can quickly choose the ideal attribute for the tree's nodes using this measurement. For ASM, there are two widely used methods, which are:

    A] Information Gain:
    • The measurement of changes in entropy following the attribute-based dataset segmentation is known as information gain.
    • It determines the amount of knowledge a feature gives us about a class.
    • We divide the node and create the decision tree based on the information gain value.
    • A node or attribute with the largest information gain is split first in a decision tree algorithm, which always seeks to maximise the value of information gain. It can be calculated using the below formula:
    Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)]

    Entropy A metric used to quantify the impurity of a given characteristic is called entropy. It describes the data's unpredictability.
    'p' denotes the probability of entropy
    E(S) denotes the entropy
    B] Gini Index:
    • When building a decision tree using the CART (Classification and Regression Tree) technique, the Gini index is a measure of purity or impurity.
    • It is better to choose an attribute with a low Gini index over one with a high index.
    • The CART algorithm only produces binary splits, and it does so by utilising the Gini index.
    • It can be calculated as:

    'pi' is probability of an object being classified to a particular class



    Advantages of the Decision Tree:
    • It is easy to understand since it uses the same procedure that people do when they make decisions in the actual world.
    • It can be quite helpful in resolving issues involving decisions.
    • It is beneficial to consider every scenario that could arise with a problem.
    • Compared to other algorithms, less data cleansing is necessary.
    Disadvantages of the Decision Tree:
    • The decision tree is complicated since it has several tiers.
    • It might have an overfitting problem, which the Random Forest method can remedy.

    In provided notebook Decision tree and random forest models are implemented on the data of employees. On this basis employees are categorized into tree. 
     
    As, I previously said during the journey I will take readers through hand on journey. I am providing the link of folder which is freely accessible where i have posted various documents in which i have implemented the model on easiest level. Any beginner can easily understand the models.
    Those models are implemented in "jupyter notebook" which is the platform for implementing python projects. 

    Kindly, refer the link provided below:

    References:


    STAIR THREE: LOGISTIC REGRESSION

    LOGISTIC REGRESSION

    A popular supervised machine learning technique for binary classification tasks is logistic regression. It may be used to determine if an email is spam or not, as well as diagnose illnesses by determining whether certain symptoms are present or absent based on test results from patients. 
    • This method converts a linear combination of input information into a probability value between 0 and 1 by using the logistic (or sigmoid) function.
    •  This probability shows how likely it is that an input falls into one of the two predetermined categories. 
    • The capacity of the logistic function to accurately describe the probability of binary events is the basis of the fundamental mechanism of logistic regression. 
    • The logistic function efficiently transfers every real-valued number to a value between 0 and 1 thanks to its characteristic S-shaped curve. 

    Logistic Regression[1]

    Sigmoid Function [2]



    Logistic Function:
    A mathematical function called the sigmoid function is utilised to convert expected values into probabilities.
    It converts any real number between 0 and 1 into another value.
    The logistic regression's result must lie between 0 and 1, and as it cannot be greater than this, it takes the shape of a "S" curve.
    The idea of the threshold value, which indicates the likelihood of either 0 or 1, is used in logistic regression. For example, values above the threshold tend towards one, and those below the threshold tend towards zero. For example, values above the threshold tend towards one, and those below the threshold tend towards zero.
    There must be a category component to the dependent variable.

    It is especially well-suited for binary classification jobs like classifying emails as "spam" or "not spam" because of this property. Logistic regression offers a probabilistic framework that facilitates well-informed decision-making by estimating the likelihood that the dependent variable will fall into a particular group.


    Comparison with Linear Regression:

    Similar to linear regression, logistic regression is a specific example of the generalized linear model. However, compared to linear regression, the logistic regression model is predicated on quite different assumptions regarding the relationship between the dependent and independent variables. 

    Linear Regression

    Logistic Regression

    Linear regression is used to predict the continuous dependent variable using a given set of independent variables.

    Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables.

    Linear Regression is used for solving Regression problem.

    Logistic Regression is used for solving Classification Problems.

    In Linear regression, we predict the value of continuous variables.

    In logistic Regression, we predict the values of categorical variables.

    The output for Linear Regression must be a continuous value, such as price, age, etc.

    The output must be a categorical value such as 0 or 1, Yes or No, etc.

     


    In Notebook logistic regression has implemented on the dataset of Prediction of Heart Disease on the students of Framingham city. This dataset contents cigarettes, cigarettes per day, their BMI, heart rate, BP, diabetes and other various factors and habits of students. Depending upon those factors whether he will get any hearth disease or not is predicted.

    As, I previously said during the journey I will take readers through hand on journey. I am providing the link of folder which is freely accessible where i have posted various documents in which i have implemented the model on easiest level. Any beginner can easily understand the models.
    Those models are implemented in "jupyter notebook" which is the platform for implementing python projects. 

    Kindly, refer the link provided below:

    References:
    [1]https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-logistic-regression/

    STAIR TWO: Model Evaluation Metrics

    Model Evaluation Metrics



    Before, Moving for next model we have to understand how the model is evaluated. Whenever we build any model, we have some inputs already and after building model we get some output according the input. Output is depend upon various factors presents in inputs. When we get output we have to evaluate the built model. Measuring the performance of the trained model is just as crucial as preparing the data and training the machine learning model, which are both essential steps in the process. Machine learning models are classified as adaptive or non-adaptive based on how well they generalise on new data.
    Before we deploy our model for production on unseen data, we should be able to enhance its overall predictive capacity by evaluating its performance using several measures. If the machine learning model is not properly evaluated using several metrics and is solely based on accuracy, it may cause issues when applied to unidentified data and produce inaccurate predictions. This occurs because our models can't generalise adequately on unseen data in situations like these since they memorise information instead of learning. Now that we have defined the evaluation measures, let's look at how a machine learning model performs, which is a crucial part of any data science project. Its goal is to calculate a model's generalisation accuracy using future (unseen/out-of-sample) data. 


    A] Confusion Matrix:
    When describing the performance of the classification model (also known as the "classifier") on a set of test data for which the true values are known, a confusion matrix. A matrix representation of the prediction outcomes of any binary testing is frequently employed.
    Although the confusion matrix itself is not too difficult to understand, there can be some confusion with the associated language.
     [1]
    Confusion Matrix

    Each prediction can be one of the four outcomes, based on how it matches up to the actual value:
    • True Positive (TP): Predicted True and True in reality.
    All that constitutes a true positive is the situation in which both the predicted and actual values are true. In addition to the model's prediction that the patient had cancer, the patient has received a cancer diagnosis.
    • True Negative (TN): Predicted False and False in reality.
    In this instance, both the projected value and the actual value are false. Stated differently, our model predicted that the patient did not have cancer, and the patient has not received a cancer diagnosis.
    • False Positive (FP): Predicted True and False in reality. 
    When there is a false negative, the patient has cancer even though the model predicted that they did not. This is known as an actual value being true while the anticipated value is false.
    • False Negative (FN): Predicted False and True in reality.
    In this instance, the actual result is false even if the predicted value is true. In this case, the patient does not actually have cancer, despite the model's prediction to the contrary.

    B] Performance Matrix:

    Performance Metrices for 
    1]Classification:
    1: Accuracy:
    One of the easiest classification metrics to use is accuracy, which can be calculated as the ratio of accurate predictions to total predictions.
     [2]
    2: Precision:
    The percentage of all the predictions we generated using our predictive model that come true is what we mean by "precision" in this context.
     [3]
    3: Recall and Sensitivity:
    All recall is is a metric that indicates what percentage of patients with cancer were also expected to have cancer. "How sensitive is the classifier in detecting positive instances?" is the question it addresses. 
     [4]
    4: Specifivity:
    In predicting positive cases, it provides an answer to the question, "How specific or selective is the classifier?"
     [5]

    5: F1 Score:
    This is nothing but the harmonic mean of precision and recall.
     [6]

    2] Regression:
    1: Mean Absolute Error:
    One of the most basic metrics, mean absolute error, or MAE, quantifies the absolute difference between actual and anticipated values. Absolute refers to taking a number as positive.
    [7]
    2: Mean Squared Error:
    When evaluating regression, one of the best metrics is mean squared error, or MSE. It calculates the average of the squared difference between the model's actual value and the values that were predicted.
     [8]

    3: R2 Score:
    Coefficient of Determination, another widely used statistic for evaluating regression models, is also known as R squared error. We may assess the model's performance by comparing it to a constant baseline using the R-squared metric. We must take the data mean and draw a line at the mean in order to choose the constant baseline.
     [9]

    4: Adjusted R2:
    As the name implies, adjusted R squared is an enhanced form of R squared error. R square may deceive data scientists and has a limit of raising a score on increasing the terms even when the model is not improving. Adjusted R squared is used to get around the R square problem, however it always displays a lower number than R². This is so that it only displays improvement when there is a true improvement and modifies the values of growing predictors.
     [10]

    References:
    [1] https://www.kdnuggets.com/2020/05/model-evaluation-metrics-machine-learning.html
    [2,3,4,5,6] https://intellipaat.com/blog/confusion-matrix-python/
    [7,8,9,10] https://www.javatpoint.com/performance-metrics-in-machine-learning


    STAIR ONE: LINEAR REGRESSION

     LINEAR REGRESSION

    One of the simplest and most often used machine learning methods is linear regression. It's a statistical technique for forecasting analysis. Linear regression produces predictions for continuous/real or numerical variables such as age, product price, sales, and so on.

    The term "LINEAR REGRESSION" refers to a procedure that displays a linear connection between one or more independent (y) variables and a dependent (y) variable. Given that it displays a linear connection, linear regression determines how the value of the dependent variable varies in response to the value of the independent variable.

    The link between the variables is represented by a skewed straight line according to the linear regression model. Take a look at the picture below:

    In Machine Learning, Linear Regression
    A linear regression can be mathematically represented as:





    y= a0+a1x+ ε

    Here,

    Y= Dependent Variable (Target Variable)
    X= Independent Variable (predictor Variable)
    a0= intercept of the line (Gives an additional degree of freedom)
    a1 = Linear regression coefficient (scale factor to each input value).
    ε = random error

    Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.

  • Strength of the relationship between the specified variables.
  • Example: An illustration of this would be the link between rising temperatures and pollution levels.
  • The independent variable's value determines the dependent variable's value.
  • Example: An illustration would be the pollution level at a particular temperature.

  • Above diagram is taken from the notebook in which linear regression model is performed on the Dataset of employees of random company which includes their Year of experiences and Salary.
    Where year of experiences is independent variable and salary is dependent variable.
     
    In Jupyter notebook First, all libraries whichever are needed are defined. Like pandas, seaborn, matplotlib. Then data is split into train and test dataset. Then the model is implemented. For model implementing linear regression library also should defined. which can be called through sklearn.linear_model which is a free and open source ML library for python.


    As, I previously said during the journey I will take readers through hand on journey. I am providing the link of folder which is freely accessible where I have posted various documents in which i have implemented the model on easiest level. Any beginner can easily understand the models.
    Those models are implemented in "Jupyter notebook" which is the platform for implementing python projects. 


    Kindly, refer the link provided below:

    References:

    [1]https://www.javatpoint.com/linear-regression-in-machine-learning


    FLOOR ONE: SUPERVISED LEARNING

    SUPERVISED LEARNING

    What is Supervised Learning in Machine Learning ?
    So, guys have u ever heard the Spam E-Mail concept?  The thing which can detected by the Supervised Learning.
    The algorithm is given a dataset with inputs and corresponding outputs, and it learns to map the inputs to the correct outputs. supervised learning can be broadly categorized into two main types: classification and regression. 
    In REGRESSION, the algorithm's goal is to predict a continuous value based on input data. For instance, imagine we want to predict the price of a house based on features like its size, number of bedrooms, and location. Here, the data type used would also be structured data, but instead of discrete labels, we'd have a continuous target variable (the price of the house).
    now in CLASSIFICATION, the algorithm's task is to categorize data into different classes or categories. For example, let's say we want to build a system that can classify whether an email is spam or not spam. Here, the data type used would typically be structured data containing features of the email like sender, subject, and body, along with the label indicating whether it's spam or not. 


    A] Algorithms used for Regression:
    For regression inputs are the in the form of numerical data and output is driven in the form of prediction.
    The core concept of linear regression revolves around fitting a straight line to the data points in such a way that the line best represents the relationship between the independent and dependent variables. This line is represented by the equation:

    =0+1+

    Where:

    • is the dependent variable (the variable we want to predict),
    • is the independent variable (the variable used for prediction),
    • 0 is the intercept (the value of when is zero),
    • 1 is the slope (the change in for a one-unit change in ),
    • is the error term (the difference between the actual and predicted values of ).
    Logistic regression uses a logistic function called a sigmoid function to map predictions and their probabilities. The sigmoid function refers to an S-shaped curve that converts any real value to a range between 0 and 1.Moreover, if the output of the sigmoid function (estimated probability) is greater than a predefined threshold on the graph, the model predicts that the instance belongs to that class. If the estimated probability is less than the predefined threshold, the model predicts that the instance does not belong to the class. The sigmoid function is referred to as an activation function for logistic regression and is defined as:


     [2]

    where,

    • e = base of natural logarithm
    • value(x) = numerical value one wishes to transform
    B] Algorithms used for Classification:
    Although decision trees are a supervised learning approach, they are mostly employed to solve classification issues. However, they may also be used to solve regression problems. This classifier is tree-structured, with internal nodes standing in for dataset attributes, branches for decision rules, and leaf nodes for each outcome. The Decision Node and the Leaf Node are the two nodes that make up a decision tree. While leaf nodes represent the result of decisions and do not have any more branches, decision nodes are used to make any kind of decision and have numerous branches. The characteristics of the provided dataset are used to inform the decisions or the test. A decision tree only poses a question, and then divides the tree into subtrees according to the response (Yes/No).
    The decision tree's general structure is illustrated in the diagram below:
    Decision Tree

    2. Random Forest:
    As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
    The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
    The below diagram explains the working of the Random Forest algorithm:



    Random Forest 

    References:
    [1] https://www.javatpoint.com/linear-regression-in-machine-learning
    [2]https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-logistic-regression/#:~:text=Practices%20for%202022-,What%20Is%20Logistic%20Regression%3F,1%2C%20or%20true%2Ffalse.

    Probability and Statistical Operation Using Python

     STATISTICS AND POBABILITY  STATISTICS: The process of gathering information, tabulating it, and interpreting it numerically is known as sta...