In modern society, we rely heavily on computers to process huge volumes of data. For economic reasons or business requirements, there is a great demand to quickly input the mountains of printed, and handwritten information into the computer. Nepali Handwritten character recognition using CNN is a field of research in artificial intelligence, computer vision, and pattern recognition. A computer performing handwriting recognition is said to be able to acquire and detect characters in paper documents, pictures, touch-screen devices, and other sources and convert them into machine-encoded form. Its application is found in optical character recognition, transcription of handwritten documents into digital documents, and more advanced intelligent character recognition systems.
Handwritten Recognition can be separated into two main types which are offline handwritten recognition and online handwritten recognition . They use different input signals. For offline handwritten recognition, the input signal is generally from a document of characters by using a scanner.
The input signal for online handwritten recognition is a signal while the writer is writing. It converts the tip movements (strokes) of the digital pen to a list of coordinates.
In the twenty-first Century, handwritten digit communication has its own standard and most of the time in daily life is being used as a means of conversation and recording the information to be shared with individuals. One of the challenges in handwritten character recognition wholly lies in the variation and distortion of the handwritten character set because distinct community may use a diverse style of handwriting, and control to draw the similar pattern of the characters of their recognized script. In
next days, the character recognition system might serve as a cornerstone to initiate paperless surroundings by digitizing and processing existing paper documents.
Use case Diagram
The Nepali Handwritten Character Recognition using CNN system takes a handwritten character as input from an HTML canvas. That character is then converted into 32 x 32 image with 1 channel (greyscale). Then that image is fed to the model. The model is pre-trained with the training dataset. That model will take the image as input and gives a prediction for it. That prediction is displayed and pronounced in an HTML document.
IMPLEMENTATION AND TESTING
The user is allowed to draw a character in the HTML canvas. The canvas data is taken and converted into an image file of size 32 by 32. That image is loaded and converted into an array of (1, 32, 32 ,1). That array signifies a single image with width and height of 32, with 1 channel as it is greyscaled.
Then the model we created from the above processes is loaded and is given the array input. The model then gives us the character it predicted. After that, the image that was saved from the canvas earlier is compared with one of the images of the predicted character (loaded form the training dataset) in order to compute the similarity. The result is then displayed if the comparison gives a satisfactory similarity value. Else a message is displayed saying it failed to predict.
We will be converting our entire dataset to a CSV (comma-separated values) file so that it would be easier to feed to our model. The CSV would contain the image data along with the image label, which we will be separating.
The image data in our CSV file would be in 3D arrays. The input shape that our CNN expects is a 4D array (batch, height, width, channels).
Channels signify whether the image is grayscale or colored (In our case, we are using grayscale images so we give 1 for channels). So, we will be reshaping our image data accordingly. It’s always good to normalize data. Our image data will have data in each pixel in between 0–255. We will be scaling it to 0–1. Character Recognition is a multi-class classification problem. All values(output) are equal to us so, we’ll be using one-hot encoding for image labels. One-hot encoding transforms integer to a binary matrix where the array contains only one ‘1’ and the rest elements are ‘0’.
We’ll be using Convolutional Neural Network to develop a Nepali Handwritten Character Recognition using CNN system in Python language. Preprocessed datasets will be fed to Convolutional Neural Network which will include feature extraction and classification. We will be designing our model by using different layers. They are as follows:
- The first hidden layer will be a convolutional layer called Convolution2D. The layer has 32 filters/output channels, with the kernel size of 5×5 and an activation function. This layer is able to detect the pattern (edges, shapes, textures, objects, etc.) in the images. Here a filter with a kernel size 5×5 is just a 5×5 matrix initialized with random numbers. After receiving an input channel, this filter will slide over every 5×5 pixels of the input and compute the sum of element-wise multiplication of itself and the 5×5 input block and store it, until it’s slid over every 5×5 block of pixels from the entire image.
A = 5×5 input block
B = 5×5 filter
C = result matrix
- The second layer will be the Max Pooling layer with pool size 2×2. Max pooling reduces the dimensionality of images by reducing the number of pixels in the output from the previous convolutional layer. It takes the first 2×2 region of the input (output of above convolutional layer) and calculates the max value from each value in the 2×2 block. This value is stored in the output channel, which makes up the full output from this max pooling operation. Then it just slides over by 2 and again calculate the max value in the next 2×2 block and store it in output and the process continues. Once it reaches the edge, it moves down by 2 and repeats the process. This process is carried out for the entire image, and when it’s finished, it outputs the new representation of the image in the output channel. Example
- The third layer will be another convolutional layer with 32 filters/output channels with a kernel size of 3×3 and an activation function. (just like the first layer, but kernel size of 3×3 instead of 5×5)
- The fourth layer will be another Max Pooling layer, exactly like the second layer.
- The next layer will be a regularization layer using a dropout called Dropout. It is configured to randomly exclude 20% of neurons in the layer in order to reduce overfitting. Overfitting occurs when our model becomes really good at being able to classify or predict data that was included in the training set but is not as good at classifying data that it wasn’t trained on. Dropping out 20% of neurons randomly helps to reduce the overfitting.
- The next layer will convert the 2D matrix data to a vector called Flatten. It allows the output to be processed by standard fully connected layers which will be the next layer in the model. Outputs that are passed to fully connected layers must be flattened out before the fully connected layer will accept the input. Flattening the 2D matrix input from the above layer simply means converting that 2D input data matrix to 1D. For this, it takes every row of the 2D input matrix in order and put them serially in a 1D array. So, if the input matrix is 4×4, then the flattened array will be a 1D array of length 16.
- The next layer will be a fully connected layer (dense layer) with 128 neurons (“fully connected” means that every node in the first layer is connected to every node in the second layer). It performs classification based on the features extracted by the previous layers.
Each connection between two nodes has an associated weight, which is just a number. Each weight represents the strength of the connection between the two nodes. When the network receives an input at a given node in the input layer, this input is passed to the next node via a connection, and the input will be multiplied by the weight assigned to that connection.
For each node in the second layer, a weighted sum is then computed with each of the incoming connections. This sum is then passed to an activation function, which performs some type of transformation on the given sum.
node output = activation (weighted sum of inputs)
- Next (last) layer will be the output layer with 46 neurons (46 because there are 10 digits, 36 alphabets ‘ka’ to ‘gay’) and it uses softmax activation function. Each neuron will give the probability of that class. It’s a multi-class classification that’s why the softmax activation function is used instead of the usual sigmoid activation function.
We will be using the Rectified Linear Units (ReLU) activation function in our layers except in the last output layer. The function returns 0 if it receives any negative input, but for any positive value x it returns that value back. It is the simplest non-linear activation function you can use, obviously. When you get the input is positive, the derivative is just 1, so there isn’t the squeezing effect It can be written as:
In the last output layer, we used a softmax activation function. The softmax function squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. But it also divides each output such that the total sum of the outputs is equal to 1.
The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true. Mathematically the softmax function is shown below, where z is a vector of the inputs to the output layer (if you have 10 output units, then there are 10
elements in z). And again, j indexes the output units, so j = 1, 2, …, K.
An implementation of Nepali Handwritten Character Recognition using CNN has been implemented in this project. At first, we reviewed the approaches that are nowadays used in similar applications. After that, we delved into the inner workings of CNN. With the knowledge we had described, we specified the requirements of the project and planned the solution. We mainly focused on the user’s input character, processing it and returning the accurate result as possible. Finally, our project met the expected outcome; it is able to recognize 36 alphabets of Nepali Language. The results have proven that this approach has fetched good performance when compared to the traditional methods.
 Ivor Uhliarik, “Handwritten Character Recognition Using Machine Learning Methods”
 Mohamed Cheriet, Nawwaf Kharma, Cheng-Lin Liu and Ching Y. Suen, “Character Recognition Systems”
 Kwanlekha Sae-Tang and Jumpol Polvichai, “An Effective Handwritten Digit Recognition with Convolutional Neural Network”
Character is a collection of number make matrix!
Full document: here
Full Code: github
Resource Provider: PJMessi