Sentiment Analysis of Nepali sentences(Part-2) | Data Preprocessing

Here we discussed data collection and some preprocessing. Data collection for this project was done manually from different websites and Facebook groups. Sentiment analysis of Nepali sentences needs Nepali data with pure Nepali letters or romanized Nepali sentences which is pure Nepali languages for analysis.

Eg.

“मैले मेरो फायर एचडीलाई 8 दुई हप्तामा गरेको छु र म यो मनपरौछु”,2
“अनुहार हेरे पनि लोद्दर लाग्छ यो बिदेशी हरु को पाल्तु कुकुर को”,0
“सबै पाटिका 60 काटेका सबै नेतालाई बरखास्त गनुँ पछँ”,0

Sentences inside double quotes are actual user sentiment and the numeric value is the corresponding polarity (sentiment label) of sentences. Sentiment Analysis of Nepali sentence, not easy due to some grammatical complexity.

Let’s start coding

Libraries Used
libraries used for sentiment analysis of Nepali Text
important libraries

Math libraries in python help to calculate complex mathematical operations Function tool and collection provided by python used for small calculation and package management. Sklearn is machine learning libraries provide different coding services. Sklearn has a regression, classification, and optimization capability. Another important package is pickle which is a python object file store, training model. When our model trains data it saves the training model as a pickle file.

Read Data
read data for sentiment analysis of Nepali Text
Read Data through pandas

The first line of code read data file (merge.csv) delimiter is the separator between data sentences and labels. First_col variable in second-line takes only first column i.e only collection of sentences, second_col takes a collection of labels. Three variables are pre-declare to store data which is explained further.

Split Documents
split document for sentiment analysis of Nepali Text
document split

This function split the document (each line) into an array of words. This is part of preprocessing. The split document is done before the collect a bag of words.

eg.

“मैले मेरो फायर एचडीलाई 8 दुई हप्तामा गरेको छु र म यो मनपरौछु”

after split

[“मैले”, “मेरो”, “फायर”, “एचडीलाई”, “8”, “दुई” ,”हप्तामा”, “गरेको “,”छु” ,”र”, “म” ,”यो”, “मनपरौछु”]

Did you read my previous blog on the introduction of this series? here

For more about how data preprocessing on Sentiment Analysis go here

Polarity Analysis of Sentences Makes Content Valuable.

Leave a Reply

Your email address will not be published. Required fields are marked *