Here we discussed data collection and some preprocessing. Data collection for this project was done manually from different websites and Facebook groups. Sentiment analysis of Nepali sentences needs Nepali data with pure Nepali letters or romanized Nepali sentences which is pure Nepali languages for analysis.
“मैले मेरो फायर एचडीलाई 8 दुई हप्तामा गरेको छु र म यो मनपरौछु”,2
“अनुहार हेरे पनि लोद्दर लाग्छ यो बिदेशी हरु को पाल्तु कुकुर को”,0
“सबै पाटिका 60 काटेका सबै नेतालाई बरखास्त गनुँ पछँ”,0
Sentences inside double quotes are actual user sentiment and the numeric value is the corresponding polarity (sentiment label) of sentences. Sentiment Analysis of Nepali sentence, not easy due to some grammatical complexity.
Let’s start coding
Math libraries in python help to calculate complex mathematical operations Function tool and collection provided by python used for small calculation and package management. Sklearn is machine learning libraries provide different coding services. Sklearn has a regression, classification, and optimization capability. Another important package is pickle which is a python object file store, training model. When our model trains data it saves the training model as a pickle file.
The first line of code read data file (merge.csv) delimiter is the separator between data sentences and labels. First_col variable in second-line takes only first column i.e only collection of sentences, second_col takes a collection of labels. Three variables are pre-declare to store data which is explained further.
This function split the document (each line) into an array of words. This is part of preprocessing. The split document is done before the collect a bag of words.
“मैले मेरो फायर एचडीलाई 8 दुई हप्तामा गरेको छु र म यो मनपरौछु”
[“मैले”, “मेरो”, “फायर”, “एचडीलाई”, “8”, “दुई” ,”हप्तामा”, “गरेको “,”छु” ,”र”, “म” ,”यो”, “मनपरौछु”]
Did you read my previous blog on the introduction of this series? here
For more about how data preprocessing on Sentiment Analysis go here