深度|为了七夕节送女朋友一支口红,我爬取了京东并进行数据分析

2020-08-26 22:21:33 作者: 深度|为了七

新建一列"weight",positive里的全部为1;negative里的全部为-1:

positive.columns = ['review']positive['weight'] = pd.Series([1]*len(positive))negative.columns = ['review']negative['weight'] = pd.Series([-1]*len(negative))

将positive和negative合并:

pos_neg = pd.concat([positive,negative],axis=0)

pos_neg中正向情感词权重为1,负向情感词权重为-1

# 合并到reviews_long_clean中data = reviews_long_clean.copy()reviews_mltype = pd.merge(data,pos_neg,how='left',left_on='word',right_on='review')

reviews_mltype中review和weight这两列的空值用0填充,表示中性情感:

reviews_mltype = reviews_mltype.drop(['review'],axis=1)reviews_mltype = reviews_mltype.replace(np.nan,0)

接下来修正情感倾向,因为否定词的存在,导致否定词所在的句子意思可能大不一样,比如:

“我喜欢吃榴莲”,前面如果有个“不”字,意思就截然相反:“我不喜欢吃榴莲”。

再加个否定词,“我不是不喜欢吃榴莲”,就负负得正了。

# 修正情感倾向# 如果有多重否定,那么奇数否定是否定,偶数否定是肯定notdict = pd.read_csv('not.csv')notdict['freq'] = [1]*len(notdict)# 准备一reviews_mltype['amend_weight'] = reviews_mltype['weight']reviews_mltype['id'] = np.arange(0,reviews_mltype.shape[0])# 准备二,只保留有情感词的行only_reviews_mltype = reviews_mltype[reviews_mltype['weight']!=0]only_reviews_mltype.index = np.arange(0,only_reviews_mltype.shape[0])# 看该情感词的前两个词,来判断否定的语气,如果在句首,则没有否词,如果在句子的第二个词,则看前一个词,来判断否定的语气index = only_reviews_mltype['id']for i in range(0,only_reviews_mltype.shape[0]):    reviews_i = reviews_mltype[reviews_mltype['index_content']==only_reviews_mltype['index_content'][i]]  #第i个情感词的评论    reviews_i.index = np.arange(0,reviews_i.shape[0]) # 重置索引后,索引值等价于index_word    word_ind = only_reviews_mltype['index_word'][i]   # 第i个情感值在该条评论的位置        # 第一种,在句首,则不用判断    # 第二种,在评论的第2个位置    if word_ind == 2:        ne=sum([reviews_i['word'][word_ind-1] in notdict['term']])        if ne==1:            reviews_mltype['amend_weight'][index[i]] = -(reviews_mltype['weight'][index[i]])        # 第三种,在评论的第2个位置以后    elif word_ind >2:        ne = sum([word in notdict['term'] for word in reviews_i['word'][[word_ind-1,word_ind-2]] ])         if ne==1:            reviews_mltype['amend_weight'][index[i]] = -(reviews_mltype['weight'][index[i]])reviews_mltype.shapereviews_mltype[(reviews_mltype['weight']-reviews_mltype['amend_weight'])!=0]    # 说明两列值一样

通过groupby得到每个句子的情感值,并保存成csv文件:

emotion_value = reviews_mltype.groupby('index_content',as_index=False)['amend_weight'].sum()emotion_value.to_csv('1_emotion.to_csv',index=True,header=True)

筛选出具备情感倾向的句子:

# 每条评论的amend_weight总和不等于零content_emotion_value = emotion_value.copy()content_emotion_value.shapecontent_emotion_value = content_emotion_value[content_emotion_value['amend_weight']!=0]

新增一列,句子的amend_weight大于0时标记为pos ; 句子的amend_weight小于0时标记为neg:

content_emotion_value['ml_type']=''content_emotion_value['ml_type'][content_emotion_value['amend_weight']>0]='pos'content_emotion_value['ml_type'][content_emotion_value['amend_weight']<0]='neg'

将content_emotion_value与reviews_mltype合并,按照index_content链接,意味着如果某个句子是pos,则该句子的所有词语都打上pos标签,反之亦然(neg):

# 合并到大表当中content_emotion_value = content_emotion_value.drop(['amend_weight'],axis=1)reviews_mltype.shapereviews_mltype = pd.merge(reviews_mltype,content_emotion_value,how='left',left_on='index_content',right_on='index_content')reviews_mltype = reviews_mltype.drop(['id'],axis=1)reviews_mltype.to_csv('1_reviews_mltype.csv')

之后通过混淆矩阵,检测情感分析的效果,看看机器标注的(ml_type)和人工标注的(content_type)有什么差异。关于混淆矩阵,可以看看我之前写的文章:“混淆矩阵”是什么?一个小例子告诉你答案

# 混淆矩阵cate = ['index_content','content_type','ml_type']data_type = reviews_mltype[cate].drop_duplicates()confusion_matrix = pd.crosstab(data_type['content_type'],data_type['ml_type'],margins=True)confusion_matrix

混淆矩阵

通过classification_report可以查看相比人工标注,机器标注的精确度:

data = data_type[['content_type','ml_type']]data = data.dropna(axis=0)print(classification_report(data['content_type'],data['ml_type']))

精确度0.92

接下来将正面情感词与负面情感词从评论中提取出来,分别绘制词云图:

'''词云图'''data = reviews_mltype.copy()data = data[data['amend_weight']!=0]word_data_pos = data[data['ml_type']=='pos']word_data_neg = data[data['ml_type']=='neg']import imageiofont = '微软雅黑 Light.ttc'picture = imageio.imread('kh4.png')w = WordCloud(font_path=font,max_words=100,background_color='white',mask=picture,colormap='Reds')w.generate_from_frequencies(Counter(word_data_pos.word.values))w.to_file('1_正面情感词词云图.png')