python特征提取代码

新星源码网 8月 20日 3 0

以下是一个简单的Python特征提取代码示例：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# 创建一个文本数据集
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

# 创建一个CountVectorizer对象
vectorizer = CountVectorizer()

# 使用CountVectorizer对象对文本数据集进行特征提取
X = vectorizer.fit_transform(corpus)

# 打印特征向量
print(X.toarray())

# 打印特征词汇表
print(vectorizer.get_feature_names())

运行以上代码，将输出特征向量和特征词汇表。特征向量是一个二维数组，每一行代表一个文本样本，每一列代表一个特征词汇，值表示该特征词汇在对应文本样本中出现的次数。特征词汇表是一个列表，包含了所有出现在文本数据集中的特征词汇。

除了使用CountVectorizer，还可以使用其他特征提取方法，例如TF-IDF（Term Frequency-Inverse Document Frequency）。

以下是一个使用TF-IDF进行特征提取的代码示例：

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# 创建一个文本数据集
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

# 创建一个TfidfVectorizer对象
vectorizer = TfidfVectorizer()

# 使用TfidfVectorizer对象对文本数据集进行特征提取
X = vectorizer.fit_transform(corpus)

# 打印特征向量
print(X.toarray())

# 打印特征词汇表
print(vectorizer.get_feature_names())

运行以上代码，将输出TF-IDF特征向量和特征词汇表。TF-IDF特征向量是一个二维数组，每一行代表一个文本样本，每一列代表一个特征词汇，值表示该特征词汇在对应文本样本中的TF-IDF权重。特征词汇表是一个列表，包含了所有出现在文本数据集中的特征词汇。

除了CountVectorizer和TfidfVectorizer，还有其他一些特征提取方法，例如HashingVectorizer、Word2Vec等，可以根据具体需求选择合适的方法进行特征提取。

Related

热门文章