duplicate-removal

统计多Sheet Excel总行数并根据规模选择处理策略，提取特定维度信息进行去重统计，并生成摘要与明细报表。

Best for Data analystsWorks with GitHubLow risk

#excel #deduplication #pandas #data-cleaning #threshold-analysis

⌘source

author: @OpenSenseNova
repo: OpenSenseNova/SenseNova-Skills
language: Python

✦overview.md

Key Features

·Load Excel file with pandas
·Keyword-based row extraction
·Data cleaning (strip, filter)
·Deduplication with set
·Multi-category counting

Use Cases

→Count unique addresses per building type from a spreadsheet
→Remove duplicate entries from an Excel list of contacts
→Analyze distribution of categories in a survey response sheet

Best for

✓Data analysts
✓Excel reporting
✓Data cleaning tasks

Not ideal for

!Real-time processing
!Large datasets (>100k rows)

FAQs

skills/sn-da-excel-workflow/capability/excel-data-cleaning/duplicate-removal/SKILL.md

name

excel-multi-sheet-threshold-analysis

description

统计多Sheet Excel总行数并根据规模选择处理策略，提取特定维度信息进行去重统计，并生成摘要与明细报表。

Excel_Multi_Sheet_Deduplication

This sub-skill covers one capability of the Excel workflow. For reading/counting/Parquet optimization, see the parent workflow SKILL.md.

Step1 加载目标数据表，并进行初步的数据预览与结构检查。

import pandas as pd

file_path = 'input_file.xlsx'
target_sheet = 'Sheet1' # 根据实际情况指定 sheet 名称

# 读取数据，header=None 用于处理无表头或非标准表头文件
df = pd.read_excel(file_path, sheet_name=target_sheet, header=None)
print(f"数据形状: {df.shape}")
print("前 5 行预览：")
print(df.head())

Step2 遍历数据行，基于关键词提取目标信息，并执行数据清洗（去除空格、空值过滤）。

import pandas as pd

# 设定目标列索引及过滤关键词
target_col_idx = 1 
keywords = ["关键词A", "关键词B"] # 示例：如"综合楼"、"控制中心"
extracted_data = []

for idx, row in df.iterrows():
    cell_val = str(row[target_col_idx]) if pd.notna(row[target_col_idx]) else ""
    # 数据清洗：去除首尾空格并匹配关键词
    clean_val = cell_val.strip()
    if any(k in clean_val for k in keywords):
        if clean_val and clean_val.lower() not in ["nan", "null", ""]:
            extracted_data.append(clean_val)

print(f"提取到相关记录共 {len(extracted_data)} 条")

Step3 对提取的信息进行分类去重，统计各维度的唯一项数量。

# 使用 set 进行高效去重
category_a_items = set()
category_b_items = set()

for item in extracted_data:

...

$install

1-click copy

npx skills add OpenSenseNova/SenseNova-Skills --skill duplicate-removal

Safety assessment

★

Clarity score

How clear and easy to understand the SKILL.md instructions are, rated from 1 to 5.

4/ 5

very good

Clear and well structured, with only minor parts that might need a second read.

◎

Actionability score

How directly an agent can act on the SKILL.md instructions, rated from 1 to 5.

4/ 5

high

Mostly actionable with clear steps; only a few small gaps remain.

~community cookbook

~you might also like

view all →

duplicate-value-coloring

★70

testing#excel

[✓]from @OpenSenseNova

[✓]

对比Excel多表中的特定系数并对异常值进行颜色标记。

April 30, 2026

◧ Compare

sn-search-academic

★70

documentation#arxiv

[✓]from @OpenSenseNova

[✓]

搜索学术论文和百科知识：ArXiv 预印本、Semantic Scholar（含引用数）、PubMed 生医文献、Wikipedia 百科。支持按章节读取 ArXiv HTML 全文和 PMC 开放获取全文，适合学术调研和深度阅读。

April 30, 2026

◧ Compare

numeric-format-normalization

★70

testing#excel

[✓]from @OpenSenseNova

[✓]

对 Excel 数据进行数值格式标准化与清洗，支持大规模数据的 Parquet 转换流程，并完成关键指标的合计核对与结果文件导出。

April 30, 2026

◧ Compare

bar-chart-visualization

★70

testing#excel

[✓]from @OpenSenseNova

[✓]

读取多工作表Excel文件，自动处理合并单元格与数据清洗，进行交叉分组统计并生成带总计行的结果表，最后绘制支持中英文字体的美化柱状图，适用于多维度数据汇总与可视化分析。

April 30, 2026

◧ Compare

percentage-calculation

★70

testing#excel

[✓]from @OpenSenseNova

[✓]

根据文件行数动态切换大文件处理策略（Parquet转换），通过逐行扫描或列匹配提取关键指标并计算占比、均值等统计量，最终输出结构化Excel报告及可视化图表。

April 30, 2026

◧ Compare

kpi-metric-analysis

★70

testing#analysis

[✓]from @OpenSenseNova

[✓]

根据数据量自动选择读取策略（大文件转Parquet），提取关键指标进行单位一致性验证与排序分析，并输出可下载的结果表格。

April 30, 2026

◧ Compare

Excel_Multi_Sheet_Deduplication

This sub-skill covers one capability of the Excel workflow. For reading/counting/Parquet optimization, see the parent workflow SKILL.md.

Step1 加载目标数据表，并进行初步的数据预览与结构检查。

import pandas as pd file_path = 'input_file.xlsx' target_sheet = 'Sheet1' # 根据实际情况指定 sheet 名称 # 读取数据，header=None 用于处理无表头或非标准表头文件 df = pd.read_excel(file_path, sheet_name=target_sheet, header=None) print(f"数据形状: {df.shape}") print("前 5 行预览：") print(df.head())

Step2 遍历数据行，基于关键词提取目标信息，并执行数据清洗（去除空格、空值过滤）。

import pandas as pd # 设定目标列索引及过滤关键词 target_col_idx = 1 keywords = ["关键词A", "关键词B"] # 示例：如"综合楼"、"控制中心" extracted_data = [] for idx, row in df.iterrows(): cell_val = str(row[target_col_idx]) if pd.notna(row[target_col_idx]) else "" # 数据清洗：去除首尾空格并匹配关键词 clean_val = cell_val.strip() if any(k in clean_val for k in keywords): if clean_val and clean_val.lower() not in ["nan", "null", ""]: extracted_data.append(clean_val) print(f"提取到相关记录共 {len(extracted_data)} 条")

Step3 对提取的信息进行分类去重，统计各维度的唯一项数量。

# 使用 set 进行高效去重 category_a_items = set() category_b_items = set() for item in extracted_data:

duplicate-removal

Key Features

Use Cases

Best for

Not ideal for

FAQs

What if my Excel has headers?

Can I use this for CSV files?

How do I change the target sheet?

Excel_Multi_Sheet_Deduplication

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

duplicate-value-coloring

sn-search-academic

numeric-format-normalization

bar-chart-visualization

percentage-calculation

kpi-metric-analysis

AI Skill Finder

duplicate-removal

Key Features

Use Cases

Best for

Not ideal for

FAQs

What if my Excel has headers?

Can I use this for CSV files?

How do I change the target sheet?

Excel_Multi_Sheet_Deduplication

Safety assessment

Clarity score

Actionability score

~community cookbook

~you might also like

duplicate-value-coloring

sn-search-academic

numeric-format-normalization

bar-chart-visualization

percentage-calculation

kpi-metric-analysis