Skip to content

Hands-On Notes: Translating Massive Product Data into an AI-Friendly Markdown Knowledge Base

Recently, I've been tinkering with a challenge: how to transform our company's messy industrial product data into "food" that AI can digest, with the ultimate goal of building a smart customer service or product Q&A bot.

The data at hand mainly consists of two tables: one is product, storing basic information for over 8,000 products; the other is prosn, recording various specific models under each product, totaling about 3 million models, each with details like price, weight, part number, and a bunch of attribute parameters.

The core requirement is: when a user asks about any model, the AI must be able to quickly and accurately retrieve the relevant information.

After some research, using Markdown files as the AI's "textbook" (i.e., knowledge base) seemed quite reliable. Why? Markdown is simple, easy for humans to read, and relatively straightforward for AI to process. But the question arose: how can we elegantly convert the data from these two database tables into well-structured, AI-friendly Markdown?

I hit a few bumps along the way but also figured out some tricks. Here's a share of my journey and the final solution.

Initial Idea: Simple and Direct, One Product Per .md File?

Intuitively, the clearest approach seemed to be creating a separate Markdown file for each product (each row in the product table). The filename could be something like ProductID_ProductName.md, making it easy to identify.

The internal file structure was planned roughly like this:

markdown
# Product: XXX Sensor (ID: 123)
**Series Number:** ABC-1000
**Category:** Pressure Sensor
**Brand:** Dali Brand
... (Basic product info like description, details, image links) ...

---
## Model List for This Product
---
### Model: ABC-1001
*   **Part Number:** ABC-1001
*   **Price:** ¥500
*   **Weight:** 0.2kg
*   **Attributes:**
    *   Range: 0-10 Bar
    *   Output: 4-20mA
... (Other detailed info for this model) ...
---
### Model: ABC-1002
... (Same structure for the next model) ...

Looks neat, right? Clear structure, logical organization.

Problem Arises: 8000+ Files to the Face, Who Can Handle That?

But reality was harsh—there are over 8,000 products! Following this plan would instantly fill the folder with 8,000+ .md files. Imagine managing that mess: finding, updating, and maintaining them would be a nightmare.

This path was clearly unworkable.

Alternative Approach: Bundle Up! Can Multiple Products Go Into One File?

What about "consolidating the pieces," bundling multiple products' information into a single Markdown file? For example, merging info for every 10, 20, or even 50 products into one file. This would drastically reduce the file count (e.g., 8000 / 50 ≈ 160 files), making it much more manageable!

This idea felt much more feasible! But new questions emerged: With so much content in one file, how will the AI know which information belongs to which product? Could different model details get mixed up and confuse the AI?

This meant our internal structure design for the Markdown files had to be more rigorous—we needed a very clear, consistent way to separate content.

Key "Aha!" Moment: Don't Forget How AI "Reads" Documents! (Embedding & Chunking)

Right here, I suddenly remembered a core step in AI document processing: Embedding. Simply put, AI doesn't read Markdown word by word like humans do. It typically first chunks the document, breaking it into meaningful text segments (Chunks), then converts each segment into a series of numbers (i.e., vectors) for similarity calculations, enabling information retrieval and Q&A.

This realization was enlightening:

  1. Chunking Strategy is Crucial! If chunks are poorly defined—like splitting a complete model's info across two chunks, or having a chunk contain partial info from two unrelated models—the AI might struggle to answer accurately or mix up information.
  2. Our designed Markdown structure must serve this "chunking" process! The goal is to enable chunking tools to easily and accurately split the document according to our intent (e.g., ideally, each model's info is an independent, complete Chunk).

Further research revealed that many RAG (Retrieval-Augmented Generation) framework chunking tools support splitting documents based on Markdown heading levels (#, ##, ###, ####, etc.). This was perfect! We could cleverly use heading hierarchies to organize content structure and guide the AI to "segment" correctly.

Final Solution: Smart Use of Heading Hierarchies to Build Structured "AI Food"

Combining the ideas of "multiple products per file" and "using heading levels to aid chunking," the finalized plan is as follows:

  1. File Organization Strategy: Merge every N products (e.g., 10 or 20; N can be adjusted based on later testing) into one .md file. Filenames can be standardized, like products_group_001.md, products_group_002.md, etc.
  2. Internal File Structure (The Core of the Core!):
    • Use the top-level heading # to mark the start of each new product. Example: # Product: 2-Position 3-Port Manual Valve (ID: 270). This is the most important separator between different products.
    • Use the secondary heading ## to organize different information sections within a product. For example: ## Product Overview, ## Product Image Links, ## Model List - 2-Position 3-Port Manual Valve. This adds structure to the product's internal info.
    • Use the next level heading ### to mark each specific model. This is key to ensuring the AI can precisely locate and answer model-related questions! Example: ### Model: 3L110-06 (Belongs to Product: 2-Position 3-Port Manual Valve, ID: 270).
      • Heads up, key point! In the model's ### heading or immediately following content, you must, absolutely, include the product information this model belongs to (e.g., product name, product ID)! The purpose is to provide sufficient context for each potentially chunked segment. Otherwise, if the AI gets a model's chunk alone, it might not know "Who am I? Where do I come from?" (which product this model belongs to).
    • If a model's information is particularly complex with many fields, consider using #### headings for further subdivision, like #### Detailed Parameters, #### Price & Stock. This allows for smaller, more focused chunks.

So, the final Markdown file structure looks roughly like this (based on the earlier example):

markdown
# Product: 2-Position 3-Port Manual Valve (ID: 270)

## Product Overview
*   **Series Number:** 3L
*   **Category:** Manual Valve
*   **Brand:** SMC
... (Other basic product info)

## Product Image Links
*   /path/to/image1.jpg
*   /path/to/image2.jpg

## Model List - 2-Position 3-Port Manual Valve

### Model: 3L110-06 (Belongs to Product: 2-Position 3-Port Manual Valve, ID: 270)
*   **Internal ID (prosn.id):** 270
*   **Model Number (Bianhao):** 3L110-06
*   **Belongs to Product:** 2-Position 3-Port Manual Valve (Product ID: 270)  <-- **See! Context info here, very important!**
*   **Price (Price):** 27.00
*   **Port Size:** PT1/8
*   **Status:** Available
... (Other model attributes, parameters, etc.)

### Model: 3L210-06 (Belongs to Product: 2-Position 3-Port Manual Valve, ID: 270)
*   **Internal ID (prosn.id):** 271
*   **Model Number (Bianhao):** 3L210-06
*   **Belongs to Product:** 2-Position 3-Port Manual Valve (Product ID: 270)
*   **Price (Price):** 32.00
*   **Port Size:** PT1/8
*   **Status:** Pre-sale (Lead Time: Within 3 days)
...

--- 


# Product: High-Speed Cylinder (ID: 271)

## Product Overview
... (Same structure as above) ...

## Model List - High-Speed Cylinder

### Model: HGC-20-100 (Belongs to Product: High-Speed Cylinder, ID: 271)
*   **Internal ID (prosn.id):** 350
*   **Model Number (Bianhao):** HGC-20-100
*   **Belongs to Product:** High-Speed Cylinder (Product ID: 271)
*   **Price (Price):** 150.00
*   **Bore Size:** 20mm
*   **Stroke:** 100mm
...

Some "Pitfalls" and Considerations in Practice:

  • Data "Translation" is a Must: Codes stored in the database like status (e.g., 0, 1, 2), huoqi (e.g., 1, 2, 3, 4 representing different days), pricetype (e.g., 1 for real price, 2 for negotiable) must be converted into plain text understandable by both humans and AI (e.g., "Discontinued", "Lead time within 3 days", "Real price") when generating Markdown.
  • Query All Related Information: Don't just put category_id or pinpai_id in the Markdown. Look up the corresponding names from related category and brand tables (e.g., "Pressure Sensor", "Dali Brand") and include them to provide richer context.
  • Format Special Fields: For text in fields like shuxing that might be "one attribute per line, attribute=value", write scripts to parse and convert them into Markdown unordered list format. Similarly, handle comma-separated image paths in the pic field as lists.
  • Keep Content Focused and Concise: Not all database fields are useful for Q&A. Fields like seo_title, seo_keywords, views, buys, mainly for website operations or statistics, don't help much with AI answering user questions about the product itself. Consider excluding them from the Markdown export to keep the knowledge base "pure."
  • Multi-language Support: If your product info needs to support English, you can add English content below the corresponding Chinese info blocks, using database fields with _en suffixes, following the same Markdown structure.
  • Consistency! Consistency! Consistency! Say it three times! All products and all models must strictly follow the exact same heading hierarchy and format specifications. Any arbitrary changes or format inconsistencies could cause automatic chunking tools to "derail," producing messy Chunks.
  • Automation Scripts are Essential: With large data volumes, manual conversion is impossible. You must write scripts (using Python, PHP, Node.js, or any language you're good with). The core script logic is roughly:
    1. Connect to the database.
    2. Set a counter count and file handle file_handler, deciding how many products (N) per file.
    3. Loop through product records in the venshop_product table.
    4. For each product, query all its model records from the venshop_prosn table based on product_id.
    5. Assemble the current product's basic info and all its models' detailed info into a properly formatted string according to the designed Markdown structure.
    6. Remember to prefix each product's info with a level-1 heading like # Product Name (ID: xxx).
    7. Prefix each model's info with a level-3 heading like ### Model: Model Name (Belongs to Product: Product Name, ID: xxx), ensuring key context is included.
    8. Write the assembled string to the currently open file file_handler.
    9. Increment the product counter count. If count reaches N, close the current file, open a new Markdown file (e.g., increment filename number), and reset count to 0.
    10. After processing all products, ensure the last file is closed.

Summary of Key Lessons Learned:

  1. Think About the End Goal: Always remember these Markdown files are ultimately for AI "consumption," especially needing to pass through Embedding and Chunking smoothly. Design the structure with downstream processing in mind.
  2. Structure is King: Clear, uniform, logical Markdown heading hierarchies are the lifeline for ensuring the AI can "segment" (chunk) correctly.
  3. Don't Lose Context: Every information fragment (especially fine-grained ones like models) must include context identifying its归属 (which product it belongs to), preventing "orphaned" Chunks.
  4. Automation is a Necessity: With any significant data volume, forget manual processing. Write scripts faithfully for efficiency and accuracy.
  5. Test-Driven Development (TDD... sorta): After generating a sample set of Markdown files, don't rush to full-scale processing. First, run sample files through your Embedding pipeline (including the chunking step) to check if the resulting Chunks meet expectations. If chunking is incorrect, adjust the Markdown generation logic and structure until satisfied before large-scale generation.

Finally, I've sorted out the process and thoughts from this endeavor. I hope this record of "pitfalls" can offer some inspiration or reference to friends currently or soon facing similar data conversion challenges. In practice, using this method to generate Markdown files not only makes management relatively easier but also allows the AI to better understand and utilize the structured information, which I believe will significantly help improve the accuracy of subsequent Q&A systems.