Skip to content

Practical Notes: "Translating" Massive Product Data into an AI-Understandable Markdown Knowledge Base

I've been tinkering with something lately: how to turn our company's massive industrial product data into "food" that AI can "eat" and understand. The ultimate goal is to create a smart customer service or product Q&A bot.

The data at hand mainly consists of two tables: product, which stores basic information for over 8,000 products, and prosn, which records specific models for each product, totaling about 3 million models. Each model also comes with prices, weights, serial numbers, and a bunch of attribute parameters.

The core requirement is that when a user asks about a model number, AI should be able to quickly and accurately retrieve the relevant information.

After some thought, using Markdown files as AI's "textbook" (i.e., the knowledge base) seemed like a good idea. Why? Markdown format is simple and easy for us humans to read, and it's also relatively easy for AI to process. But the question is, how can we elegantly convert the data from these two database tables into structured, AI-friendly Markdown?

I did stumble upon a few pitfalls during the process, but I also figured out some tricks. I'd like to share my tinkering process and the final solution with you all.

Initial Thought: Simple and Brutal, One Product Per .md File?

Intuitively, the clearest solution would be to create a separate Markdown file for each product (one row in the product table). The file name could be something like ProductID_ProductName.md, which is clear at a glance.

I also thought about the structure inside the file, which would look something like this:

markdown
# Product: XXX Sensor (ID: 123)
**Serial Number:** ABC-1000
**Category:** Pressure Sensor
**Brand:** DALI
... (Basic information such as product description, details, image links, etc.) ...

---
## Model List for This Product
---
### Model: ABC-1001
*   **Number:** ABC-1001
*   **Price:** $500
*   **Weight:** 0.2kg
*   **Attributes:**
    *   Range: 0-10 Bar
    *   Output: 4-20mA
... (Other detailed information for this model) ...
---
### Model: ABC-1002
... (Same structure, list the next model) ...

Looks pretty good, right? Clear structure, distinct logic.

The Problem: 8000+ Files in Your Face, Who Can Handle It!

But reality is harsh – there are over 8,000 products! If we really followed this plan, the folder would be instantly filled with over 8,000 .md files. Imagine the scenario of managing these files, it's a disaster movie scene, finding, updating, and maintaining would be crazy.

This path is obviously not feasible.

Change of Mind: Packaging! Can Multiple Products Be Put into One File?

Then, can we "integrate the parts into a whole" and package the information of multiple products into one Markdown file? For example, combine the information of every 10, 20, or even 50 products into one file. This would sharply reduce the number of files (for example, if we calculate 50 products per file, 8000 / 50 ≈ 160 files), which is very manageable!

This idea feels much more reliable! But a new problem arises: With so much content in one file, how does AI know which part of the information belongs to which product? Will different model information be mixed up and confuse AI?

This means that we must place higher demands on the internal structural design of the Markdown file – we must use a very clear and consistent way to separate the content.

The Key "Epiphany": Don't Forget How AI "Reads" Documents! (Embedding & Chunking)

Right here, it suddenly occurred to me about a core part of AI processing documents: Embedding. Simply put, AI doesn't read Markdown word by word like humans. It usually chunks the document, breaking it down into meaningful text fragments (Chunks), and then converts each fragment into a string of numbers (i.e., vectors) so that similarity calculations can be performed to achieve information retrieval and Q&A.

This realization suddenly enlightened me:

  1. Chunking Strategy is crucial! If the cut is not good, for example, splitting the information of a complete model in the middle and dividing it into two Chunks, or mixing the partial information of two unrelated models in one Chunk, then AI is prone to "grabbing blindly" or "attributing to the wrong person" when answering questions.
  2. The Markdown structure we designed must in turn serve this "chunking" process! The goal is to allow the chunking tool to easily and accurately divide the document according to our intention (for example, ideally each model information is an independent and complete Chunk).

After further understanding, I found that many chunking tools in RAG (Retrieval-Augmented Generation) frameworks support splitting documents according to the Markdown heading level (#, ##, ###, ####, etc.). This is like a pillow when you're sleepy! We can cleverly use the heading level to organize the content structure and guide AI to correctly "punctuate."

Final Solution: Clever Use of Heading Levels to Build Structured "AI Food"

Combining the two ideas of "multi-product merged files" and "using heading levels to assist chunking," the final solution is as follows:

  1. File Organization Strategy: Merge every N products (such as 10 or 20, this N can be adjusted according to subsequent test results) into a .md file. File names can be standardized, such as products_group_001.md, products_group_002.md, etc.
  2. Internal File Structure (This is the Core of the Core!):
    • Use the highest-level title # to mark the beginning of each new product. For example: # Product: Two-Position Three-Way Hand Pull Valve (ID: 270). This is the most important separator for distinguishing different products.
    • Use the secondary title ## to organize different information areas within the product. For example: ## Product Overview, ## Product Image Links, ## Model List - Two-Position Three-Way Hand Pull Valve. This makes the product internal information more organized.
    • Use the next level of title ### to mark each specific model. This is the key to ensuring that AI can accurately locate and answer model-related questions! For example: ### Model: 3L110-06 (Belongs to Product: Two-Position Three-Way Hand Pull Valve, ID: 270).
      • Pay attention, key point! In the ### title of the model, or at the beginning of the content that follows, be sure, must, and must write out the product information (such as product name, product ID) to which the model belongs! The purpose of this is to provide sufficient context to each chunk that may be split. Otherwise, AI may not know "Who am I? Where do I come from?" (which product does this model belong to) if it gets a Chunk of a single model.
    • If the information of a model itself is particularly complex and has many fields, you can also consider using the #### title for further subdivision, such as #### Detailed Parameters, #### Price and Inventory. This allows the cut out Chunk to have a smaller granularity and more focused information.

So, the final Markdown file structure looks something like this (based on the previous example):

markdown
# Product: Two-Position Three-Way Hand Pull Valve (ID: 270)

## Product Overview
*   **Series Number:** 3L
*   **Category:** Hand Pull Valve
*   **Brand:** Airtac
... (Other basic product information)

## Product Image Links
*   /path/to/image1.jpg
*   /path/to/image2.jpg

## Model List - Two-Position Three-Way Hand Pull Valve

### Model: 3L110-06 (Belongs to Product: Two-Position Three-Way Hand Pull Valve, ID: 270)
*   **Internal ID (prosn.id):** 270
*   **Model Number (Bianhao):** 3L110-06
*   **Product Information:** Two-Position Three-Way Hand Pull Valve (Product ID: 270)  <-- **Look! Contextual information is here, very important!**
*   **Price (Price):** 27.00
*   **Interface Size:** PT1/8
*   **Status:** On Sale
... (Other model attributes, parameters, etc.)

### Model: 3L210-06 (Belongs to Product: Two-Position Three-Way Hand Pull Valve, ID: 270)
*   **Internal ID (prosn.id):** 271
*   **Model Number (Bianhao):** 3L210-06
*   **Product Information:** Two-Position Three-Way Hand Pull Valve (Product ID: 270)
*   **Price (Price):** 32.00
*   **Interface Size:** PT1/8
*   **Status:** Presale (Delivery Time: Within 3 Days)
...

---

# Product: High-Speed Cylinder (ID: 271)

## Product Overview
... (Same structure) ...

## Model List - High-Speed Cylinder

### Model: HGC-20-100 (Belongs to Product: High-Speed Cylinder, ID: 271)
*   **Internal ID (prosn.id):** 350
*   **Model Number (Bianhao):** HGC-20-100
*   **Product Information:** High-Speed Cylinder (Product ID: 271)
*   **Price (Price):** 150.00
*   **Bore Diameter:** 20mm
*   **Stroke:** 100mm
...

Some "Pits" and Points to Note in Actual Operation:

  • Data "Translation" is a Must: Codes such as status (e.g., 0, 1, 2), huoqi (e.g., 1, 2, 3, 4 representing different days), and pricetype (e.g., 1 representing the real price, 2 representing negotiable) stored in the database must be converted into text that humans and AI can directly understand (e.g., "Discontinued," "Delivery within 3 days," "Real Price") when generating Markdown.
  • Associated Information Must Be Searched Completely: You can't just put a category_id, pinpai_id in Markdown, you have to go to the associated category table and brand table in advance to find the corresponding names (such as "Pressure Sensor," "DALI") and write them in to provide richer context.
  • Special Field Formatting: Text such as "one attribute per line, attribute name = attribute value" in the shuxing field needs to be parsed by writing a script and converted into a Markdown unordered list format. Similarly, multiple image paths separated by commas in the pic field are also best processed into a list.
  • Content Should Be Simplified and Focused: Not all database fields are useful for Q&A. Fields such as seo_title, seo_keywords, views, buys, which are mainly used for website operation or statistics, are not helpful for AI to answer users' questions about the product itself, and you can consider not exporting them to Markdown to make the knowledge base more "pure."
  • Multilingual Support: If your product information needs to support English, you can add English content below the corresponding Chinese information block, such as using fields with the _en suffix in the database, and follow the same Markdown structure.
  • Consistency! Consistency! Consistency! Important things are said three times! All products and all models must strictly follow the exact same heading level and format specifications. Any random changes or inconsistent formats may cause the automatic chunking tool to "crash" and cut out messy Chunks.
  • Script Automation is Great: The amount of data is large, and manual conversion is impossible. You must write a script (using Python, PHP, Node.js, or any language you are good at). The core logic of the script is probably:
    1. Connect to the database.
    2. Set a counter count and file handler file_handler to determine how many products each file contains.
    3. Loop through the product records in the venshop_product table.
    4. For each product, go to the venshop_prosn table and find all the model records under its name according to the product_id.
    5. According to the well-designed Markdown structure above, assemble the basic information of the current product and the detailed information of all its models into a formatted string.
    6. Remember to add a level 1 title such as # Product Name (ID: xxx) before each product information.
    7. Add a level 3 title such as ### Model: Model Name (Belongs to Product: Product Name, ID: xxx) before each model information, and make sure it contains key contextual information.
    8. Write the assembled string into the currently open file file_handler.
    9. Product counter count plus 1. If count reaches N, close the current file, open a new Markdown file (for example, the file name number plus 1), and reset count to 0.
    10. After all products are processed, make sure to close the last file.

Summarize a Few Key Lessons Learned:

  1. Endgame Thinking: Always keep in mind that these Markdowns are ultimately for AI to "consume," especially to be able to successfully pass the two levels of Embedding and Chunking. When designing the structure, consider the downstream processing.
  2. Structure is King: A clear, unified, and logical Markdown heading level is the lifeline to ensure that AI can correctly "punctuate" (chunk).
  3. Context Cannot Be Lost: Each piece of information (especially fine-grained ones like models) must contain contextual information that can identify its affiliation (which product it belongs to) to prevent "orphan" Chunks.
  4. Automation is a Must: If the amount of data is a bit large, don't think about manual processing. Just write a script honestly, which guarantees both efficiency and accuracy.
  5. Test-Driven Development (TDD... sorta): After generating a part of the Markdown file, don't rush to run it all at once. First, take a sample file and run it with your Embedding process (including the chunking step) to check whether the cut out Chunks meet expectations. If the cut is not right, quickly adjust the Markdown generation logic and structure until you are satisfied before generating on a large scale.

Finally, I've sorted out the process and ideas of this tinkering. I hope my "pit" record can give some inspiration or reference to friends who are dealing with similar data conversion problems or will deal with them in the future. In practice, the Markdown files generated by this method are not only relatively easy to manage, but also allow AI to better understand and utilize the structured information in them. I believe it will be of great help in improving the accuracy of the subsequent Q&A system.