深入理解HTTP协议的文件上传

Content-Type介绍

Content-Type实体头部用于指示资源的MIME类型(Multipurpose Internet Mail Extensions)。
MIME一般称为媒体类型(media type)或是内容类型(content type);是指示文件类型的字符串,与文件一起发送,例如:一个声音文件可能被标记为audio/ogg,一个图像文件可能是image/png。例子:

1
2
Content-Type: text/html; charset=utf-8
Content-Type: multipart/form-data; boundary=something

参考MIME types 列表

上传文件时的Content-Type

multipart/form-dataapplication/octet-stream是两种不同的HTTPContent-Type类型,它们分别用于不同的文件上传情况:

  • multipart/form-data是一种用于在HTTP请求中传输表单数据和文件的标准方法。

    使用这个类型时,HTTP请求会被分成多个部分,每个部分包含一个表单字段或文件数据。这些部分会使用特定的分隔符(boundary)分隔开来,以便服务器能够正确地解析请求。

  • application/octet-stream是一种通用的MIME类型,表示二进制数据流。

    通常用于传输不带任何元数据的二进制数据,比如图像、音频、视频等文件。当使用application/octet-stream时,HTTP请求的Body直接包含二进制数据流,而没有其他任何信息。

application/octet-stream例子

The octet-stream subtype is used to indicate that a body contains arbitrary binary data. which has two optional parameters TYPE and PADDING.

通过HTTP PUT请求向华为OBS对象存储上传文件时,文件内容就是PUT请求Body的所有内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
async def obs_put(file_path):
size = os.path.getsize(file_path)
if 5 * 1024 * 1024 * 1024 < size:
log.warning(f"log file {file_path} is too large")
return

base_url, uri, headers = await generate_http_info(file_path, size)
async with aiohttp.ClientSession(base_url) as client:
async with client.put(
uri,
data=open(file_path, 'rb'),
headers=headers
) as resp:
assert resp.status == 200, \
f"upload {uri} failed! error: {await resp.text()}"
return f"{base_url}{uri}"

注释:
If you pass a file object as data parameter, aiohttp will stream it to the server automatically. streaming-uploads

Definition of multipart/form-data

In many applications, it is possible for a user to be presented with a form. The user will fill out the form, including information that is typed, generated by user input, or included from files that the user has selected. When the form is filled out已填入, the data from the form is sent from the user to the receiving application. The definition of multipart/form-data is derived from one of those applications.

HTML常见表单元素:

  • 文本框: <input type="text">
  • 密码框: <input type="password">
  • 复选框: <input type="checkbox">
  • 单选框: <input type="radio">
  • 下拉列表: <select>
  • 文本区域: <textarea>

表单提交时数据可以通过两种方法提交到服务器:GETPOSTGET方法将表单数据添加到URL的末尾,适用于小量非敏感数据。POST方法将表单数据包含在HTTP请求体中,适用于大量或敏感数据。

表单数据在提交前需要进行编码。HTML表单支持两种编码类型:application/x-www-form-urlencodedmultipart/form-data。前者用于普通表单数据(键值对),后者用于包含文件上传的表单。

application/x-www-form-urlencoded格式中,表单数据被编码为 key-value 对:key 和 value 之间用等号=连接,不同的 key-value 对之间用&符号分隔。这种格式还会对某些字符进行 URL 编码(也称为百分比编码),例如空格会被编码为 +, 特殊字符 @ 会被编码为 %40

A multipart/form-data body contains a series of parts separated by a boundary.

  1. Boundary Parameter of multipart/form-data
    As with other multipart types, the parts are delimited with a boundary delimiter, constructed using CRLF, , and the value of the boundary parameter.

  2. Content-Disposition Header Field for Each Part
    Each part MUST contain a Content-Disposition header field RFC2183 where the disposition性情,布置,处置 type is form-data. The Content-Disposition header field MUST also contain an additional parameter of name; the value of the name parameter is the original field name from the form (possibly encoded; see Section 5.1).
    In most multipart types, the MIME header fields in each part are restricted to US-ASCII; for compatibility with those systems, file names normally visible to users MAY be encoded using the percent-encoding method.

  3. Content-Type Header Field for Each Part
    Each part MAY have an (optional) Content-Type header field, which defaults to text/plain. If the contents of a file are to be sent, the file data SHOULD be labeled with an appropriate media type, if known, or application/octet-stream.

  4. The Charset Parameter for text/plain Form Data
    In the case where the form data is text, the charset parameter for the text/plain Content-Type MAY be used to indicate the character encoding used in that part:

    1
    2
    3
    4
    5
    6
    7
    --AaB03x
    content-disposition: form-data; name="field1"
    content-type: text/plain;charset=UTF-8
    content-transfer-encoding: quoted-printable

    Joe owes =E2=82=AC100.
    --AaB03x

    Content-Transfer-Encoding用来说明数据的编码方式,以适应不同的传输协议。因为有些传输协议并不设计来处理二进制数据或特殊字符,因此需要使用特定的编码方式,比如Base64或Quoted-Printable,以确保数据可以在发送和接收时保持完整。
    例如,发送一个包含非ASCII字符的HTML邮件,需要使用Content-Transfer-Encoding: quoted-printable来确保所有的字符都可以被正确地传输。如果有附件(比如图像或PDF文件),需要使用Content-Transfer-Encoding: base64来发送这些二进制文件。

    Base64Quoted-Printable这两种编码方法的主要目的都是将非ASCII二进制数据转换为可以在ASCII环境下处理的格式,从而使得这些数据可以通过电子邮件等只支持ASCII的网络协议进行传输。电子邮件最初设计的时候,只针对文本信息的传输。

    • Base64:一种基于64个可打印字符来表示二进制数据的方法。用于处理二进制数据,特别是那些包含字节对齐区别的复杂数据。
    • Quoted-Printable:又称可打印引用编码法,主要用于对邮件中的非ASCII字符进行编码。它会将非ASCII字符转换成=后面跟着两个十六进制数的形式。
  5. Other Content- Header Fields
    The multipart/form-data media type does not support any MIME header fields in parts other than Content-Type, Content-Disposition and Content-Transfer-Encoding.

multipart/form-data脚本例子

可以通过python的aiohttp模块来发送Multipart-encoded files:

1
2
3
4
5
6
7
8
9
10
11
12
13
async def upload_attach(ci_platform, attach):
uri = "/api/enclosure/upload"
data = aiohttp.FormData()
data.add_field(
'file',
open(attach, 'rb'),
filename=os.path.basename(attach),
content_type='application/octet-stream'
)
async with ci_platform.post(uri, data=data) as resp:
assert resp.status == 200, f"upload {attach}' failed!"
data = await resp.json()
return data["data"]["url"]

Wirshark抓取的一次上传文件交互过程如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
POST /api/enclosure/upload HTTP/1.1
Host: 1.66.115.80:80
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.12 aiohttp/3.8.5
Authorization: Basic 5GFvLldhbmz3ODphN2M6MDQ9MDEqZjBkNjU3YjE5YzBiJTAzYjA0ZTQyNDlk033mNXDi
Content-Length: 284
Content-Type: multipart/form-data; boundary=65ef653de6e740829bf661a98e2d72f5

--65ef653de6e740829bf661a98e2d72f5
Content-Type: application/octet-stream
Content-Disposition: form-data; name="file"; filename="param.txt"
Content-Length: 84

C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -WindowStyle Hidden -File

--65ef653de6e740829bf661a98e2d72f5--
HTTP/1.1 200 OK
Date: Mon, 13 Nov 2024 13:50:04 GMT
Server: WSGIServer/0.2 CPython/3.8.0
Content-Type: application/json
X-Frame-Options: ALLOWALL
Content-Length: 201
X-Content-Type-Options: nosniff
Referrer-Policy: same-origin

{"code": 0, "message": null, "data": {"url": "https://my-mitio-mooklci-uat-uis.ersvp4.bj-mkr.taolife.com/534635982b7ae5a2ea175e4bf0750a43/param.txt", "info": "\u4e0a\u4f20\u6210\u529f!"}}

X-Content-Type-Options: nosniff 含义如下:
The X-Content-Type-Options response HTTP header is a marker used by the server to indicate that the MIME types advertised in the Content-Type headers should be followed and not be changed. The header allows you to avoid MIME type sniffing嗅探 by saying that the MIME types are deliberately故意的 configured.

Percent-Encoding Option:
percent-encoding (as defined in RFC3986) is offered as a possible way of encoding characters in file names that are otherwise disallowed, including non-ASCII characters, spaces, control characters, and so forth诸如此类,等等. The encoding is created replacing each non-ASCII or disallowed character with a sequence, where each byte of the UTF-8 encoding of the character is represented by a percent-sign (%) followed by the (case-insensitive) hexadecimal[ˌheksəˈdesɪml] of that byte.


深入理解HTTP协议的文件上传
https://www.tao-wt.fun/upload_file_via_http/
作者
tao-wt@qq.com
发布于
2024年1月5日
许可协议