Liarlee's Notebook

发表于2023-07-12|更新于2025-09-23|Linux

远端创建， Clone本地Step 1Github上面创建一个新的仓库，页面创建即可，然后记录下URL。 https://github.com/LiarLee/vps-init.git Step 2本地创建目录，并初始化本地的仓库路径。 mkdir project-init cd ./project-init git clone https://github.com/LiarLee/project-init.git 关联远端仓库创建一个本地仓库。 mkdir test-init cd ./test-init git init . touch README echo "For test" git add -A git commit -m "init" git remote add origin https://github.com/LiarLee/test-init.git git -u origin master 配置代理# 设置代理 arch ~> git config --global http.proxy socks5://1.1.1.1:7890 arch ~> git config --global https.proxy socks5://1.1.1.1:7890 # 取消代理 git config --global --unset http.proxy git config --global --unset https.proxy 取消 git 特定文件追踪最好的办法是直接使用gitignore忽略这个文件，但是开始创建仓库的时候可能想不到，所以已经追踪的文件需要取消。查看这个哪些文件会受到影响： git rm -r -n --cached ./plugin 删除这些文件的追踪： git rm -r --cached ./plugin

Linux Redhat 9 oom不触发

发表于2023-07-12|更新于2025-09-23|Linux

测试环境Amazon AMI ID: ami-0e54fe8afeb8fa59a Operating System: Red Hat Enterprise Linux 9.2 (Plow) Kernel: Linux 5.14.0-284.11.1.el9_2.x86_64 MySQL Version: Ver 8.0.33 for Linux on x86_64 (MySQL Community Server - GPL) 测试步骤：使用上面的AMI启动一个新的实例，在实例启动之后ssh连接进去。按照MySQL官方的 repo 安装一个社区的版本。 # https://dev.mysql.com/downloads/repo/yum/ # (mysql80-community-release-el9-1.noarch.rpm) wget https://dev.mysql.com/get/mysql80-community-release-el9-1.noarch.rpm dnf install -y ./mysql80-community-release-el9-1.noarch.rpm dnf makecache -y dnf update -y # Refer this doc to install mysql-community packages. # https://dev.mysql.com/doc/refman/8.0/en/linux-installation-yum-repo.html dnf install -y mysql-community-server # Check the installation success. systemctl status mysqld 编辑MySQL配置文件，添加如下的参数: vim /etc/my.cnf #设置buffer pool 的参数大于物理内存。例如 os 本身由可用的内存是 16G，那么设置一个更大的值即可。 innodb_buffer_pool_size = 40G sudo systemctl enable --now mysqld sudo systemctl status mysqld # 获取root用户的临时初始化密码： sudo grep 'temporary password' /var/log/mysqld.log # 使用root用户登录，并查看配置已经生效。 mysql -u root -p MySQL [(none)]> show variables like "%buffer_pool_size%"; +-------------------------+-------------+ | Variable_name | Value | +-------------------------+-------------+ | innodb_buffer_pool_size | 42949672960 | +-------------------------+-------------+ 1 row in set (0.005 sec) 创建database 以及 table： create database test; CREATE TABLE tt1( id int NOT NULL AUTO_INCREMENT PRIMARY KEY, person_id tinyint not null , person_name1 VARCHAR(3000) , person_name2 VARCHAR(3000) , person_name3 VARCHAR(3000) , person_name4 VARCHAR(3000) , person_name5 VARCHAR(3000) , gmt_create datetime , gmt_modified datetime ) ; insert into tt1 (person_id, person_name1, person_name2, person_name3, person_name4, person_name5, gmt_create, gmt_modified) values (1, lpad('',3000,'*'), lpad('',3000,'*'), lpad('',3000,'*'), lpad('',3000,'*'), lpad('',3000,'*'), now(), now()); # 反复执行这个命令，复制全表的数据，大概执行 10 次左右。 insert into tt1 (person_id, person_name1, person_name2, person_name3, person_name4, person_name5, gmt_create, gmt_modified) select person_id, person_name1, person_name2, person_name3, person_name4, person_name5, now(), now() from tt1; # 查看当前table的信息，确认一下数据量。 show table status like 'tt1'\G 在另一个机器上面 select 这个表格。 select count(*) from tt1; 如果数据量足够大的话，那么就会占用超过物理内存的空间，导致OOM。问题其他版本的os上面都会触发oom，然后MySQL会被干掉重启，只有Redhat 9 这个版本的os是不会触发的。测试了Ubuntu 22， amazonlinux2， amazonlinux2023 都没由这个问题，他们的内核都是 5.10+ 。 Redhat 9 这个版本的表现是，在接近物理内存容量极限的时候， os 开始非常频繁的扫描并尝试回收内存的空间，导致命令的响应变慢，我开了sar的监控命令，返回数据的速度也会便的比较慢，大部分的进程会慢慢变成 Uninterreptable状态，并且数量越来越多，最终会导致实例的网络子系统不工作，完全不响应任何的网络报文，实例的健康检查失败。CPU使用率依旧还在， EBS的监控显示这个卷满速度读取输出，并且非常平稳, 响应时间以及队列的长度也没有异常。如果这个时候尝试取手动触发oom，是可以成功的，在杀掉Mysqld之后， os就恢复了正常。分析思路可以确定的是磁盘的工作是正常的， CPU使用率慢慢增长但是也是正常的。这部分的工作没有问题。在发生问题的时候因为网络无法正常的工作，ssh已经断开，无法确定当时的情况。初步判断是内存的问题，但是具体的差异是哪里。第一sar命令中的记录第二尝试对比与行为正常的os的差异，获取sysctl -a 并记录到文件中。对比完关于内存部分的参数，完全没有任何的差别，不是设置或者配置文件导致的这个问题。 min 水位的设置是默认的，与其他的发行版本一直，测试的时候用的相同的规格，所以不存在这类的差异。第三触发一个kdump 看看网络不相应的时候 os 当时的状态，这个不太会。第四查看内核的编译选项，看看有什么不同。第五学习os 内存的部分，看看具体这个版本的内核如何进行内存的回收的，或者触发oom的条件有哪些。

升级以及清理内核的步骤

发表于2023-07-12|更新于2025-09-23|Linux

通常情况下升级内核版本的步骤CentOS 升级步骤 yum makecache -y yum update -y grub2-editenv list grub2-set-default 'CentOS Linux (3.10.xxxxx.el7.elrepo.x86_64) 7 (Core)' # entry_name systemctl reboot 清理旧版本的步骤RHEL 或者 Centos rpm -qa kernel* # 这个命令会列出所有当前已经安装的版本的内核，然后手动使用命令移除对应的软件包即可。直接使用yum 移除不需要的版本即可. yum remove -y kernel-devel-5.10.216-204.855.amzn2.x86_64 kernel-devel-5.10.218-208.862.amzn2.x86_64 kernel-5.10.216-204.855.amzn2.x86_64 kernel-5.10.218-208.862.amzn2.x86_64 rpm -qa | grep kernel kernel-tools-5.10.219-208.866.amzn2.x86_64 kernel-headers-5.10.219-208.866.amzn2.x86_64 kernel-devel-5.10.219-208.866.amzn2.x86_64 kernel-5.10.219-208.866.amzn2.x86_64 列出确认一下是不是已经清理出来. ls -alh /boot/ total 29M dr-xr-xr-x 4 root root 4.0K Jul 19 15:02 ./ dr-xr-xr-x 19 root root 268 Jul 1 17:32 ../ -rw-r--r-- 1 root root 174 Jun 18 22:04 .vmlinuz-5.10.219-208.866.amzn2.x86_64.hmac -rw------- 1 root root 4.5M Jun 18 22:04 System.map-5.10.219-208.866.amzn2.x86_64 -rw-r--r-- 1 root root 141K Jun 18 22:04 config-5.10.219-208.866.amzn2.x86_64 drwxr-xr-x 3 root root 17 Oct 14 2022 efi/ drwx------ 5 root root 79 Jul 19 15:02 grub2/ -rw------- 1 root root 14M Jul 9 15:03 initramfs-5.10.219-208.866.amzn2.x86_64.img -rw-r--r-- 1 root root 643K Oct 14 2022 initrd-plymouth.img -rw-r--r-- 1 root root 268K Jun 18 22:05 symvers-5.10.219-208.866.amzn2.x86_64.gz -rwxr-xr-x 1 root root 9.7M Jun 18 22:04 vmlinuz-5.10.219-208.866.amzn2.x86_64* 当然如果全都卸载了. 也是可以重装的(doge. yum groupinstall -y "Development Tools" yum install -y kernel kernel-devel kernel-debug Ubuntu 降级Ubuntu Online 的内核不能直接卸载, 需要安装, 然后切换, 卸载新的 root@ip-172-31-59-13:~# update-initramfs -k all -c update-initramfs: Generating /boot/initrd.img-5.15.0-1048-aws update-initramfs: Generating /boot/initrd.img-5.4.0-1126-aws root@ip-172-31-59-13:~# update-grub Sourcing file `/etc/default/grub' Sourcing file `/etc/default/grub.d/40-force-partuuid.cfg' Sourcing file `/etc/default/grub.d/50-cloudimg-settings.cfg' Sourcing file `/etc/default/grub.d/init-select.cfg' Generating grub configuration file ... GRUB_FORCE_PARTUUID is set, will attempt initrdless boot Found linux image: /boot/vmlinuz-5.15.0-1048-aws Found initrd image: /boot/microcode.cpio /boot/initrd.img-5.15.0-1048-aws Found linux image: /boot/vmlinuz-5.4.0-1126-aws Found initrd image: /boot/microcode.cpio /boot/initrd.img-5.4.0-1126-aws Found Ubuntu 20.04.6 LTS (20.04) on /dev/nvme0n1p1 done 查看可用内核的版本 root@ip-172-31-59-13:$ apt search linux-image | grep 5.4.0 | grep linux-image | grep aws 查看所有已经安装的内核 root@ip-172-31-59-13:~$ dpkg --get-selections | grep linux console-setup-linux install libselinux1:amd64 install linux-aws install linux-aws-5.15-headers-5.15.0-1048 install linux-aws-headers-5.4.0-1126 install linux-base install linux-headers-5.15.0-1048-aws install linux-headers-5.4.0-1126-aws install linux-headers-aws install linux-image-5.15.0-1048-aws install linux-image-5.4.0-1126-aws install linux-image-aws install linux-modules-5.15.0-1048-aws install linux-modules-5.4.0-1126-aws install util-linux install 安装内核 root@ip-172-31-59-13:~$ apt install -y linux-image-5.4.0-1126-aws/focal-updates linux-headers-5.4.0-1126-aws 指定Grub Entry条目 root@ip-172-31-59-13:~$ vim /etc/default/grub 其中Entry的变量应该设置为下面的格式: Advanced options for Ubuntu>Ubuntu, with Linux 5.4.0-1126-aws 清理内核的步骤 - Version 2Deb 包管理工具清理步骤列出所有已经安装的内核版本：dpkg --list | grep linux-image 列出所有旧的内核并自动删除除当前内核之外的旧内核：sudo apt-get autoremove --purge` 如果想手动删除旧内核，可以使用以下命令，sudo apt-get remove --purge linux-image-X.X.X-X-generic Rpm 包管理工具的清理步骤查看安装的内核rpm -qa | grep kernel 使用yum卸载sudo yum install yum-utils 设置只保留两个内核sudo package-cleanup --oldkernels --count=2

Linux内存管理笔记

发表于2023-07-11|更新于2025-09-23|Linux

内存管理部分的笔记Crash命令的使用使用这个命令需要有debuginfo 以及kernel debug 的数据包，同时可能需要gdb。需要在配置文件里面开启这个仓库： rhel-8-baseos-rhui-debug-rpms 具体的步骤也可以看这个文档，来自Redhat 官方： https://access.redhat.com/solutions/9907 yum install -y kernel-debuginfo # 使用这个命令就可以安装，但是尺寸非常的大。 crash /boot/vmlinuz-$(uname -a) 使用命令crash来进行 PM 和 VM 的对应关系：内核的 debug 文件在： /var/lib/debug/lib/modules/kernel-version/ 使用crash命令： ~ # ❯❯❯ crash crash 7.3.2-4.el8 Copyright (C) 2002-2022 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011, 2020-2022 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel relocated [592MB]: patching 107327 gdb minimal_symbol values KERNEL: /usr/lib/debug/lib/modules/4.18.0-477.13.1.el8_8.x86_64/vmlinux [TAINTED] DUMPFILE: /proc/kcore CPUS: 2 DATE: Tue Jul 11 17:07:01 CST 2023 UPTIME: 01:04:34 LOAD AVERAGE: 0.15, 0.03, 0.01 TASKS: 226 NODENAME: center RELEASE: 4.18.0-477.13.1.el8_8.x86_64 VERSION: #1 SMP Thu May 18 10:27:05 EDT 2023 MACHINE: x86_64 (2199 Mhz) MEMORY: 7.9 GB PID: 7657 COMMAND: "crash" TASK: ffff9ce7d835a800 [THREAD_INFO: ffff9ce7d835a800] CPU: 0 STATE: TASK_RUNNING (ACTIVE) crash> vm -p [pid] PID: 913 TASK: ffff9ce7c75fd000 CPU: 0 COMMAND: "sshd" MM PGD RSS TOTAL_VM ffff9ce7c11b8000 ffff9ce7c75ae000 7604k 76644k VMA START END FLAGS FILE ffff9ce7c759f828 55a7fcc7b000 55a7fcd4c000 8000875 /usr/sbin/sshd VIRTUAL PHYSICAL 55a7fcc7b000 12026b000 55a7fcc7c000 1201df000 55a7fcc7d000 1201ec000 55a7fcc7e000 1200c7000 55a7fcc7f000 120c43000 55a7fcc80000 10fa79000 55a7fcc81000 11fdd3000 55a7fcc82000 11087f000 55a7fcc83000 11fa8d000 55a7fcc84000 10fe05000 55a7fcc85000 110870000 55a7fcc86000 10fa2c000 55a7fcc87000 10f9fc000 55a7fcc88000 10fdab000 55a7fcc89000 11f296000 55a7fcc8a000 1117ec000 55a7fcc8b000 10fdac000 55a7fcc8c000 120c65000 55a7fcc8d000 12011b000 55a7fcc8e000 110714000 55a7fcc8f000 110c83000 55a7fcc90000 110c90000 55a7fcc91000 110d2b000 55a7fcc92000 120730000 55a7fcc93000 12076f000 55a7fcc94000 1207e8000 55a7fcc95000 110c2f000 55a7fcc96000 110c3c000 55a7fcc97000 120650000 55a7fcc98000 1206c1000 55a7fcc99000 120c67000 55a7fcc9a000 120c0f000 55a7fcc9b000 FILE: /usr/sbin/sshd OFFSET: 20000 55a7fcc9c000 11d46d000 55a7fcc9d000 10fe01000 55a7fcc9e000 10fdb9000 55a7fcc9f000 10fde7000 55a7fcca0000 FILE: /usr/sbin/sshd OFFSET: 25000 # 结果省略了后面的部分，太长了。。。。可以看到内存的映射关系， notmapped 表示没有被映射到物理内存的部分。一般来说后面的三位是一样的，如果是THP的话，那么后面的五位是一样的。这个vtop 可以直接查看里面保存的内容以及具体的映射关系。 crash> vtop 55d5473fc000 VIRTUAL PHYSICAL 55d5473fc000 (not accessible) rd命令可以读取指定的内存虚拟地址之后的偏移量。 crash> rd 55d54879d000 100 rd: invalid user virtual address: 55d54879d000 type: "64-bit UVADDR" 超过内存申请容量的使用，会导致访问内存越界，例如申请了1G的内存，但是尝试写入超出的数据量，会导致数据写到后续不属于这个进程的空间上，而这个时候内核会触发一个 segfault，来终止这个进程。这个报错不是立刻发生的，可能确实会溢出一部分。匿名页面实际上是 mmap with MAP_ANONYMOUS flag映射出来的虚拟内存地址，当需要第一次去写匿名页面的时候，会将物理内存的地址映射到虚拟内存并将其中填0. overcommit 0 可以所有的地址， 1 无限制，虚拟内存没有限制， 2 按照一定的比例进行计算，最终的结果。 GDB 调试工具的使用记录首先写了一个这样的程序: #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { int *data; data = (int *)malloc(100 * sizeof(int)); // 分配 100 个 int 大小的内存块 if (data == NULL) { // 内存分配失败的处理 return 1; } //data[100] = 0; // 使用 data 数组... // 例如，初始化数组 for (int i = 0; i < 100; ++i) { data[i] = 0; // 将每个元素初始化为 0 } free(data); printf("%d\n", data[2]); } 编译这个程序. ╰─>$ gcc ./123.c -g -Og 运行需要调试的程序. ╰─>$ ./a.out 971012533 尝试使用 gdb 调试. ╰─>$ gdb ./a.out -d . ╰─>$ gdb ./a.out -c ./COREDUMP_FILE 指令说明 list - list 指令会列出 10 行代码. 可以重复使用, 每10行一次. (gdb) list 11 return 1; 12 } 13 14 //data[100] = 0; 15 16 // 使用 data 数组... 17 // 例如，初始化数组 18 for (int i = 0; i < 100; ++i) { 19 data[i ...

添加一个Redhat到EKS集群，基于Packer的步骤

发表于2023-07-11|更新于2025-09-23|Kubernetes

Copy From zhojiew 的私有仓库文档，已经经过授权 ~ 官方提供了基于packer工具的构建脚本这里手动把相关的步骤执行下，基于redhat9构建一个自定义ami。据称eks优化的ami也是通过以下步骤完成的手动构建ami拉仓库 cd /home/ec2-suer sudo yum install git -y git clone https://github.com/awslabs/amazon-eks-ami.git 配置环境变量 KUBERNETES_VERSION=1.26.4 KUBERNETES_BUILD_DATE=2023-05-11 BINARY_BUCKET_NAME=amazon-eks BINARY_BUCKET_REGION=cn-north-1 DOCKER_VERSION=20.10.23-1.amzn2.0.1 CONTAINERD_VERSION=1.6.* RUNC_VERSION=1.1.5-1.amzn2 CNI_PLUGIN_VERSION=v0.8.6 PULL_CNI_FROM_GITHUB=true SONOBUOY_E2E_REGISTRY="" PAUSE_CONTAINER_VERSION=3.5 CACHE_CONTAINER_IMAGES=false WORKING_DIR=/tmp/worker TEMPLATE_DIR=/home/ec2-user/amazon-eks-ami 复制文件更新内核（可以跳过） mkdir -p $WORKING_DIR mkdir -p $WORKING_DIR/log-collector-script mkdir -p $WORKING_DIR/bin mv $TEMPLATE_DIR/files/* $WORKING_DIR/ mv $TEMPLATE_DIR/log-collector-script/linux/eks-log-collector.sh $WORKING_DIR/log-collector-script/ sudo chmod -R a+x $WORKING_DIR/bin/ sudo mv /tmp/worker/bin/* /usr/bin/ # sudo bash $TEMPLATE_DIR/scripts/upgrade_kernel.sh KERNEL_VERSION=5.10 sudo grubby \ --update-kernel=ALL \ --args="psi=1" sudo grubby \ --update-kernel=ALL \ --args="clocksource=tsc tsc=reliable" sudo reboot 构建的主要逻辑在脚本install-worker.sh中 # sudo bash $TEMPLATE_DIR/scripts/install-worker.sh export AWS_DEFAULT_OUTPUT="json" ARCH="amd64" sudo yum update -y sudo yum install -y \ chrony \ conntrack \ curl \ ethtool \ ipvsadm \ jq \ nfs-utils \ socat \ unzip \ wget \ yum-utils \ yum-plugin-versionlock \ mdadm \ pigz # Remove any old kernel versions. sudo package-cleanup --oldkernels --count=1 -y # Remove the ec2-net-utils package if yum list installed | grep ec2-net-utils; then sudo yum remove ec2-net-utils -y -q; fi sudo mkdir -p /etc/eks/ sudo mv $WORKING_DIR/configure-clocksource.service /etc/eks/configure-clocksource.service # iptables sudo mv $WORKING_DIR/iptables-restore.service /etc/eks/iptables-restore.service # awscli sudo yum install less unzip jq -y curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install --update complete -C '/usr/local/bin/aws_completer' aws # systemd sudo mv "${WORKING_DIR}/runtime.slice" /etc/systemd/system/runtime.slice 编译安装runc # install runc and lock version # sudo yum install -y runc-${RUNC_VERSION} sudo yum install libseccomp-devel.x86_64 golang -y go env -w GOPROXY=https://goproxy.io,direct git clone https://github.com/opencontainers/runc cd runc make sudo make install 安装containerd # install containerd and lock version sudo yum install -y yum-utils device-mapper-persistent-data lvm2 # sudo yum install -y containerd-${CONTAINERD_VERSION} yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo sudo yum install -y containerd # 1.6.21 配置containerd sudo mkdir -p /etc/eks/containerd sudo mv $WORKING_DIR/containerd-config.toml /etc/eks/containerd/containerd-config.toml # containerd and related service sudo mv $WORKING_DIR/kubelet-containerd.service /etc/eks/containerd/kubelet-containerd.service sudo mv $WORKING_DIR/sandbox-image.service /etc/eks/containerd/sandbox-image.service sudo mv $WORKING_DIR/pull-sandbox-image.sh /etc/eks/containerd/pull-sandbox-image.sh sudo mv $WORKING_DIR/pull-image.sh /etc/eks/containerd/pull-image.sh sudo chmod +x /etc/eks/containerd/pull-sandbox-image.sh sudo chmod +x /etc/eks/containerd/pull-image.sh sudo mkdir -p /etc/systemd/system/containerd.service.d cat << EOF | sudo tee /etc/systemd/system/containerd.service.d/10-compat-symlink.conf [Service] ExecStartPre=/bin/ln -sf /run/containerd/containerd.sock /run/dockershim.sock EOF cat << EOF | sudo tee -a /etc/modules-load.d/containerd.conf overlay br_netfilter EOF cat << EOF | sudo tee -a /etc/sysctl.d/99-kubernetes-cri.conf net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 net.ipv4.ip_forward = 1 EOF # skip docker 日志轮换配置 # logrotate sudo mv $WORKING_DIR/logrotate-kube-proxy /etc/logrotate.d/kube-proxy sudo mv $WORKING_DIR/logrotate.conf /etc/logrotate.conf sudo chown root:root /etc/logrotate.d/kube-proxy sudo chown root:root /etc/logrotate.conf sudo mkdir -p /var/log/journal 下载kubelet和aws-iam-authenticator ## download bin in china region S3_DOMAIN="amazonaws.com.cn" S3_PATH="s3://amazon-eks/1.26.4/2023-05-11/bin/linux/amd64" # Verify that the aws-iam-authenticator is at last v0.5.9 or greater BINARIES=( kubelet aws-iam-authenticator ) for binary in ${BINARIES[*]}; do aws s3 cp $S3_PATH/$binary . --region cn-north-1 sudo chmod +x $binary sudo mv $binary /usr/bin/ done 继续配置服务 # kubernetes sudo mkdir -p /etc/kubernetes/manifests sudo mkdir -p /var/lib/kubernetes sudo mkdir -p /var/lib/kubelet sudo mkdir -p /opt/cni/bin CNI_PLUGIN_FILENAME="cni-plugins-linux-${ARCH}-${CNI_PLUGIN_VERSION}" aws s3 cp --region $BINARY_BUCKET_REGION $S3_PATH/${CNI_PLUGIN_FILENAME}.tgz . su ...

BufferIO与DirectIO的比较

发表于2023-07-10|更新于2025-09-23|Linux

测试方法使用BufferIO的方式，测试文件的写入： #!/bin/bash perf record -T -C 0 -- taskset -c 0 dd if=/dev/zero of=./a.dat bs=4k count=16384 使用DirectIO的方式，测试文件的写入: #!/bin/bash perf record -T -C 0 -- taskset -c 0 dd if=/dev/zero of=./a.dat bs=4k count=16384 oflag=direct 运行结果BufferIO: [root@ip-172-31-53-200 perf_records]# ./start_test_bufferio.sh 16384+0 records in 16384+0 records out 67108864 bytes (67 MB, 64 MiB) copied, 0.118848 s, 565 MB/s [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.225 MB perf.data (485 samples) ] ll -h -rw-r--r--. 1 root root 64M Jun 1 13:45 a.dat [root@ip-172-31-53-200 ~]# dstat -tf ----system---- -----cpu0-usage----------cpu1-usage----------cpu2-usage----------cpu3-usage---- dsk/nvme0n1 ---net/lo-----net/eth0- ---paging-- ---system-- time |usr sys idl wai stl:usr sys idl wai stl:usr sys idl wai stl:usr sys idl wai stl| read writ| recv send: recv send| in out | int csw 01-06 13:35:48| 2 0 99 0 0: 1 0 99 0 0: 0 1 98 0 0: 1 0 99 0 0|8192B 35k|1096B 1096B: 968B 828B| 0 0 | 712 971 01-06 13:35:49| 25 9 60 5 0: 0 0 99 0 0: 4 12 84 0 0: 4 8 90 0 0| 0 64M|1096B 1096B: 576B 756B| 0 0 |2283 1311 01-06 13:35:50| 6 1 94 0 0: 1 1 99 0 0: 16 2 83 0 0: 0 1 99 0 0| 0 0 |1096B 1096B: 156B 418B| 0 0 | 954 1018 DirectIO: [root@ip-172-31-53-200 perf_records]# ./start_test_directio.sh 16384+0 records in 16384+0 records out 67108864 bytes (67 MB, 64 MiB) copied, 10.4225 s, 6.4 MB/s [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.417 MB perf.data (41489 samples) ] [root@ip-172-31-53-200 ~]# dstat -tf ----system---- -----cpu0-usage----------cpu1-usage----------cpu2-usage----------cpu3-usage---- dsk/nvme0n1 ---net/lo-----net/eth0- ---paging-- ---system-- time |usr sys idl wai stl:usr sys idl wai stl:usr sys idl wai stl:usr sys idl wai stl| read writ| recv send: recv send| in out | int csw 01-06 13:36:36| 0 1 99 0 0: 1 1 99 0 0: 0 1 100 0 0: 0 1 100 0 0| 0 0 |1097B 1097B: 688B 624B| 0 0 | 622 930 01-06 13:36:37| 3 4 32 61 0: 4 14 81 0 0: 1 0 97 0 0: 0 5 94 0 0| 0 4277k|1095B 1095B: 332B 338B| 0 0 |6434 3133 01-06 13:36:38| 3 3 0 92 0: 1 1 99 0 0: 3 1 96 0 0: 0 0 99 0 0| 0 6421k|1096B 1096B: 52B 174B| 0 0 |8767 4148 01-06 13:36:39| 4 4 0 92 0: 0 0 99 0 0: 4 1 96 0 0: 2 0 100 0 0| 0 6431k|1096B 1096B: 52B 150B| 0 0 |8790 4191 01-06 13:36:40| 4 4 0 91 0: 0 1 99 0 0: 2 1 96 0 0: 0 1 99 0 0| 0 6320k|1096B 1096B: 52B 142B| 0 0 |8744 4092 01-06 13:36:41| 4 4 0 92 0: 1 0 99 0 0: 3 0 96 0 0: 0 0 100 0 0| 0 6216k|1096B 1096B: 52B 142B| 0 0 |8662 4103 01-06 13:36:42| 3 4 0 92 0: 1 1 99 0 0: 2 2 96 0 0: 0 0 99 0 0| 0 7492k|1576B 1576B: 52B 134B| 0 0 |8756 4099 01-06 13:36:43| 3 3 0 91 0: 1 0 99 0 0: 4 1 96 0 0: 0 0 100 0 0| 0 6284k|1096B 1096B: 52B 134B| 0 0 |8720 4077 01-06 13:36:44| 4 2 0 92 0: 0 0 99 0 0: 2 1 96 0 0: 0 0 99 0 0| 0 6296k|1096B 1096B: 52B 134B| 0 0 |8788 4067 01-06 13:36:45| 4 5 0 91 0: 1 0 99 0 0: 4 0 96 0 0: 1 0 99 0 0| 0 6368k|1096B 1096B: 52B 134B| 0 0 |8792 4071 01-06 13:36:46| 3 5 0 92 0: 1 1 99 0 0: 4 1 96 0 0: 0 0 100 0 0| 0 5904k|1096B 1096B: 52B 134B| 0 0 |8576 3893 01-06 13:36:47| 25 7 0 69 0: 0 0 99 0 0: 2 1 96 0 0: 0 0 100 0 0| 0 4811k|1097B 1097B: 364B 763B| 0 0 |7035 3360 01-06 13:36:48| 4 0 96 0 0: 1 0 99 0 0: 22 3 75 1 0: 0 1 100 0 0|2642k 109k|1095B 1095B: 208B 472B| 0 0 | 977 1008 01-06 13:36:49| 0 1 99 0 0: 0 0 100 0 0: 0 0 98 0 0: 0 0 99 0 0| 0 0 |1096B 1096B: 104B 276B| 0 0 | 640 903 Perf 采样结果BufferIO： DirectIO：

VPCFlowlog 解析

发表于2023-07-05|更新于2025-09-23|AWS

VPC Flow Log 怎么看https://docs.amazonaws.cn/vpc/latest/userguide/flow-logs.html#flow-log-recordshttps://docs.amazonaws.cn/vpc/latest/userguide/flow-logs-records-examples.html#flow-log-example-tcp-flag vpc flow log里的tcp-flags记录的不是某个单个tcp包头里的flag，而是单次观察的时间窗口里这条flow的所有tcp包出现过的tcp flag的合计。 TCP flags can be OR-ed during the aggregation interval. For short connections, the flags might be set on the same line in the flow log record, for example, 19 for SYN-ACK and FIN, and 3 for SYN and FIN. For an example, see TCP flag sequence.For general information about TCP flags (such as the meaning of flags like FIN, SYN, and ACK), see TCP segment structureon Wikipedia.. 这个记录里面的值，是这样计算出来的，从右向左，从 0 次方开始计算。 FIN 2^0 SYN 2^1 RST 2^2 PSH 2^3 ACK 2^4 URG 2^5 ECE 2^6 CWR 2^7

Linux OS Debug 方法记录

发表于2023-06-28|更新于2025-09-23|Linux

触发 EC2 Linux 的 NMI Unknown 中断发送一个诊断请求给 EC2，触发 os 本身 NMI Unknown 事件，这个时间会触发 Kdump 记录当时的现场。 aws ec2 send-diagnostic-interrupt --region cn-north-1 --instance-id i-******************** 记录下来的现场文件保存在 /var/crash/ [root@mysql 5.14.0-284.11.1.el9_2.x86_64]# ll /var/crash/ total 0 drwxr-xr-x. 2 root root 67 Jun 6 05:22 127.0.0.1-2023-06-06-05:22:11 drwxr-xr-x. 2 root root 67 Jun 6 08:58 127.0.0.1-2023-06-06-08:58:20 drwxr-xr-x. 2 root root 67 Jun 9 09:39 127.0.0.1-2023-06-09-09:39:56 使用Crash命令进行分析，需要安装kernel-debug 和 kernel-debuginfo kernel-devel [root@mysql 5.14.0-284.11.1.el9_2.x86_64]# crash /usr/lib/debug/lib/modules/5.14.0-284.11.1.el9_2.x86_64/vmlinux /var/crash/127.0.0.1-2023-06-09-09\:39\:56/vmcore 相关文档： New – Trigger a Kernel Panic to Diagnose Unresponsive EC2 Instances发送诊断中断（适用于高级用户） Cscope 查看内核源代码# 下载源代码 yum install -y yum-utils yum yum download --source kernel # 解压代码包 rpm2cpio ./kernel-5.14.0-284.11.1.el9_2.src.rpm | cpio -div tar xf ./linux-5.14.0-284.11.1.el9_2.tar.xz # 使用命令查看源代码 make cscope ARCH=x86 # 读取并标记tag make tags ARCH=x86 # 查看 cscope -d Dracut 的使用和命令# 添加驱动程序到 ramfs ]$ dracut -f -v --add-drivers "nvme ena" /boot/initramfs-$(uname -r).img $(uname -r) # 查看是否有模块在 ramfs 中 ]$ lsinitrd /boot/initramfs-$(uname -r).img | grep -E "nvme|ena" 安全软件引起的用户空间进程失去响应Redhat关于这个问题的文档说明： https://access.redhat.com/solutions/5201171 https://access.redhat.com/solutions/2838901 使用Ftrace的方法，和一部分命令的使用方法： [root@ip-172-31-51-167 ~]$ echo 'func fanotify_get_response +p' > /sys/kernel/debug/dynamic_debug/control 追踪这个系统调用，并输出 callgraph.内核的DynamicTracing，这是一个古老的方式了，出现在Kprobe之前。会直接将追踪的结果输出到dmesg中。 [root@ip-172-31-51-167 ~]$ perf trace -s -p 2688 [root@ip-172-31-51-167 ~]$ cd /var/crash/127.0.0.1-2023-08-11-06:53:10 [root@ip-172-31-51-167 ~]$ crash /usr/lib/debug/lib/modules/6.1.34-59.116.amzn2023.x86_64/vmlinux vmcore [root@ip-172-31-51-167 127.0.0.1-2023-08-11-06:53:10]$ ll /var/crash total 0 drwxr-xr-x. 2 root root 67 Aug 10 05:52 127.0.0.1-2023-08-10-05:52:16 drwxr-xr-x. 2 root root 67 Aug 10 06:13 127.0.0.1-2023-08-10-06:13:44 drwxr-xr-x. 2 root root 67 Aug 11 05:45 127.0.0.1-2023-08-10-13:03:27 drwxr-xr-x. 2 root root 91 Aug 12 15:05 127.0.0.1-2023-08-11-04:57:13 drwxr-xr-x. 2 root root 67 Aug 11 08:41 127.0.0.1-2023-08-11-06:53:10 drwxr-xr-x. 2 root root 67 Aug 11 20:56 badstop drwxr-xr-x. 2 root root 41 Aug 11 20:46 crash Grubby 命令简单的用法设置内核参数： # 查看所有内核的参数 $ grubby --info=ALL # 设置默认的启动内核 $ grubby --set-default-index=1 # 查看当前的默认启动内核 $ grubby --default-kernel # 移除所有内核的参数 $ grubby --update-kernel=ALL --remove-args="systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M loglevel=8 crashkernel=512M" # 更新所有内核的参数 $ grubby --update-kernel=ALL --args="systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M loglevel=8 crashkernel=512M" # 为特定的内核添加参数。 $ grubby --update-kernel=/boot/vmlinuz-5.9.1-1.el8.elrepo.x86_64 --args=“systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M loglevel=8 crashkernel=512M” [root@ip-172-31-0-170 ~]# sudo kdumpctl status kdump: Kdump is operational [root@ip-172-31-0-170 ~]# sudo kdumpctl showmem kdump: Reserved 256MB memory for crash kernel [root@ip-172-31-0-170 ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/boot/vmlinuz-6.1.34-59.116.amzn2023.x86_64 root=UUID=483d7075-a0f8-4ba8-a951-a668fa079cac ro console=tty0 console=ttyS0,115200n8 nvme_core.io_timeout=42949672 95 rd.emergency=poweroff rd.shell=0 selinux=1 security=selinux quiet systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M loglevel=8 crashkernel=512M

快速启动一个 prometheus 和 grafana

发表于2023-05-25|更新于2025-09-23|Docker

快速创建一个可用的 prometheus 和 grafana 进行测试，并将数据保留在当前的目录中，在重启之后数据不会丢失：创建一个目录. mkdir /opt/monitor mkdir /opt/monitor/grafana mkdir /opt/monitor/grafana_data mkdir /opt/monitor/prometheus mkdir /opt/monitor/prometheus_data touch /opt/monitor/docker-compose.yaml 创建docker-compose 文件 --- version: "3" services: prometheus: container_name: prometheus image: reg.liarlee.site/docker.io/prom/prometheus:latest restart: always network_mode: host environment: - TZ=Asia/Shanghai volumes: # - /opt/monitor/prometheus/prometheus.yaml:/etc/prometheus/prometheus.yml - /opt/monitor/prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/usr/share/prometheus/console_libraries' - '--web.console.templates=/usr/share/prometheus/consoles' - '--storage.tsdb.retention.time=90d' grafana: container_name: grafana image: reg.liarlee.site/docker.io/grafana/grafana-oss:main-ubuntu restart: always network_mode: host environment: - TZ=Asia/Shanghai volumes: - /opt/monitor/grafana_data:/var/lib/grafana - /opt/monitor/grafana/datasource:/etc/grafana/provisioning/datasources # - /opt/monitor/grafana/grafana.ini:/etc/grafana/grafana.ini - /etc/localtime:/etc/localtime:ro user: '472' 准备基础配置文件 docker compose up -d docker cp grafana:/etc/grafana/grafana.ini /opt/monitor/grafana/grafana.ini docker cp prometheus:/etc/prometheus/prometheus.yml /opt/monitor/prometheus/prometheus.yaml chown -R 472:472 /opt/monitor/grafana_data chown -R 472:472 /opt/monitor/grafana chown -R nobody:nobody /opt/monitor/prometheus_data chown -R nobody:nobody /opt/monitor/prometheus docker compose down --remove-orphans 准备prometheus 作为默认的Datasource touch /opt/monitor/grafana/datasource/datasource.yml --- apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://localhost:9090 isDefault: true access: proxy editable: true 修改配置文件中需要的参数, 取消配置文件中的注释，然后重启即可。 docker compose down --remove-orphans && docker compose up -d

数据库单表的测试

发表于2023-05-19|更新于2025-09-23|Database