Auto syncing

author: Bill <bill@billserver.senders.io> 2021-04-25 23:37:11 -0400
committer: Bill <bill@billserver.senders.io> 2021-04-25 23:37:11 -0400
commit: 7a1f212bda7280ec6a6fb16f1e5c1bbda2866f06 (patch)
tree: 36a9dab83c9c66b969945d9eed463d57d170d99f
parent: feceae5ea2c4ca6e10c9f9e5e058287615080892 (diff)
3 files changed, 102 insertions, 3 deletions
diff --git a/gemini/gemlog/2021-04-26-auto-syncing.gmi b/gemini/gemlog/2021-04-26-auto-syncing.gmi
new file mode 100644
index 0000000..416414a
--- /dev/null
+++ b/gemini/gemlog/2021-04-26-auto-syncing.gmi
@@ -0,0 +1,101 @@
+# Auto Syncing
+
+I have a remote server that acts as sort-of a DMZ between my friends and local server. I recently obtained two 16TB drives that I set up in my local server to act as a NAS. The only issue is - my remote server has limited space and lacks a decent structure to just rely on basic rsyncing. So I wrote a script that checks if any new files exist in a place and downloads them.
+
+## ssync
+
+The script I wrote I called ssync and I set it up to just run on a */1 cron.
+
+=> https://git.senders.io/senders/ssync/tree/ssync [https] ssync (git)
+
+```ssync
+... # variable setup and prechecks
+
+log "Fetching files"
+mkdir -p $RUN_DIR
+ssh -i $KEY_FILE $REMOTE \
+  "find ${REMOTE_DIR} -newermt ${PREV_RUN_DATE} -exec realpath --relative-to ${REMOTE_DIR} {} \;" \
+  >> $CURGET_FILE
+comm -23 <(sort -u $CURGET_FILE) <(sort -u $FETCHED_FILE) > $FETCH_FILE
+COUNT=$(wc -l $FETCH_FILE | cut -d' ' -f1)
+
+if [ $COUNT -gt 0 ]; then
+  # Syncing
+  log "Found ${COUNT} files to fetch"
+
+  cat $FETCH_FILE >> $FETCHED_FILE
+  log "Wrote files to fetched files"
+  log "Syncing now"
+  cat $FETCH_FILE | xargs -n1 -P$PARALLEL -I '{}' rsync -e "ssh -i $KEY_FILE" \
+    -av \
+    $REMOTE:${REMOTE_DIR}/'{}' ${SRC_DIR}
+else
+  log "No files to sync"
+fi
+echo $NEXT_RUN_DATE > $LASTRAN_FILE
+log "Done syncing"
+```
+
+The script relies on 5 main utilities: 
+* ssh
+* find
+* comm
+* xargs
+* rsync
+
+I use ssh to connect to the remote server and find all the files that have been created since the last run and pipe that into a file on my local machine. Then I compare those files against everything I've fetched to see if anything has already been fetched (or is currently being fetched).
+
+With the new list of files to grab (the output of comm) I xargs those files into rsync with a parallelism of 5 (about what my bandwidth can manage if they're large files)
+
+Once the sync is complete I update the next run date with start time of the script to use on the next run. This means if a sync takes 15 minutes - each minute we will still be looking for files using the date of the last successful run. I did this as a way to ensure we're not missing anything. I like to write my scripts with a margin of error - I'd rather have a duplicate on my local machine than lose something. But the fetched file contains a list of everything fetched - or in the process of fetching - so even though our ssh list will pull the files comm weeds those out.
+
+Everything goes into an inbox directory that I sort though periodically cataloging the files into their proper directory.
+
+## Joy
+
+It is very nice to now be able to not have to think "did I sync this directory yet?". Since I have a limited amount of space on the remote server it's good to know I can just delete anything a few days old without worry.
+
+I've had this remote server for 8 years and I've never got around to setting this up. So it's a huge relief having this finally.
+
+## Pain
+
+This entire process would've been SO MUCH simpler if I had just made the two directories match file structure. I could simply run rsync relative to the two main outer-directories and called it a day. 
+
+## find
+
+I use xargs and comm all the time. But find is one of those utilities that honestly I never knew had the -exec option. This has been a huge life savor in getting everything off of this remote server that I have missed. 
+
+I wanted to find any files that didn't exist already on my local machine so on both servers I ran
+
+```index.sh
+find . -type f -exec basename {} \; | sort > index.txt
+```
+
+This created a list of all of the files on each device. Then I could just comm -23 and get the ones that I needed to go get. I could rely that the names wouldn't conflict so I just did:
+
+```get-realpath.sh
+cat remote-only.txt | xargs -I{} find . -iname '*{}' -exec realpath {} \;
+```
+
+And now I knew exactly where the files were that I was missing and I could figure out the best way to fetch all the files.
+
+## I spend my weekend shell scripting
+
+I don't know about most of you - but I honestly could not compute without access to the shell. So much of what I do is simplified because I can write a line of commands that execute the actions I want as a single process (through pipelining) and then make a few adjustments and run it again for a new set of data.
+
+=> https://youtu.be/tc4ROCJYbm0?t=341 [youtube] AT&T Archives: The UNIX Operating System (ts=5:41)
+
+I recommend you watch the entire video if you're a fan of the unix style operating systems - but around 6 minutes David Kernighan explains pipelining and the power it provides. I use this video as a reference all the time when asked what I love about my Linux machine, and why I wouldn't want to go back to windows full time. He breaks it down into such a clear and precise way that I find is useful to explain to non-technical/people unfamiliar with unix or the command line, why and how the command line can be so powerful.
+
+## Time to write
+
+This was a perfect candidate for writing a script because the process of syncing files over the internet takes time - so spending a few hours perfecting this script ultimately saves me time and creates peace of mind. But writing a script to look at the file and try and discern where to move it to once it appeared on the local machine is not worth the time. A simple mv will take seconds at most - but parsing the file name, maybe looking at some metadata - and trying to guess what directory it belongs in - honestly there are far too many requirements to even list - that a human taking a few minutes on the weekend to just create some new dirs and move the files into them is where scripting isn't worth it (yet).
+
+## Conclusion
+
+I know in my reply about the shell not being good for automation I go into shell scripting - but I wouldn't be surprised if I've gushed over it in a few other gemlogs. There is something about the terminal that just clicks with how my brain wants to manage the PC, and I wouldn't want to use a computer with out.
+
+# Links
+
+=> /gemlog/ Gemlog
+=> / Home
diff --git a/gemini/gemlog/index.gmi b/gemini/gemlog/index.gmi
index c0f81c1..e7f661f 100644
--- a/gemini/gemlog/index.gmi
+++ b/gemini/gemlog/index.gmi
@@ -4,6 +4,7 @@ Welcome to my gemlog. I post whenever I do something I feel is worth writing abo
 
 ## My posts
 
+=> 2021-04-26-auto-syncing.gmi 2021-04-26 - Auto Syncing
 => 2021-04-25-stowaway-2021.gmi 2021-04-25 - Stowaway (2021)
 => 2021-04-23-re-the-linux-shell-is-not-a-good-automation-platform.gmi 2021-04-23 - re: The Linux shell is not a good automation platform
 => 2021-04-21-vaccination.gmi 2021-04-21 - I got vaccinated! (part 1)
diff --git a/gemini/index.gmi b/gemini/index.gmi
index f4ea114..0df41ed 100644
--- a/gemini/index.gmi
+++ b/gemini/index.gmi
@@ -56,6 +56,3 @@ And if there is anything critical about this capsule/hosting/security please sen
 
 Thanks! And if you sub, shoot me an email and I'll happily sub back :)
 
-## P.S - Cert migration (2021-04-06)
-
-I migrated my cert! Sorry for the inconvenience!
author	Bill <bill@billserver.senders.io>	2021-04-25 23:37:11 -0400
committer	Bill <bill@billserver.senders.io>	2021-04-25 23:37:11 -0400
commit	7a1f212bda7280ec6a6fb16f1e5c1bbda2866f06 (patch)
tree	36a9dab83c9c66b969945d9eed463d57d170d99f
parent	feceae5ea2c4ca6e10c9f9e5e058287615080892 (diff)