CS3214 Spring 2022 Exercise 0
In this class, you are required to have familiarity with Unix commands and Unix programming environments. The first part of this exercise is related to making sure you are comfortable in our Unix environment. The second part relates to the use of basic command line and standard I/O facilities from the application’s developer perspective. The third part focuses on the difference between byte and character streams.
1 Using Linux
It is crucial that everybody become productive using a Unix command line, even if the computer you are using daily is a Windows or OSX machine. Working on the command line requires working knowledge of a shell such as bash, fish, or zsh, but it also requires an understanding of the most common system commands and how the shell interacts with these commands and with user programs.
Please do the following, then answer the questions below.
- Remote Terminal Access. Make sure your personal machine has an ssh client in- stalled. Set your machine up for public key authentication when logging on to rlogin.cs.vt.edu. Use ssh-keygen to create a key.
There is also an web interface provided by the department that allows you to create a key pair at https://admin.cs.vt.edu/my-ssh-keys/. In this case, you will not maintain continuous possession of the private key from its inception (and thus should use this key only for your SLO account).
At the end of this step, you should be able to ssh into rlogin without having to type a password.
- Command-line Editing. Make sure you know how to use the command line editing facilities of your shell. For bash users, which most of you are by default, examine the effect of the following keys when editing:ˆd, TAB,ˆa,ˆe,ˆr,ˆk, andˆw. Then memorize these keystrokes, making them part of your finger memory.
Examine the effect of the following keys when you invoke a program: ˆc, ˆs, ˆq (ˆx stands for Ctrl-x.)
- Shell Customization. Customize your shell and create a custom prompt and any aliases you may need. A custom prompt typically includes the name of the machine you’re on and at least part of the pathname of the shell’s current directory as when settingPS1to[\u@\h \W]\$
- Terminal Editors. Make sure you know how to use at least one command line editor, such as vim, nano, pico, or emacs. We recommend vim, an editor that “can match the speed at which you think.”
- Visual Studio Code. Many students set up a remote environment that allows them to use an IDE on their computer. Notably, Microsoft’s Visual Studio Code provides an extension that provides a remote environment within the IDE that is well inte- grated. Although not mandatory, we highly recommend that you do this as well. The TAs will share instructions on how to do that.
Answer the following questions:
- Based on the specific method you chose to set up key authentication you will have created a file with a corresponding name that represents your identity. What is the name of that file and on which computer is it stored?
- You also needed to inform the rlogin system by providing the public key derived from the private key that represents your identity. What entry did you need to add to which file?
- The tilde ~ is a shell shortcut. What is the output of running echo ~ on an rlogin clustermachine?Whatistheoutputofrunningecho ~cs3214?
- Make sure that ~cs3214/bin is part of your PATH variable when you log on. To testit,logoffandlogonagain,thentypeecho $PATH.Whenwesay~cs3214/bin must be part of your PATH we are somewhat imprecise – what is the actual string that represents the directory that must be a component of your PATH?
- Howmanymachinesarepartoftherlogincluster(Hint:visithttp://rlogin.cs.vt.edu/) this semester? Include only those whose names are derived from trees, e.g. “birch.”
- Make sure your bash prompt includes your username, the name of the current ma- chine, and a suffix of the current directory. To prove it, copy the value of your $PS1 variable here.
- Some filenames in your home directory start with a dot. When you type ls, these are not shown. How can you list those files or directories?
- Define an alias for rm, such as alias rm=’rm -i’ and make sure the alias is in effect every time you log on. To which startup file did you add the alias definition?
- The diff Unix command compares two files line by line. It is typically used to create “patch files” which capture a change made to one or more related files.
When applying this patch and then compiling and running the resulting program, what output do you obtain?
- Which Unix group or group(s) are you currently a member of on our cluster?
- UnderstandardUnixpermissions,ifadirectoryhaspermissionsdrwx------who can access it?
2 Understanding Command Line Arguments and Standard I/O in Unix
In the past, we observed that some students coming into CS 3214 did not understand how programs access their command line arguments and how they make use of the standard input/output facilities, which present one of the basic abstractions provided by an oper- ating system. Some students came with the mistaken impression that “standard input” and “standard output” represents input or output from/to some kind of “console.”
To practice this knowledge, write a C program that concatenates a combination of given files and/or its standard input stream to its standard output stream. The exact specification is as follows.
Your program should be called concatenate.c.
When compiled and invoked without arguments, it should copy the content of its standard input stream to its standard output stream. “Standard input” and “standard output” are standard streams that are set up by a control program that starts your program (often, the control program is a shell).
When invoked with arguments, it should process the arguments in order. Each argument should be treated as the name of a file. These files should be opened and their content should be written to the standard output stream, in the order in which they are listed on the command line. If the name of any file that is provided is - (a single hyphen), then the program should read and output the content of its standard input stream instead in this place.
If any of the files whose names are given on the command line do not exist, the program’s behavior is undefined.
Your C program may make use of C’s stdio library (e.g., the family of functions including fgetc, etc.), or it may use system calls such as read() or write() directly. You should buffer data to avoid frequent system calls, but you may not assume that it is possible to buffer the entire file content in memory all at once.
Implementation Requirement: to make sure you understand the uniformity provided by the POSIX C API, we require that your program define a function, and then use this function to copy the data contained in files as well as the data it reads from its standard input stream. Your program’s main() function will then call this single function multiple times, as needed. In other words, do not special case standard input/output by providing a separate code path for standard input/output that makes use of facilities such as getchar() that implicitly refer to the standard input stream. Your code should be DRY.
You may use the script test-concat.sh to test your code.
3 Understanding how to access the Standard Input and Out- put Streams in your Preferred Language
Standard input and output are not concepts that are specific to the use of C. Choose a language of your choice that is not C (e.g. C++, Go, Ruby, Java, Python 2, Python 3,
As described in the Bash Hacker’s Wiki you can use the "$@" shorthand to refer to the script’s arguments, which are passed onto the Java program:
#!/bin/sh
# save this file as wrap-java.sh
java -Xmx120m Concatenate "$@"
You should use test-concat.sh to test by passing the name of your script or exe- cutable as an argument.
Java is an exception here: although the JVM is an ordinary Unix process, it makes certain assumptions about how much memory is available to it, which means it will not run well when this memory is limited from the outside. For Java implementations, you should run the test with:
SKIP_MEMORY_LIMIT=yes ./test-concat.sh ./wrap-java.sh
and make sure that the memory your program uses is instead limited in wrap-java.sh via the -Xmx flag.
You are encouraged to read test-concat.sh as it provides more examples of how to run programs on the command line. It also shows the different ways in which a shell can control a program’s standard input streams.
Implementation Requirement: the implementation requirement is the same. Do not special case standard input/output, use a single function. Unlike for C, this single func- tion may be one that you write, but some languages provide a suitable function in their standard library that you may be able to find.
Efficiency. You should use buffered forms of input and output in order to reduce the number of system calls your program makes. For instance, in C, the stdio library provides such buffering by default if you use fgetc() or fread(), whereas if you use the lower- level read() call directly you will need to make sure that you do buffering yourself (in other words, read multiple bytes at once rather than a single byte in each call). The autograder will run your program under a suitable timeout that is designed to eliminate submissions that lack buffering.
Use of Byte Streams. For both parts 2 and 3, your program must not attempt to interpret the content of the streams it reads and writes in any way. In other words, it should output the bytes (octets) that appear in the input as they appear, without making assumptions or processing them in any way. This includes the possible occurrence of the byte value 0x00, which may occur any number of times in the input and must be copied into the output.
Avoid Character-based Input Routines. Many real-world programs process input that is thought to represent characters, which has contributed to the fact that the I/O libraries of some higher-level languages default to the assumption that programmers will want to input and/or output character streams in some valid encoding when accessing file streams. Note that character streams are abstractions built on top of byte streams - at the process/OS boundary all I/O is byte-based (this is true for at least the vast majority of contemporary environments).
For the two implementations of concatenate you’re being asked to implement, do not assume that the input represents characters in any valid encoding. Specifically, the input data may not represent a valid UTF-8 encoding, and therefore, attempts to interpret it as UTF-8 data and decode it will fail for some tests, resulting in exceptions and/or data corruption. This means that you must be careful to avoid the default implementation in those languages that default to imposing a character stream abstraction, which include Python 3 and Java. Instead, you will need to examine their API and find the correspond- ing constructs that give you access to byte-based streams, which are sometimes referred to as “binary” forms of input or output.
4 Understanding Character-Based I/O
In this part of the exercise, you will implement a simple utility that interprets its standard input stream as a stream of UTF-8 encoded Unicode characters which it then counts. If the
the Unix tool ‘wc -m‘ which counts the number of Unicode characters in the input stream, except that ‘wc -m‘ ignores if the input stream is not in a valid encoding.
Write a program unicodecount.c using only functions that are part of the C standard library. You may use the fgetwc (easiest) or the mbrtowc functions, or identify the length of each encoded Unicode character manually. Your program must use buffering as well. Remember to use the setlocale(3) function to set the character type locale (LC_CTYPE) to "en_US.utf8".
If the standard input stream does not consist of correctly encoded Unicode characters in the UTF-8 transfer encoding, output:
Invalid or incomplete multibyte or wide character
else output the number of Unicode code points (each representing a Unicode character) found in the input stream.
Finally, write the same program in a high-level language of your choice.
Unlike for the concatenate program, you only need to process the program’s standard input stream and you do not need to handle the case where names of files are passed as command line arguments.
You may use the script test-unicodecount.sh to test your code. Independent of the size of the input stream, your program must not use more than 120MB of virtual memory – this is how we will enforce that your program does not attempt to buffer the entire content of its standard input stream in memory.